# Final Masters Project

## Name: Sreekanth Palagiri, Student ID: R00184198

## Project Topic: Evaluation of Ensemble Approach for Sentiment Analysis on a Small Dataset

##NoteBook1: Trainer XLNet


### **Mount google drive**

In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
!ls "gdrive/My Drive/Colab Notebooks/Masters Project/VMDataset/"

 Bert.ipynb
 Ensemble_model-V1.ipynb
 Ensemble_model-V2.ipynb
 Ensemble_model-V3.ipynb
 Ensemble_model-V4.ipynb
 Export_loop-sentiment-pos-neg-train_05112020000000.csv
 Flair.ipynb
'Logistic Regression.ipynb'
 LSTM.ipynb
 Models
'Naive Bayees.ipynb'
 Roberta.ipynb
 XLNet.ipynb


### **Install Transformers**

In [None]:
!pip install sentencepiece
!pip install transformers



### **Setting Seed**

In [None]:
import random
import torch
import numpy as np

seed_val = 1               #the seed values ensures the same samples go in same order while training
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

### **Load Data and Preprocess**

In [None]:
import pandas as pd
import numpy as np

df=pd.read_csv("/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/sentimentpolarity.csv")
print(df.groupby(['label']).size())
df.head()

label
0    1000
1    1000
dtype: int64


Unnamed: 0,text,label
0,[ferrera] has the charisma of a young woman wh...,1
1,"both flawed and delayed , martin scorcese's ga...",1
2,"for his first attempt at film noir , spielberg...",1
3,easily one of the best and most exciting movie...,1
4,this director's cut -- which adds 51 minutes -...,0


**Preprocessor to Remove all special characters except emoticons**

In [None]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[^A-Za-z0-9\']+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text


print(df['text'][19])
print(preprocessor(df['text'][19]))

the only fun part of the movie is playing the obvious game . you try to guess the order in which the kids in the house will be gored . 
the only fun part of the movie is playing the obvious game you try to guess the order in which the kids in the house will be gored 


In [None]:
df['text'] = df['text'].apply(preprocessor)

### **Split into Train and test datasets**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test= train_test_split(df['text'], 
                                                   df['label'], 
                                                   random_state=1, 
                                                   test_size=0.15, 
                                                   shuffle=False)

### **Bert Model**

**Define Bert Tokenixer from pre-trained models**

In [None]:
from transformers import XLNetTokenizer

tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased") 

**Encode Test and Train DataSets**

In [None]:
encoded_data_train=tokenizer.batch_encode_plus(   
                        X_train.values,    
                        add_special_tokens=True,     
                        return_attention_mask=True,  
                        padding='longest', 
                        pad_to_max_length=False,
                        truncation=True,    
                        max_length=256,   
                        return_tensors='pt') 


encoded_data_test=tokenizer.batch_encode_plus(
                        X_test.values,              # Same we are doing for validation set.
                        add_special_tokens=True,
                        return_attention_mask=True,
                        padding='longest',
                        pad_to_max_length=False,
                        max_length=256,
                        truncation=True,
                        return_tensors='pt')

### **Created Torch Datasets in the format required for HuggingFace**

In [None]:
from torch.utils.data import TensorDataset

input_ids_train= encoded_data_train['input_ids']
input_ids_test= encoded_data_test['input_ids']


attention_masks_train= encoded_data_train['attention_mask']
attention_masks_test= encoded_data_test['attention_mask']


labels_train= torch.tensor(Y_train.values)
labels_test= torch.tensor(Y_test.values)


dataset_train= TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_test= TensorDataset(input_ids_test, attention_masks_test, labels_test)

print(len(dataset_train),len(dataset_test))


1700 300


**Load Predefined Bert Model for Classification**

In [None]:
from transformers import XLNetForSequenceClassification

model= XLNetForSequenceClassification.from_pretrained('xlnet-base-cased',                               #BERT pre-trained model
        num_labels= 2,
        output_attentions=False,
        output_hidden_states=False
    )

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.weight', 'logits_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

**Prepare Torch DataLoader for Training Process**

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_linear_schedule_with_warmup

batch_size=4 
epochs= 8

dataloader_train= DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)
# Creating dataloader_train variable as dataloader object passing tokenized train data, sampling parameter and batch size.

dataloader_test= DataLoader(
    dataset_test,
    sampler=RandomSampler(dataset_test),
    batch_size= 32
)

optimizer = AdamW(
    model.parameters(), #it will take all parameter we defined in variable model above.
    lr=2.0e-5,      # acc to paper learning rate is 2e-5>5e-5 hyperparameter. We can vary this and check foe best accuracy.
    eps=1e-8        # eps stands for epsilon and this is default value.
)

scheduler= get_linear_schedule_with_warmup(
    optimizer,                                        # passing optimizer parameters
    num_warmup_steps=0,                               # default #lower learning rate
    num_training_steps=len(dataloader_train)*epochs   # As training steps is total training in all epochs therefore multiply

)

**Checking Availability for GPU**

In [None]:
device= torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


**Define Evaluation Function**

In [None]:
from tqdm.notebook import tqdm  

def evaluate(dataloader_val):

    model.eval()                            # calling the builtin function eval present in BERT classification.
    
    loss_val_total = 0
    predictions, true_vals = [], []         # creating empty lists for appending the values.
    
    for batch in tqdm(dataloader_val):      # for all validation samples or rows.
        
        batch = tuple(b.to(device) for b in batch) # Passing the batch of samples to device as we initialised model.todevice
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],  #passing the input of dataloader_val
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()       # calculating the loss for each batch

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)            # appending fucntion to append all result in the list
        true_vals.append(label_ids)           # appending fucntion to append all result in the list
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0) # forming a prediction array
    true_vals = np.concatenate(true_vals, axis=0)     # forming true label array
            
    return loss_val_avg, predictions, true_vals

### **Training Bert Model**


In [None]:
from sklearn.metrics import accuracy_score

def scoring_func(preds, labels):
    preds_flat= np.argmax(preds, axis=1).flatten() # as prediction array is flattened to 0 or 1, possible labels
    labels_flat=labels.flatten()                   # for comparing it with true_labels
    return accuracy_score(labels_flat, preds_flat)

In [None]:

for epoch in tqdm(range(1, epochs+1)): # As the range function accounts for one value less than the original, we have epoch+1
    model.train()                      # calling the fucntion to train the model.
    
    loss_train_total=0                 # initialising the loss at starting to be zero. 
    
    progress_bar= tqdm(dataloader_train, 
                       desc='Epoch {:1d}'.format(epoch),
                       leave=False,
                       disable=False
                      )               # Progress bar helps us to track the number of epochs remaining out of total epochs.         
    for batch in progress_bar:
        
        model.zero_grad()             # Setting gradient to zero when strating the training
        
        batch= tuple(b.to(device) for b in batch)
        
        inputs={
            'input_ids'     : batch[0],
            'attention_mask': batch[1],        # passing the input prameters
            'labels'        : batch[2]
        }
        
        outputs= model(**inputs)
        
        loss= outputs[0]
        loss_train_total += loss.item()  #calculating total loss
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()               # calling our optimizer function defined above
        scheduler.step()               # calling our scheduler function defined above
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))}) # print training loss
        
    torch.save(model.state_dict(), f'/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/XLnet_ft_epoch{epoch}.model') # saving all the model with epoch number as a name.
    
    tqdm.write('\nEpoch {}'.format(epoch))                          # tqdm.write will print this statement at output
    
    loss_train_avg= loss_train_total/len(dataloader_train) # average train loss
    tqdm.write(f'Training loss: {loss_train_avg}')         # print training loss on screen
    
    val_loss, predictions, true_vals = evaluate(dataloader_test)
    val_acc= scoring_func(predictions, true_vals) # calculating f1_score user defined function.
    tqdm.write(f'Validation loss: {val_loss}')    # printing validation loss and F1 score weighted.
    tqdm.write(f'Accuracy Score: {val_acc}')

HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=425.0, style=ProgressStyle(description_widt…


Epoch 1
Training loss: 0.6489298375060453


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Validation loss: 0.6273404791951179
Accuracy Score: 0.83


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=425.0, style=ProgressStyle(description_widt…


Epoch 2
Training loss: 0.5486109047002323


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Validation loss: 0.704378516972065
Accuracy Score: 0.8466666666666667


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=425.0, style=ProgressStyle(description_widt…


Epoch 3
Training loss: 0.35452115848057847


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Validation loss: 0.6633993246592581
Accuracy Score: 0.86


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=425.0, style=ProgressStyle(description_widt…


Epoch 4
Training loss: 0.20809048327244134


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Validation loss: 0.9409565962851048
Accuracy Score: 0.85


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=425.0, style=ProgressStyle(description_widt…


Epoch 5
Training loss: 0.11916632933285334


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Validation loss: 0.9939370483160019
Accuracy Score: 0.8566666666666667


HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=425.0, style=ProgressStyle(description_widt…


Epoch 6
Training loss: 0.06387580710571081


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Validation loss: 1.056897723674774
Accuracy Score: 0.87


HBox(children=(FloatProgress(value=0.0, description='Epoch 7', max=425.0, style=ProgressStyle(description_widt…


Epoch 7
Training loss: 0.052318067769699744


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Validation loss: 1.1516981601715088
Accuracy Score: 0.8566666666666667


HBox(children=(FloatProgress(value=0.0, description='Epoch 8', max=425.0, style=ProgressStyle(description_widt…


Epoch 8
Training loss: 0.010574011055702132


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Validation loss: 1.2206550985574722
Accuracy Score: 0.8566666666666667



### **Load the best model, predict and evaluated**

In [None]:
import torch
from transformers import XLNetForSequenceClassification 

best_model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased',
                                                      num_labels=len(df.label.unique()),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

best_model.load_state_dict(torch.load('/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/XLnet_ft_epoch6.model'))

device= torch.device('cuda' if torch.cuda.is_available() else 'cpu')
best_model.to(device)

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.weight', 'logits_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

XLNetForSequenceClassification(
  (transformer): XLNetModel(
    (word_embedding): Embedding(32000, 768)
    (layer): ModuleList(
      (0): XLNetLayer(
        (rel_attn): XLNetRelativeAttention(
          (layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (ff): XLNetFeedForward(
          (layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (layer_1): Linear(in_features=768, out_features=3072, bias=True)
          (layer_2): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (1): XLNetLayer(
        (rel_attn): XLNetRelativeAttention(
          (layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (ff): XLNetFeedForward(
          (layer_norm): LayerNorm((768,), eps=1e

In [None]:
dataloader_test = DataLoader(
    dataset_test, 
    sampler=SequentialSampler(dataset_test), 
    batch_size=4
    )

In [None]:
import torch.nn.functional as F

def predict_xlnet(dataloader_test):
  
    best_model.eval()
    all_logits = []
    
    for batch in dataloader_test:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {
            'input_ids':      batch[0],
            'attention_mask': batch[1],
            }

        with torch.no_grad():        
            outputs = best_model(**inputs)
            
        # since we have no loss, the only thing returned is logits
        logits = outputs[0]
        all_logits.append(logits)
    
    all_logits = torch.cat(all_logits, dim=0)
    preds_flat = np.argmax(all_logits.cpu().numpy(), axis=1).flatten()

    probs = F.softmax(all_logits, dim=1).cpu().numpy()

    # get highest prob dimension as prediction
    
    return preds_flat, probs

In [None]:
Y_pred, probs=predict_xlnet(dataloader_test)

In [None]:
from sklearn import metrics

print('F1 Score:',metrics.f1_score(Y_test,Y_pred),
      'Precision:',metrics.precision_score(Y_test,Y_pred),
      'Recall:',metrics.recall_score(Y_test,Y_pred),
      'Accuracy:',metrics.accuracy_score(Y_test,Y_pred))

F1 Score: 0.8737864077669903 Precision: 0.8881578947368421 Recall: 0.8598726114649682 Accuracy: 0.87


In [None]:
print(metrics.confusion_matrix(Y_test, Y_pred))

[[126  17]
 [ 22 135]]
