# Final Masters Project

## Name: Sreekanth Palagiri, Student ID: R00184198

## Project Topic: Evaluation of Ensemble Approach for Sentiment Analysis on a Small Dataset

##NoteBook1: Trainer Flair


### **Mount google drive**

In [1]:
from google.colab import drive 
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
!ls "gdrive/My Drive/Colab Notebooks/Masters Project"

'Airline Tweets dataset'  'Sentence Polarity Dataset'
 glove.6B.300d.txt	   VMDataset


### **Install Flair**

In [3]:
!pip install flair



### **Load Data and Preprocess**

In [4]:
import pandas as pd
import numpy as np

df=pd.read_csv("/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/sentimentpolarity.csv")
print(df.groupby(['label']).size())
df.head()

label
0    1000
1    1000
dtype: int64


Unnamed: 0,text,label
0,[ferrera] has the charisma of a young woman wh...,1
1,"both flawed and delayed , martin scorcese's ga...",1
2,"for his first attempt at film noir , spielberg...",1
3,easily one of the best and most exciting movie...,1
4,this director's cut -- which adds 51 minutes -...,0


**Preprocessor to Remove all special characters except emoticons**

In [5]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[^A-Za-z0-9\']+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

print(df['text'][19])
print(preprocessor(df['text'][19]))

the only fun part of the movie is playing the obvious game . you try to guess the order in which the kids in the house will be gored . 
the only fun part of the movie is playing the obvious game you try to guess the order in which the kids in the house will be gored 


In [6]:
df['text'] = df['text'].apply(preprocessor)

In [7]:
from sklearn.model_selection import train_test_split

df_train, df_test, sentiment_train, sentiment_test = train_test_split(df['text'], df['label'], 
                                                                      random_state=1, test_size=0.15, 
                                                                      shuffle=False)


print('Length of train set:',len(df_train),'Length of test set:',len(df_test))

Length of train set: 1700 Length of test set: 300


In [8]:
df['label'][df.label==0]='Negative'
df['label'][df.label==1]='Positive'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


**Label Encoding to the format specified by Flair**

In [9]:
df['label'] = '__label__' + df['label'].astype(str)
df=df[['label','text']]
df.head()

Unnamed: 0,label,text
0,__label__Positive,ferrera has the charisma of a young woman who...
1,__label__Positive,both flawed and delayed martin scorcese's gang...
2,__label__Positive,for his first attempt at film noir spielberg p...
3,__label__Positive,easily one of the best and most exciting movie...
4,__label__Negative,this director's cut which adds 51 minutes take...


**Flair Needs validation, test and train data in csv's.**

**Creating csv files with 85-10-5 split.**

In [10]:
df.iloc[0:int(len(df)*0.85)].to_csv('/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/train.csv', sep='\t', 
                                                index = False, header = False)

df.iloc[int(len(df)*0.85):int(len(df)*0.95)].to_csv('/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/test.csv', sep='\t', 
                                                                      index = False, header = False)

df.iloc[int(len(df)*0.95):].to_csv('/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/val.csv', sep='\t', 
                                               index = False, header = False);

**Creating corpus with Train, Test and Validation data for model training.**

In [11]:
from flair.data import Corpus
from flair.datasets import ClassificationCorpus
#from flair.data import Sentence

data_folder = '/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/'

# this is the folder in which training, test and dev files reside

corpus: Corpus = ClassificationCorpus(data_folder,
                                      test_file='test.csv',
                                      dev_file='val.csv',
                                      train_file='train.csv')

2021-04-27 21:44:09,851 Reading data from /content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models
2021-04-27 21:44:09,852 Train: /content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/train.csv
2021-04-27 21:44:09,857 Dev: /content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/val.csv
2021-04-27 21:44:09,860 Test: /content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/test.csv


### **Traing Flair Model**


In [12]:
from torch.optim.adam import Adam
from flair.data import Corpus
#from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer


# 2. create the label dictionary
label_dict = corpus.make_label_dictionary()

# 3. initialize transformer document embeddings (many models are available)
document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=True)

# 4. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)

# 5. initialize the text classifier trainer with Adam optimizer
trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

# 6. start the training
trainer.train('/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/resources/taggers/trec',
              learning_rate=3e-5, # use very small learning rate
              mini_batch_size=16,
              mini_batch_chunk_size=4, # optionally set this if transformer is too much for your machine
              max_epochs=5, # terminate after 5 epochs
              )

2021-04-27 21:44:09,901 Computing label dictionary. Progress:


100%|██████████| 1900/1900 [00:01<00:00, 1623.75it/s]

2021-04-27 21:44:11,189 [b'Positive', b'Negative']





2021-04-27 21:44:17,303 ----------------------------------------------------------------------------------------------------
2021-04-27 21:44:17,309 Model: "TextClassifier(
  (document_embeddings): TransformerDocumentEmbeddings(
    (model): DistilBertModel(
      (embeddings): Embeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (transformer): Transformer(
        (layer): ModuleList(
          (0): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in

{'dev_loss_history': [0.48591241240501404,
  0.8493568301200867,
  1.0318491458892822,
  1.2852423191070557,
  1.4093979597091675],
 'dev_score_history': [0.79, 0.79, 0.79, 0.79, 0.79],
 'test_score': 0.84,
 'train_loss_history': [0.5959003552724825,
  0.27639020471801007,
  0.07315631609567262,
  0.04205196625952059,
  0.010748998561521105]}

### **Predict and Score**

In [15]:
from flair.data import Sentence

model_flair=TextClassifier.load('/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/resources/taggers/trec/best-model.pt')

results=[]
Y_pred=[]
for i in df_test.index:
    sentence=Sentence(df_test[i])
    model_flair.predict(sentence)
    if sentence.get_labels()[0].value=='Positive':
      score=1-sentence.get_labels()[0].score
      Y_pred.append([1])
    else:
      score=sentence.get_labels()[0].score
      Y_pred.append([0])
    results.append([score,1-score])

2021-04-27 21:47:57,802 loading file /content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/resources/taggers/trec/best-model.pt


In [16]:
from sklearn import metrics

Y_test=sentiment_test
print('F1 Score:',metrics.f1_score(Y_test,Y_pred),
      'Precision:',metrics.precision_score(Y_test,Y_pred),
      'Recall:',metrics.recall_score(Y_test,Y_pred),
      'Accuracy:',metrics.accuracy_score(Y_test,Y_pred))

F1 Score: 0.812720848056537 Precision: 0.9126984126984127 Recall: 0.732484076433121 Accuracy: 0.8233333333333334


In [17]:
print(metrics.confusion_matrix(Y_test, Y_pred))

[[132  11]
 [ 42 115]]
