# Final Masters Project

## Name: Sreekanth Palagiri, Student ID: R00184198

## Project Topic: Evaluation of Ensemble Approach for Sentiment Analysis on a Small Dataset

##NoteBook1: Trainer Logistic Regression


### **Mount google drive**

In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
!ls "gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset"

 Bert.ipynb		   Flair.ipynb		        rt-polarity.neg
 datapreperation.ipynb	  'Logistic Regression.ipynb'   rt-polarity.pos
 Ensemble_model-V1.ipynb   LSTM.ipynb		        sentimentpolarity.csv
 Ensemble_model-V2.ipynb   Models		        XLNet.ipynb
 Ensemble_model-V3.ipynb  'Naive Bayees.ipynb'
 Ensemble_model-V4.ipynb   Roberta.ipynb


### **Load Data and Preprocess**

In [None]:
import pandas as pd
import numpy as np

df=pd.read_csv("/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/sentimentpolarity.csv")
print(df.groupby(['label']).size())
df.head()

label
0    1000
1    1000
dtype: int64


Unnamed: 0,text,label
0,[ferrera] has the charisma of a young woman wh...,1
1,"both flawed and delayed , martin scorcese's ga...",1
2,"for his first attempt at film noir , spielberg...",1
3,easily one of the best and most exciting movie...,1
4,this director's cut -- which adds 51 minutes -...,0


**Preprocessor to Remove all special characters except emoticons**

In [None]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[^A-Za-z0-9\']+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

print(df['text'][19])
print(preprocessor(df['text'][19]))

the only fun part of the movie is playing the obvious game . you try to guess the order in which the kids in the house will be gored . 
the only fun part of the movie is playing the obvious game you try to guess the order in which the kids in the house will be gored 


In [None]:
df['text'] = df['text'].apply(preprocessor)

In [None]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()


def stemm(text):
  return ' '.join([stemmer.stem(word) for word in text.split()])

print(stemm(df['text'][19]))

the onli fun part of the movi is play the obviou game you tri to guess the order in which the kid in the hous will be gore


In [None]:
df['text'] = df['text'].apply(stemm)

### **Seperate Into Train and Test Sets**

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test, sentiment_train, sentiment_test = train_test_split(df['text'], df['label'], 
                                                                      random_state=1, test_size=0.15, 
                                                                      shuffle=False)


print('Length of train set:',len(df_train),'Length of test set:',len(df_test))

Length of train set: 1700 Length of test set: 300


### **Logistic Regression Model**

**Define Tokenizer**

In [None]:
def tokenizer(text):
  return [word for word in text.split()]

**Tokenize Text**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None,
                        tokenizer= tokenizer,
                        use_idf=True,
                        norm='l2',
                        smooth_idf=True
                       )

tfidf.fit(df['text'])



TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=False, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function tokenizer at 0x7f55aa6b79e0>, use_idf=True,
                vocabulary=None)

In [None]:
X_train=tfidf.transform(df_train)
Y_train=sentiment_train
X_test=tfidf.transform(df_test)
Y_test=sentiment_test

**Fit Logistic Regressor**

In [None]:
from sklearn.linear_model import LogisticRegressionCV

clf= LogisticRegressionCV(cv=5,
                          scoring='accuracy',
                          random_state=0,
                          n_jobs=-1,
                          verbose=3,
                          max_iter=300).fit(X_train, Y_train)
clf

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    5.5s finished


LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=300, multi_class='auto', n_jobs=-1, penalty='l2',
                     random_state=0, refit=True, scoring='accuracy',
                     solver='lbfgs', tol=0.0001, verbose=3)

**Train and Test Scores**

In [None]:
print('Train Accuracy Score:',clf.score(X_train, Y_train))
print('Test Accuracy Score:',clf.score(X_test, Y_test))

Train Accuracy Score: 0.9788235294117648
Test Accuracy Score: 0.7266666666666667


In [None]:
from sklearn import metrics

Y_pred=clf.predict(X_test)
print('F1 Score:',metrics.f1_score(Y_test,Y_pred),
      'Precision:',metrics.precision_score(Y_test,Y_pred),
      'Recall:',metrics.recall_score(Y_test,Y_pred),
      'Accuracy:',metrics.accuracy_score(Y_test,Y_pred))

F1 Score: 0.7388535031847133 Precision: 0.7388535031847133 Recall: 0.7388535031847133 Accuracy: 0.7266666666666667


In [None]:
print(metrics.confusion_matrix(Y_test, Y_pred))

[[102  41]
 [ 41 116]]


### **Save the Model**

In [None]:
from joblib import dump

dump(clf, '/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/clf_logistic.joblib') 

dump(tfidf, '/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/tfidf_logistic.joblib') 

['/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/tfidf_logistic.joblib']