# Movie sentiment analysis

**Data:**

The data source is Movie Sentiment project from Kaggle, including train and test datasets; In trainning data, there is Phrase column and Sentiment column as the result score from 0 to 4 (negative to positve); In test data, there is Phrase column for us to analyze the sentiment of each phrase.

**Goal:**


Perform NLP analysis on trainning data and use different Machine learning models to compare model performances, and predict Phrase sentiment on test data. 

**Highlight:** 

- **Show NLP process step by step**
- **Stremline NLP process with ML Pipeline**
- **Using Object_oriented programming build class to perfrom mutiple ML models efficiently**

**NLP process steps:**

- Remove punctuation
- Tokenize sentence
- Remove stopwords
- Stem or lemmatize words:
  - Both methods aim to change the words to original form (if using both: better lemmartize first and then stem)
     - Stemming change words based on rules on string: e.g.: delted 's' at the end of noun. While it has serious limitations on change the actual meaning of words. Since the algorithm is change based on rules for strings, it runs faster and it's a good choice if time is a concern in NLP process
          - There are three stemmer: porter, snowball(porter2), lancaster; porter is the orginal and most gental one, while it's the most computationally intensive. snowball is a litter intensive than porter and it improves from porter (common option); lancaster is the most aggresive one, the faster one while the final words might obscure
     - Lemmatization change words based on the dictionary from different algorithms, such as "went" to "go". Based on the differnt type of the word (verb, noun), it can change to differnt meaning of word which solve the disambiguation problem. While it demands more computaional power. (It can be used if you want to build a dictionary world: NLP system)
- Calculate TFIDF 
- Train ML models
- Compare models results and test model



**Reference:**


NLP process: 
https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1

TFIDF:
https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089

Hyperparameter on machine learning models:  https://github.com/davidsbatista/machine-learning-notebooks/blob/master/hyperparameter-across-models.ipynb



# Build ML models step by step
## Step1:  Exploring Data Analysis

In [None]:
## Import basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
## Read data
train = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip',sep="\t") 
test = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/test.tsv.zip',sep="\t") 

In [None]:
train.head()

In [None]:
train.info()

In [None]:
## Show the number of class distributed
plt.figure(figsize=(10,5))
ax=plt.axes()
ax.set_title('Number of sentiment class')
sns.countplot(x=train.Sentiment,data=train)

## Step2: NLP process (step by step)
### Remove punctuation and lowercase

In [None]:
train.Phrase[:10]

In [None]:
import string
string.punctuation

In [None]:
train.Phrase=train.Phrase.apply(lambda x: x.translate(str.maketrans('','',string.punctuation)).lower())

In [None]:
train.Phrase[:10]

### Tokenize sentence

In [None]:
train.Phrase=train.Phrase.str.split(' ')

In [None]:
train.Phrase[:10]

### Remove stopwords

In [None]:
from nltk.corpus import stopwords
stopwords_e=stopwords.words('english')

In [None]:
stopwords_e=stopwords.words('english')

In [None]:
train.Phrase=[w for w in train.Phrase if w not in stopwords_e]
train.Phrase.head()

### Lemmatize words

In [None]:
import nltk
##nltk.download()

In [None]:
from nltk.stem import WordNetLemmatizer
lemmar=WordNetLemmatizer()

In [None]:
train.Phrase=train.Phrase.apply(lambda x: [lemmar.lemmatize(w) for w in x])

### Stemming words

In [None]:
## Method1:
from nltk.stem import PorterStemmer
porter=PorterStemmer()

In [None]:
train.Phrase=train.Phrase.apply(lambda x: [porter.stem(w) for w in x])

In [None]:
## Method2:
from nltk.stem import SnowballStemmer
snow=SnowballStemmer('english')

In [None]:
train.Phrase=train.Phrase.apply(lambda x: [snow.stem(w) for w in x])

### TFIDF vectorize

TFIDF: Term frequency inverse document frequency


**formula:** 
TFIDF=Term frequency* Inverse Document frequency



- Term frequency: count of same word w in a documents/ the total number of words in documents

- Document frequency: number of documents have the word/the total number of documents

- To avoid the number of documents too big, we take log of the IDF: if word not shows up, log(IDF)=0, and 0 cannot be divide, we add 1, so formula becomes: TF*log(N/DF+1) [More info in references]


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer,TfidfTransformer
vector=TfidfVectorizer(stop_words='english')

In [None]:
train.Phrase=train.Phrase.apply(lambda x: ' '.join(x))

In [None]:
vector1=vector.fit(train.Phrase)

In [None]:
train_feature=vector1.transform(train.Phrase)

In [None]:
train_feature.toarray()

## Step3: Build ML models on train dataset

### Multi_class logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
lr=LogisticRegression(multi_class='ovr')

In [None]:
train.head()

In [None]:
train.info()

In [None]:
lr=lr.fit(train_feature,train.Sentiment)

In [None]:
## Coefficient
lr.coef_

In [None]:
## Get the model performance on train dataset since we don't have test response data
train_predict=lr.predict(train_feature)

In [None]:
## the number of data in each class
train.Sentiment.value_counts().sort_index()

In [None]:
## number of data in predict result
np.unique(train_predict,return_counts=True)

In [None]:
## Plot predict result
plt.figure(figsize=(10,5))
ax=plt.axes()
ax.set_title('Number of sentiment class')
sns.countplot(train_predict)

In [None]:
print(classification_report(train_predict, train.Sentiment))

### Muti-class SVM

In [None]:
from sklearn import svm

In [None]:
svm1=svm.SVC(decision_function_shape='ovo')

In [None]:
svm1.fit(train_feature, train.Sentiment)

In [None]:
svm_train_pred=svm1.predict(train_feature)

In [None]:
## Number of predict class
np.unique(svm_train_pred,return_counts=True)

In [None]:
plt.figure(figsize=(10,5))
ax=plt.axes()
ax.set_title('Number of sentiment class')
sns.countplot(svm_train_pred)

In [None]:
print(classification_report(svm_train_pred, train.Sentiment))

### Decision tree model

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
ds=DecisionTreeClassifier()
ds.fit(train_feature, train.Sentiment)

In [None]:
print(ds.feature_importances_)

In [None]:
ds_train_pred=ds.predict(train_feature)

In [None]:
train.Sentiment.value_counts().sort_index()

In [None]:
## Number of predict class
np.unique(ds_train_pred,return_counts=True)

In [None]:
plt.figure(figsize=(10,5))
ax=plt.axes()
ax.set_title('Number of sentiment class')
sns.countplot(ds_train_pred)

In [None]:
print(classification_report(ds_train_pred, train.Sentiment))

### Random forest model

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf=RandomForestClassifier()
rf.fit(train_feature, train.Sentiment)

In [None]:
print(rf.feature_importances_)

In [None]:
rf_train_pred=rf.predict(train_feature)

In [None]:
plt.figure(figsize=(10,5))
ax=plt.axes()
ax.set_title('Number of sentiment class')
sns.countplot(rf_train_pred)

In [None]:
print(classification_report(rf_train_pred, train.Sentiment))

# Stremline the process 

## Method1: Pipeline 
only use LR model as an example

In [None]:
def data_preprocess(text):
    text_nonpunc=[w.lower() for w in text if w not in string.punctuation]
    text_nonpunc=''.join(text_nonpunc)
    text_rmstop=[x for x in text_nonpunc.split(' ') if x not in stopwords_e]
    text_stem=[snow.stem(w) for w in text_rmstop]
    text1=' '.join(text_stem)
    return (text1)

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
# Can't use TfidVecterizer() because line: 
# https://stackoverflow.com/questions/50192763/python-sklearn-pipiline-fit-attributeerror-lower-not-found
# TfidTransformer should combine with countVectorizer()
lrpipeline=Pipeline([('preprocess',CountVectorizer(analyzer=data_preprocess)),
                  ('Tfidf',TfidfTransformer()),
                  ('classify',LogisticRegression())])

In [None]:
lrpipeline.fit(train.Phrase,train.Sentiment)

In [None]:
## have to saved the vocabulary
result=lrpipeline.predict(test['Phrase'])

In [None]:
np.unique(result)

In [None]:
plt.figure(figsize=(10,5))
ax=plt.axes()
ax.set_title('Number of sentiment class')
sns.countplot(result)

## Method 2: OOP to built class perform all models

Perform the TOP 3 models (based on accuracy on train data) in functions. 

In [None]:
## Import every packages
from scipy import stats
import string
from nltk.corpus import stopwords
stopwords_e=stopwords.words('english')
from nltk.stem import SnowballStemmer
snow=SnowballStemmer('english')
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import  RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer,TfidfTransformer
vector=TfidfVectorizer(stop_words='english')

In [None]:
## Preprocess function
def data_preprocess(text):
    text_nonpunc=[w.lower() for w in text if w not in string.punctuation]
    text_nonpunc=''.join(text_nonpunc)
    text_rmstop=[x for x in text_nonpunc.split(' ') if x not in stopwords_e]
    text_stem=[snow.stem(w) for w in text_rmstop]
    text1=' '.join(text_stem)
    return (text1)

In [None]:
## OOP Class 
## Notice: Class name and the first def should have a blank line
class EstimatorSelection:
    
    def __init__(self, models):
        self.models=models
        self.keys=models.keys()
        self.results={}
        self.modelfit={}
        self.modelpredict={}
    def fit(self, x, y):
        x1=x.apply(lambda i: data_preprocess(i))
        x_feature1=vector.fit_transform(x1)
        for key in self.keys:
            model=self.models[key]
            self.modelfit[key]=model.fit(x_feature1,y)
            y_pred=model.predict(x_feature1)
            self.results[key]=classification_report(y, y_pred,output_dict=True)
    def predict(self,test_x):
        test_x1=test_x.apply(lambda i: data_preprocess(i))
        test_feature1=vector.transform(test_x1)
        test_frames=[]
        for key in self.keys:
            modelfit=self.modelfit[key]
            test_y=modelfit.predict(test_feature1)
            test_frame=pd.DataFrame(test_y,columns=[key])
            test_frames.append(test_frame)
        predict_frame=pd.concat(test_frames,axis=1)            
        return(predict_frame)     
    def summary(self):
        Frames=[]
        for key in self.keys:
            result=self.results[key]
            Frame=pd.DataFrame(result['macro avg'], index=[key])
            Frames.append(Frame)
        result_sum=pd.concat(Frames)
        return result_sum.iloc[:,:3]

In [None]:
## Models want to predict on test data
models = { 
    'LogisticClassifier': LogisticRegression(multi_class='ovr'),
    'RandomforestClassifier':RandomForestClassifier(),
    'DecisionTreeClassifier':DecisionTreeClassifier()
}

In [None]:
model_compare=EstimatorSelection(models)

In [None]:
model_compare.fit(train.Phrase, train.Sentiment)

### Compare model performance

In [None]:
summary=model_compare.summary()
summary

In [None]:
predict_result=model_compare.predict(test.Phrase)
predict_result

## Reshape result dataframe to plot
Method: Melt() and Pivottable()

In [None]:
predict_result1=predict_result.reset_index().rename(columns={'index':'case'})
predict_result2=pd.melt(predict_result1,id_vars='case', value_vars=['LogisticClassifier', 'RandomforestClassifier', 'DecisionTreeClassifier'])

In [None]:
predict_result2=pd.melt(predict_result1,id_vars='case', value_vars=['LogisticClassifier', 'RandomforestClassifier', 'DecisionTreeClassifier'])
predict_result2

In [None]:
predict_result3=predict_result2.groupby(['variable','value']).size().reset_index().rename(columns={0:'count'})
predict_result3

## Compare ML models predict results

In [None]:
plt.figure(figsize=(10,5))
ax=plt.axes()
ax.set_title('Number of class for each methods')
sns.barplot(x='value', y='count', hue='variable', data=predict_result3)

## Get the Final result from the mode of three classification results

In [None]:
Final_results=[]
for i in range(predict_result1.shape[0]):
    Final_result=stats.mode(predict_result1.iloc[i,]).mode.item()
    Final_results.append(Final_result)

In [None]:
predict_result1['Final_result']=Final_results
predict_result1

In [None]:
test['Sentiment']=Final_results
test

## Submission

In [None]:
#make the predictions with trained model and submit the predictions.
sub_file = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/sampleSubmission.csv',sep=',')
sub_file.Sentiment=Final_results
sub_file.to_csv('Submission.csv',index=False)