<a href="https://colab.research.google.com/github/toraaglobal/tutorial/blob/text_mining/sentiment_analysis_movie_review_kaggle_MNB_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sentiment Analysis of Movie Reviews
**Comparing MNB and SVMs for Kaggle Sentiment Classification**
***

The Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis, originally collected by Pang and Lee . In their work on sentiment treebanks, Socher et al.used Amazon's Mechanical Turk to create fine-grained labels for all parsed phrases in the corpus. This competition presents a chance to benchmark your sentiment-analysis ideas on the Rotten Tomatoes dataset. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging.


[Data Source](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/overview)

**Evaluation**
Submissions are evaluated on classification accuracy (the percent of labels that are predicted correctly) for every parsed phrase. The sentiment labels are:

* 0 - negative
* 1 - somewhat negative
* 2 - neutral
* 3 - somewhat positive
* 4 - positive

**Import Packages**
The following python packages are used for this analysis

In [1]:
## Packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import re

## vectoriation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

## model
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

## model evaluation
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix  # confusion matrix, model evaluation


from sklearn.model_selection import train_test_split
from scipy.sparse import csr_matrix
import nltk


%matplotlib inline

  import pandas.util.testing as tm


**Mount Gdrive**

The gdrive is where the movie reviews data is saved.

In [2]:

## Mount the gdrive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


## change directry to the gdrive

os.chdir('./drive/My Drive/Colab Notebooks/data')

Mounted at /content/drive


**Read the datasets**

In [3]:
## Read Data

train = pd.read_csv('./kaggle-sentiment/train.tsv', sep='\t')
test = pd.read_csv('./kaggle-sentiment/test.tsv', sep='\t')

print('Train shape : {}'.format(train.shape))
print('Test shape : {}'.format(test.shape))

FileNotFoundError: ignored

**View Train Data**

In [None]:
train.head(5)

In [None]:
sns.countplot(train.Sentiment)

**Get the baseline accuracy**

In [None]:
## count the propotion for each class for baseline accuracy

unique, count = np.unique(train.Sentiment, return_counts=True)

for clas, i in zip( unique,count):
  print('{} : {}%'.format(clas, (i/len(train) * 100)))

The baseline accuracy is 51%

**Vectorization**

MNB: Multinomial Naive Bayes Use count vectorizer as an input. The ngram  count vectorizer willbe used for both MNB and SVM to compared there accuracy in classifying sentiment. SVM did not required any special form of vectorization as input. 


In [None]:
#  unigram and bigram term frequency vectorizer, set minimum document frequency to 5
gram12_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1,2), min_df=5, stop_words='english')
gram13_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1,3), min_df=5, stop_words='english')
gram23_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(2,3), min_df=5, stop_words='english')
gram22_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(2,2), min_df=5, stop_words='english')

**Get train and target features**

In [None]:
## get the phrase and the sentiment
features = train['Phrase'].values
target = train['Sentiment'].values

**Cross Validation Score of the model peformance**

* create a pipeline

In [None]:
## create a pipeline
def score_model_pipeline(model, vectorizer, X,y, cv=5):
  pipe = Pipeline([('vect', vectorizer), ('model', model)])
  scores = cross_val_score(pipe, X,y,cv=cv)
  print('Avg Score: {}'.format(sum(scores)/len(scores)))

**Initial Modeling**

Cross Val Score

In [None]:

## create a model container
model = {}

## add SVM and multinomial naives to the model cointainer
model['SVM'] = LinearSVC()
model['MNB'] = MultinomialNB()

## create a vectorization container
vec = {}

## add vectorizer to the container
vec['ngram12'] =  CountVectorizer(input="content",encoding='latin-1', ngram_range=(1,2), min_df=5, stop_words='english')
vec['ngram13'] =   CountVectorizer(input="content",encoding='latin-1', ngram_range=(1,3), min_df=5, stop_words='english')
vec['ngram23'] = CountVectorizer(input="content",encoding='latin-1', ngram_range=(2,3), min_df=5, stop_words='english')
vec['ngram22'] =  CountVectorizer(input="content",encoding='latin-1', ngram_range=(2,2), min_df=5, stop_words='english')



## 10 fold cross validation function
## create a pipeline
def score_model_pipeline(model, vectorizer, X,y, cv=3):
  '''10 fold cross validation pipeline and return the average scores'''
  nbc = Pipeline([('vect', vectorizer), ('nb', model)])
  scores = cross_val_score(nbc, X,y,cv=cv)
  print('Avg Score: {}'.format(sum(scores)/len(scores)))
  return sum(scores)/len(scores)


## get features and label from dataframe
def get_X_y_from_df(df, X='text', y='label'):
  '''get the text,features and the label, the target. return features and target'''
  X = list(df[X].values)
  y = df[y].values
  return X,y

## create an emty list to store score
score = []
vec_use = []
model_use= []

## get X and y
X,y = get_X_y_from_df(train, X='Phrase', y='Sentiment')

## run the cross validation using the pipeline
for mod in model:
  # loop through the model in the model container
  for v in vec:
    # loop through the vectorizer in the vectorizer container
    cv = score_model_pipeline(model[mod], vec[v], X,y)
    score.append(cv)  # append the score
    vec_use.append(v)  # append the vectorization used
    model_use.append(mod) # append the model used
    
    
  
## create a date frame of cross validation score of the classifier
result = {'Model': model_use, 'Vectorization': vec_use, '10 fold Avg Score': score}
sentiment_df = pd.DataFrame(result)
sentiment_df

In [None]:
sentiment_df

**Train Test Split**
* Create a train and test set. The test set is used to evaluate the model performance

In [None]:
## Train test split

X_train, X_test, y_train,y_test = train_test_split(features, target, test_size=0.3, random_state=0)

print("X_train: {}".format(X_train.shape))
print("X_test: {}".format(X_test.shape))
print("y_train: {}".format(y_train.shape))
print("y_test: {}".format(y_train.shape))

**Data Prep**



In [None]:
## vectorization

X_train_vec = gram12_count_vectorizer.fit_transform(X_train)

X_test_vec = gram12_count_vectorizer.transform(X_test)



## get features names
feature_names = gram12_count_vectorizer.get_feature_names()


In [None]:

sns.lineplot(sentiment_df.Vectorization,sentiment_df['10 fold Avg Score'])


In [None]:
#KDEPlot: Kernel Density Estimate Plot
fig = plt.figure(figsize=(15,4))
ax=sns.kdeplot(sentiment_df.loc[(sentiment_df['Model'] == 'SVM'),'10 fold Avg Score'] , color='b',shade=True, label='SVM')
ax=sns.kdeplot(sentiment_df.loc[(sentiment_df['Model'] == 'MNB'),'10 fold Avg Score'] , color='r',shade=True, label='MNB')
plt.title('SVM vs MNB')

**Modeling**

### MNB

In [None]:
## MNB
mnb_model =  MultinomialNB()

mnb_model.fit(X_train_vec, y_train)

mnb_prediction = mnb_model.predict(X_test_vec)

## Classification Report
report = classification_report(mnb_prediction, y_test)
print(report)

In [None]:
## Confusion matrix
mnb_cm  = confusion_matrix(mnb_prediction, y_test)
print(mnb_cm)

**Feature Importance**: MNB

In [None]:

# feature importance
featLogProb = []
ind = 0
for feats in feature_names:
    ## the following line takes the difference of the log prob of feature given model
    ## thus it measure the importance of the feature for classification.
    featLogProb.append(abs(mnb_model.feature_log_prob_[1,ind] - mnb_model.feature_log_prob_[0,ind]))
    s = ""
    s += (feats)
    s +="  " 
    s += str(featLogProb[ind])
    s += "\n" 
    print(s)
    ind = ind + 1
    



In [None]:


#feats_sorted = sorted(featLogProb , reverse = True)
## Sort features based on importance!
sort_inds = sorted(range(len(featLogProb)), key=featLogProb.__getitem__, reverse = True)
for i in range(10):
    s = ""
    s += feature_names[sort_inds[i]]
    s += ":  "
    s += str(featLogProb[sort_inds[i]])
    s += "\n"
    print(s)

In [None]:


#feats_sorted = sorted(featLogProb , reverse = True)
## Sort features based on importance!
sort_inds = sorted(range(len(featLogProb)), key=featLogProb.__getitem__, reverse = False)
for i in range(10):
    s = ""
    s += feature_names[sort_inds[i]]
    s += ":  "
    s += str(featLogProb[sort_inds[i]])
    s += "\n"
    print(s)

### SVM

In [None]:
# initialize the LinearSVC model
svm_model = LinearSVC(C=1)


# use the training data to train the model
svm_model.fit(X_train_vec,y_train)


# make prediction
svm_prediction = svm_model.predict(X_test_vec)

# get classification report
print(classification_report(svm_prediction, y_test))

In [None]:
## print confusion matrix
svm_cm = confusion_matrix(svm_prediction, y_test)
print(svm_cm)

**Feature Importtance**

In [None]:
## Linear SVC also ranks all features based on their contribution to distinguish the two concepts in each binary classifier
## For category "0" (very negative), get all features and their weights and sort them in increasing order
feature_ranks = sorted(zip(svm_model.coef_[0],gram12_count_vectorizer.get_feature_names()))


## get the 10 features that are best indicators of very negative sentiment (they are at the bottom of the ranked list)
very_negative_10 = feature_ranks[-10:]
print("Very negative words")
for i in range(0, len(very_negative_10)):
    print(very_negative_10[i])
print()

In [None]:
## get 10 features that are least relevant to "very negative" sentiment (they are at the top of the ranked list)
not_very_negative_10 = feature_ranks[:10]
print("not very negative words")
for i in range(0, len(not_very_negative_10)):
    print(not_very_negative_10[i])
print()

In [None]:
## Linear SVC also ranks all features based on their contribution to distinguish the two concepts in each binary classifier
## For category "0" (very negative), get all features and their weights and sort them in increasing order
feature_ranks = sorted(zip(svm_model.coef_[4],gram12_count_vectorizer.get_feature_names()))


## get the 10 features that are best indicators of very negative sentiment (they are at the bottom of the ranked list)
very_positive_10 = feature_ranks[-10:]
print("Very positive words")
for i in range(0, len(very_positive_10)):
    print(very_positive_10[i])
print()

In [None]:
## get 10 features that are least relevant to "very negative" sentiment (they are at the top of the ranked list)
not_very_positive_10 = feature_ranks[:10]
print("not very positive words")
for i in range(0, len(not_very_positive_10)):
    print(not_very_positive_10[i])
print()

**Interpretation**

In [None]:
## get the confidence scores for all test examples from each of the five binary classifiers
svm_confidence_scores = svm_model.decision_function(X_test_vec)

## get the confidence score for the first test example
print(svm_confidence_scores[0])

## sample output: array([-1.05306321, -0.62746206,  0.31074854, -0.89709483, -1.08343089]
## because the confidence score is the highest for category 2, 
## the prediction should be 2. 

## Confirm by printing out the actual prediction
print(y_test[0])

**Error Analysis**

In [None]:
# print out specific type of error for further analysis

# print out the very positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 5 such examples
# note if you use a different vectorizer option, your result might be different

err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==4 and svm_prediction[i]==0):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

In [None]:
# very positive and its predicted as negative


err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==4 and svm_prediction[i]==1):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

**Feature Engineering**

In [None]:
## function for negative detection

def has_negation(post):
    pattern_neg_1 = re.compile(r'\b(not|no|never)\b')
    pattern_neg_2 = re.compile(r'\b([a-z]+less)\b')
    if pattern_neg_1.search(post.lower()) or pattern_neg_2.search(post.lower()):
        return 1
    else: 
        return 0

In [None]:
txts = ['this is good', 'this is bad', 'this is not good', 'this is not bad', 'this is useless']
df = pd.DataFrame({'text':txts})
pattern_neg = re.compile(r'\b(not|no|never)\b')
print(df)

In [None]:
## apply the function

df['neg'] = df['text'].apply(lambda x: 1 if has_negation(x) else 0)
print(df)

**Now vectorize the text and combine the word vectors with the negation feature values.**

In [None]:
from scipy import sparse

vecs = gram12_count_vectorizer.transform(df['text']).astype(float)
#print(vecs)
X_dense = df[['neg']]
X_sparse = vecs
X = sparse.hstack([X_sparse, X_dense]).tocsr()
print(X)

**Now Apply the function to the Data**

In [None]:
y=train['Sentiment']

pattern_neg = re.compile(r'\b(not|no|never)\b')
train['neg'] = train['Phrase'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)

X_dense = train[['neg']]
X_sparse = gram12_count_vectorizer.fit_transform(train['Phrase']).astype(float)
X = sparse.hstack([X_sparse, X_dense]).tocsr()

In [None]:
%%time
# test the model with negation detection

svm_model = LinearSVC()
scores = cross_val_score(svm_model, X, y, cv=3, n_jobs=3)
avg=sum(scores)/len(scores)
print(avg)

In [None]:
%%time
# test the model without negation detection
# note this cross validation is not the standard pipeline method
# but a cut-corner version that does vectorization first and then train/test models
# this cut-corner version would allow the model to see the text of the test data, 
# but the model would still not see the labels of the test data
svm_model2= LinearSVC()
scores2 = cross_val_score(svm_model2, X_sparse, y, cv=3, n_jobs=3)
avg2=sum(scores2)/len(scores2)
print(avg2)

In [None]:
svm_model = LinearSVC()

svm_model.fit(X,y)

########## submit to Kaggle submission


# preserve the id column of the test examples
kaggle_ids= test['PhraseId'].values

# read in the text content of the examples
kaggle_X_test=test['Phrase'].values

# vectorize the test examples using the vocabulary fitted from the 60% training data
kaggle_X_test_vec= gram12_count_vectorizer.transform(kaggle_X_test)


## add feaure to X_test
pattern_neg = re.compile(r'\b(not|no|never)\b')
test['neg'] = test['Phrase'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)

X_dense = test[['neg']]

kaggle_X = sparse.hstack([kaggle_X_test_vec, X_dense]).tocsr()


# predict 
kaggle_pred=svm_model.predict(kaggle_X)

# combine the test example ids with their predictions
kaggle_submission=zip(kaggle_ids, kaggle_pred)

# prepare output file
outf=open('kaggle_submission_linearSVC.csv', 'w')

# write header
outf.write('PhraseId,Sentiment\n')

# write predictions with ids to the output file
for x, value in enumerate(kaggle_submission): outf.write(str(value[0]) + ',' + str(value[1]) + '\n')

# close the output file
outf.close()

**Kaggle test score:** 61.037% 

**Model Optimization**

In [None]:
%%time

c = [0.1,0.2, 0.3, 0.5, 1, 2,5,7]

score = []

for i in c:
  svm_model = LinearSVC(C=i)
  scores = cross_val_score(svm_model, X, y, cv=3, n_jobs=3)
  avg=sum(scores)/len(scores)
  score.append(avg)


plt.figure()
plt.plot(c,score)
plt.show()

In [None]:
sns.lineplot(c,score)

In [None]:
%%time

c = [0.001, 0.002, 0.004, 0.005, 0.008, 0.01, 0.02,0.04, 0.1, 0.2, 0.4,0.5,1,2]

score = []

for i in c:
  svm_model = LinearSVC(C=i)
  scores = cross_val_score(svm_model, X, y, cv=3, n_jobs=3)
  avg=sum(scores)/len(scores)
  score.append(avg)
  
  
sns.lineplot(c,score)

**C=0.2**

In [None]:
svm_model = LinearSVC(C=0.2)

svm_model.fit(X,y)

########## submit to Kaggle submission


# preserve the id column of the test examples
kaggle_ids= test['PhraseId'].values

# read in the text content of the examples
kaggle_X_test=test['Phrase'].values

# vectorize the test examples using the vocabulary fitted from the 60% training data
kaggle_X_test_vec= gram12_count_vectorizer.transform(kaggle_X_test)


## add feaure to X_test
pattern_neg = re.compile(r'\b(not|no|never)\b')
test['neg'] = test['Phrase'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)

X_dense = test[['neg']]

kaggle_X = sparse.hstack([kaggle_X_test_vec, X_dense]).tocsr()


# predict 
kaggle_pred=svm_model.predict(kaggle_X)

# combine the test example ids with their predictions
kaggle_submission=zip(kaggle_ids, kaggle_pred)

# prepare output file
outf=open('kaggle_submission_linearSVC-0.2.csv', 'w')

# write header
outf.write('PhraseId,Sentiment\n')

# write predictions with ids to the output file
for x, value in enumerate(kaggle_submission): outf.write(str(value[0]) + ',' + str(value[1]) + '\n')

# close the output file
outf.close()

**kaggle Test Score**: 60.945%

**C=0.1**

In [None]:
svm_model = LinearSVC(C=0.1)

svm_model.fit(X,y)

########## submit to Kaggle submission


# preserve the id column of the test examples
kaggle_ids= test['PhraseId'].values

# read in the text content of the examples
kaggle_X_test=test['Phrase'].values

# vectorize the test examples using the vocabulary fitted from the 60% training data
kaggle_X_test_vec= gram12_count_vectorizer.transform(kaggle_X_test)


## add feaure to X_test
pattern_neg = re.compile(r'\b(not|no|never)\b')
test['neg'] = test['Phrase'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)

X_dense = test[['neg']]

kaggle_X = sparse.hstack([kaggle_X_test_vec, X_dense]).tocsr()


# predict 
kaggle_pred=svm_model.predict(kaggle_X)

# combine the test example ids with their predictions
kaggle_submission=zip(kaggle_ids, kaggle_pred)

# prepare output file
outf=open('kaggle_submission_linearSVC-0.2.csv', 'w')

# write header
outf.write('PhraseId,Sentiment\n')

# write predictions with ids to the output file
for x, value in enumerate(kaggle_submission): outf.write(str(value[0]) + ',' + str(value[1]) + '\n')

# close the output file
outf.close()

Kaggle score: 60.366%

C=2

In [None]:
svm_model = LinearSVC(C=2)

svm_model.fit(X,y)

########## submit to Kaggle submission


# preserve the id column of the test examples
kaggle_ids= test['PhraseId'].values

# read in the text content of the examples
kaggle_X_test=test['Phrase'].values

# vectorize the test examples using the vocabulary fitted from the 60% training data
kaggle_X_test_vec= gram12_count_vectorizer.transform(kaggle_X_test)


## add feaure to X_test
pattern_neg = re.compile(r'\b(not|no|never)\b')
test['neg'] = test['Phrase'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)

X_dense = test[['neg']]

kaggle_X = sparse.hstack([kaggle_X_test_vec, X_dense]).tocsr()


# predict 
kaggle_pred=svm_model.predict(kaggle_X)

# combine the test example ids with their predictions
kaggle_submission=zip(kaggle_ids, kaggle_pred)

# prepare output file
outf=open('kaggle_submission_linearSVC-0.2.csv', 'w')

# write header
outf.write('PhraseId,Sentiment\n')

# write predictions with ids to the output file
for x, value in enumerate(kaggle_submission): outf.write(str(value[0]) + ',' + str(value[1]) + '\n')

# close the output file
outf.close()

Score = 60.47%

**C=1 has the highest score of 61% on kaggle test set**

**Tune Penalty**

In [None]:
%%time
# test the model with negation detection

svm_model = LinearSVC(C=1)
scores = cross_val_score(svm_model, X, y, cv=3, n_jobs=3)
avg=sum(scores)/len(scores)
print(avg)

## Unigram and Bigram

In [None]:

## create a model container
model = {}

## add SVM and multinomial naives to the model cointainer
model['SVM'] = LinearSVC()
model['MNB'] = MultinomialNB()

## create a vectorization container
vec = {}

## add vectorizer to the container
vec['ngram12'] =  CountVectorizer(input="content",encoding='latin-1', ngram_range=(1,2), min_df=5, stop_words='english')
vec['unigram'] =   CountVectorizer(input="content",encoding='latin-1', binary=False, min_df=5, stop_words='english')


## 10 fold cross validation function
## create a pipeline
def score_model_pipeline(model, vectorizer, X,y, cv=3):
  '''10 fold cross validation pipeline and return the average scores'''
  nbc = Pipeline([('vect', vectorizer), ('nb', model)])
  scores = cross_val_score(nbc, X,y,cv=cv)
  print('Avg Score: {}'.format(sum(scores)/len(scores)))
  return sum(scores)/len(scores)


## get features and label from dataframe
def get_X_y_from_df(df, X='text', y='label'):
  '''get the text,features and the label, the target. return features and target'''
  X = list(df[X].values)
  y = df[y].values
  return X,y

## create an emty list to store score
score = []
vec_use = []
model_use= []

## get X and y
X,y = get_X_y_from_df(train, X='Phrase', y='Sentiment')

## run the cross validation using the pipeline
for mod in model:
  # loop through the model in the model container
  for v in vec:
    # loop through the vectorizer in the vectorizer container
    cv = score_model_pipeline(model[mod], vec[v], X,y)
    score.append(cv)  # append the score
    vec_use.append(v)  # append the vectorization used
    model_use.append(mod) # append the model used
    
    
  
## create a date frame of cross validation score of the classifier
result = {'Model': model_use, 'Vectorization': vec_use, '10 fold Avg Score': score}
sentiment_df = pd.DataFrame(result)
sentiment_df

* C =1
* bigram
* feature engineering
* error analysis
*SVM(C=1)

In [None]:
svm_model = LinearSVC(C=1)


# use the training data to train the model
svm_model.fit(X_train_vec,y_train)


# make prediction
svm_prediction = svm_model.predict(X_test_vec)

# get classification report
print(classification_report(svm_prediction, y_test))

In [None]:
## print confusion matrix
svm_cm = confusion_matrix(svm_prediction, y_test)
print(svm_cm)

### **Negative Error Analysis**

In [None]:
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==0 and svm_prediction[i]==4):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

In [None]:
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==0 and svm_prediction[i]==3):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

In [None]:
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==0 and svm_prediction[i]==2):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

In [None]:
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==1 and svm_prediction[i]==4):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

In [None]:
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==1 and svm_prediction[i]==3):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

**Observation**

### **Positive Error Analysis**

In [None]:
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==4 and svm_prediction[i]==0):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

In [None]:
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==4 and svm_prediction[i]==1):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

In [None]:
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==4 and svm_prediction[i]==2):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

In [None]:
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==3 and svm_prediction[i]==0):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

In [None]:
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==3 and svm_prediction[i]==1):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

**Positive Error Observation**

**Using nltk stemmer to see if it will improve the performance of the model**

In [None]:
## eglish stemmer
english_stemmer = nltk.stem.SnowballStemmer('english')

## class to tem and vectorized doc
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([english_stemmer.stem(w) for w in analyzer(doc)])

In [None]:
##  stem vectorizer
stem_ngram_vectorizer = StemmedCountVectorizer(input="content",encoding='latin-1', ngram_range=(1,2), min_df=5, stop_words='english', analyzer="word")

In [None]:
y=train['Sentiment']

pattern_neg = re.compile(r'\b(not|no|never)\b')
train['neg'] = train['Phrase'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)

X_dense = train[['neg']]
X_sparse = stem_ngram_vectorizer.fit_transform(train['Phrase']).astype(float)
X = sparse.hstack([X_sparse, X_dense]).tocsr()

In [None]:
svm_model = LinearSVC(C=1)

svm_model.fit(X,y)

########## submit to Kaggle submission


# preserve the id column of the test examples
kaggle_ids= test['PhraseId'].values

# read in the text content of the examples
kaggle_X_test=test['Phrase'].values

# vectorize the test examples using the vocabulary fitted from the 60% training data
kaggle_X_test_vec= stem_ngram_vectorizer.transform(kaggle_X_test)


## add feaure to X_test
pattern_neg = re.compile(r'\b(not|no|never)\b')
test['neg'] = test['Phrase'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)

X_dense = test[['neg']]

kaggle_X = sparse.hstack([kaggle_X_test_vec, X_dense]).tocsr()


# predict 
kaggle_pred=svm_model.predict(kaggle_X)

# combine the test example ids with their predictions
kaggle_submission=zip(kaggle_ids, kaggle_pred)

# prepare output file
outf=open('kaggle_submission_linearSVC-stem.csv', 'w')

# write header
outf.write('PhraseId,Sentiment\n')

# write predictions with ids to the output file
for x, value in enumerate(kaggle_submission): outf.write(str(value[0]) + ',' + str(value[1]) + '\n')

# close the output file
outf.close()

Kaggle Test Score = 60.7%

### **Highest Test Score Model**

In [None]:
y=train['Sentiment']

pattern_neg = re.compile(r'\b(not|no|never)\b')
train['neg'] = train['Phrase'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)

X_dense = train[['neg']]
X_sparse = gram12_count_vectorizer.fit_transform(train['Phrase']).astype(float)
X = sparse.hstack([X_sparse, X_dense]).tocsr()




svm_model = LinearSVC()

svm_model.fit(X,y)

########## submit to Kaggle submission


# preserve the id column of the test examples
kaggle_ids= test['PhraseId'].values

# read in the text content of the examples
kaggle_X_test=test['Phrase'].values

# vectorize the test examples using the vocabulary fitted from the 60% training data
kaggle_X_test_vec= gram12_count_vectorizer.transform(kaggle_X_test)


## add feaure to X_test
pattern_neg = re.compile(r'\b(not|no|never)\b')
test['neg'] = test['Phrase'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)

X_dense = test[['neg']]

kaggle_X = sparse.hstack([kaggle_X_test_vec, X_dense]).tocsr()


# predict 
kaggle_pred=svm_model.predict(kaggle_X)

# combine the test example ids with their predictions
kaggle_submission=zip(kaggle_ids, kaggle_pred)

# prepare output file
outf=open('kaggle_submission_linearSVC-final.csv', 'w')

# write header
outf.write('PhraseId,Sentiment\n')

# write predictions with ids to the output file
for x, value in enumerate(kaggle_submission): outf.write(str(value[0]) + ',' + str(value[1]) + '\n')

# close the output file
outf.close()

Score: 61.037%