In [1]:
import os
os.chdir ("D:\E-Commerce\sentiment labelled sentences")

In [2]:
import numpy as np
import pandas as pd

In [3]:
data = pd.read_csv('imdb_labelled.txt', header=None, sep=r"\t", engine='python')
data.columns = ['review','sentiment']

In [4]:
data.head(n=5)

Unnamed: 0,review,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
review       1000 non-null object
sentiment    1000 non-null int64
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


In [11]:
from sklearn.cross_validation import train_test_split
corpus, test_corpus, y, yt = train_test_split(data.iloc[:,0], data.iloc[:,1],test_size=0.25, random_state=101)

### After splitting the data, the code transforms the text using several NLP techniques: token counts, unigrams and bigrams, stop words removal, text length normalization, and TF-IDF transformation.

In [7]:
from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer(ngram_range=(1,2),
                                  stop_words='english').fit(corpus)
TfidF = text.TfidfTransformer()
X = TfidF.fit_transform(vectorizer.transform(corpus))
Xt = TfidF.transform(vectorizer.transform(test_corpus))

## This kind of support vector machine supports L2 regularization, so the code must search for the best C parameter using the grid search approach.

In [12]:
from sklearn.svm import LinearSVC
from sklearn.grid_search import GridSearchCV

param_grid = {'C': [0.01, 0.1, 1.0, 10.0, 100.0]}

clf = GridSearchCV(LinearSVC(loss='hinge',
                             random_state=101), param_grid)

clf = clf.fit(X, y)

print ("Best parameters: %s" % clf.best_params_)

Best parameters: {'C': 1.0}


## Now that the code has determined the best hyper-parameter for the problem, you can test performance on the test set using the accuracy measure, the percentage of correct times the code can guess the correct sentiment.

In [9]:
from sklearn.metrics import accuracy_score

solution = clf.predict(Xt)

print("Achieved accuracy: %0.3f" %

accuracy_score(yt, solution))

Achieved accuracy: 0.816


## The results indicate accuracy of higher than 80 percent, but determining which phrases tricked the algorithm into making a wrong prediction is interesting. You can print the misclassified texts and consider what the learning algorithm is missing in terms of learning from text.

In [10]:
print(test_corpus[yt!=solution])

601    There is simply no excuse for something this p...
32     This is the kind of money that is wasted prope...
887    At any rate this film stinks, its not funny, a...
668    Speaking of the music, it is unbearably predic...
408         It really created a unique feeling though.  
413         The camera really likes her in this movie.  
138    I saw "Mirrormask" last night and it was an un...
132    This was a poor remake of "My Best Friends Wed...
291                               Rating: 1 out of 10.  
904    I'm so sorry but I really can't recommend it t...
410    A world better than 95% of the garbage in the ...
55     But I recommend waiting for their future effor...
826    The film deserves strong kudos for taking this...
100            I don't think you will be disappointed.  
352                                    It is shameful.  
171    This movie now joins Revenge of the Boogeyman ...
814    You share General Loewenhielm's exquisite joy ...
218    It's this pandering to t