# About Dataset

IMDB movie reviews dataset : http://ai.stanford.edu/~amaas/data/sentiment
<br>Contains 25000 positive and 25000 negative reviews
<br>Contains atmost reviews per movie
<br>At least 7 stars out of 10  →  positive (label = 1)
<br>At most 4 stars out of 10  →  negative (label = 0)

# 1. Importing Data

In [1]:
import pandas as pd
df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [3]:
df['review'][1]

"OK... so... I really like Kris Kristofferson and his usual easy going delivery of lines in his movies. Age has helped him with his soft spoken low energy style and he will steal a scene effortlessly. But, Disappearance is his misstep. Holy Moly, this was a bad movie! <br /><br />I must give kudos to the cinematography and and the actors, including Kris, for trying their darndest to make sense from this goofy, confusing story! None of it made sense and Kris probably didn't understand it either and he was just going through the motions hoping someone would come up to him and tell him what it was all about! <br /><br />I don't care that everyone on this movie was doing out of love for the project, or some such nonsense... I've seen low budget movies that had a plot for goodness sake! This had none, zilcho, nada, zippo, empty of reason... a complete waste of good talent, scenery and celluloid! <br /><br />I rented this piece of garbage for a buck, and I want my money back! I want my 2 hou

# 2. Data Preparation

Cleaning and pre-processing text data is a vital process in data analysis and especially in natural language processing tasks.
We strip the data set of reviews of irrelevant characters including HTML tags, punctuation, and emojis using regular expressions.


In [13]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re

In [8]:
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text

In [9]:
df['review'] = df['review'].apply(preprocessor)

In [10]:
from nltk.stem.porter import PorterStemmer
port = PorterStemmer()

In [11]:
def tokenizer_stemmed(text):
    return [port.stem(word) for word in text.split()]

In [12]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

# 3. Transform Text Data into TF-IDF Vectors

In information retrieval and text mining, we often observe words that crop up across our corpus of documents. These words can lead to bad performance during training and test time because they usually don’t contain useful information. Here we understand and implement a useful statistical technique, Term frequency-inverse document frequency (tf-idf), to downweight these class of words in the feature vector representation. The tf-idf is the product of the term frequency and the inverse document frequency. Applying scikit-learn’s TfidfTransformer to convert sample text into a vector of tf-idf values and apply the L2-normalization to it.

In [14]:
tfidf = TfidfVectorizer(strip_accents=None,
                       lowercase=False,
                       preprocessor=None,
                       tokenizer=tokenizer_stemmed,
                       use_idf=True,
                       norm='l2',
                       smooth_idf=True)

In [15]:
y = df.sentiment.values
x = tfidf.fit_transform(df.review)

# 4. Document Classification using Logistic Regression

First, we split the data into training and test sets of equal size. Then we create a pipeline to build a logistic regression model. To estimate the best parameters and model, we employ cross-validated grid-search over a parameter grid. We also save our model using pickle 


In [27]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, 
                                                    random_state = 30, 
                                                    test_size = 0.5,
                                                    shuffle = False)

In [20]:
import pickle
from sklearn.linear_model import LogisticRegressionCV

clf = LogisticRegressionCV(cv = 5,
                        scoring = 'accuracy', 
                        random_state = 30,
                        n_jobs = -1,
                        verbose = 3,
                        max_iter = 300).fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  1.6min remaining:  2.5min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.2min finished


In [21]:
saved_model = open('sentiment_analysis_using_logistic_regression.sav', 'wb')
pickle.dump(clf, saved_model)
saved_model.close()

# 5. Model Evaluation

We take a look at the best parameter settings, cross-validation score, and how well our model classifies the sentiments of reviews it has never seen before from the test set.

In [28]:
print("Accuracy is : ", accuracy_score(y_test, clf.predict(X_test)))

Accuracy is :  0.89592


In [29]:
print("Accuracy is : ", clf.score(X_test, y_test))

Accuracy is :  0.89592
