<h2> Sentiment analysis program for movie reviews using scikit-learn. The program returns if a movie review is positive or negative . </h2>

<h4> Importing necessary libraries </h4>

In [1]:
import nltk

In [2]:
from nltk.corpus import movie_reviews as mr

In [3]:
import pandas as pd

In [4]:
from nltk.corpus import stopwords #for stopwords

In [5]:
from nltk.stem import PorterStemmer

In [6]:
from nltk import word_tokenize

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
import numpy as np

In [9]:
from sklearn import preprocessing

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
from sklearn.svm import LinearSVC

<h4> Extract sentiment tags and reviews from NLTK corpus movie review into dataframes which creates labels for each movie. </h4>

In [12]:
reviews = []
for ids in mr.fileids():
    tag, filename = ids.split('/')
    reviews.append((tag, mr.raw(ids)))

dataframe = pd.DataFrame(reviews, columns=['tag', 'text'])

In [13]:
dataframe.iloc[0:5]

Unnamed: 0,tag,text
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


<h4> function to preprocess the data. </h4>

In [14]:
def preprossing(row):
    
    lowerRow = row.lower()
    sentence = []
    stopword = stopwords.words("english")
    words = word_tokenize(lowerRow)
    
    stemmer = PorterStemmer()
    
    
    
    for token in words:
        if (token not in stopword):
            stem_word = stemmer.stem(token)
            sentence.append(stem_word)
            
    texts = " ".join(sentence)    
    return texts

<h4> Apply preprossesing to the DataFrame. </h4>

In [15]:
dataframe["text"]=dataframe["text"].apply(preprossing)

In [16]:
sentiment = (dataframe["tag"])

In [17]:
raw_reviews = (dataframe["text"])

<h4> Split data into Train and Test dataset. </h4>

In [18]:
review_train, review_test, labels_train, labels_test = train_test_split(raw_reviews,sentiment,test_size = 0.25,random_state = 0)

In [19]:
review_train.shape

(1500,)

In [20]:
review_test.shape

(500,)

In [21]:
labels_train.shape

(1500,)

In [22]:
labels_test.shape

(500,)

<h4> Generate label preprocessing to normalize labels.</h4>

In [23]:
label_processing = preprocessing.LabelEncoder()

In [24]:
label_train_enc = label_processing.fit_transform(labels_train)
label_test_enc = label_processing.transform(labels_test)

In [25]:
label_train_enc.shape

(1500,)

In [26]:
label_test_enc.shape

(500,)

<h4> Generate feature vectors for review_train and review_test and normalize them. </h4>

In [27]:
vectorizer = CountVectorizer()

In [28]:
vectorizer.fit(review_train)

CountVectorizer()

In [29]:
f_matrix_train =vectorizer.transform(review_train)
f_matrix_test =vectorizer.transform(review_test)

In [30]:
f_matrix_train.shape

(1500, 24362)

In [31]:
f_matrix_test.shape

(500, 24362)

<h4> Initialize the support vector classifier and apply it to fit models. </h4>

In [32]:
svc_model = LinearSVC(C=2,max_iter=500)
svc_model.fit(f_matrix_train,label_train_enc)

LinearSVC(C=2, max_iter=500)

<h4> Import file to review and make prediction based on models. </h4>

In [33]:
test_file = open('Test_file.txt','r')
test_file = test_file.read()
file_preprocess = [preprossing(str(test_file))]
file_transform = vectorizer.transform(file_preprocess)

predictions = svc_model.predict(file_transform)

In [34]:
#print("Accuracy on Train: " + str(svc_model.score(f_matrix_train,label_train_enc)))
#print("Accuracy on Test: " + str(svc_model.score(f_matrix_test,label_test_enc)))

<h4> Display prediction </h4>

In [35]:
if predictions[0] == 0:
    print ("Negative")
else:
    print ("Positive")

Negative
