<h1>Movie Review Sentiment Analysis</h1>
<p>In this kernel I will be showing a basic NLP mechanism and the steps involved in the test preprocessing.</p>
<p>So lets get started ...</p>

<h3>We start by importing our favourite libraries of Data Science as well as some libraries which are required for NLP </h3>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

%matplotlib inline

<p>Setting global parameters for the Seaborn and Matplotlib </p>

In [None]:
plt.rcParams["figure.figsize"] = (16,9)
sns.set_style('whitegrid')

<h3>Now lets import our data..</h3>

In [None]:
train_df = pd.read_csv('../input/train.tsv',delimiter='\t')
test_df = pd.read_csv('../input/test.tsv',delimiter='\t')

<h3>Now, lets analyse how our data looks...</h3>

In [None]:
train_df.head()

In [None]:
train_df.shape

In [None]:
train_df.isnull().sum()

<p>So we don't have any null values in our dataset.</p>
<p>Now lets see the number of reviews in each category ...</p>

In [None]:
sns.countplot(x='Sentiment',data=train_df)

<h4>Take Away : The neutral comments are the most in the dataset.</h4>

<p>This step is not required, however in order to know the actual values of the numerical representations of the sentiments will help in understanding the dataset better.</p>

In [None]:
#label mapping
labels = ["Negative","Somewhat negative","Neutral","Somewhat positive","Positive"]
sentiment_code = [0,1,2,3,4]

labels_df = pd.DataFrame({"Label":labels,"Code":sentiment_code})

In [None]:
labels_df

<p>Lets get the length of each review...</p>

In [None]:
train_df['Phrase Length'] = train_df['Phrase'].apply(len)

<p>Lets analyse the distribution  of review length in the dataset...</p>

In [None]:
sns.distplot(train_df['Phrase Length'],bins=80,kde=False,hist_kws={"edgecolor":"blue"})

<h4>Take Away : The length of the  most of the reviews in the dataset are between 0-50.</h4>

In [None]:
train_df.hist(column='Phrase Length',by='Sentiment',bins=80,edgecolor='black')

<h4>Take Away : Reviews belonging to 1 and 3 category are much longer than the other category reviews.</h4>

<p>Initializing the Stemming object </p>

In [None]:
ps = PorterStemmer()

Function for text preprocessing which invloves removing all the punctuation marks,stopwords.

In [None]:
def text_processing(comment):
    nopunc = [char for char in comment if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    clean_text = [text for text in nopunc.split() if text.lower() not in stopwords.words('english')]
    #final_text = [ps.stem(text) for text in clean_text]
    return clean_text

<h4>Text Preprocessing Part - 1</h4>
<p>Applying the text_preprocessing function to the dataset.</p>

In [None]:
train_df['Phrase'].head(5).apply(text_processing)

<h4>Text Preprocessing Part -2 </h4>
<p>Applying Vectorization and tf-idf(term frequency-inverse document frequency) to the dataset.</p>

In [None]:
bow_transformer = CountVectorizer(analyzer=text_processing).fit(train_df['Phrase'])
phrases_bow = bow_transformer.transform(train_df['Phrase'])
tfidf_transformer = TfidfTransformer().fit(phrases_bow)
phrases_tfidf = tfidf_transformer.transform(phrases_bow)

Creating the Train and Test dataset...

In [None]:
X_train, X_test, y_train, y_test = train_test_split(phrases_tfidf, train_df['Sentiment'], test_size=0.3, random_state=42)

<h3>Now, applying the Multinomial Naive Bayes algorithm to the dataset.</h3>

In [None]:
sentiment_detect_model = MultinomialNB().fit(X_train, y_train)

In [None]:
predictions = sentiment_detect_model.predict(X_test)

<h3>Now lets analyse the accuracy of our model.</h3>

In [None]:
print (classification_report(y_test, predictions))

<h2>Lets apply our model to the test set of the competition</h2>

In [None]:
test_df.head()

In [None]:
test_transformer = CountVectorizer(analyzer=text_processing).fit(test_df['Phrase'])
test_bow = bow_transformer.transform(test_df['Phrase'])

test_transformer = TfidfTransformer().fit(test_bow)
test_tfidf = test_transformer.transform(test_bow)

In [None]:
test_predictions = sentiment_detect_model.predict(test_tfidf)

In [None]:
test_df['Sentiment'] = test_predictions

In [None]:
submission_df = test_df[['PhraseId','Sentiment']]

In [None]:
submission_df.to_csv('submission.csv',index=False)