
## Practice: Sentiment Analysis Classification

In this notebook we will try to solve a classification problem where the goal is to classify movie reviews based on sentiment, negative or positive. This notebook presents the problem in its simplest terms unlike the sophisticated sentiment analysis which is done based on the presence or absence of specific words. We will use scikit learn data loading functionality to build training and testing data.

The notebook is partially complete. Look for "Your code here" to complete the partial code.

-----

**Activity 1: ** Load the data from '../../../datasets/movie_reviews' into the mvr variable

In [16]:
import nltk
mvr = nltk.corpus.movie_reviews

from sklearn.datasets import load_files

data_dir = '../../datasets/movie_reviews'
# <Your code here to load the movie reviews data in above path>
mvr = load_files(data_dir)

#help(mvr)

# <Your code here to print the number of reviews>
print('Number of Reviews: {0}'.format(len(mvr['data'])))

Number of Reviews: 2000


**Activity 2: ** Split the data in mvr.data into train(mvr_train) and test(mvr_test) datasets

In [27]:
from sklearn.model_selection import train_test_split

mvr_train, mvr_test, y_train, y_test = train_test_split(
    mvr.data, 
    mvr.target, 
    test_size=0.25, random_state=23)

-----

Now that the training and testing data have been loaded into the notebook, we can build a simple pipeline by using a `CountVectorizer` and `MultinomialNB` to build a document-term matrix and to perform a Naive Bayes classification.

-----

**Activity 3: ** Build a pipeline by using a `CountVectorizer` and `MultinomialNB` to build a document-term matrix and to perform a Naive Bayes classification. Print the metrics of classification result.

In [29]:
# Build simple pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

pipeline = Pipeline([('cv', CountVectorizer()), ('nb', MultinomialNB())])

# Build DTM and classify data
pipeline.fit(mvr_train, y_train)

# Predict the reviews on mvr_test data
y_pred = pipeline.predict(mvr_test)

# Print the prediction results
print(metrics.classification_report(y_test, y_pred, target_names = mvr.target_names))

              precision    recall  f1-score   support

         neg       0.78      0.81      0.80       235
         pos       0.83      0.80      0.81       265

   micro avg       0.81      0.81      0.81       500
   macro avg       0.81      0.81      0.81       500
weighted avg       0.81      0.81      0.81       500



**Activity 4: ** Use stop words in above `CountVectorizer`. Build the document-term matrix and perform a Naive Bayes classification again. Print the metrics of the new classification results.

In [24]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])

pipeline.set_params(cv__stop_words = 'english', \
                cv__ngram_range=(1,2), \
                cv__lowercase=True)

# Build DTM and classify data
pipeline.fit(mvr_train, y_train)
y_pred = pipeline.predict(mvr_test)
print(metrics.classification_report(y_test, y_pred, target_names = mvr.target_names))

ValueError: Invalid parameter cv for estimator Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]). Check the list of available parameters with `estimator.get_params().keys()`.

**Activity 5: ** Change the vectorizer to TF-IDF. Perform a Naive Bayes classification again. Print the metrics of new classification results.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

pipeline = Pipeline(<Your code here to build pipeline for TF-IDF and MultinomialNB)

pipeline.set_params(tf__stop_words = 'english')


# Build DTM and classify data
clf.fit(mvr_train, y_train)
y_pred = clf.predict(mvr_test)
print(metrics.classification_report(y_test, y_pred, target_names = mvr.target_names))

**Activity 6: ** Change the TF-IDF parameters, such as `max_features` and `lowercase`. perform a Naive Bayes classification again. Print the metrics of the new classification results.

Note: Find the documentation for [TfidfVectorizer here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to find the right values for max_features and lowercase. Play around with the parameter to see how results are changing.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tools = [('tf', TfidfVectorizer()), ('nb', MultinomialNB())]
clf = Pipeline(tools)
clf.<Your code to use max_features and lowercase with TfidfVectorizer>


# Build DTM and classify data
clf.fit(mvr_train, y_train)
y_pred = clf.predict(mvr_test)
print(metrics.classification_report(y_test, y_pred, target_names = mvr.target_names))

**Activity 7: ** Change the classifier to the logistic regression algorithm. Print the results metrics.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression


tools = <Your code here to build pipeline for TF-IDF and LogisticRegression>
clf = Pipeline(tools)
clf.set_params(tf__stop_words = 'english')


# Build DTM and classify data
clf.fit(mvr_train, y_train)
y_pred = clf.predict(mvr_test)
print(metrics.classification_report(y_test, y_pred, target_names = mvr.target_names))