## Goal:
My goal is to take in a dataset containing reviews and create a sentiment analysis model to determine whether those reviews are negative or positive. I will then analyse the model to see how I could improve it.

## Steps:
This is the method I will be following for this project:
1. Load in the data set
2. Split the data into training and testing
3. Create model
4. Train and test the model
5. Analyse the results 
6. Make improvements
7. Real-world Applications

In [1]:
import nltk
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

### 1.0 Importing the data
The dataset I will be using in this project is from The University of Michigan who asked their students to review a list of films. I have two datasets here, the training and the testing data. The training data contains a binary feature that denotes if the review was positive or negative. The testing data does not have this feature.

I will be importing the csv files as data frames, making sure I separate out the different features by tab.

In [2]:
train = pd.read_csv('trainingdata.csv',header=None,sep='\t',names=['target','text'])
test = pd.read_csv('testdata.csv',sep='\t',header=None,names=['text'])

### 2.0 Splitting the data
I will be splitting the 'training' dataset into training and testing data.

In [3]:
x = train.drop('target',axis=1)
y = train['target']

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.33)

### 3.0 Create the model
To create this model I will be using the pipeline feature from sklearn. Within this pipeline I will include a CountVectorizer to clean my input text (getting rid of stopwords and punctuation) and convert it to vector format, TfidfTransformer to add weights to important words, and a MultinomialNB to classify the data.

### 3.1 Cleaning the text
To improve the accuracy of the model I will first create a function that will remove all the stopwords and punctuation from the reviews. Stopwords are words such as: the, when, if, that etc. these words don't tell us anything about the text so they will be removed.

In [4]:
from nltk.corpus import stopwords
from string import punctuation

In [5]:
def clean_text(string):
    
    nopunc = [char for char in string if char not in punctuation]
    nopunc = ''.join(nopunc)
    
    return [word.lower() for word in nopunc if word not in stopwords.words('english')]

This function will take in a string, remove the punctuation, stopwords and then return the cleaned string.

### 3.2 Creating the pipeline
Now it's time to create my pipeline. Within this pipeline I will include 3 elements:
1. CountVectorizer.
This will convert the strings into a matrix of tokens, whilst also using the function I defined in 3.1 to clean the string.
2. TfidfTransformer.
This will transform the matrix of tokens into a normalized tf-idf representation. Effectively adding weights to words if they are more important in the corpus.
3. MultinomialNB.
This is my classifier for my model. 

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

In [7]:
pipeline = Pipeline([
    ('vect',CountVectorizer(analyzer=clean_text)),
    ('tfidf',TfidfTransformer()),
    ('classifier',MultinomialNB()),
])

### 4.0 Train and test my model
Now that I created my pipeline, it is time to train and test the model.

In [8]:
pipeline.fit(x_train.values,y_train.values)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=<function clean_text at 0x102ac7ea0>, binary=False,
        decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,
...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [9]:
predictions_one = pipeline.predict(x_test.values)

### 5.0 Analyse the results
Let's see how our model has performed.

In [30]:
from sklearn.metrics import classification_report,confusion_matrix

print(confusion_matrix(y_test.values,predictions_one))
print('\n')
print(classification_report(y_test.values,predictions_one))
print('\n')
print("Accuracy: {:.4%}".format(np.mean(predictions_one == y_test.values)))

[[ 544  467]
 [  56 1216]]


             precision    recall  f1-score   support

          0       0.91      0.54      0.68      1011
          1       0.72      0.96      0.82      1272

avg / total       0.80      0.77      0.76      2283



Accuracy: 77.0915%


For a first pass at this model, it's a good set of results. It would appear that our model struggled with precision of positive reviews and the recall of negative reviews. Let's see if we can improve it.

## 6.0 Improving the model
When using machine learning it is important to always asses and improve the model. To find areas of improvement I will try the following:
1. Use GridSearchCV to find optimal parameters
2. Use a different classifier in the pipeline

### 6.1 GridSearchCV with original model
To see if I can increase the accuracy of my original model pipeline I will use GridSearchCV to find the optimal parameters I pass through. 

The parameters that I will be tweaking in this model are the following:

In [13]:
from sklearn.model_selection import GridSearchCV
parameters = {
    'vect__ngram_range': [(1,1),(1,2)],
    'tfidf__use_idf': (True, False),
    'classifier__alpha': (1,1e-1,1e-2,1e-3),
}

In [14]:
gs_clf = GridSearchCV(pipeline, parameters, verbose=1)
gs_clf = gs_clf.fit(x_train.values,y_train.values)

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed: 64.8min finished


In [55]:
# The best parameters for my pipeline are the following

gs_clf.best_params_

{'classifier__alpha': 0.1, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}

In [20]:
predictions_two = gs_clf.predict(x_test.values)

In [29]:
print(confusion_matrix(y_test.values,predictions_two))
print('\n')
print(classification_report(y_test.values,predictions_two))
print('\n')
print("Accuracy: {:.4%}".format(np.mean(predictions_two == y_test.values)))

[[ 545  466]
 [  58 1214]]


             precision    recall  f1-score   support

          0       0.90      0.54      0.68      1011
          1       0.72      0.95      0.82      1272

avg / total       0.80      0.77      0.76      2283



Accuracy: 77.0477%


The only parameters that it suggested I change was the classifier alpha from 1.0 to 0.1. The accuracy was not increased.

### 6.2 Using BernoulliNB

Next I will try to use a different classifier, BernoulliNB. From my research I believe that this should improve my results due to it working well on binary features, such as the word vectors in this project.

In [36]:
from sklearn.naive_bayes import BernoulliNB

In [56]:
pipeline = Pipeline([
    ('vect',CountVectorizer(analyzer=clean_text)),
    ('tfidf',TfidfTransformer()),
    ('classifier',BernoulliNB()),
])

In [57]:
pipeline.fit(x_train.values,y_train.values)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=<function clean_text at 0x102ac7ea0>, binary=False,
        decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,
..._idf=True)), ('classifier', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))])

Now to see if my change of classifier will improve our accuracy results.

In [61]:
predictions_three = pipeline.predict(x_test.values)

In [62]:
print(confusion_matrix(y_test.values,predictions_three))
print('\n')
print(classification_report(y_test.values,predictions_three))
print('\n')
print("Accuracy: {:.4%}".format(np.mean(predictions_three == y_test.values)))

[[885 126]
 [285 987]]


             precision    recall  f1-score   support

          0       0.76      0.88      0.81      1011
          1       0.89      0.78      0.83      1272

avg / total       0.83      0.82      0.82      2283



Accuracy: 81.9974%


We can see that our accuracy was increased by a few percentage points. What is also interesting about these results is that our model was much more balanced in terms of predicting and recalling both positive and negative reviews.

### 6.3 Final accuracy results
Now we have tried our three methods here, the results for accuracy come out as the following:

In [53]:
acc_results = {'MultinomialNB': (np.mean(predictions_one == y_test.values)), 'Multinomial NB after GridSearchCV': (np.mean(predictions_two == y_test.values)), 'BernoulliNB': (np.mean(predictions_three == y_test.values))}
acc_results

{'BernoulliNB': 0.8199737187910644,
 'Multinomial NB after GridSearchCV': 0.77047744196233026,
 'MultinomialNB': 0.77091546211125717}

BernoulliNB is clearly the preferred choice.

### 6.3.1 Where I could go next
There are other areas that I could explore next if I were to continue. I could tweak the parameters even more on both the BernoulliNB and MultinomialNB models, or possibly try other machine learning models away from the Naive Bays package. I could also analyse the predictions to see if there was a trend in the reviews that were misclassified.

## 7.0 Real-world applications
This has been an interesting model to work on, but what are the practical applications of a model like this? Here are a couple of areas where I believe we could apply it:

1. This model doesn't have to be limited to movie reviews. This could be inbuilt into a chatbot system to determine the sentiment of a conversation between a customer and a customer service representative. 
2. Internal business communication could be anonymously fed into the model to track the overall mood within the business. If there is a long trend of negative sentiment within the business, this could then be addressed to avoid long term issues.