# Goal:
My goal is to load a dataset containing reviews and create a sentiment analyser model to determine whether those reviews are negative or positive. I will then analyse the model to see how I can improve it.

## Process:
When I undertake a machine learning project, I always lay out my process to improve my workflow and ability to explain the project findings. For this project, my workflow process will be:
1. Load in the data set
2. Split the data into training and testing
3. Create model
4. Train and test the model
5. Analyse the results 
6. Make improvements
7. Apply to test data set and export
8. Real-world Applications

In [1]:
import nltk
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

### 1.0 Importing the data
The dataset I will be using in this project is from The University of Michigan who asked their students to review a list of films. I have two datasets here, the training and the testing data. The training data contains a binary feature that denotes if the review was positive or negative. The testing data does not have this feature.

I will be importing the csv files as data frames, making sure I separate out the different features by tab.

In [2]:
train = pd.read_csv('trainingdata.csv',header=None,sep='\t',names=['target','text'])
test = pd.read_csv('testdata.csv',sep='\t',header=None,names=['text'])

What does my data look like?

In [42]:
train.head()

Unnamed: 0,target,text
0,1,The Da Vinci Code book is just awesome.
1,1,this was the first clive cussler i've ever rea...
2,1,i liked the Da Vinci Code a lot.
3,1,i liked the Da Vinci Code a lot.
4,1,I liked the Da Vinci Code but it ultimatly did...


### 2.0 Splitting the data
I will be splitting the 'training' dataset into training and testing data.

In [3]:
x = train.drop('target',axis=1)
y = train['target']

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.33)

### 3.0 Create the model
To create this model I will be using the pipeline feature from sklearn. Within this pipeline I will include a CountVectorizer to clean my input text (getting rid of stopwords and punctuation) and convert it to vector format, TfidfTransformer to add weights to important words, and a MultinomialNB to classify the data.

### 3.1 Cleaning the text
To improve the accuracy of the model I will first create a function that will remove all the stopwords and punctuation from the reviews. Stopwords are words such as: the, when, if, that etc. these words don't tell us anything about the text so they will be removed.

In [4]:
from nltk.corpus import stopwords
from string import punctuation

In [5]:
def clean_text(string):
    
    nopunc = [char for char in string if char not in punctuation]
    nopunc = ''.join(nopunc)
    
    return [word.lower() for word in nopunc if word not in stopwords.words('english')]

This function will take in a string, remove the punctuation, stopwords and then return the cleaned string.

### 3.2 Creating the pipeline
Now it's time to create my pipeline. Within this pipeline I will include 3 elements:
1. CountVectorizer.
This will convert the strings into a matrix of tokens, whilst also using the function I defined in 3.1 to clean the string.
2. TfidfTransformer.
This will transform the matrix of tokens into a normalized tf-idf representation, effectively adding weights to words if they are more important in the corpus.
3. MultinomialNB.
This is my classifier for my model. 

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

In [7]:
pipeline = Pipeline([
    ('vect',CountVectorizer(analyzer=clean_text)),
    ('tfidf',TfidfTransformer()),
    ('classifier',MultinomialNB()),
])

### 4.0 Train and test my model
Now that I created my pipeline, it is time to train and test the model.

In [8]:
pipeline.fit(x_train.values,y_train.values)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=<function clean_text at 0x1a0c974840>, binary=False,
        decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [9]:
predictions_one = pipeline.predict(x_test.values)

### 5.0 Analyse the results
Let's see how our model has performed.

In [10]:
from sklearn.metrics import classification_report,confusion_matrix

print(confusion_matrix(y_test.values,predictions_one))
print('\n')
print(classification_report(y_test.values,predictions_one))
print('\n')
print("Accuracy: {:.4%}".format(np.mean(predictions_one == y_test.values)))

[[ 553  434]
 [  49 1247]]


             precision    recall  f1-score   support

          0       0.92      0.56      0.70       987
          1       0.74      0.96      0.84      1296

avg / total       0.82      0.79      0.78      2283



Accuracy: 78.8436%


For a first pass at this model, it's a good set of results. It would appear that our model struggled with precision of positive reviews and the recall of negative reviews. Let's see if we can improve it.

### 6.0 Improving the model
When using machine learning it is important to assess and improve the model. To find areas of improvement I will try the following:
1. Use GridSearchCV to find optimal parameters
2. Use a different classifier in the pipeline

### 6.1 GridSearchCV with original model
To see if I can increase the accuracy of my original model pipeline I will use GridSearchCV to find the optimal parameters I pass through. 

I need to tell GridSearchCV which parameters to test first. Below are the parameters that I'm asking it to test.

In [11]:
from sklearn.model_selection import GridSearchCV
parameters = {
    'vect__ngram_range': [(1,1),(1,2)],
    'tfidf__use_idf': (True, False),
    'classifier__alpha': (1,1e-1,1e-2,1e-3),
}

In [12]:
gs_clf = GridSearchCV(pipeline, parameters, verbose=1)
gs_clf = gs_clf.fit(x_train.values,y_train.values)

Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed: 80.2min finished


In [13]:
# The best parameters for my pipeline are the following

gs_clf.best_params_

{'classifier__alpha': 0.1, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}

In [14]:
predictions_two = gs_clf.predict(x_test.values)

In [15]:
print(confusion_matrix(y_test.values,predictions_two))
print('\n')
print(classification_report(y_test.values,predictions_two))
print('\n')
print("Accuracy: {:.4%}".format(np.mean(predictions_two == y_test.values)))

[[ 552  435]
 [  52 1244]]


             precision    recall  f1-score   support

          0       0.91      0.56      0.69       987
          1       0.74      0.96      0.84      1296

avg / total       0.82      0.79      0.77      2283



Accuracy: 78.6684%


The only parameters that it suggested I change was the classifier alpha from 1.0 to 0.1. The accuracy was not increased.

### 6.2 Using BernoulliNB

Next I will try to use a different classifier, BernoulliNB. From my research I believe that this should improve my results due to it working well on binary features, such as the word vectors in this project.

In [16]:
from sklearn.naive_bayes import BernoulliNB

In [17]:
pipeline = Pipeline([
    ('vect',CountVectorizer(analyzer=clean_text)),
    ('tfidf',TfidfTransformer()),
    ('classifier',BernoulliNB()),
])

In [18]:
pipeline.fit(x_train.values,y_train.values)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=<function clean_text at 0x1a0c974840>, binary=False,
        decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,..._idf=True)), ('classifier', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))])

Now to see if my change of classifier will improve our accuracy results.

In [19]:
predictions_three = pipeline.predict(x_test.values)

In [20]:
print(confusion_matrix(y_test.values,predictions_three))
print('\n')
print(classification_report(y_test.values,predictions_three))
print('\n')
print("Accuracy: {:.4%}".format(np.mean(predictions_three == y_test.values)))

[[ 857  130]
 [ 254 1042]]


             precision    recall  f1-score   support

          0       0.77      0.87      0.82       987
          1       0.89      0.80      0.84      1296

avg / total       0.84      0.83      0.83      2283



Accuracy: 83.1800%


We can see that our accuracy was increased by a few percentage points. What is also interesting about these results is that our model was much more balanced in terms of predicting and recalling both positive and negative reviews.

### 6.3 Final accuracy results
I have now created three models. The first being my original model, the second after adjusting parameters, and the third after changing my classifier. 

The accuracy results for those three models are:

In [21]:
acc_results = {'MultinomialNB': (np.mean(predictions_one == y_test.values)), 'Multinomial NB after GridSearchCV': (np.mean(predictions_two == y_test.values)), 'BernoulliNB': (np.mean(predictions_three == y_test.values))}
acc_results

{'BernoulliNB': 0.83180026281208941,
 'Multinomial NB after GridSearchCV': 0.7866841874726237,
 'MultinomialNB': 0.78843626806833111}

From these results I can say that the BernoulliNB classifier is the better choice for this set of data.

### 6.3.1 Where I could go next
There are other areas that I could explore next if I were to continue. I could tweak the parameters even more on both the BernoulliNB and MultinomialNB models, or possibly try other machine learning models away from the Naive Bayes library. I could also analyse the predictions to see if there was a trend in the reviews that were misclassified.

### 7.0 Apply the model to test data
Now that I have created my model I will apply it to my test dataset and export the results as a CSV.

In [23]:
test_data_predictions = pipeline.predict(test.values)

In [32]:
test['classification'] = test_data_predictions

In [38]:
test.head()

Unnamed: 0,text,classification
0,"I don't care what anyone says, I like Hillary...",0
1,"harvard is dumb, i mean they really have to be...",1
2,I'm loving Shanghai > > > ^ _ ^.,1
3,harvard is for dumb people.,1
4,"As i stepped out of my beautiful Toyota, i hea...",0


In [39]:
test.to_csv('test_data_predictions.csv',index=False)

### 8.0 Real-world applications
This has been an interesting model to work on, but what are the practical applications of a model like this? Here are a couple of areas where I believe we could apply it:

1. This model doesn't have to be limited to movie reviews. This could be inbuilt into a chatbot system to determine the sentiment of a conversation between a customer and a customer service representative. 
2. Internal business communication could be anonymously fed into the model to track the overall mood within the business. If there is a long trend of negative sentiment within the business, this could then be addressed to avoid long term issues.