# <font color="red"> Machine Learning and NLP Exercises

# Introduction

We will be using the same review data set from Kaggle for this exercise. The product we'll focus on this time is a cappuccino cup. The goal of this week is to not only preprocess the data, but to classify reviews as positive or negative based on the review text.

The following code will help you load in the data.


In [12]:
import nltk
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
data = pd.read_csv('coffee.csv')
data.head()

Unnamed: 0,user_id,stars,reviews
0,A2XP9IN4JOMROD,1,I wanted to love this. I was even prepared for...
1,A2TS09JCXNV1VD,5,Grove Square Cappuccino Cups were excellent. T...
2,AJ3L5J7GN09SV,2,I bought the Grove Square hazelnut cappuccino ...
3,A3CZD34ZTUJME7,1,"I love my Keurig, and I love most of the Keuri..."
4,AWKN396SHAQGP,1,It's a powdered drink. No filter in k-cup.<br ...


# Question 1 

- Determine how many reviews there are in total.


Use the preprocessing code below to clean the reviews data before moving on to modeling.


In [4]:
# Text preprocessing steps - remove numbers, captial letters and punctuation
import re
import string

alphanumeric = lambda x: re.sub(r"""\w*\d\w*""", ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())

data['reviews'] = data.reviews.map(alphanumeric).map(punc_lower)
data.head()

Unnamed: 0,user_id,stars,reviews
0,A2XP9IN4JOMROD,1,i wanted to love this i was even prepared for...
1,A2TS09JCXNV1VD,5,grove square cappuccino cups were excellent t...
2,AJ3L5J7GN09SV,2,i bought the grove square hazelnut cappuccino ...
3,A3CZD34ZTUJME7,1,i love my keurig and i love most of the keuri...
4,AWKN396SHAQGP,1,it s a powdered drink no filter in k cup br ...


In [5]:
len(data)

542

# Question 2: Classsification *(20% testing, 80% training)*

Processes for classification 

### Step 1: Prepare the data (identify the feature and label)

In [6]:
x = data['reviews']
y = data['stars']

### Step 2: Split the data into training and testing sets

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = 'word')

x = vectorizer.fit_transform(x)
print (x.shape)

(542, 2320)


### Step 3: Vectorize the feature

In [16]:
# split data
x_train , x_test, y_train , y_test = train_test_split(x, y, stratify = y, test_size=0.20,random_state = 42)

print (x_train.shape)
print (x_test.shape)


(433, 2320)
(109, 2320)


### Step 4: Idenfity the model/ classifier to be used. Feed the train data into the model

### - Decision Tree

In [20]:
from sklearn.tree import DecisionTreeClassifier

DT_classifier = DecisionTreeClassifier()
DT_classifier.fit(x_train, y_train)
DT_classifier.predict(x_test)

array([5, 1, 5, 5, 3, 5, 5, 1, 4, 5, 4, 2, 5, 5, 5, 5, 1, 5, 5, 5, 1, 5,
       5, 5, 1, 5, 5, 1, 5, 5, 1, 5, 1, 5, 1, 1, 3, 1, 1, 5, 5, 4, 3, 2,
       5, 1, 3, 5, 5, 1, 1, 5, 5, 1, 5, 5, 5, 5, 5, 4, 5, 5, 1, 1, 1, 5,
       5, 5, 5, 2, 1, 5, 2, 1, 5, 5, 5, 1, 5, 5, 1, 5, 4, 5, 5, 5, 5, 4,
       5, 4, 1, 5, 5, 1, 5, 5, 5, 5, 5, 2, 5, 5, 5, 5, 5, 5, 1, 1, 5],
      dtype=int64)

### - Random Forest

In [21]:
rf_classifier = RandomForestClassifier()

rf_classifier.fit(x_train, y_train)

rf_classifier.predict (x_test)

array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 5, 5, 5, 5, 5, 1, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 1, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 1, 5, 5, 5, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 5],
      dtype=int64)

# Question 3 
Generate the accuracy scores for Decision Tree and Random Forest.  

In [22]:
y_pred_dt = DT_classifier.predict(x_test)
print ('Decision Tree Accuracy:{}'.format(accuracy_score(y_test, y_pred_dt)))

y_pred_rf = rf_classifier.predict(x_test)
print ('Random Forest Classifier:{}'.format(accuracy_score(y_test, y_pred_rf)))


Decision Tree Accuracy:0.5137614678899083
Random Forest Classifier:0.6055045871559633


# Question 4
Predict the rate of this review, 

<font color="blue">__"I dislike this coffee, terrible taste and very greasy."__



by using Decision Tree, Random Forest

In [23]:
ts = "I dislike this coffee, terrible taste and very greasy."

ts = re.sub(r"""\w*\d\w*""", ' ', ts)
ts = re.sub('[%s]'%re.escape(string.punctuation), ' ',ts.lower())
ts = [ts]
ts = vectorizer.transform(ts)
rf_classifier.predict(ts)

array([5], dtype=int64)