# Module 6 Assignment

A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without an error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

# We do this to ignore warnings
import warnings
warnings.filterwarnings("ignore")

---
# Prepare Yelp Review Data

In this assignment, you will use the Yelp review dataset. Before we attempt to build a model, we first prepare the data.

The Yelp review dataset contains 1000 customer reviews of a group of restaurants. There are two columns in the dataset, column **stars** which is the star rating and column __text__ which is the review text. The dataset only contains 1-star and 5-star reviews. There are 500 1-star reviews and 500 5-star reviews.

Please run the next Code cell before proceeding to Problem 1.


In [2]:
#Load yelp review dataset
df = pd.read_csv('data/yelp_reviews.csv')
rating = df['stars']
data = df['text']
sample_review = data[0]
print(f'Counts of Reviews by Star Rating:\n{rating.value_counts()}\n')
print(f'Sample Review({rating[0]} stars):\n{sample_review}')
print('-'*80)
print('\nDataset Sample:')
df.sample(2, random_state=2)

Counts of Reviews by Star Rating:
1    500
5    500
Name: stars, dtype: int64

Sample Review(5 stars):
I love love LOVE this place. My boss (who is into healthy eating) recommended this place. I went over with some highly skeptical friends and one dinner was enough to convert them into believers! The food here is so good! We had the Shrimp dumplings and the Onion tart as starters. We ordered the Shirataki noodles and street tacos as entrees. So also ordered the Kale-aid. All of the dishes were yummy. 
I have gone back many times since then and have never been disappointed! I have gone after yoga to get some Kale salad or the chicken chopped salad. I always have to get the Kale aid. 
Once, a guy at the next table, uprooted a whole plant by mistake (on the patio) and was highly embarrassed as was his date! Ever since, I have very careful not to throw my arms around as I can be quite clumsy sometimes! I do NOT want to be banned from my favorite place for my clumsiness! I don't think I can

Unnamed: 0,stars,text
37,5,Celebrated my anniversary here. Everything was...
726,1,I had such high expectaions after reading revi...


---
# Problem 1: Get Top Words from a Sample Review

Get the most-used words in a sample review.

For this problem you will use the string **sample_review** created above.

To solve this problem do the following:
- Use regular expression `r'[^\w]'` to replace all characters in sample_review that are not letter or number with white spaces, assign the new string to sample_review_ns.
- Convert sample_review_ns to **all lower case** with string function lower(), assign the new string to sample_review_nsl.
- Use string function split() to split sample_review_nsl to a list of words and use the list to construct a `Counter` object in `collections` module.
- Get **top 10** words from the Counter object with `most_common()` function and assign the result to variable __top_10_words__.

After this problem, there's a new variable **top_10_words** defined.

---

In [3]:
import re
import collections as cl

### BEGIN SOLUTION
sample_review_ns = re.sub(r'[^\w]', ' ', sample_review)
sample_review_nsl = sample_review_ns.lower()
c = cl.Counter(sample_review_nsl.split())
top_10_words = c.most_common(10)
### END SOLUTION

In [4]:
assert_equal(type(top_10_words), list, msg='top_10_words should be a list')
assert_equal(top_10_words, [('i', 10),('the', 10),('and', 5),('to', 5),('have', 5),('my', 4),
                            ('as', 4),('love', 3),('place', 3),('was', 3)], msg='top_10_words list is not correct')
print('Top 10 Words:')
top_10_words

Top 10 Words:


[('i', 10),
 ('the', 10),
 ('and', 5),
 ('to', 5),
 ('have', 5),
 ('my', 4),
 ('as', 4),
 ('love', 3),
 ('place', 3),
 ('was', 3)]

---
# Problem 2: Prepare Data for Text Classification

Create label and split text data set to training and testing set.

For this problem you will use **data** and __rating__ created above.

To solve this problem do the following:
- Create **label** from __rating__. Set label to 1 if rating is 5 stars and 0 otherwise.(Hint: use lambda function: `label = rating.map(lambda x: 1 if x==5 else 0)`
- Split data and label to training and testing using `train_test_split`.
 - Set test_size to 0.4.
 - Set random_state to 23.
 - Assign return values to **d_train, d_test, l_train, l_test**.

After this problem, there are four new variable, **d_train, d_test, l_train** and  __l_test__ defined.

---

In [5]:
from sklearn.model_selection import train_test_split

### BEGIN SOLUTION
label = rating.map(lambda x: 1 if x==5 else 0)
d_train, d_test, l_train, l_test = train_test_split(data, label, test_size=0.4, random_state=23)
### END SOLUTION

In [6]:
assert_equal(len(d_train), 600, msg='training set size is not correct')
assert_equal(len(d_test), 400, msg='testing set size is not correct')
assert_equal(l_train.iloc[1], 1, msg='l_train is not correct')
assert_equal(l_train.iloc[100], 1, msg='l_train is not correct')
assert_equal(l_test.iloc[0], 0, msg='l_test is not correct')
assert_equal(l_test.iloc[100], 1, msg='l_test is not correct')

---

# Problem 3: Train a Multinomial Naive Bayes Model

For this problem, use d_train, l_train, d_test and l_test created above.

Your task for this problem is to build and train a `MultinomialNB` estimator to make predictions on the Yelp review dataset. 

To solve this problem do the following:
- Create a `TfidfVectorizer` object, set `stop_words` to 'english'.
- Fit the `TfidfVectorizer` objec with d_train
- Transform d_train and d_test with the `TfidfVectorizer` object to get **train_dtm** and __test_dtm__
- Create a `MultinomialNB` estimator **nb_model**. Accept default values for all hyperparameters.
- Fit the `MultinomialNB` estimator using train_dtm and l_train.

After this problem, there will be a trained multinomial Naive Bayes model **nb_model** define, as well as two document term matrices, __train_dtm, test_dtm__.

-----

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

### BEGIN SOLUTION
tf_cv = TfidfVectorizer(stop_words='english')
train_dtm = tf_cv.fit_transform(d_train)
test_dtm = tf_cv.transform(d_test)

# Fit model
nb_model = MultinomialNB()
nb_model = nb_model.fit(train_dtm, l_train)
### END SOLUTION

In [8]:
assert_equal(type(nb_model), type(MultinomialNB()), msg="nb_model is not defined as a MultinomialNB model")
assert_almost_equal(train_dtm[0, 132], 0.05808376842859404, msg="train_dtm is not correct")
assert_almost_equal(test_dtm[0, 157], 0.12459374859215833, msg="test_dtm is not correct")


---

# Problem 4: Get Text Classification Metrics

For this problem, you will compute the classification metrics of the nb_model created in problem 3.  

To complete this function, you must explicitly:

- Apply nb_model `predict` function on test_dtm created in problem 3 to get predicted label, assign it to variable **l_pred**.
- Use l_test and l_pred to calculate:
 - The mean accuracy score using `accuracy_score` function in `metrics` module and save the value to **mas_score**.
 - The classification report using `classification_report` function in `metrics` module and save it to variable **c_report**.

After this problem, there will be two new variables, **mas_score** and __c_report__ defined.

-----

In [9]:
from sklearn import metrics

### BEGIN SOLUTION
l_pred = nb_model.predict(test_dtm)
mas_score = metrics.accuracy_score(l_test, l_pred)
c_report = metrics.classification_report(l_test, l_pred)
### END SOLUTION

In [10]:
assert_equal(mas_score, 0.885, msg='Accuracy score is not correct')
assert_true('0.94' in c_report, msg='Classification report is not correct')
assert_true('194' in c_report, msg='Classification report is not correct')
print(f"Accuracy Score: {mas_score*100:4.1f}%")
print(f"Classification Report\n{c_report}")

Accuracy Score: 88.5%
Classification Report
              precision    recall  f1-score   support

           0       0.84      0.94      0.89       194
           1       0.94      0.83      0.88       206

    accuracy                           0.89       400
   macro avg       0.89      0.89      0.88       400
weighted avg       0.89      0.89      0.88       400

