# Module 6 Assignment

A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

# We do this to ignore warnings
import warnings
warnings.filterwarnings("ignore")

---
# Prepare Yelp Review Data

This assignment will use the Yelp review dataset. Before we attempt to build a model, we first prepare the data.

The Yelp review dataset contains 1000 customer reviews of a group of restaurants. There are two columns in the dataset, column **stars** which is the star rating and column __text__ which is the review text. The dataset only contains 1-star and 5-star reviews. There are 500 1-star reviews and 500 5-star reviews.

Please run the next code cell before proceeding to Problem 1.

In [None]:
#Load yelp review dataset
df = pd.read_csv('data/yelp_reviews.csv')
rating = df['stars']
data = df['text']
sample_review = data[0]
print(f'Counts of Reviews by Star Rating:\n{rating.value_counts()}\n')
print(f'Sample Review({rating[0]} stars):\n{sample_review}')
print('-'*80)
print('\nDataset Sample:')
df.sample(2, random_state=2)

---
# Problem 1: Prepare Data for Sentiment Analysis
Create label and split text data set to training and testing set.

For this problem you will use **data** and __rating__ created above.

To solve this problem do the following:
- Create **label** from __rating__, Set label to 1(positive) if rating is 5 stars and 0(negative) otherwise.(Hint: use lambda function: `label = rating.apply(lambda x: 1 if x==5 else 0)`
- Split data and label to training and testing using `train_test_split`.
 - set test_size to 0.4
 - set random_state to 23
 - assign return values to **d_train, d_test, l_train, l_test**

After this problem, there're four new variable, **d_train, d_test, l_train** and  __l_test__ defined.

---

In [None]:
from sklearn.model_selection import train_test_split

# YOUR CODE HERE


In [None]:
assert_equal(len(d_train), 600, msg='training set size is not correct')
assert_equal(len(d_test), 400, msg='testing set size is not correct')
assert_equal(l_train.iloc[1], 1, msg='l_train is not correct')
assert_equal(l_train.iloc[100], 1, msg='l_train is not correct')
assert_equal(l_test.iloc[0], 0, msg='l_test is not correct')
assert_equal(l_test.iloc[100], 1, msg='l_test is not correct')

---

# Problem 2: Train a LogisticRegression Model

For this problem, use d_train, l_train, d_test and l_test created above.

Your task for this problem is to build and train a `LogisticRegression` estimator to make predictions on the Yelp review dataset. 

To solve this problem do the following:
- Create a `TfidfVectorizer` object **tf_cv**, set `stop_words` to 'english'.
- Fit the `TfidfVectorizer` objec with d_train
- Transform d_train and d_test with the `TfidfVectorizer` object to get **train_dtm** and __test_dtm__
- Create a `LogisticRegression` estimator **lr_model**. Set `C` to `1E6`(1000000). Accept default values for all other hyperparameters.
- Fit the `LogisticRegression` estimator using train_dtm and l_train.
- Calcuate mean accuray score of **lr_model**
    - Apply lr_model `predict` function on test_dtm created in problem 1 to get predicted label, assign it to variable **l_pred**.
    - use `accuracy_score` function in `metrics` module on l_test and l_pred to calculate the mean accuracy score and assign score to **mas_score_lr**


After this problem, there will be a trained LogisticRegression model **lr_model** defined, as well as two document term matrices, __train_dtm, test_dtm__, and **mas_score_lr**

-----

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# YOUR CODE HERE


In [None]:
assert_equal(type(lr_model), type(LogisticRegression()), msg="lr_model is not defined as a LogisticRegression model")
assert_equal(lr_model.get_params()['C'], 1E6, msg="lr_model is not created with C = 1E6")
assert_almost_equal(train_dtm[0, 132], 0.05808376842859404, msg="train_dtm is not correct")
assert_almost_equal(test_dtm[0, 157], 0.12459374859215833, msg="test_dtm is not correct")
assert_equal(mas_score_lr, 0.89, msg="Mean accuracy score is not correct")
print(f'Logistic Regression accuracy score: {mas_score_lr*100:4.1f}%')

---

# Problem 3: Get Top 10 Words in Positive Reviews

Get to 10 words in positive reviews.

This problem will use **lr_model** and __tf_cv__ created in problem 2.

To solve this problem do the following:
- Use tf_cv `get_feature_names` function to get all tokens(words) from the TfidfVectorizer object. Convert the token list to numpy array and assign the numpy array to variable **all_words**.
- Sort **lr_model** `coef_` attribute with `numpy argsort` function and get the last 10 items' index and assign to **top_words_index**
- Get top 10 words using all_words and top_words_index and assign top 10 words list to **top10_positive_words**
- Reverse top10_positive_words

After this problem, there's a new variable **top10_positive_words** defined.

-----

In [None]:
# YOUR CODE HERE


In [None]:
assert_equal(top10_positive_words, ['best', 'love', 'amazing', 'great', 'delicious', 'awesome', 'good', 'little', 'highly', 'included'],
             msg='Top 10 words in positive reviews are not correct')
print(f"Top 10 words in positive reviews:\n {top10_positive_words}")

---

# Problem 4: Get Top 10 Words in Negative Reviews

Get to 10 words in negative reviews.

This problem will use **lr_model, all_words, train_dtm, l_train** created above.

To solve this problem do the following:
- Reverse **l_train** values, change 0 to 1 and 1 to 0 and create another training label **l_train_2**
- Fit the **lr_model** with __train_dtm__ and **l_train_2**
- Get top 10 words from **lr_model**:
    - Sort **lr_model** `coef_` attribute and get the last 10 items' index and assign to **top_words_index**
    - Get top 10 words using all_words created in problem 3 and top_words_index and assign top 10 words list to **top10_negative_words**
    - Reverse top10_negative_words

After this problem, there's a new variable **top10_negative_words** defined.

-----

In [None]:
# YOUR CODE HERE


In [None]:
assert_equal(l_train[:5].tolist(), [0, 1, 1, 1, 1], 
             msg="You can't change l_train in this problem. Please fix the problem and run from problem 1 to reset l_train.")
assert_equal(top10_negative_words, ['bad', 'worst', 'horrible', 'poor', 'went', 'problem', 'dirty', 'tasted', 'ordered', 'company'],
             msg='Top 10 words in negative reviews are not correct')
print(f"Top 10 words in negative reviews:\n {top10_negative_words}")

---

# Problem 5: Use Custom Stop Words

Add two words in the top 10 words in negative reviews, "went" and "company" to English stop words and train a LogisticRegression model. The two words are rather neutral, we hope to get better classification result with this change.

This problem will use **d_train, d_test, l_train, l_test** created in problem 1.

To solve this problem do the following:
- Get all English stop words from `stopwords` in `nltk.corpus` module, assign stop words list to variable **stop_words**.
- Add two words "went" and "company" into stop_words
- Create a `TfidfVectorizer` object **tf_cv_2**, set `stop_words` to stop_words created in step 2.
- Fit the tf_cv_2 with d_train
- Transform d_train and d_test with tf_cv_2 to get **train_dtm_2** and __test_dtm_2__
- Create a `LogisticRegression` estimator **lr_model_2**. Set `C` to `1E6`(1000000). Accept default values for all other hyperparameters.
- Fit lr_model_2 using train_dtm_2 and l_train.
- Calcuate mean accuray score of **lr_model_2**
    - Apply lr_model_2 `predict` function on test_dtm_2 to get predicted label, assign it to variable l_pred.
    - use `accuracy_score` function in `metrics` module on l_test and l_pred to calculate the mean accuracy score and assign score to **mas_score_lr_2**


After this problem, there will be a trained LogisticRegression model **lr_model_2** define, as well as two new document term matrices, __train_dtm_2, test_dtm_2__, and mean accuracy score **mas_score_lr_2**.

-----

In [None]:
from nltk.corpus import stopwords
# YOUR CODE HERE


In [None]:
assert_equal(type(lr_model_2), type(LogisticRegression()), msg="lr_model is not defined as a LogisticRegression model")
assert_equal(lr_model_2.get_params()['C'], 1E6, msg="lr_model is not created with C = 1E6")
assert_almost_equal(train_dtm_2[0, 132], 0.056447087318836305, msg="train_dtm_2 is not correct")
assert_almost_equal(test_dtm_2[0, 157], 0.11448748543343225, msg="test_dtm_2 is not correct")
assert_equal(mas_score_lr_2, 0.8975, msg="Mean accuracy score is not correct")
print(f'Logistic Regression with custom stop words accuracy score: {mas_score_lr_2*100:5.2f}%')