# Module 6 Assignment


A few things you should keep in mind when working on assignments:

1. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. 
2. Make sure that you save your work.
3. Upload your notebook to Compass.

-----

# Prepare Yelp Review Data

This assignment will use the Yelp review dataset. Before we attempt to build a model, we first prepare the data.

The Yelp review dataset contains 1000 customer reviews of a group of restaurants. There are two columns in the dataset, column **stars**, which is the star rating, and column __text__, which is the review text. The dataset only contains 1-star and 5-star reviews. There are 500 1-star reviews and 500 5-star reviews.

Please run the next code cell before proceeding to Problem 1.

In [8]:
import pandas as pd
#Load yelp review dataset
df = pd.read_csv('yelp_reviews.csv')
print(f'Counts of Reviews by Star Rating:\n{df.stars.value_counts()}\n')
print(f'Sample Review({df.stars[0]} stars):\n{df.text[0]}')
print('-'*80)
print('\nDataset Sample:')

df.sample(2, random_state=2)

Counts of Reviews by Star Rating:
5    500
1    500
Name: stars, dtype: int64

Sample Review(5 stars):
I love love LOVE this place. My boss (who is into healthy eating) recommended this place. I went over with some highly skeptical friends and one dinner was enough to convert them into believers! The food here is so good! We had the Shrimp dumplings and the Onion tart as starters. We ordered the Shirataki noodles and street tacos as entrees. So also ordered the Kale-aid. All of the dishes were yummy. 
I have gone back many times since then and have never been disappointed! I have gone after yoga to get some Kale salad or the chicken chopped salad. I always have to get the Kale aid. 
Once, a guy at the next table, uprooted a whole plant by mistake (on the patio) and was highly embarrassed as was his date! Ever since, I have very careful not to throw my arms around as I can be quite clumsy sometimes! I do NOT want to be banned from my favorite place for my clumsiness! I don't think I can

Unnamed: 0,stars,text
37,5,Celebrated my anniversary here. Everything was...
726,1,I had such high expectaions after reading revi...


---
# Problem 1: Prepare Data for Sentiment Analysis
Create label and split text data set to training and testing set.

For this problem you will use `df` created above.

To solve this problem do the following:
1. Import needed modules.
2. Create `label` column in DataFrame `df`. Set label to 1(positive) if stars is 5, and 0 (negative) otherwise. Display ramdom 5 rows of `df` to verify the column is created correctly.
3. Create variable **label**  which is the `label` column in `df`, and variable __data__ which is the `text` column in `df`. 
4. Split `data` and `label` to training and testing using `train_test_split`.
 - Set test_size to 0.4.
 - Assign return values to **d_train, d_test, l_train, l_test**.

After this problem, there are four new variable, **d_train, d_test, l_train** and  __l_test__ defined.

Feel free to add extra code cells if needed.

---

In [9]:
# Your answer
df['label'] = 0
df.loc[df.stars==5, 'label'] = 1
df.sample(5)

Unnamed: 0,stars,text,label
757,1,What was Dunkin Donuts thinking when they took...,0
606,1,I guess this is an east coast/midwest thing be...,0
509,1,After landing in PHX but before embarking on a...,0
723,1,I am totally baffled why this place gets such ...,0
561,1,I was looking into gyms around the area. Upon ...,0


In [10]:
from sklearn.model_selection import train_test_split

label = df.label
data = df.text

d_train, d_test, l_train, l_test = train_test_split(data, label, test_size=0.4)

---

# Problem 2: Train a LogisticRegression Model

For this problem, use d_train, l_train, d_test and l_test created in problem 1.

Your task for this problem is to build and train a `LogisticRegression` estimator to make predictions on the Yelp review dataset. 

To solve this problem do the following:
1. Import needed modules.
2. Create a `TfidfVectorizer` object **tf_cv**, set `stop_words` to 'english'.
3. Fit the `TfidfVectorizer` objec with d_train.
4. Transform d_train and d_test with the `TfidfVectorizer` object to get **train_dtm** and __test_dtm__.
5. Create a `LogisticRegression` estimator **lr_model**. Set `C` to `1E6`(1000000). Accept default values for all other hyperparameters.
6. Fit the `LogisticRegression` estimator using train_dtm and l_train.
7. Calculate the mean accuracy score of **lr_model**.
    - Apply lr_model `predict` function on test_dtm to get predicted label, assign it to variable **l_pred**.
    - Use `accuracy_score` function in `metrics` module on l_test and l_pred to calculate the mean accuracy score.
8. Display the mean accuracy score.

-----

In [11]:
# Your answer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

tf_cv = TfidfVectorizer(stop_words='english')
train_dtm = tf_cv.fit_transform(d_train)
test_dtm = tf_cv.transform(d_test)

# Fit model, predict, and calculate accuracy score
lr_model = LogisticRegression(C=1E6)
lr_model = lr_model.fit(train_dtm, l_train)
l_pred = lr_model.predict(test_dtm)
mas_score_lr = metrics.accuracy_score(l_test, l_pred)
mas_score_lr

0.875

---

# Problem 3: Get Top 10 Words in Positive Reviews

Get to 10 words in positive reviews.

This problem will use **lr_model** and __tf_cv__ created in problem 2.

To solve this problem do the following:
1. Import needed modules.
2. Use tf_cv `get_feature_names` function to get all tokens (words) from the TfidfVectorizer object. Convert the token list to numpy array and assign the numpy array to variable **all_words**.
3. Sort **lr_model** `coef_` attribute with `numpy argsort` function and get the last 10 items' index and assign to **top_words_index**.
4. Get top 10 words using all_words and top_words_index and assign top 10 words list to **top10_positive_words**.
5. Reverse top10_positive_words.
6. Display the top 10 words in positive reviews.

-----

In [12]:
# Your answer
import numpy as np

all_words = np.array(tf_cv.get_feature_names())
top_words_index = np.argsort(lr_model.coef_[0])[-10:]
top10_positive_words = [word for word in all_words[top_words_index]]
top10_positive_words.reverse()
top10_positive_words

['great',
 'best',
 'amazing',
 'love',
 'awesome',
 'friendly',
 'good',
 'favorite',
 'delicious',
 'stop']

---

# Problem 4: Get Top 10 Words in Negative Reviews

Get top 10 words in negative reviews.

This problem will use **lr_model, all_words, train_dtm, l_train** created above.

To solve this problem do the following:
1. Reverse **l_train** values, change 0 to 1 and 1 to 0 and create another training label **l_train_2**.
2. Fit the **lr_model** with __train_dtm__ and **l_train_2**.
3. Get top 10 words from **lr_model**.
    - Sort **lr_model** `coef_` attribute and get the last 10 items' index and assign to **top_words_index**.
    - Get top 10 words using **all_words** created in problem 3 and top_words_index and assign top 10 words list to **top10_negative_words**.
    - Reverse top10_negative_words.
4. Display the top 10 words in negative reviews.

-----

In [13]:
# Your answer
l_train_2 = [0 if y==1 else 1 for y in l_train]
lr_model = lr_model.fit(train_dtm, l_train_2)

top_words_index = np.argsort(lr_model.coef_[0])[-10:]
top10_negative_words = [word for word in all_words[top_words_index]]
top10_negative_words.reverse()

top10_negative_words

['bad',
 'horrible',
 'worst',
 'don',
 'told',
 'poor',
 'rude',
 'asked',
 'ordered',
 'didn']

---

# Problem 5: Use Custom Stop Words

Add two words, "went" and "company", to English stop words and train a LogisticRegression model. The two words are rather neutral, we hope to get a better classification result with this change.

This problem will use **d_train, d_test, l_train, l_test** created in problem 1.

To solve this problem do the following:
1. Import needed modules.
    - If needed, install nltk package in a terminal with command `conda install nltk`.
2. Get all English stop words from `stopwords` in `nltk.corpus` module, assign stop words list to variable **stop_words**.
3. Add two words "went" and "company" into stop_words.
4. Create a `TfidfVectorizer` object **tf_cv_2**, set `stop_words` to stop_words created in step 2.
5. Fit the tf_cv_2 with d_train.
6. Transform d_train and d_test with tf_cv_2 to get **train_dtm_2** and __test_dtm_2__.
7. Create a `LogisticRegression` estimator **lr_model_2**. Set `C` to `1E6`(1000000). Accept default values for all other hyperparameters.
8. Fit lr_model_2 using train_dtm_2 and l_train.
9. Calculate the mean accuray score.
    - Apply lr_model_2 `predict` function on test_dtm_2 to get predicted label, assign it to variable l_pred.
    - Use `accuracy_score` function in `metrics` module on l_test and l_pred to calculate the mean accuracy score.
10. Display the mean accuracy score.
-----

In [14]:
# Your answer
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
#add to stop words
stop_words.extend(['went', 'company'])
tf_cv_2 = TfidfVectorizer(stop_words=stop_words)
train_dtm_2 = tf_cv_2.fit_transform(d_train)
test_dtm_2 = tf_cv_2.transform(d_test)

# Fit model, predict, and calculate accuracy score
lr_model_2 = LogisticRegression(C=1E6)
lr_model_2 = lr_model_2.fit(train_dtm_2, l_train)
l_pred = lr_model_2.predict(test_dtm_2)
mas_score_lr_2 = metrics.accuracy_score(l_test, l_pred)
mas_score_lr_2


0.8825