# Module 6 Assignment


A few things you should keep in mind when working on assignments:

1. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. 
2. Make sure that you save your work.
3. Upload your notebook to Compass.

-----

# Prepare Yelp Review Data

This assignment will use the Yelp review dataset. Before we attempt to build a model, we first prepare the data.

The Yelp review dataset contains 1000 customer reviews of a group of restaurants. There are two columns in the dataset, column **stars**, which is the star rating, and column __text__, which is the review text. The dataset only contains 1-star and 5-star reviews. There are 500 1-star reviews and 500 5-star reviews.

Please run the next code cell before proceeding to Problem 1.

In [None]:
import pandas as pd
#Load yelp review dataset
df = pd.read_csv('yelp_reviews.csv')
print(f'Counts of Reviews by Star Rating:\n{df.stars.value_counts()}\n')
print(f'Sample Review({df.stars[0]} stars):\n{df.text[0]}')
print('-'*80)
print('\nDataset Sample:')

df.sample(2, random_state=2)

---
# Problem 1: Prepare Data for Sentiment Analysis
Create label and split text data set to training and testing set.

For this problem you will use `df` created above.

To solve this problem do the following:
1. Import needed modules.
2. Create `label` column in DataFrame `df`. Set label to 1(positive) if stars is 5, and 0 (negative) otherwise. Display random 5 rows of `df` to verify the column is created correctly.
3. Create variable **label**  which is the `label` column in `df`, and variable __data__ which is the `text` column in `df`. 
4. Split `data` and `label` to training and testing using `train_test_split`.
 - Set test_size to 0.4.
 - Assign return values to **d_train, d_test, l_train, l_test**.

After this problem, there are four new variable: **d_train, d_test, l_train** and  **l_test** defined.

Feel free to add extra code cells if needed.

---

In [None]:
# Your answer


---

# Problem 2: Train a LogisticRegression Model

For this problem, use d_train, l_train, d_test and l_test created in Problem 1.

Your task for this problem is to build and train a `LogisticRegression` estimator to make predictions on the Yelp review dataset. 

To solve this problem do the following:
1. Import needed modules.
2. Create a `TfidfVectorizer` object **tf_cv**, set `stop_words` to 'english'.
3. Fit the `TfidfVectorizer` objec with d_train.
4. Transform d_train and d_test with the `TfidfVectorizer` object to get **train_dtm** and __test_dtm__.
5. Create a `LogisticRegression` estimator **lr_model**. Set `C` to `1E6`(1000000). Accept default values for all other hyperparameters.
6. Fit the `LogisticRegression` estimator using train_dtm and l_train.
7. Calculate the mean accuracy score of **lr_model**.
    - Apply lr_model `predict` function on test_dtm to get predicted label, assign it to variable **l_pred**.
    - Use `accuracy_score` function in `metrics` module on l_test and l_pred to calculate the mean accuracy score.
8. Display the mean accuracy score.

-----

In [None]:
# Your answer


---

# Problem 3: Get Top 10 Words in Positive Reviews

Get to 10 words in positive reviews.

This problem will use **lr_model** and **tf_cv** created in Problem 2.

To solve this problem do the following:
1. Import needed modules.
2. Use tf_cv `get_feature_names` function to get all tokens (words) from the TfidfVectorizer object. Convert the token list to numpy array and assign the numpy array to variable **all_words**.
3. Sort **lr_model** `coef_` attribute with `numpy argsort` function and get the last 10 items' index and assign to **top_words_index**.
4. Get top 10 words using all_words and top_words_index and assign top 10 words list to **top10_positive_words**.
5. Reverse top10_positive_words.
6. Display the top 10 words in positive reviews.

-----

In [None]:
# Your answer


---

# Problem 4: Get Top 10 Words in Negative Reviews

Get top 10 words in negative reviews.

This problem will use **lr_model, all_words, train_dtm, l_train** created above.

To solve this problem, do the following:
1. Reverse **l_train** values, change 0 to 1 and 1 to 0 and create another training label **l_train_2**.
2. Fit the **lr_model** with __train_dtm__ and **l_train_2**.
3. Get top 10 words from **lr_model**.
    - Sort **lr_model** `coef_` attribute and get the last 10 items' index and assign to **top_words_index**.
    - Get top 10 words using **all_words** created in problem 3 and top_words_index and assign top 10 words list to **top10_negative_words**.
    - Reverse top10_negative_words.
4. Display the top 10 words in negative reviews.

-----

In [None]:
# Your answer


---

# Problem 5: Use Custom Stop Words

Add two words, "went" and "company", to English stop words and train a LogisticRegression model. The two words are rather neutral, we hope to get a better classification result with this change.

This problem will use **d_train, d_test, l_train, l_test** created in Problem 1.

To solve this problem do the following:
1. Import needed modules.
    - If needed, install nltk package in a terminal with command `conda install nltk`.
2. Get all English stop words from `stopwords` in `nltk.corpus` module, assign stop words list to variable **stop_words**.
3. Add two words "went" and "company" into stop_words.
4. Create a `TfidfVectorizer` object **tf_cv_2**, set `stop_words` to stop_words created in step 2.
5. Fit the tf_cv_2 with d_train.
6. Transform d_train and d_test with tf_cv_2 to get **train_dtm_2** and __test_dtm_2__.
7. Create a `LogisticRegression` estimator **lr_model_2**. Set `C` to `1E6`(1000000). Accept default values for all other hyperparameters.
8. Fit lr_model_2 using train_dtm_2 and l_train.
9. Calculate the mean accuray score.
    - Apply lr_model_2 `predict` function on test_dtm_2 to get predicted label, assign it to variable l_pred.
    - Use `accuracy_score` function in `metrics` module on l_test and l_pred to calculate the mean accuracy score.
10. Display the mean accuracy score.
-----

In [None]:
# Your answer

