# Project 2 - Pandas & Visualization 101

---

- Your Name Here:
- WFU Email Address:
- Submission Date:

# Instructions

1. Download the notebook `pandas_viz_101_yournamehere.ipynb` and the dataset `AmazonReviews.json` from the course website.

2. Open the notebook on your local computer; Or upload and open it in Google Colab.

3. Replace the placeholder text above with your name, email address, and submission date.

4. This is a simple project made up of mini-tasks. Simply write the code to answer the question, and be sure to display your results!

5. Please submit your notebook in **HTML** on Canvas.

# Amazon Reviews

The `AmazonReviews.json` dataset contains over 370,000 reviews of products in Beauty and Personal Care. The data was initially scraped, munged and prepped by Jianno Ni (https://nijianmo.github.io/) an NLP researcher/engineer at Google.


Your challenge is to further prepare the data (easy), create summaries and charts answering various questions about the data (also easy). To complete this project, please follow these steps:

## Step 0. Load Libraries

Load the following libraries, you may need to install them first.
- Data Manipulation: `pandas`, `numpy`
- Data Visualization：`seaborn`, `matplotlib.pyplot`

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import warnings
warnings.filterwarnings("ignore", category=FutureWarning) # suppress the FutureWarning

## Step 1. Stage

1. Use `pd.read_json()` from `pandas` to read in `AmazonReviews.json` data file, store it as a new data frame called `"df_review"`

2. Display the first 7 records

3. Check the shape and data types using `info()`

In [2]:
df_review = pd.read_json('/content/AmazonReviews.json')

In [3]:
df_review.head(7)

Unnamed: 0,reviewerID,reviewerName,reviewTime,itemID,reviewText,overallRating,summary,verified
0,A1V6B6TNIC10QE,theodore j bigham,"02 19, 2015",143026860,great,1,One Star,True
1,A2F5GHSXFQ0W6J,Mary K. Byke,"12 18, 2014",143026860,My husband wanted to reading about the Negro ...,4,... to reading about the Negro Baseball and th...,True
2,A1572GUYS7DGSR,David G,"08 10, 2014",143026860,"This book was very informative, covering all a...",4,Worth the Read,True
3,A1PSGLFK1NSVO,TamB,"03 11, 2013",143026860,I am already a baseball fan and knew a bit abo...,5,Good Read,True
4,A6IKXKZMTKGSC,shoecanary,"12 25, 2011",143026860,This was a good story of the Black leagues. I ...,5,"More than facts, a good story read!",True
5,A36NF437WZLQ9E,W. Powell,"02 26, 2010",143026860,Today I gave a book about the Negro Leagues of...,5,The Gift of Black Baseball,False
6,A10Q8NIFOVOHFV,Robert S. Clay Jr.,"03 7, 2001",143026860,The story of race relations in American histor...,4,"Baseball, America, and racism.",False


## Step 2. Structure & Transform (Part I)

Applying data cleaning methods to the dataset `df_review`:

1. Summarize NAs: Report the **number** and **percentage** of missing values for each column

2. Clean NAs: Drop any rows where `reviewText` is None

3. Drop the columns of `summary` and `reviewerName`

4. Double-check NAs: Given the processed data, double check the **number** and **percentage** of missing values for each column
> Hint: There sholud be no NAs anymore

5. Explore the column of `verified`, report the percentage of rows that are verified using `value_counts()`

6. Filter and keep the subset of data where`verified==True`. Store this as a new data frame `df_review_prep`
> Hint: There should be 322098 rows

## Step 3. Structure & Transform (Part II)

Given the data frame `"df_review_prep"` we created, please further transform the data by performing the following:

1. Create binary indicators (aka flag variables or dummy variables)

- Create a flag variable "good_flag" if the `reviewText` mentions the word "good", regardless of case, set the flag to 1 else default it to 0
> There should be around 43831 references to “good”.

- Create a flag variable "great_flag" if the `reviewText` mentions the word "great", regardless of case, set the flag to 1 else default it to 0
> There should be around 62080 references to “great”.

- Create a flag variable "bad_flag", if the `reviewText` mentions the word "bad", regardless of case, set the flag to 1 else default it to 0
> There should be around 6710 references to “bad”.

> One approach to create flag variables (there are many):
```python
df['good_flag'] = 0
df['good_flag'][df['text'].str.contains('good', case=False)] = 1
```

2. Create another variable "review_len" which is the number of characters each review includes
> Hint:
```python
df['text_len'] = df['text'].str.len()
```

3. List the first 5 records of `"df_review_prep"` to make sure your code works

## Step 4. Frequency Analysis & Graphs

1. Create 4 basic bar charts using `sns.countplot()` to explore `overallRating` -  set the argument `hue` (i.e., color setting) by indicated fields respectively: `None`, `good_flag`, `great_flag`, `bad_flag`.

> Hint: Expected charts can be created using code like this
```python
plt.figure(figsize=(6, 4))
factor = 'your_flag_variable'
sns.countplot(data=df, x="your_main_variable", hue=factor).set_title('your title')
plt.xlabel('your x label')
plt.ylabel('your y label')
plt.show()
```


2. To further explore `overallRating`, let's create stacked bar charts that present data as percentages. This method is particularly effective for comparing the proportion each subgroup contributes to the total.

   In a 100% stacked bar chart, each bar will represent a unique rating value (1-5 stars) under `overallRating`. The segments of each bar will illustrate the percentage of reviews at that specific rating that have been flagged as either 'good', 'great', or 'bad'.

   Let's create **three** 100% stacked bar charts - each one visualizes the relationship between `overallRating` and one of the binary flag variables respectively (`good_flag`, `great_flag`, `bad_flag`).

   **Describe** the insights or patterns you find.

> Hint: Expected charts can be created using code like this
```python
main_category = 'your_main_variable'
binary_category = 'your_flag_variable'
df_grouped = df.groupby(main_category)[binary_category].value_counts(normalize=True).unstack()
df_grouped.plot(kind='bar', stacked=True, figsize=(6,3))
plt.ylabel('Percentage')
plt.show()
```

## Step 5. Does Time Matter?

1. Clean datetimes: Change the data type of `reviewTime` to a standard type `datetime`
> Hint: Consider using `pd.to_datetime()`

2. Create 4 new variables to store the year, month, day, and day of week by applying transformation to `reviewTime`
> Hint:
```python
df['year'] = df['date_variable'].dt.year
df['month'] = df['date_variable'].dt.month
df['day'] = df['date_variable'].dt.day
df['day_of_week'] = df['date_variable'].dt.weekday # where 0 represents Monday
```

3. Group by `day_of_week`, then Report the following summaries in one single data frame named `df_summary`:
  - count the total number of reviews
  - calculate the mean of `overallRating`
  - calculate the mean of `review_len`


4. Create three Bar charts using `sns.barplot()` to answer the following questions:

- What is the most/least reviewed year?

- What year are you most likely to get the highest/lowest mean review length?

- For the data in July 2018, what day of week are you most likely to get the highest/lowest mean overall rating?

## Step 6. Chat with your Data

In this final step, let's build a naive chatbot that can answer simple questions about your data using a **While Loop** and **If-Else Structures**. A good reference to complete this task is [here](https://colab.research.google.com/github/MonkeyWrenchGang/PythonBootcamp/blob/main/day_3/3_4_Journey_into_WHILE_Loops.ipynb). Please follow the instructions and code the missing parts.

-  Create a greeting for the user when the chatbot starts, use the `print()` function to display it. For example:
```python
print("Hello! I am your friendly chatbot that can answer simple questions about your data.")
```
- Use a while loop to start a continuous conversation. The conversation should continue until the user types 'exit'.
- Inside the while loop, use the `input()` function to get the name of a variable from the user, store it as `input_variable`. For example:
```python
input_variable = input("What variable would you like to ask me? (Type 'exit' to end the conversation) ")
```
- Check if the user's input is a valid column in the data frame `df_review_prep`.
 - If it is, print a message indicating that it's a valid variable and display its data type. For example:
 ```python
print("Good!", input_variable, "is a valid variable in this dataset")
print('Its data type:', df_review_prep[input_variable].dtypes)
 ```
 - If it isn't, the user should be notified and asked to try again. For example:
 ```python
print('Sorry, your input', input_variable,'is not a variable of the dataset. Please try it again :)')
 ```

- If the user's chosen variable is an integer type 'int64'
  - Use another `input()` to ask what statistic to report. The options should be 'count', 'mean', 'max', 'min', or 'all'. For example:
  ```python
  summary_stats = input("What statistic would you like to report? Type one from: [count, mean, max, min, all]")
  ```
  - Depending on the user's input, calculate and display the requested statistic for this variable.
  - If the input isn't recognized, display an error message. For example:
  ```python
  print('Please input a valid statistic.')
  ```

- If the user's chosen variable is NOT an integer type 'int64'
```python
print('Under development.. Stay tuned :)')
```

Example conversation：
```
Hello! I am your friendly chatbot that can answer simple questions about your data.
What variable would you like to ask me? (Type 'exit' to end the conversation) overallRating
Good! overallRating is a valid variable in this dataset
Its data type: int64
What statistic would you like to report? Type one from: [count, mean, max, min, all] all
count    322098.000000
mean          4.113881
std           1.361587
min           1.000000
25%           4.000000
50%           5.000000
75%           5.000000
max           5.000000
Name: overallRating, dtype: float64
What variable would you like to ask me? (Type 'exit' to end the conversation) exit
Goodbye! It was nice chatting with you.
```


# Finally

**Important**: Wrap this up in a notebook and convert it to **HTML**, to exceed the bar - make sure things look good.