# **Amazon Product Review : Business problem**
### _Authors: Magdalena Szymanowska & Sofia Llàcer Caro_
**Perform a sentiment Analysis for Amazon Reviews using NTLK and other required Python
packages and solve the business problem for Amazon as stated through the questions below.**

In order to elaborate the report, we considered that it would be clearer to first preprocess the review data through the script `preprocessing_pipeline.py`. That way, we would be able to produce a file with the fields relevant to the questions, and discuss the way they could be used in the current notebook. For more details on the preprocessing pipeline, the way this data was generated and the tools used, feel free to have a look at the source python file included in this directory.

In [51]:
import pandas as pd

# Import data
df = pd.read_csv("preprocessed_data.csv")
df.head()

Unnamed: 0,product,avg_rating,num_ratings,percent_neg,percent_pos,percent_neu,percent_winter,product_name
0,AVpgNzjwLJeJML43Kpxn,4.44804,8343,0.02565,0.461704,0.512645,0.0,AmazonBasics AAA Performance Alkaline Batterie...
1,AVpe7nGV1cnluZ0-aG2o,4.25,4,0.0,0.5,0.5,0.0,AmazonBasics Nylon CD/DVD Binder (400 Capacity)
2,AVpfl8cLLJeJML43AE3S,5.0,2,0.0,0.0,1.0,0.0,Amazon Echo ‚Äì White
3,AWK8z0pOIwln0LfXlSxH,5.0,1,0.0,0.0,1.0,0.0,Amazon Echo Show - Black
4,AWYAV-i9Iwln0LfXqrUq,4.5,2,0.0,0.0,1.0,0.0,Echo Spot Pair Kit (Black)


## 1. Which products should be kept?
We decided that we should keep the products that have a lot of reviews and mostly good, as they are most liked  by the consumers. We decided to set a threshold of number of reviews for the rating at 2 stars average.

In [66]:
# Thresholds
q1_rating_threshold = 3

# Filtering results
q1_filtered_df = df[df['avg_rating'] > q1_rating_threshold]
q1_kept = q1_filtered_df['product_name']
q1_answer = f"The following products should be kept:\n{q1_kept}"
print(q1_answer)

The following products should be kept:
0     AmazonBasics AAA Performance Alkaline Batterie...
1       AmazonBasics Nylon CD/DVD Binder (400 Capacity)
2                                 Amazon Echo ‚Äì White
3                              Amazon Echo Show - Black
4                            Echo Spot Pair Kit (Black)
                            ...                        
58    Kindle E-reader - White, 6 Glare-Free Touchscr...
59    Amazon Fire TV Gaming Edition Streaming Media ...
61    All-New Kindle Oasis E-reader - 7 High-Resolut...
62    AmazonBasics Bluetooth Keyboard for Android De...
64    Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...
Name: product_name, Length: 62, dtype: object


## 2. Which products should be dropped?
Because this is quite critical as it involves removing products from the market, we also add here the condition that the number of reviews should be more than a threshold. That way, we ensure that we give a chance to products that have low ratings and a low number of reviews. We realise that the ones that should be kept and the ones that should be dropped are not exaclty opposite of each other here, but we decided to add the constraint of the number of ratings in this case so that there is a sort of transition from one category to another in the case of products with a low number of reviews, like a probation-like period.

In [72]:
# Thresholds
q2_rating_threshold = 3
q2_num_reviews_threshold = 10

# Filtering results
q2_filtered_df = df[(df['num_ratings'] > q2_num_reviews_threshold) & (df['avg_rating'] < q2_rating_threshold)]
q2_dropped = q2_filtered_df['product_name']
q2_answer = f"The following products should be dropped:\n{q2_dropped}"
print(q2_answer)

The following products should be dropped:
Series([], Name: product_name, dtype: object)


## 3. Which products are junk?
Here, we chose a similar strategy as in the previous question. However, we can deem a product junk and not necessarily decide to make the decision to drop it. That's why in this case we made a less restrictive filtering of the products, involving only the condition that they have an average rating under 2 stars out of 5.

In [73]:
# Thresholds
q3_rating_threshold = 2

# Filtering results
q3_filtered_df = df[df['avg_rating'] < q3_rating_threshold]
q3_junk = q3_filtered_df['product_name']
q3_answer = f"The following products are junk:\n{q3_junk}"
print(q3_answer)

The following products are junk:
23    Oem Amazon Kindle Power Usb Adapter Wall Trave...
Name: product_name, dtype: object


## 4. Which product should be recommended to customer?
We consider the doRecommend field in the original raw data provided (before preprocessing) would have been very useful for that. However, it was empty in all cases, which means this information could not be assessed. That is why we recur in this case to products with good reviews (average rating above 4.5) and also a lot of them.

In [74]:
# Thresholds
q4_rating_threshold = 4.5
q4_num_reviews_threshold = 100

# Filtering results
q4_filtered_df = df[(df['num_ratings'] > q4_num_reviews_threshold) & (df['avg_rating'] > q4_rating_threshold)]
q4_recommended = q4_filtered_df['product_name']
q4_answer = f"The following products should be recommended to the customer:\n{q4_recommended}"
print(q4_answer)

The following products should be recommended to the customer:
19    Amazon Tap Smart Assistant Alexaenabled (black...
26    Kindle Voyage E-reader, 6 High-Resolution Disp...
30    All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...
31    Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...
32    All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...
33    All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...
37    Fire Tablet, 7 Display, Wi-Fi, 16 GB - Include...
39    All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...
42    Amazon Fire HD 8 with Alexa (8" HD Display Tab...
46    All-New Fire HD 8 Kids Edition Tablet, 8 HD Di...
47    All-New Fire HD 8 Kids Edition Tablet, 8 HD Di...
48    All-New Fire HD 8 Tablet with Alexa, 8 HD Disp...
49    All-New Fire HD 8 Tablet with Alexa, 8 HD Disp...
50    Fire Tablet with Alexa, 7 Display, 16 GB, Mage...
52    Fire Tablet with Alexa, 7 Display, 16 GB, Blue...
55    Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...
56    Fire HD 8 Tablet with Alexa, 8 HD Di

## 5. Which consumer products are the best products?
We consider these to be the products with average rating aabove 4.5, a lot of reviews (more than the number in the previous question) and a high portion of postive reviews. Note that here we make the requirement of number of reviews more restrictive as, in the case before, products that are good and not necessarily have a lot of reviews could be not recommended and then that would be unfair to products that are good but have slightly less reviews, making the disparities bigger between these two cases. Finally, here we included the additional constraint of positive review sentiment unlike in the previous question because we would not like to influence the consumer with other cosumer's opinions, which would then go into the case where first reviews are, to some extent, more valuable, important or authentic than latter ones.


In [75]:
# Thresholds
q5_rating_threshold = 4.5
q5_num_reviews_threshold = 500
q5_review_analysis_threshold = 0.5

# Filtering results
q5_filtered_df = df[(df['num_ratings'] > q5_num_reviews_threshold) & (df['avg_rating'] > q5_rating_threshold) & (df['percent_pos'] > q5_review_analysis_threshold)]
q5_best = q5_filtered_df['product_name']
q5_answer = f"The following are the best products:\n{q5_best}"
print(q5_answer)

The following are the best products:
30    All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...
64    Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...
Name: product_name, dtype: object


## 6. Which products should be planned for inventory for coming winter? 
We decided to choose the ones that have a good average rating and also are mostly bought in winter. For this, we selected those for which the rating of purchase in winter months over all purchases is higher than 0.5 (indicating more than half the purchases of this product are done over winter). We also added the number of reviews, because otherwise there could be a case where a product has 1 purchase and just happened to be in winter. We also added those that are relatively positive rating, as people would order it again next year if they were happy with it before. On the other side, if the reviews are bad from previous year, it is less likely there will be demand for it as users will see the rating and not want to order it. For more details on the calculations done on the raw data and adaptability to other time periods, refer to `preprocessing_pipeline.py`.

In [79]:
# Thresholds
q6_season_threshold = 0.5
q6_num_reviews = 20
q6_rating_threshold = 4

# Filtering results
q6_filtered_df = df[(df['percent_winter'] > q6_season_threshold) & (df['num_ratings'] > q6_num_reviews) & (df['avg_rating'] > q6_rating_threshold)]
q6_best = q6_filtered_df['product_name']
q6_answer = f"The following are products to be planned in inventory for winter:\n{q6_best}"
print(q6_answer)

The following are products to be planned in inventory for winter:
6     AmazonBasics AA Performance Alkaline Batteries...
10         AmazonBasics 15.6-Inch Laptop and Tablet Bag
14    Amazon 9W PowerFast Official OEM USB Charger a...
19    Amazon Tap Smart Assistant Alexaenabled (black...
31    Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...
35    Kindle Voyage E-reader, 6 High-Resolution Disp...
54    All-New Kindle Oasis E-reader - 7 High-Resolut...
55    Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...
Name: product_name, dtype: object


## 7. Which products require advertisment? 
Those requiring advertisment we established to be the ones that have good ratings and positive reviews, but a low number of them. This way, we could compensate for these low numbers and the products would be part of a nudge program where the products in these characteristics would be advertised.

In [85]:
# Thresholds
q7_review_analysis_threshold = 0.5
q7_num_reviews = 20
q7_rating_threshold = 4.5

# Filtering results
q7_filtered_df = df[(df['percent_pos'] > q7_review_analysis_threshold) & (df['num_ratings'] < q7_num_reviews) & (df['avg_rating'] > q7_rating_threshold)]
q7_best = q7_filtered_df['product_name']
q7_answer = f"The following are products that require advertisment:\n{q7_best}"
print(q7_answer)

The following are products that require advertisment:
15    Kindle PowerFast International Charging Kit (f...
24        AmazonBasics 16-Gauge Speaker Wire - 100 Feet
Name: product_name, dtype: object


## 8. In list of opinion O how many quintuples have positive sentiment s?