# **Amazon Product Review : Business problem**
### _Authors: Magdalena Szymanowska & Sofia Llàcer Caro_
**Perform a sentiment Analysis for Amazon Reviews using NTLK and other required Python
packages and solve the business problem for Amazon as stated through the questions below.**

In order to elaborate the report, we considered that it would be clearer to first preprocess the review data through the script `preprocessing_pipeline.py`. That way, we would be able to produce a file with the fields relevant to the questions, and discuss the way they could be used in the current notebook. For more details on the preprocessing pipeline, the way this data was generated and the tools used, feel free to have a look at the source python file included in this directory.

In [5]:
import pandas as pd

# Import data
df = pd.read_csv("preprocessed_data.csv")
df.head()

Unnamed: 0,product,avg_rating,num_ratings,percent_neg,percent_pos,percent_neu,percent_winter,product_name
0,AVpgNzjwLJeJML43Kpxn,4.44804,8343,0.02565,0.461704,0.512645,0.0,AmazonBasics AAA Performance Alkaline Batterie...
1,AVpe7nGV1cnluZ0-aG2o,4.25,4,0.0,0.5,0.5,0.0,AmazonBasics AAA Performance Alkaline Batterie...
2,AVpfl8cLLJeJML43AE3S,5.0,2,0.0,0.0,1.0,0.0,AmazonBasics AAA Performance Alkaline Batterie...
3,AWK8z0pOIwln0LfXlSxH,5.0,1,0.0,0.0,1.0,0.0,AmazonBasics AAA Performance Alkaline Batterie...
4,AWYAV-i9Iwln0LfXqrUq,4.5,2,0.0,0.0,1.0,0.0,AmazonBasics AAA Performance Alkaline Batterie...


## 1. Which products should be kept?
We decided that we should keep the products that have a lot of reviews and mostly good, as they are most liked  by the consumers. We decided to set a threshold of number of reviews for the rating at 2 stars average.

In [33]:
# Thresholds
q1_rating_threshold = 2

# Filtering results
q1_filtered_df = df[df['avg_rating'] > q1_rating_threshold]
q1_kept = q1_filtered_df['product_name']
q1_answer = f"The following products should be kept:\n{q1_kept}"
print(q1_answer)

The following products should be kept:
0     AmazonBasics AAA Performance Alkaline Batterie...
1     AmazonBasics AAA Performance Alkaline Batterie...
2     AmazonBasics AAA Performance Alkaline Batterie...
3     AmazonBasics AAA Performance Alkaline Batterie...
4     AmazonBasics AAA Performance Alkaline Batterie...
                            ...                        
60    AmazonBasics AAA Performance Alkaline Batterie...
61    AmazonBasics AAA Performance Alkaline Batterie...
62    AmazonBasics AAA Performance Alkaline Batterie...
63    AmazonBasics AAA Performance Alkaline Batterie...
64    AmazonBasics AAA Performance Alkaline Batterie...
Name: product_name, Length: 64, dtype: object


## 2. Which products should be dropped?
Because this is quite critical as it involves removing products from the market, we also add here the condition that the number of reviews should be more than a threshold. That way, we ensure that we give a chance to products that have low ratings and a low number of reviews. We realise that the ones that should be kept and the ones that should be dropped are not exaclty oposites of each other here, but we decided to add the constraint of the number of ratings in the first case so that there is a sort of transition from one category to another in the case of products with a low number of reviews, like a probation-like period.

In [35]:
# Thresholds
q2_rating_threshold = 2
q2_num_reviews_threshold = 20

# Filtering results
q2_filtered_df = df[(df['num_ratings'] > q2_num_reviews_threshold) & (df['avg_rating'] < q2_rating_threshold)]
q2_dropped = q2_filtered_df['product_name']
q2_answer = f"The following products should be dropped:\n{q2_dropped}"
print(q2_answer)

The following products should be dropped:
Series([], Name: product_name, dtype: object)


## 3. Which products are junk?
Here, we chose a similar strategy as in the previous question. However, we can deem a product junk and not necessarily decide to make the decision to drop it. That's why in this case we made a less restrictive filtering of the products, involving only the condition that they have an average rating under 2 stars out of 5.

In [38]:
# Thresholds
q3_rating_threshold = 2

# Filtering results
q3_filtered_df = df[df['avg_rating'] < q3_rating_threshold]
q3_junk = q3_filtered_df['product_name']
q3_answer = f"The following products are junk:\n{q3_junk}"
print(q3_answer)

The following products are junk:
23    AmazonBasics AAA Performance Alkaline Batterie...
Name: product_name, dtype: object


## 4. Which product should be recommended to customer?
We consider the doRecommend field in the original raw data provided (before preprocessing) would have been very useful for that. However, it was empty in all cases, which means this information could not be assessed. That is why we recur in this case to products with good reviews (average rating above 4.5) and also a lot of them.

In [47]:
# Thresholds
q4_rating_threshold = 4.5
q4_num_reviews_threshold = 100

# Filtering results
q4_filtered_df = df[(df['num_ratings'] > q4_num_reviews_threshold) & (df['avg_rating'] > q4_rating_threshold)]
q4_recommended = q4_filtered_df['product_name']
q4_answer = f"The following products should be recommended to the customer:\n{q4_recommended}"
print(q4_answer)

The following products should be recommended to the customer:
19    AmazonBasics AAA Performance Alkaline Batterie...
26    AmazonBasics AAA Performance Alkaline Batterie...
30    AmazonBasics AAA Performance Alkaline Batterie...
31    AmazonBasics AAA Performance Alkaline Batterie...
32    AmazonBasics AAA Performance Alkaline Batterie...
33    AmazonBasics AAA Performance Alkaline Batterie...
37    AmazonBasics AAA Performance Alkaline Batterie...
39    AmazonBasics AAA Performance Alkaline Batterie...
42    AmazonBasics AAA Performance Alkaline Batterie...
46    AmazonBasics AAA Performance Alkaline Batterie...
47    AmazonBasics AAA Performance Alkaline Batterie...
48    AmazonBasics AAA Performance Alkaline Batterie...
49    AmazonBasics AAA Performance Alkaline Batterie...
50    AmazonBasics AAA Performance Alkaline Batterie...
52    AmazonBasics AAA Performance Alkaline Batterie...
55    AmazonBasics AAA Performance Alkaline Batterie...
56    AmazonBasics AAA Performance Alkalin

## 5. Which consumer products are the best products?
We consider these to be the products with average rating aabove 4.5, a lot of reviews (more than the number in the previous question) and a high portion of postive reviews. Note that here we make the requirement of number of reviews more restrictive as, in the case before, products that are good and not necessarily have a lot of reviews could be not recommended and then that would be unfair to products that are good but have slightly less reviews, making the disparities bigger between these two cases. Finally, here we included the additional constraint of positive review sentiment unlike in the previous question because we would not like to influence the consumer with other cosumer's opinions, which would then go into the case where first reviews are, to some extent, more valuable, important or authentic than latter ones.


In [49]:
# Thresholds
q5_rating_threshold = 4.5
q5_num_reviews_threshold = 500
q5_review_analysis_threshold = 0.5

# Filtering results
q5_filtered_df = df[(df['num_ratings'] > q5_num_reviews_threshold) & (df['avg_rating'] > q5_rating_threshold) & (df['percent_pos'] > q5_review_analysis_threshold)]
q5_best = q5_filtered_df['product_name']
q5_answer = f"The following are the best products:\n{q5_best}"
print(q5_answer)

The following are the best products:
30    AmazonBasics AAA Performance Alkaline Batterie...
64    AmazonBasics AAA Performance Alkaline Batterie...
Name: product_name, dtype: object


## 6. Which products should be planned for inventory for coming winter? 
(Good reviews and more buying during winter)


## 7. Which products require advertisment? 
(Little reviews, good ratings, would recommend)


## 8. In list of opinion O how many quintuples have positive sentiment s?