# Rating Product and Sorting Reviews in Amazon

### Business Problem

📌 One of the most important problems in e-commerce is the correct calculation of the points given to the products after sales. The solution to this problem means providing greater customer satisfaction for the e-commerce site, prominence of the product for the sellers, and a seamless shopping experience for the buyers. Another problem is the correct ordering of the comments given to the products. Since misleading comments will directly affect the sale of the product, it will cause both financial loss and loss of customers. In the solution of these 2 basic problems, e-commerce sites and sellers will increase their sales, while customers will complete their purchasing journey without any problems.





### Dataset Story

📌 This dataset, which includes Amazon product data, includes product categories and various metadata. The product with the most reviews in the electronics category has user ratings and reviews.

Variables:

reviewerID: User ID

asin: Product ID

reviewerName: Username

helpful: Helpful rating rating

reviewText: Review

overall: Product rating

summary: Evaluation summary

unixReviewTime: Evaluation time

reviewTime: Reviewtime Raw

day_diff: Number of days since evaluation

helpful_yes: The number of times the review was found helpful

total_vote: Number of votes given to the review

## Rating Products

In [42]:
# Reading the Data Set
import numpy as np
import pandas as pd
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",500)
pd.set_option("display.expand_frame_repr",False)
pd.set_option("display.float_format",lambda x: '%.5f' % x)
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/amazon_review.csv")
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote
0,A3SBTW3WS4IQSN,B007WTAJTO,,"[0, 0]",No issues.,4.0,Four Stars,1406073600,2014-07-23,138,0,0
1,A18K1ODH1I2MVB,B007WTAJTO,0mie,"[0, 0]","Purchased this for my device, it worked as adv...",5.0,MOAR SPACE!!!,1382659200,2013-10-25,409,0,0
2,A2FII3I2MBMUIA,B007WTAJTO,1K3,"[0, 0]",it works as expected. I should have sprung for...,4.0,nothing to really say....,1356220800,2012-12-23,715,0,0
3,A3H99DFEG68SR,B007WTAJTO,1m2,"[0, 0]",This think has worked out great.Had a diff. br...,5.0,Great buy at this price!!! *** UPDATE,1384992000,2013-11-21,382,0,0
4,A375ZM4U047O79,B007WTAJTO,2&amp;1/2Men,"[0, 0]","Bought it with Retail Packaging, arrived legit...",5.0,best deal around,1373673600,2013-07-13,513,0,0


In [43]:
df["overall"].mean()

4.587589013224822

### Calculation of the Average Score of the Product

In [44]:
def average_score_product(dataframe,column_time,column_rating):
  AV_0_25 = dataframe.loc[dataframe[column_time]<= dataframe[column_time].quantile(0.25),column_rating].mean()
  AV_25_50 = dataframe.loc[(dataframe[column_time] > dataframe[column_time].quantile(0.25)) & \
                           (dataframe[column_time] < dataframe[column_time].quantile(0.50)),column_rating].mean()
  AV_50_75 = dataframe.loc[(dataframe[column_time] > dataframe[column_time].quantile(0.50)) & \
                           (dataframe[column_time] < dataframe[column_time].quantile(0.75)),column_rating].mean()
  AV_75_100 = dataframe.loc[dataframe[column_time] >= dataframe[column_time].quantile(0.75),column_rating].mean()
  return AV_0_25,AV_25_50,AV_50_75,AV_75_100

In [45]:
average_score_product(df,"day_diff","overall")

(4.6957928802588995, 4.637335526315789, 4.571428571428571, 4.446791226645004)

### Time Based Weighted Average

In [46]:
def time_based_weighted_average(dataframe,column_time,column_rating,w1=28,w2=26,w3=24,w4=22):
  return dataframe.loc[dataframe[column_time]<= dataframe[column_time].quantile(0.25),column_rating].mean() * w1/100 + \
         dataframe.loc[(dataframe[column_time] > dataframe[column_time].quantile(0.25)) & \
                           (dataframe[column_time] < dataframe[column_time].quantile(0.50)),column_rating].mean() * w2/100 + \
         dataframe.loc[(dataframe[column_time] > dataframe[column_time].quantile(0.50)) & \
                           (dataframe[column_time] < dataframe[column_time].quantile(0.75)),column_rating].mean() * w3/100 + \
         dataframe.loc[dataframe[column_time] >= dataframe[column_time].quantile(0.75),column_rating].mean() * w4/100

In [47]:
time_based_weighted_average(df,"day_diff","overall")

4.595966170319355

## Sorting Reviews

In [48]:
# Preprocessing
df["helpful_no"] = df["total_vote"] - df["helpful_yes"]
df = df[["reviewerName", "overall", "summary", "helpful_yes", "helpful_no", "total_vote", "reviewTime"]]
df.head()

Unnamed: 0,reviewerName,overall,summary,helpful_yes,helpful_no,total_vote,reviewTime
0,,4.0,Four Stars,0,0,0,2014-07-23
1,0mie,5.0,MOAR SPACE!!!,0,0,0,2013-10-25
2,1K3,4.0,nothing to really say....,0,0,0,2012-12-23
3,1m2,5.0,Great buy at this price!!! *** UPDATE,0,0,0,2013-11-21
4,2&amp;1/2Men,5.0,best deal around,0,0,0,2013-07-13


### Up-Down Diff Score (score positive-negative diff)

In [49]:
def up_down_diff_score(up, down):
  return up - down

In [50]:
df["score_pos_neg_diff"] = df.apply(lambda x: up_down_diff_score(x["helpful_yes"],x["helpful_no"]),axis=1)

In [51]:
df.sort_values("score_pos_neg_diff",ascending=False).head()

Unnamed: 0,reviewerName,overall,summary,helpful_yes,helpful_no,total_vote,reviewTime,score_pos_neg_diff
2031,"Hyoun Kim ""Faluzure""",5.0,UPDATED - Great w/ Galaxy S4 & Galaxy Tab 4 10...,1952,68,2020,2013-01-05,1884
4212,SkincareCEO,1.0,1 Star reviews - Micro SDXC card unmounts itse...,1568,126,1694,2013-05-08,1442
3449,NLee the Engineer,5.0,Top of the class among all (budget-priced) mic...,1428,77,1505,2012-09-26,1351
317,"Amazon Customer ""Kelly""",1.0,"Warning, read this!",422,73,495,2012-02-09,349
3981,"R. Sutton, Jr. ""RWSynergy""",5.0,"Resolving confusion between ""Mobile Ultra"" and...",112,27,139,2012-10-22,85


### Average Rating Score


In [52]:
def average_rating_score(up, down):
  if up+down==0:
    return 0
  return up / (up+down)

In [53]:
df["score_average_rating"] = df.apply(lambda x: average_rating_score(x["helpful_yes"],x["helpful_no"]),axis=1)

In [54]:
df.sort_values("score_average_rating", ascending=False).head()

Unnamed: 0,reviewerName,overall,summary,helpful_yes,helpful_no,total_vote,reviewTime,score_pos_neg_diff,score_average_rating
4277,S. Q.,5.0,Perfect!!,1,0,1,2012-12-19,1,1.0
2881,Lou Thomas,5.0,Nexus One Loves This Card!,1,0,1,2012-01-10,1,1.0
1073,C. Sanchez,5.0,Tons of space for phone,1,0,1,2013-08-13,1,1.0
445,"Apache ""Elizabeth""",4.0,Amazon Great Prices,1,0,1,2013-12-18,1,1.0
3923,Rock Your Roots,5.0,What more to say?,1,0,1,2013-12-30,1,1.0


### Wilson Lower Bound Score


In [55]:
import scipy.stats as st
import math
def wilson_lower_bound(up, down, confidence=0.95):
    n = up + down
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * up / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)

In [56]:
df["wilson_lower_bound"] = df.apply(lambda x: wilson_lower_bound(x["helpful_yes"],x["helpful_no"]),axis=1)

In [58]:
df.sort_values("wilson_lower_bound",ascending=False).head()

Unnamed: 0,reviewerName,overall,summary,helpful_yes,helpful_no,total_vote,reviewTime,score_pos_neg_diff,score_average_rating,wilson_lower_bound
2031,"Hyoun Kim ""Faluzure""",5.0,UPDATED - Great w/ Galaxy S4 & Galaxy Tab 4 10...,1952,68,2020,2013-01-05,1884,0.96634,0.95754
3449,NLee the Engineer,5.0,Top of the class among all (budget-priced) mic...,1428,77,1505,2012-09-26,1351,0.94884,0.93652
4212,SkincareCEO,1.0,1 Star reviews - Micro SDXC card unmounts itse...,1568,126,1694,2013-05-08,1442,0.92562,0.91214
317,"Amazon Customer ""Kelly""",1.0,"Warning, read this!",422,73,495,2012-02-09,349,0.85253,0.81858
4672,Twister,5.0,Super high capacity!!! Excellent price (on Am...,45,4,49,2014-07-03,41,0.91837,0.80811
