# Exploratory analysis of amazon instrument reviews
At first we are gonna take a look at the given dataset. I dont know how the data was created or whether there was any kind of preprocessing before it was uploaded, but maybe we can make deductions on the prior processing of the data. 

The data table contains 9 columns and 10261 rows. 

* reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
* asin - ID of the product, e.g. 0000013714
* reviewerName - name of the reviewer
* helpful - helpfulness rating of the review, e.g. 2/3
* reviewText - text of the review
* overall - rating of the product
* summary - summary of the review
* unixReviewTime - time of the review (unix time)
* reviewTime - time of the review (raw)

Lets start with a quick first look over the data just to see if there is any data missing.

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
######## HELPER FUNCTIONS ##########
def look_at_categories(df,col):
    # general look
    print(df[col].describe())
    # how often do reviewers review?
    count_reviews = df[col].value_counts()

    #print(count_reviews)
    print('mean',count_reviews.mean())
    print('min',count_reviews.min())
    print('max',count_reviews.max())

    print('0.5',count_reviews.quantile(0.5))
    print('0.95',count_reviews.quantile(0.95))
    print('0.99',count_reviews.quantile(0.99))

    return count_reviews
    
def look_at_numerics(df,col):
    print(df[col].describe())
    print('mean',df[col].mean())
    print('min',df[col].min())
    print('max',df[col].max())

    print('0.5',df[col].quantile(0.5))
    print('0.95',df[col].quantile(0.95))
    print('0.99',df[col].quantile(0.99))
    
    return df[col]

    
def draw_hist(count_reviews,ylabel,xlabel='number of reviews',bins=30):

    fig  = plt.figure()
    #hist
    plt.hist(count_reviews,bins=bins)
    # vertical lines
    plt.axvline(count_reviews.mean(),color='yellow',label='mean')
    plt.axvline(count_reviews.quantile(0.95),color='orange',label='Top 5%')
    plt.axvline(count_reviews.quantile(0.99),color='red',label='Top 1%')
    # labels
    plt.legend()
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    # set size
    fig.set_size_inches(18.5, 10.5)

In [None]:
# read data
df= pd.read_csv('/kaggle/input/amazon-music-reviews/Musical_instruments_reviews.csv')#
# Is the shape as anticipated?
print('Is the shape as anticipated? ',df.shape==(10261,9))
# Are the columns as anticipated and what kind of index is used? 
print('Are the columns as anticipated and what kind of index is used?')
print('index: ',df.head().index)
print('columns: ', df.columns)
# Are there any NaN entries?
print('Number of nan entries? \n',df.isna().sum())
# Are there any duplicated rows?
print('Are there any duplicated rows? ',df.duplicated().sum()>0)

# First Look
## Has the data the expected shape?
Yes its (10261,9)
## What do the columns and index look like?
The columns are exactly as descriped and the index is just ongoing numbers.
## Are there any NaN entries in the table?
Yes. There are 27 missing reviewer names and 7 missing reviewer texts.

### The number of columns are very manageable so we can take a quick look at all of them.

In [None]:
# can i fill up the missing names
print('can i fill up the missing names?')
no_names_id = df[df['reviewerName'].isna()]['reviewerID'].unique()
i_can='No'
for id in no_names_id:
    if df[df['reviewerID'].isna()]['reviewerName'].shape[0]>0:
        i_can='Yes'
print(i_can)
    
###
count_reviews_rvid = look_at_categories(df,'reviewerID')


# Reviewer ID
## How many reviewers are there?
There are 1429 reviewers writing 10261 reviews.
## How many reviews are individual reviewers writing?
Reviewers write between 5 and 42 reviews. Oddly there arent any reviewers who wrote less than 5 reviews. Seems like they either removed all reviewers who didnt have atleast 5 reviews or their method of obtaining the reviews didnt allow them to capture infrequent reviewers.
**This means we are most likely looking only at frequent reviewers instead of all reviewers**. We assume looking further into the data that there are actually reviewers that wrote less than 5 reviews, but were removed from the dataset or not captured.
#### Reviewers in numbers
* Minimum: 5
* Mean: 7,18
* Median: 6
* 95% Quantile: 14
* 99% Quantile: 23
* Maximum: 42



##### The data is visualized in the following histogram.



In [None]:
draw_hist(count_reviews_rvid,'number of reviewers')

The histogram shows us that most reviewers write only 5 reviews. This means these reviewers make up more than a third of all reviewers.

My first thought was that there might be an error in the data acquisition that converts reviewers with less than 5 reviews to reviewers with 5 reviews. In that case there would be a huge decrease in reviewers from 5 to 6, but there are 558 reviewers with 5 reviews and 296 reviewers with 6 reviews, which is a very reasonable progression.

Another notable thing to see is, that span of the upper 1% (23-42 -->19) is bigger than the span of the lower 99% (5-23 --> 18).

### Conclusions:
1. The data doesnt contain any reviewers with less than 5 reviews
2. Most reviewers write 5 reviews
3. There is a bigger difference in the top 1% of reviewers than in the lower 99% of reviewers


In [None]:
count_reviews_asin = look_at_categories(df,'asin')

# Can a reviewer review a product multiple times? 
# (is there a duplicated row when only looking at asin and reviewerID?)
print('Can a reviewer review a product multiple times? ',df[['asin','reviewerID']].duplicated().sum()>0)


# ASIN
## How many different products are in the dataset?
There are 900 different products in the dataset with atleast 5 Reviews each. Again it seems like all products that dont have atleast 5 reviews are left out of the dataset either through method of obtaining or preprocessing of the data.
It's a bit odd that there are exactly 900 diffent products. Having exactly a round number of anything in a dataset tends to be a systemic reason, a human set limit or might just be a product of randomness. I dont think that there is a systemic reason within amazon to have a round number of musical instruments, but it also seems odd for a human to pick 900 as a limit instead of 1000. So i guess this 900 just occurs randomly, but there is a chance that there are actually more than 900 instruments on amazon.
## Can a reviewer review a product multiple times?
Maybe they can but no one did in the dataset.
## How many reviews are there per product?
#### Products in numbers
* Minimum: 5
* Mean: 11,4
* Median: 8
* 95% Quantile: 29
* 99% Quantile: 67
* Maximum: 163



##### The data is visualized in the following histogram.

In [None]:
draw_hist(count_reviews_asin,'number of products')

We get the same results we already got from looking at the reviewers. Most products get a low amount of reviews and the products with the highest amounts of reviews vary strongly in their number of reviews.


### Conclusions:
1. The data doesnt contain any products or any reviewers with less than 5 reviews
2. Most products and reviewers have 5 reviews
3. There is a bigger difference in the top 1% of products and reviewers than in the lower 99%


In [None]:

# Can a reviewer review a product multiple times? 
# (is there a duplicated row when only looking at asin and reviewerID?)
print('Can a reviewer review a product multiple times? ',df[['asin','reviewerID']].duplicated().sum()>0)

# how often do reviewers review?
count_reviews_rvn = look_at_categories(df,'reviewerName')

# Reviewer name
You would expect reviewer name to be a very much like reviewer ID because it should be just a worse version of identifying a reviewer. 

### Reviewer name in numbers compared
  \  |Reviewer name| Reviewer ID
---|-------------|-------------
 Minimum: | 1 | 5
Mean: | 7,32 | 7,18
Median: | 6 | 6 
95% Quantile: | 14 | 14
99% Quantile: | 25 | 23
Maximum: | 66 | 42


In [None]:
fig,ax = plt.subplots(2,sharex=True)


#fig  = plt.figure()
#hist
ax[0].hist(count_reviews_rvn,bins=60,label = 'number of reviewer names')
# vertical lines
ax[0].axvline(count_reviews_rvn.mean(),color='yellow',label='mean')
ax[0].axvline(count_reviews_rvn.quantile(0.95),color='orange',label='Top 5%')
ax[0].axvline(count_reviews_rvn.quantile(0.99),color='red',label='Top 1%')
# labels
ax[0].legend()


#hist
ax[1].hist(count_reviews_rvid,bins=40,label='number of reviewer IDs')
# vertical lines
ax[1].axvline(count_reviews_rvid.mean(),color='yellow',label='mean')
ax[1].axvline(count_reviews_rvid.quantile(0.95),color='orange',label='Top 5%')
ax[1].axvline(count_reviews_rvid.quantile(0.99),color='red',label='Top 1%')
# labels
ax[1].legend()

ax.flat[0].set( ylabel='number of reviewer names')
ax.flat[1].set(xlabel='number of reviews', ylabel='number of reviewer IDs')

# set size
fig.set_size_inches(16.5, 12.5)

The histograms are atleast very similar. The minimum of reviews by reviewer names is smaller than the minimum of reviews by reviewer IDs, so there have to be reviewer IDs with multiple names. Additionally the maximum of reviews by reviewer names is also higher than the maximum of reviews by reviewer IDs, so there have to be reviewer names with multiple IDs.

Alright lets take a look at the names with multiple IDs.

In [None]:
## names with multiple IDs
for name in df['reviewerName'].unique():
    if df[df['reviewerName']==name]['reviewerID'].unique().shape[0]>1:
        print(name)



Ok so the reviewer names arent unique usernames. So there are just different users with the same reviewer name.

So lets look at the names with that share the same ID.

In [None]:
        
## multiple names with one ID
for ID in df['reviewerID'].unique():
    if df[df['reviewerID']==ID]['reviewerName'].unique().shape[0]>1:
        print(df[df['reviewerID']==ID]['reviewerName'].unique())


Ok so most reviewer IDs with multiple names are caused by nans and on some occasion its just a change in name like 'caffeinebrain' --> 'coffeebrain'.

I just had wrong assumptions about the reviewer names.



### Conclusions:
1. The data doesnt contain any products or any reviewers with less than 5 reviews
2. Most products and reviewers have 5 reviews
3. There is a bigger difference in the top 1% of products and reviewers than in the lower 99%
4. Reviewer names are not unique and can be changed


In [None]:
### helpful
# make helpful ratio
from ast import literal_eval

def ratio_fun(x):
    evaled = literal_eval(x)
    if evaled[1] == 0:
        return float('NaN')
    else:
        return evaled[0]/evaled[1]

def all_votes_fun(x):
    return literal_eval(x)[1]

df['helpful_ratio'] = df['helpful'].apply(ratio_fun)
df['helpful_all_votes'] = df['helpful'].apply(all_votes_fun)
count_reviews_hpr = look_at_numerics(df,'helpful_ratio')
count_reviews_hpav = look_at_numerics(df,'helpful_all_votes')



# Helpful
Helpful is a string in the form of "[\*upvote\*,\* all votes\*]" (atleast this seems to make the most sense, because the first number is always smaller than the second). Because we got two number in relation to each other, we are gonna look at the ratio between them. In the case [0,0] we are just gonna give back NaN.\
This means the helpful ratio is between 0 and 1 and the number of entries is the number of reviews that received votes. Just looking at the ratio doesnt give a picture on  how total votes are destributed over the reviews, so we are also gonna look at the number of all votes.\

### Helpful in numbers:
\  | Helpful ratio | Helpful
----|--------|-----
 Minimum:| 0 | 0
 Mean:| 0,78 | 1,86
 Median:| 1 | 0
 95% Quantile:| 1| 7
 99% Quantile:| 1 | 30
 Maximum:| 1 | 300
 
There are 3465 (33,8%) reviews with votes out of a total of 10261 reviews.

##### The data is visualized in the following histogram.

In [None]:

draw_hist(df['helpful_ratio'].dropna(),'number of reviews','helpful ratios')


We are looking around 1/3 of all reviews and a majority of the reviews are completely helpful. We already know that we are only looking at frequent reviewers, so we can say frequent reviewers are getting mostly positiv feedback. If we assume that people give positive and negative votes at the same rate if they actually found a review helpful or not helpful, we can conclude that frequent reviewers write helpful reviews. I am not really willing to make that assumption just because i believe if people read an unhelpful review they just keep looking for a helpful one and if people read a really helpful review they might just give it an upvote. BUT this believe is mostly based on how i behave.

In [None]:
draw_hist(df[df['helpful_all_votes']>0]['helpful_all_votes'],'helpful ratios', 'all votes',bins=80)
_= look_at_numerics(df[df['helpful_all_votes']>0],'helpful_all_votes')

Looking at the distribution of votes over the reviews with atleast one vote given, we see that the median number of votes is still only at 2 and the top 25% is getting atleast 4 votes. Seeing that we already removed the 2/3 of reviews that didnt get any votes, we can say that reviews get rarely a lot of feedback.
So we can update our Conclusion list.



### Conclusions:
1. The data doesnt contain any products or any reviewers with less than 5 reviews
2. Most products and reviewers have 5 reviews
3. There is a bigger difference in the top 1% of products and reviewers than in the lower 99%
4. Reviewer names are not unique and can be changed
5. Reviews get rarely more than 4 helpful votes, but the votes given are overwhelmingly positiv


In [None]:
count_reviews_ov = look_at_numerics(df,'overall')
# are all entries integers?
sum(df['overall']==df['overall'].apply(lambda x:int(x)))
# how many 5 star reviews are there?
df['overall'].value_counts()

# Overall (product rating)
Overall shows the product rating as an integer between 1 and 5. Its the rating given by the corresponding review. 

### Overall in numbers:
  \  |Reviewer name
---|-------------
 Minimum: | 1 
Mean: | 4,48 
Median: | 5 
95% Quantile: | 5 
99% Quantile: | 5 
Maximum: | 5 


##### The data is visualized in the following histogram.

In [None]:
draw_hist(df['overall'],'number of reviews','overall')


We can see the average review of a frequent reviewer is pretty high at around 4,5 and nearly 7000 out of 10261 reviews give the product 5 stars. On the lower end you can see that there are barely any 1-2 star reviews.




### Conclusions:
1. The data doesnt contain any products or any reviewers with less than 5 reviews
2. Most products and reviewers have 5 reviews
3. There is a bigger difference in the top 1% of products and reviewers than in the lower 99%
4. Reviewer names are not unique and can be changed
5. Reviews get rarely more than 4 helpful votes, but the votes given are overwhelmingly positiv
6. Frequent reviewers give in ~90% of cases a 4-5 star review



In [None]:
df['reviewText_len'] = df['reviewText'].apply(lambda x: len(str(x)))
_ = look_at_numerics(df,'reviewText_len')
# Minimum without "NaN"
no_na_review = df[['reviewText']].dropna()
no_na_review['reviewText_len'] = no_na_review['reviewText'].apply(lambda x: len(str(x)))
print('Minimum without "NaN": ',no_na_review['reviewText_len'].min())
# Whats the shortest review?
print('Whats the shortest review? \n',no_na_review[no_na_review['reviewText_len'] == no_na_review['reviewText_len'].min()]['reviewText'])
# Is the shortest review helpful?
print('Is the shortest review helpful? ',df[df['reviewText_len'] == no_na_review['reviewText_len'].min()]['helpful'])
# Whats the longest review?
#print('Whats the longest review? \n',no_na_review[no_na_review['reviewText_len'] == no_na_review['reviewText_len'].max()]['reviewText'].values)


# ReviewText
A quick evaluation of text is difficult-ish. If you wanted to look whether a text is just nonsense, you could run the text through a pretrained english language model and look at the average likeliness that the words are in the place that they are. I have never tried this. We are just gonna look at the length of each review. The length is the number of characters in the review.


### Review text length in numbers:
  \  |Reviewer text length
---|-------------
 Minimum: | 3 (9)*
Mean: | 485,93
Median: | 284 
95% Quantile: | 1552
99% Quantile: | 3027 
Maximum: | 11310

\* without 'NaN'

##### The data is visualized in the following histogram.

In [None]:
draw_hist(df['reviewText_len'],'number of reviews','reviewer text length',bins=80)

Most reviews have less than 300 characters with the shortest one just being the word "excellent" and the longest one being a comparison between multiple articles. I looked at a few of the reviews and i didnt see any nonsense or anything that would point towards systemic error. So there seems to be no problem with the review texts.


In [None]:
df['summary_len'] = df['summary'].apply(lambda x: len(str(x)))
_ = look_at_numerics(df,'summary_len')
# Whats a summary of a NaN review?
print(' Whats a summary of a NaN review? \n',df[df['reviewText'].isna()]['summary'])
# Whats the shortest summary?
print('Whats the shortest summary? \n',df[df['summary_len']==df['summary_len'].min()]['summary'])
# Whats the longest summary?
print('Whats the longest summary? \n',df[df['summary_len']==df['summary_len'].max()]['summary'].values)
#


# Summarys
Now review texts but shorter. So there are no NaN entries here. That means that the missing reviews all have summarys and they look completely normal like:
* The Pop Rocks with the Yeti
* No power = No Sound, But It Sounds GREAT!

We are gonna do the same and look at the summary length again.


### Summary length in numbers:
  \  |Summary length
---|-------------
 Minimum: | 1
Mean: | 24,34
Median: | 21 
95% Quantile: | 55
99% Quantile: | 74 
Maximum: | 128

##### The data is visualized in the following histogram.

In [None]:
draw_hist(df['summary_len'],'number of reviews','length of summary',bins=60)

Most summarys are below 25 characters. Some short ones are just like "F", "A+", "OK" or just a "-". The longest one is:

"Excellent, best design ever, but only get the SILVER aluminum one, not the thermoplastic as they break-- I've had 3 break on me."

The longest one is 128 characters long, which might be the limit, because its a power of 2.
Otherwise no data irregularities to see here.

In [None]:
df['reviewTime_dt'] = pd.to_datetime(df['reviewTime'])
_ = look_at_numerics(df,'reviewTime_dt')
# Is unixTime = Time?
print('Is unixTime = Time? ',sum(pd.to_datetime(df['reviewTime']) == pd.to_datetime(df['unixReviewTime'],unit='s'))>0)
# Where is 95% of the data?
print('Where is 95% of the data? past ', df['reviewTime_dt'].quantile(0.05))


# Time (unixReviewTime, ReviewTime)
Unixtime and review time point to the same date. We are gonna look at the number of reviews over time as datetime.



### ReviewTime numbers:
  \  |ReviewTime
---|-------------
 First: | 2004-09-18
Mean: | 2013-02-11 
Median: | 2013-05-14 
95% Quantile: | 2014-06-03
99% Quantile: | 2014-07-08 
Last: | 2014-07-22

##### The data is visualized in the following histogram.

In [None]:
draw_hist(df['reviewTime_dt'],'number of reviews','timeline')

So the data has reviews inbetween 2004 and 2015 with rarely any reviews before 2008. Half of the reviews made are just between 2013-05-14 and 2014-07-22.
This seems to fit to my presumption of amazons usage. 


### Conclusions:
1. The data doesnt contain any products or any reviewers with less than 5 reviews
2. Most products and reviewers have 5 reviews
3. There is a bigger difference in the top 1% of products and reviewers than in the lower 99%
4. Reviewer names are not unique and can be changed
5. Reviews get rarely more than 4 helpful votes, but the votes given are overwhelmingly positiv
6. Frequent reviewers give in ~90% of cases a 4-5 star review
7. Around 95% of all reviews are made past 2011


# Relations

At last we are gonna look at the relationship between the numerical features.

In [None]:
sns.pairplot(df)

Looking at the unixReviewTime column we can see:
* before 2010 there were barely close to no 1-2 star ratings (but there were also barely any reviews before 2010 and 1-2 star reviews are rare overall atleast by frequent reviewers)
* helpful ratio before 2010 is in average higher than after 2010
* the number of all helpful votes seem to drop near the end of the timeline (but these reviews might have had too little time to accumulate higher number of votes)
* review text length and summary text length increases over time

Other interesting things are:
* long summarys and long reviews dont get a lot of helpful votes
* reviews with an overall low rating seem to be short

# All Conclusions
* The data doesnt contain any products or any reviewers with less than 5 reviews
* Most products and reviewers have 5 reviews
* There is a bigger difference in the top 1% of products and reviewers than in the lower 99%
* Reviewer names are not unique and can be changed
* Reviews get rarely more than 4 helpful votes, but the votes given are overwhelmingly positiv
* Frequent reviewers give in ~90% of cases a 4-5 star review
* Around 95% of all reviews are made past 2011
* before 2010 there were barely close to no 1-2 star ratings (but there were also barely any reviews before 2010 and 1-2 star reviews are rare overall atleast by frequent reviewers)
* helpful ratio before 2010 is in average higher than after 2010
* the number of all helpful votes seem to drop near the end of the timeline (but these reviews might have had too little time to accumulate higher number of votes)
* review text length and summary text length increases over time
* long summarys and long reviews dont get a lot of helpful votes
* reviews with an overall low rating seem to be short