In [5]:
import pandas as pd

In [6]:
amazon_df = pd.read_parquet("amazon_reviews_with_embeddings.parquet")

In [7]:
print(f"Shape of Amazon Reviews Data: {amazon_df.shape}")
print(f"Columns of Amazon Reviews Data: {amazon_df.columns}")

Shape of Amazon Reviews Data: (8201231, 16)
Columns of Amazon Reviews Data: Index(['overall', 'vote', 'verified', 'reviewTime', 'reviewerID', 'asin',
       'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'image',
       'style', 'review_len_words', 'review_len_chars', 'reviewText_clean',
       'embedding'],
      dtype='object')


## Drop Null Values

In [9]:
amazon_df.dropna(subset=['summary','reviewText', 'reviewerName'],inplace=True)

In [10]:
amazon_df.drop(columns=['image', 'vote', 'style', 'reviewTime'],inplace=True)

In [11]:
amazon_df.shape

(8191295, 12)

After dropping null review Text, we will see how embeddings are:

In [13]:
amazon_df['embedding'].isna().sum()

156

There are still 156 rows with null embeddings:

In [15]:
print(amazon_df.loc[amazon_df['embedding'].isna(), 'reviewText'])

32620      <div id="video-block-RZJE0FUS9Q294" class="a-s...
39561      <div id="video-block-R140VRJ98DUX7I" class="a-...
141368     <div id="video-block-RWO16ORWOSEG" class="a-se...
480058     <div id="video-block-R3OS0H9AM1JAE2" class="a-...
550180     <div id="video-block-R1T6JKTB1KCZ0F" class="a-...
                                 ...                        
7994999    <div id="video-block-R2V63EVDOHZLHF" class="a-...
8055368                                                     
8084287    <div id="video-block-RBEWWMCKKIY6M" class="a-s...
8128175                                                     
8186277    <div id="video-block-R7NSNC4PCI6VZ" class="a-s...
Name: reviewText, Length: 156, dtype: object


In [16]:
print(amazon_df.loc[32620,'reviewText'])

<div id="video-block-RZJE0FUS9Q294" class="a-section a-spacing-small a-spacing-top-mini video-block"></div><input type="hidden" name="" value="https://images-na.ssl-images-amazon.com/images/I/B1m9qVXPZTS.mp4" class="video-url"><input type="hidden" name="" value="https://images-na.ssl-images-amazon.com/images/I/51YwxGYUp6S.png" class="video-slate-img-url">


This is Video reivew, so we will drop these rows. 

In [18]:
amazon_df.dropna(subset=['embedding'], inplace=True)

## Analyze One Review Products

#### Print Some Reviews Belong to One Review Products

In [21]:
review_counts = amazon_df.groupby('asin')['reviewText'].count()
one_review_asins = review_counts[review_counts == 1].index
one_review_df = amazon_df[amazon_df['asin'].isin(one_review_asins)].copy()

In [22]:
for i, review in enumerate(one_review_df['reviewText'].head(20), 1):
    print(f"--- Review {i} ---\n{review}\n")

--- Review 1 ---
After being inundated with Barbie, I had not paid much attention to these dolls and their accessories until my child recently pointed them out in the newspaper insert and then wanted to look at them online.  This one was specifically liked because of the appealing clothing.  After receiving it and seeing just how cute it is, I don't mind investing a little in the My Scene stuff.  I think I might even have fun playing with the stuff.

--- Review 2 ---
Great product, thank you! Our son loved the puzzles.  They have large pieces yet they are still challenging for a 4 year old.

--- Review 3 ---
A1

--- Review 4 ---
A very cute little toy, though very expensive for its size.

--- Review 5 ---
This book has so many reviews mine is not necessary; but I feel so strongly about this book I had to share.  I have two boys- 3yrs & 1yr- They have actually read our book so much it fell apart.  I am purchasing  another copy.  They LOVE this book.  A great gift idea for baby showers--

Most of these reviews has a lot of information, Let's print one word one review products. 

In [28]:
one_review_df['word_count'] = one_review_df['reviewText'].str.split().apply(len)

amazon_df_sorted = one_review_df.sort_values(by='word_count')

amazon_df_sorted = amazon_df_sorted.reset_index(drop=True)

for i, row in amazon_df_sorted.head(10).iterrows():
    print(f"--- Review {i+1} ---")
    print(f"Word Count: {row['word_count']}")
    print(f"{row['reviewText']}\n")

--- Review 1 ---
Word Count: 1
AWESOME!!!!

--- Review 2 ---
Word Count: 1
JOKE

--- Review 3 ---
Word Count: 1
A++++++

--- Review 4 ---
Word Count: 1
good

--- Review 5 ---
Word Count: 1
Stunning.

--- Review 6 ---
Word Count: 1
Darling.

--- Review 7 ---
Word Count: 1
good.

--- Review 8 ---
Word Count: 1
nice

--- Review 9 ---
Word Count: 1
AA

--- Review 10 ---
Word Count: 1
Perfect!



Since we use sentence embeddings, one-word reviews from single-review products provide little meaningful information and instead introduce noise. Therefore, we will eliminate such products from our dataset.

In [30]:
one_word_review_asins = one_review_df[one_review_df['word_count'] <= 1]['asin']

amazon_df_cleaned = amazon_df[~amazon_df['asin'].isin(one_word_review_asins)]

In [32]:
amazon_df_cleaned.shape

(8181798, 12)

In [34]:
unique_asins = amazon_df_cleaned['asin'].unique()

In [36]:
# upload labels
amazon_df_labels = pd.read_pickle("Data/amazon_df_labels.pkl")

In [38]:
# Merge amazon_df_labels with amazon_df_cleaned on 'asin'
asin_labels_clean_review_df = amazon_df_labels[amazon_df_labels['asin'].isin(amazon_df_cleaned['asin'].unique())]

In [40]:
asin_labels_clean_review_df = asin_labels_clean_review_df.drop_duplicates(subset='asin')

In [44]:
print(f"After dropping, the size of the data is: {len(asin_labels_clean_review_df)}")

After dropping, the size of the data is: 614658


In [46]:
asin_labels_clean_review_df['match'].value_counts()

match
0    613217
1      1441
Name: count, dtype: int64

In [48]:
asin_labels_clean_review_df.shape

(614658, 3)

In [50]:
# Save cleaned asins and labels:
asin_labels_clean_review_df.to_csv("Data/asin_labels_clean_review_df.csv", index=False)