**Table of contents**<a id='toc0_'></a>    
- [Introduction](#toc1_1_1_)    
      - [Importing Python Libraries](#toc1_1_1_1_)    
      - [Loading Clean Dataset](#toc1_1_1_2_)    
    - [Merge Review and Meta data](#toc1_1_2_)    
    - [Word Counts in Reviews](#toc1_1_3_)    
    - [Aggregated Statistics by Store](#toc1_1_4_)    
    - [Aggregated Statistics based on Category](#toc1_1_5_)    
    - [Tokenizing Review texts and titles](#toc1_1_6_)    
    - [Combine store and category aggregated columns](#toc1_1_7_)    
    - [Data Dictionary](#toc1_1_8_)    
- [Data Dictionary](#toc2_)    
    - [Pickling the dataframe](#toc2_1_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

### <a id='toc1_1_1_'></a>[Introduction](#toc0_)

In this notebook, we first merge the meta data with reviews dataframe and then perform pre-processing on user generated reviews and ratings to make them suitable for modeling.

#### <a id='toc1_1_1_1_'></a>[Importing Python Libraries](#toc0_)

Importing necessary libraries for data pre-processing

In [110]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import regex as re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS,TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
import string
import spacy

# Ignore all warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")

In [111]:
#Useful settings

plt.rcParams['figure.figsize'] = (8.0, 6.0) # set matplotlib global settings eg. figsize
# plt.rcParams['font.size']=17              # useful when saving figures
sns.set_style("darkgrid")                   #Setting grid style in seaborn

# Uncomment if necessary
#pd.set_option('display.max_columns', None) # show all dataframe columns
#pd.set_option('display.max_colwidth', 1000)  # display long column titles

# from warnings import filterwarnings
# filterwarnings(action='ignore')

#### <a id='toc1_1_1_2_'></a>[Loading Clean Dataset](#toc0_)

In [112]:
# Here we load the pickled review DataFrame which has undergone basic cleaning   
review_df = pd.read_pickle('../data/review_sample_handmade.pkl')

#display first 5 rows
review_df.head()

Unnamed: 0,rating_by_user,title,text_review,images,product_id,parent_asin,user_id,time_of_review,helpful_vote,verified_purchase
0,5,Beautiful colors,I bought one for myself and one for my grandda...,[],B08GPJ1MSN,B08GPJ1MSN,AF7OANMNHQJC3PD4HRPX2FATECPA,2021-05-21 14:31:35.111,1,True
1,5,You simply must order order more than one!,I’ve ordered three bows so far. Have not been ...,[],B084TWHS7W,B084TWHS7W,AGMJ3EMDVL6OWBJF7CA5RGJLXN5A,2020-04-24 21:15:46.965,0,True
2,5,Great,As pictured. Used a frame from the dollar stor...,[],B07V3NRQC4,B07V3NRQC4,AEYORY2AVPMCPDV57CE337YU5LXA,2020-06-06 13:09:11.297,0,True
3,5,Well made and so beautiful,"This is beyond beautiful. So shiny, the size ...",[],B071ZMDK26,B071ZMDK26,AEINY4XOINMMJCK5GZ3M6MMHBN6A,2019-06-02 01:14:39.784,2,True
4,5,Smells just like the real thing!,Oh wow what a pleasant surprise! This smells g...,[],B01MPVZ4YP,B01MPVZ4YP,AGCPAPUHXYA3EEIL2KGSQTGO5HRA,2019-01-08 00:12:11.674,1,True


In [113]:
# Here we load the pickled meta DataFrame which has undergone basic cleaning, EDA and preprocessing 
meta_df = pd.read_pickle('../data/meta_sample_preprocessed.pkl')
meta_df.head()

Unnamed: 0,product_price,highly_rated_product,product_title_length,log_store_grouped_total_products,average_rating,num_product_images,log_store_grouped_weighted_mean_rating,store_grouped_std_rating,has_package_density,package_density,...,subcategory1_total_products,subcategory1_mean_rating,subcategory1_std_rating,subcategory1_std_rating_number,combined_category_weighted_mean_rating,combined_category_total_products,combined_category_mean_rating_number,combined_category_std_rating_number,spacy_tokenized_features,combined_category
0,24.0,0,33,1.716003,4.4,3,0.651539,0.380259,1,0.074682,...,11593,4.383533,0.58814,146.375686,4.420524,2899,37.654709,174.315571,sterling silver hammered cuff simple solid han...,Jewelry Earrings
1,12.95,0,143,2.93852,4.5,8,0.643441,0.383498,1,0.162845,...,16302,4.534591,0.506527,117.700437,4.517265,7350,39.613605,119.932873,humorous wall decor home office apartment deco...,Home & Kitchen Artwork
2,31.0,0,172,1.643453,4.4,8,0.668089,0.325217,0,0.105715,...,16302,4.534591,0.506527,117.700437,4.582926,1839,37.506253,166.538104,whiskey glass black lantern floral glasses flo...,Home & Kitchen Dining
3,3.99,0,169,4.053962,4.1,2,0.652882,0.568533,0,0.041667,...,461,4.458134,0.536782,770.436923,4.572816,206,35.650485,112.597206,love print heart sticker decal compatible ipad...,Electronics Accessories Laptop
4,15.59,1,181,1.079181,4.6,8,0.66197,0.317543,0,0.019231,...,3870,4.499871,0.529016,93.262382,4.463906,1147,25.355711,53.194126,bachelorette party shirts soft crew neck custo...,Clothing Shoes & Accessories Men


### <a id='toc1_1_2_'></a>[Merge Review and Meta data](#toc0_)

In [114]:
# Merge the 'review_df' DataFrame and 'meta_df' DataFrame based on the common column 'parent_asin'
merged_df = pd.merge(meta_df, review_df, on='parent_asin')#.reset_index(drop=True)

# Display the first few rows of the 'merged_df' DataFrame
merged_df.head()

Unnamed: 0,product_price,highly_rated_product,product_title_length,log_store_grouped_total_products,average_rating,num_product_images,log_store_grouped_weighted_mean_rating,store_grouped_std_rating,has_package_density,package_density,...,combined_category,rating_by_user,title,text_review,images,product_id,user_id,time_of_review,helpful_vote,verified_purchase
0,24.0,0,33,1.716003,4.4,3,0.651539,0.380259,1,0.074682,...,Jewelry Earrings,5,Perfect,It is very pretty. It can be adjusted a littl...,[],B0178HXZUY,AHFF5VKSO3EHKKRXLSO7D3PTYONA,2022-11-22 11:34:28.627,0,True
1,24.0,0,33,1.716003,4.4,3,0.651539,0.380259,1,0.074682,...,Jewelry Earrings,4,Thinner sterling silver,"The silver is thin, but the price is great",[],B0178HXZUY,AFAII5DNHRUXEEEDQA5ZPYONLB4Q,2022-05-17 20:44:54.272,0,True
2,24.0,0,33,1.716003,4.4,3,0.651539,0.380259,1,0.074682,...,Jewelry Earrings,5,I really liked it and then I lost it☹️,I had never worn one before and it took me a l...,[],B0178HXZUY,AER6VFZ6C6CNDMIR5GVHRZHS2L5Q,2020-06-07 16:47:42.152,0,True
3,24.0,0,33,1.716003,4.4,3,0.651539,0.380259,1,0.074682,...,Jewelry Earrings,5,Beautiful Hammered Ear Cuff,Beautiful and very well made. I love it. Super...,[],B0178HXZUY,AFBMULI3HP7MUV7DQQGLYAXCKIKA,2019-01-31 20:35:35.277,2,True
4,24.0,0,33,1.716003,4.4,3,0.651539,0.380259,1,0.074682,...,Jewelry Earrings,5,Lovedit,I loved this little cuff. And then I lost it. ...,[],B0178HXZUY,AF2GWPXV2TNQBBLRRBHMCDKGN7LA,2020-04-23 05:03:24.699,2,True


In [115]:
print(f'The shape of the merged dataframe is {merged_df.shape}.')

The shape of the merged dataframe is (224702, 32).


In [116]:
#display info
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224702 entries, 0 to 224701
Data columns (total 32 columns):
 #   Column                                  Non-Null Count   Dtype         
---  ------                                  --------------   -----         
 0   product_price                           224702 non-null  float64       
 1   highly_rated_product                    224702 non-null  int64         
 2   product_title_length                    224702 non-null  int64         
 3   log_store_grouped_total_products        224702 non-null  float64       
 4   average_rating                          224702 non-null  float64       
 5   num_product_images                      224702 non-null  int64         
 6   log_store_grouped_weighted_mean_rating  224702 non-null  float64       
 7   store_grouped_std_rating                224702 non-null  float64       
 8   has_package_density                     224702 non-null  int64         
 9   package_density                      

In [117]:
#checking null values 
merged_df.isna().sum().loc[lambda x: x> 0]

Series([], dtype: int64)

### <a id='toc1_1_3_'></a>[Word Counts in Reviews](#toc0_)

In [118]:
# Function to clean text: remove punctuation & emojis, then split
def clean_and_split(text):
        # Remove HTML tags like <br />
        text = re.sub(r'<.*?>', ' ', text)
        text = re.sub(r"[^\w\s]", " ", text)  # Remove punctuation
        text = re.sub(r"[\U00010000-\U0010FFFF]", " ", text, flags=re.UNICODE)  # Remove emojis
        return text

#apply to data frame
merged_df['review_title_cleaned'] = merged_df['title'].apply(clean_and_split)

merged_df['review_title_cleaned'].head()

0                                   Perfect
1                   Thinner sterling silver
2    I really liked it and then I lost it ️
3               Beautiful Hammered Ear Cuff
4                                   Lovedit
Name: review_title_cleaned, dtype: object

In [119]:
#clean review text

merged_df['review_text_cleaned'] = merged_df['text_review'].apply(clean_and_split)

merged_df['review_text_cleaned'].head()

0    It is very pretty   It can be adjusted a littl...
1           The silver is thin  but the price is great
2    I had never worn one before and it took me a l...
3    Beautiful and very well made  I love it  Super...
4    I loved this little cuff  And then I lost it  ...
Name: review_text_cleaned, dtype: object

In [120]:
#calculate the word counts in text and title of reviews

# Apply to the cleaned review text column
merged_df['review_text_word_counts'] = merged_df['review_text_cleaned'].str.split().apply(len)

# Apply to the cleaned title column
merged_df['review_title_word_counts'] = merged_df['review_title_cleaned'].str.split().apply(len)

merged_df[['review_text_word_counts','review_title_word_counts']].describe()

Unnamed: 0,review_text_word_counts,review_title_word_counts
count,224702.0,224702.0
mean,25.507143,3.413267
std,29.770351,2.717292
min,0.0,0.0
25%,8.0,2.0
50%,17.0,2.0
75%,33.0,4.0
max,1496.0,36.0


### <a id='toc1_1_4_'></a>[Aggregated Statistics by Store](#toc0_)

While individual reviews cannot be directly used as features in our model, we can leverage aggregated store-level features to predict whether a product will be highly rated. This is because seller characteristics may significantly influence a product's overall rating.  

In [121]:
#create store based aggregates of  user based ratings, total reviews, helpful votes and review/title  text

#filter rows where verified purchase = 1
filtered_rows = merged_df['verified_purchase'] == 1 

store_stats = merged_df[filtered_rows].groupby('store_grouped',as_index=False).agg(
    num_reviews_store_grouped=('rating_by_user', 'count'),
    avg_user_rating_store_grouped=('rating_by_user', 'mean'),                            #group user ratings per store
    std_user_rating_store_grouped=('rating_by_user', 'std'),                            
    review_title_word_counts_store_grouped = ('review_title_word_counts', 'mean'),       #group word counts per store
    review_text_word_counts_store_grouped = ('review_text_word_counts', 'mean'),
    review_text_store_grouped  = ('review_text_cleaned', lambda x: ' '.join(x)),         #concatenate review text per store
    review_title_store_grouped = ('review_title_cleaned', lambda x: ' '.join(x))   
)

store_stats.head()

Unnamed: 0,store_grouped,num_reviews_store_grouped,avg_user_rating_store_grouped,std_user_rating_store_grouped,review_title_word_counts_store_grouped,review_text_word_counts_store_grouped,review_text_store_grouped,review_title_store_grouped
0,12:13 jewelry,47,4.851064,0.658679,3.382979,23.361702,So pretty I get compliments on this necklace ...,Cute necklace love it Beautiful Great price q...
1,3d woodworker,516,4.903101,0.508482,3.02907,23.218992,I ordered this plaque as a wedding gift for a ...,A perfect gift Awesome Beautiful Absolutely A...
2,48 hour monogram,692,4.760116,0.796319,3.043353,23.656069,Fast and beautiful Exactly what I needed Lov...,Quick service Beautiful piece SHIPS QUICK Came...
3,6grape,61,4.819672,0.695418,3.065574,35.819672,It s perfect Love it so much For b...,Just as described Better than expected MmVery...
4,8 track romeo,16,4.625,1.024695,2.3125,28.5625,The sign was perfect it came with text exactl...,Great Great quality and fast Don t get it ...


In [122]:
#ratio of verified purchases and total reviews per store

verified_purchase_mapping = merged_df.groupby('store_grouped')['verified_purchase'].mean()  #create a mapping per store
store_stats['verified_purchase_ratio_store_grouped'] = store_stats['store_grouped'].map(verified_purchase_mapping) #apply mapping

store_stats['verified_purchase_ratio_store_grouped'].describe() 

count    976.000000
mean       0.957311
std        0.092411
min        0.105263
25%        0.954545
50%        0.978367
75%        1.000000
max        1.000000
Name: verified_purchase_ratio_store_grouped, dtype: float64

Some reviews receive more helpful votes than others, regardless of whether they are one-star or five-star ratings. To better assess a seller’s performance, we can compute a weighted average of user ratings, factoring in the helpful votes received.

In [123]:
merged_df['rating_polarity'] = merged_df['rating_by_user'].apply(lambda x: 1 if x >= 4 else -1)
merged_df["weighted_helpfulness"] = merged_df["helpful_vote"] * merged_df["rating_polarity"]

#Aggregate by store considering only verified purchases
helpfulness_mapping = merged_df.loc[filtered_rows].groupby("store_grouped")["weighted_helpfulness"].mean()
store_stats["weighted_helpfulness_store_grouped"] = store_stats["store_grouped"].map(helpfulness_mapping)

store_stats["weighted_helpfulness_store_grouped"].describe()

count    976.000000
mean       0.331593
std        0.842167
min       -3.708333
25%        0.071347
50%        0.218333
75%        0.445607
max       22.653846
Name: weighted_helpfulness_store_grouped, dtype: float64

Additionally, we can analyze the ratio of one-star to five-star ratings for each store, which may serve as an indicator of seller reputation and popularity.

In [124]:
#create a map of count of differnt types of ratings per store
store_ratings_map = merged_df.loc[filtered_rows].groupby("store_grouped")["rating_by_user"].value_counts().unstack()

#take the ratio of one to five star
one_to_five_star = store_ratings_map[1]/store_ratings_map[5]

store_stats["one_to_five_star_store_grouped"] = store_stats["store_grouped"].map(one_to_five_star).fillna(0) #fill zero where there are no 5 star ratings

store_stats["one_to_five_star_store_grouped"].describe()

count    976.000000
mean       0.096927
std        0.167605
min        0.000000
25%        0.009901
50%        0.050569
75%        0.115744
max        2.500000
Name: one_to_five_star_store_grouped, dtype: float64

In [125]:
#combine title and text of the reviews in a single column

store_stats["review_store_grouped"] = store_stats["review_title_store_grouped"] + " " + store_stats["review_text_store_grouped"] 

#drop original columns

store_stats.drop(columns =["review_title_store_grouped", "review_text_store_grouped"],inplace=True)

In [126]:
#check null values
store_stats.isna().sum()

store_grouped                             0
num_reviews_store_grouped                 0
avg_user_rating_store_grouped             0
std_user_rating_store_grouped             0
review_title_word_counts_store_grouped    0
review_text_word_counts_store_grouped     0
verified_purchase_ratio_store_grouped     0
weighted_helpfulness_store_grouped        0
one_to_five_star_store_grouped            0
review_store_grouped                      0
dtype: int64

### <a id='toc1_1_5_'></a>[Aggregated Statistics based on Category](#toc0_)

In [127]:
#create sub-category based aggregates of  user based ratings, total reviews, helpful votes and review/title  text

#filter rows where verified purchase = 1
filtered_rows = merged_df['verified_purchase'] == 1 

category_stats = merged_df[filtered_rows].groupby('combined_category',as_index=False).agg(
    num_reviews_combined_category_grouped=('rating_by_user', 'count'),
    avg_user_rating_combined_category_grouped=('rating_by_user', 'mean'),                            #group user ratings per combined_category
    std_user_rating_combined_category_grouped=('rating_by_user', 'std'),                            
    review_title_word_counts_combined_category_grouped = ('review_title_word_counts', 'mean'),       #group word counts per combined_category
    review_text_word_counts_combined_category_grouped = ('review_text_word_counts', 'mean'),
    review_text_combined_category_grouped  = ('review_text_cleaned', lambda x: ' '.join(x)),         #concatenate review text per combined_category
    review_title_combined_category_grouped = ('review_title_cleaned', lambda x: ' '.join(x))   
)

category_stats.head()

Unnamed: 0,combined_category,num_reviews_combined_category_grouped,avg_user_rating_combined_category_grouped,std_user_rating_combined_category_grouped,review_title_word_counts_combined_category_grouped,review_text_word_counts_combined_category_grouped,review_text_combined_category_grouped,review_title_combined_category_grouped
0,Baby Diaper Changing,53,4.679245,0.893859,3.283019,24.018868,The sizes are too small and they won t refund ...,Don t buy they don t refund Size Toaster Cover...
1,Baby Health Bathing & Skin Care,43,4.27907,1.201697,3.372093,29.976744,Designed specifically for my granddaughter Wo...,Great pillow Smoothe and lasts all day No sme...
2,Baby Nursery,854,4.550351,1.09851,3.564403,28.901639,Perfectly done I love how mine turned out it ...,Good buy Absolutely perfect Adorable STELLAR C...
3,Baby Nursing & Feeding,49,4.530612,1.226479,3.285714,25.591837,Absolutely BEAUTIFUL This is so lovely th...,Professional job Beautiful Vintage Best Just ...
4,Baby Pacifiers & Teethers,87,4.747126,0.810087,3.103448,15.505747,This is so beautiful omg im mad i bought it su...,Beautiful Hermoso Same as any other clip ha...


In [128]:
merged_df.loc[filtered_rows].groupby("store_grouped")["weighted_helpfulness"].mean().describe()

count    976.000000
mean       0.331593
std        0.842167
min       -3.708333
25%        0.071347
50%        0.218333
75%        0.445607
max       22.653846
Name: weighted_helpfulness, dtype: float64

In [129]:
#create a map of count of differnt types of ratings per category
category_ratings_map = merged_df.loc[filtered_rows].groupby("combined_category")["rating_by_user"].value_counts().unstack()

#take the ratio of one to five star
one_to_five_star = category_ratings_map[1]/category_ratings_map[5]

category_stats["one_to_five_star_combined_category_grouped"] = category_stats["combined_category"].map(one_to_five_star).fillna(0) #fill zero where there are no 5 star ratings

category_stats["one_to_five_star_combined_category_grouped"].describe()

count    64.000000
mean      0.080043
std       0.055919
min       0.011236
25%       0.046723
50%       0.066318
75%       0.099946
max       0.347452
Name: one_to_five_star_combined_category_grouped, dtype: float64

In [130]:
#combine title and text of the reviews in a single column

category_stats["review_combined_category_grouped"] = category_stats["review_title_combined_category_grouped"] + " " + category_stats["review_text_combined_category_grouped"] 

#drop original columns

category_stats.drop(columns =["review_title_combined_category_grouped", "review_text_combined_category_grouped"],inplace=True)

In [131]:
#check null values
category_stats.isna().sum()

combined_category                                     0
num_reviews_combined_category_grouped                 0
avg_user_rating_combined_category_grouped             0
std_user_rating_combined_category_grouped             0
review_title_word_counts_combined_category_grouped    0
review_text_word_counts_combined_category_grouped     0
one_to_five_star_combined_category_grouped            0
review_combined_category_grouped                      0
dtype: int64

### <a id='toc1_1_6_'></a>[Tokenizing Review texts and titles](#toc0_)

In this section, we explore token creation using store-grouped and category-grouped reviews. We analyze the top tokens using the TF-IDF vectorizer. However, since incorporating these tokens as features does not significantly improve our final model, we exclude this approach from the final workflow.  

In [132]:
# # Let's download en_core_web_lg trained model
# #!python -m spacy download en_core_web_lg -q

# # Load the large English pipeline
# nlp = spacy.load('en_core_web_sm',disable=["parser", "ner"])
# nlp.max_length = 8e6  
# # Function to generate token after lemmatization
# def tokenizer(row):
#     """
#     Tokenizes and lemmatizes a product title, removing stopwords.

#     Args:
#         row (Series or dict): A row from a Pandas DataFrame containing a 'title' column.

#     Returns:
#         List[str]: List of lemmatized tokens after removing stopwords.
#     """
#     parsed = nlp(row)
    
# # Convert to lowercase, keep only alphabetic words, and remove stopwords
#     tok_lemmas = [
#         token.lemma_.lower()    # Convert lemma to lowercase
#         for token in parsed
#         if token.is_alpha       # Check if it's alphabetic
#         and not token.is_stop   # and not a stopword
#         and len(token) >2       # and token is more than 2 letters
#         and token.pos_ in ("ADJ", "NOUN","PROPN")    #extract nouns and adjectives
#     ]
#     # entities = [ent.text for ent in parsed_title.ents]  # Extract named entities
#     #print([token.pos_ for token in parsed])
    
#     # Remove duplicate tokens while maintaining order

#     unique_tokens = list(dict.fromkeys(tok_lemmas))
    
#     return(unique_tokens) #output is a list



In [133]:
# #apply custom tokenizer on store grouped titles
# store_stats['tokenized_review_store_grouped'] = store_stats['review_store_grouped'].apply(tokenizer) 

# #Split words in the list to form a string
# store_stats["tokenized_review_store_grouped"] = store_stats["tokenized_review_store_grouped"].apply(lambda x: ' '.join(x))

In [134]:
# store_stats["tokenized_review_store_grouped"].head()

In [135]:
# #trying tfidf vectorization of tokenized reviews  to have a quick look at the generated features
# tfidf_vect = TfidfVectorizer(      token_pattern=r"\b[a-z]{3,}\b",
#                                    lowercase=True,
#                                    min_df=0.05,
#                                    max_df=0.90,                  
#                                    stop_words='english',   
#                                    )         #

# # Fit 
# tfidf_vect.fit(store_stats['tokenized_review_store_grouped'])

# #Transform - output is a sparse matrix
# transformed_features = tfidf_vect.transform(store_stats['tokenized_review_store_grouped'])

# #convert to dataframe
# store_tokens_df  = pd.DataFrame(transformed_features.toarray(), columns= tfidf_vect.get_feature_names_out())

# print(f'Shape of transformed array of title_tokenized features: {store_tokens_df.shape}')


In [136]:
# #checking a few most common tokens in the title
# store_tokens_df.sum(axis=0).sort_values().head(20)

In [137]:
#drop original columns
store_stats.drop(columns=["review_store_grouped"],inplace=True)

In [138]:
# #apply custom tokenizer on category grouped reviews
# category_stats['tokenized_review_combined_category_grouped'] = category_stats['review_combined_category_grouped'].apply(tokenizer) 

# #Split words in the list to form a string
# category_stats["tokenized_review_combined_category_grouped"] = category_stats["tokenized_review_combined_category_grouped"].apply(lambda x: ' '.join(x))

In [139]:
# #trying tfidf vectorization of tokenized reviews  to have a quick look at the generated features
# tfidf_vect = TfidfVectorizer(      token_pattern=r"\b[a-z]{3,}\b",
#                                    lowercase=True,
#                                    min_df=0.05,
#                                    max_df=0.90,                  
#                                    stop_words='english',   
#                                    )         #

# # Fit 
# tfidf_vect.fit(category_stats['tokenized_review_combined_category_grouped'])

# #Transform - output is a sparse matrix
# transformed_features = tfidf_vect.transform(category_stats['tokenized_review_combined_category_grouped'])

# #convert to dataframe
# cat_tokens_df  = pd.DataFrame(transformed_features.toarray(), columns= tfidf_vect.get_feature_names_out())

# print(f'Shape of transformed array of title_tokenized features: {cat_tokens_df.shape}')

In [140]:
# #checking a few most common tokens in the title
# cat_tokens_df.sum(axis=0).sort_values().head(20)

In [141]:
#drop original columns
category_stats.drop(columns=["review_combined_category_grouped"],inplace=True)

### <a id='toc1_1_7_'></a>[Combine store and category aggregated columns](#toc0_)

In [142]:
# Merge the results back into meta_df
meta_store_df = meta_df.merge(store_stats, on='store_grouped').reset_index(drop=True)

merged_final_df = meta_store_df.merge(category_stats, on='combined_category').reset_index(drop=True)
merged_final_df.head()

Unnamed: 0,product_price,highly_rated_product,product_title_length,log_store_grouped_total_products,average_rating,num_product_images,log_store_grouped_weighted_mean_rating,store_grouped_std_rating,has_package_density,package_density,...,review_text_word_counts_store_grouped,verified_purchase_ratio_store_grouped,weighted_helpfulness_store_grouped,one_to_five_star_store_grouped,num_reviews_combined_category_grouped,avg_user_rating_combined_category_grouped,std_user_rating_combined_category_grouped,review_title_word_counts_combined_category_grouped,review_text_word_counts_combined_category_grouped,one_to_five_star_combined_category_grouped
0,24.0,0,33,1.716003,4.4,3,0.651539,0.380259,1,0.074682,...,23.365044,0.980477,0.772124,0.051213,20154,4.372829,1.243143,3.288975,22.939615,0.105545
1,12.95,0,143,2.93852,4.5,8,0.643441,0.383498,1,0.162845,...,18.927813,0.961375,0.081823,0.138847,41110,4.54809,1.09433,3.300122,21.584164,0.072742
2,31.0,0,172,1.643453,4.4,8,0.668089,0.325217,0,0.105715,...,21.185315,0.888199,0.230769,0.019084,10528,4.646657,0.956326,3.435695,24.045118,0.04604
3,3.99,0,169,4.053962,4.1,2,0.652882,0.568533,0,0.041667,...,24.487361,0.970938,0.366214,0.073569,1040,4.501923,1.146613,3.480769,23.256731,0.082536
4,15.59,1,181,1.079181,4.6,8,0.66197,0.317543,0,0.019231,...,23.785714,0.982456,0.25,0.021277,3953,4.431318,1.188636,3.405009,24.865166,0.088787


In [143]:
merged_final_df.isna().sum().loc[lambda x: x >0]

Series([], dtype: int64)

In [144]:
#drop rows corresponding to null values if any
merged_final_df = merged_final_df.dropna().reset_index(drop=True)
merged_final_df.isna().sum().loc[lambda x: x >0]

Series([], dtype: int64)

In [145]:
merged_final_df.head()

Unnamed: 0,product_price,highly_rated_product,product_title_length,log_store_grouped_total_products,average_rating,num_product_images,log_store_grouped_weighted_mean_rating,store_grouped_std_rating,has_package_density,package_density,...,review_text_word_counts_store_grouped,verified_purchase_ratio_store_grouped,weighted_helpfulness_store_grouped,one_to_five_star_store_grouped,num_reviews_combined_category_grouped,avg_user_rating_combined_category_grouped,std_user_rating_combined_category_grouped,review_title_word_counts_combined_category_grouped,review_text_word_counts_combined_category_grouped,one_to_five_star_combined_category_grouped
0,24.0,0,33,1.716003,4.4,3,0.651539,0.380259,1,0.074682,...,23.365044,0.980477,0.772124,0.051213,20154,4.372829,1.243143,3.288975,22.939615,0.105545
1,12.95,0,143,2.93852,4.5,8,0.643441,0.383498,1,0.162845,...,18.927813,0.961375,0.081823,0.138847,41110,4.54809,1.09433,3.300122,21.584164,0.072742
2,31.0,0,172,1.643453,4.4,8,0.668089,0.325217,0,0.105715,...,21.185315,0.888199,0.230769,0.019084,10528,4.646657,0.956326,3.435695,24.045118,0.04604
3,3.99,0,169,4.053962,4.1,2,0.652882,0.568533,0,0.041667,...,24.487361,0.970938,0.366214,0.073569,1040,4.501923,1.146613,3.480769,23.256731,0.082536
4,15.59,1,181,1.079181,4.6,8,0.66197,0.317543,0,0.019231,...,23.785714,0.982456,0.25,0.021277,3953,4.431318,1.188636,3.405009,24.865166,0.088787


In [146]:
print(f'The final dataframe ready for modeling has {merged_final_df.shape[0]} rows and {merged_final_df.shape[1]} columns.')

The final dataframe ready for modeling has 38127 rows and 37 columns.


In [147]:
merged_final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38127 entries, 0 to 38126
Data columns (total 37 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   product_price                                       38127 non-null  float64
 1   highly_rated_product                                38127 non-null  int64  
 2   product_title_length                                38127 non-null  int64  
 3   log_store_grouped_total_products                    38127 non-null  float64
 4   average_rating                                      38127 non-null  float64
 5   num_product_images                                  38127 non-null  int64  
 6   log_store_grouped_weighted_mean_rating              38127 non-null  float64
 7   store_grouped_std_rating                            38127 non-null  float64
 8   has_package_density                                 38127 non-null  int64  


### <a id='toc1_1_8_'></a>[Data Dictionary](#toc0_)

# <a id='toc2_'></a>[Data Dictionary](#toc0_)

| #  | Column                                              | Data Type  | Description |
|----|-----------------------------------------------------|------------|-------------|
| 0  | product_price                                      | float64    | Price of the product |
| 1  | highly_rated_product                              | int64      | Binary flag (1 = highly rated, 0 = not highly rated) |
| 2  | product_title_length                              | int64      | Length of the product title (number of characters) |
| 3  | store_grouped_total_products                      | int64      | Total number of products sold by the store |
| 4  | average_rating                                    | float64    | Average rating of the product |
| 5  | num_product_images                                | int64      | Number of images associated with the product |
| 6  | store_grouped_weighted_mean_rating               | float64    | Weighted mean rating of all products in the store |
| 7  | store_grouped_std_rating                         | float64    | Standard deviation of product ratings in the store |
| 8  | store_grouped_low_rated_ratio                    | float64    | Ratio of low-rated products in the store |
| 9  | has_package_density                              | int64      | Binary flag (1 = product has package density, 0 = no) |
| 10 | package_density                                  | float64    | Density of the product packaging |
| 11 | product_age_days                                 | int64      | Number of days since the product was first listed |
| 12 | store_grouped                                   | object     | Store name (grouped by common identifier) |
| 13 | parent_asin                                     | object     | Parent ASIN (Amazon Standard Identification Number) |
| 14 | subcategory1_total_products                     | int64      | Total number of products in subcategory 1 |
| 15 | subcategory1_mean_rating                        | float64    | Mean rating of products in subcategory 1 |
| 16 | subcategory1_std_rating                         | float64    | Standard deviation of ratings in subcategory 1 |
| 17 | subcategory1_std_rating_number                  | float64    | Standard deviation of rating count in subcategory 1 |
| 18 | combined_category_weighted_mean_rating         | float64    | Weighted mean rating of all products in the combined category |
| 19 | combined_category_total_products               | int64      | Total number of products in the combined category |
| 20 | combined_category_mean_rating_number           | float64    | Mean number of ratings per product in the combined category |
| 21 | combined_category_std_rating_number            | float64    | Standard deviation of number of ratings in the combined category |
| 22 | combined_category_low_rated_ratio              | float64    | Ratio of low-rated products in the combined category |
| 23 | spacy_tokenized_features                       | object     | Tokenized product features using spaCy |
| 24 | combined_category                              | object     | Category name after combining multiple levels |
| 25 | num_reviews_store_grouped                      | int64      | Total number of reviews for the store |
| 26 | avg_user_rating_store_grouped                  | float64    | Average user rating for the store |
| 27 | std_user_rating_store_grouped                  | float64    | Standard deviation of user ratings in the store |
| 28 | review_title_word_counts_store_grouped         | float64    | Average word count of review titles in the store |
| 29 | review_text_word_counts_store_grouped          | float64    | Average word count of review texts in the store |
| 30 | verified_purchase_ratio_store_grouped          | float64    | Ratio of verified purchases in the store |
| 31 | weighted_helpfulness_store_grouped             | float64    | Weighted helpfulness score for reviews in the store |
| 32 | one_to_five_star_store_grouped                 | float64    | Ratio of 1-star to 5-star reviews in the store |
| 33 | num_reviews_combined_category_grouped          | int64      | Total number of reviews in the combined category |
| 34 | avg_user_rating_combined_category_grouped      | float64    | Average user rating in the combined category |
| 35 | std_user_rating_combined_category_grouped      | float64    | Standard deviation of user ratings in the combined category |
| 36 | review_title_word_counts_combined_category_grouped  | float64 | Average word count of review titles in the combined category |
| 37 | review_text_word_counts_combined_category_grouped   | float64 | Average word count of review texts in the combined category |
| 38 | one_to_five_star_combined_category_grouped     | float64    | Ratio of 1-star to 5-star reviews in the combined category |

### <a id='toc2_1_1_'></a>[Pickling the dataframe](#toc0_)

In [148]:
# Pickle the DataFrame
merged_final_df.to_pickle('../data/meta_review_sample_preprocessed.pkl')