# Product Metadata, Feature Merging, and Stratified Group Splitting

This notebook performs the following steps for Amazon product data:

1. **Data Import and Cleaning**
   - Loads cleaned product metadata and merges with match/component labels.
   - Ensures only products with cleaned reviews are included.
   - Merges in review-based features and text embeddings.

2. **Class Balancing**
   - Randomly subsamples the majority class (`match==0`).

3. **Stratified Group Splitting**
   - Uses a custom `StratifiedGroupSplit` function to split the data into train/test/validation sets, preserving both label proportions and component integrity.

4. **Split Validation**
   - Checks that splits are balanced in terms of size and label distribution.
   - Ensures no group/component appears in more than one split.

5. **Saving Outputs**
   - Saves the resulting splits and their indices to Parquet files for downstream modeling.

6. **Cross-Validation Split**
   - Performs a 3-fold cross-validation splitting using custom splitter, ensuring stratification and component integrity.

## Input Files

-  `../Data/metadata_cleaned.parquet`       
   Cleaned product metadata. Generated by ``meta_cleaning.ipynb``
- `../Data/amazon_df_labels.csv`    
   Match and component labels. Generated by `../matching/MatchWithPretrainedModelandLLM.ipynb`.
- `../Data/asin_labels_clean_review_df.csv`    
   List of cleaned reviews. Generated by `CleanReviewsData.ipynb`
- `../final_reviews.parquet`  
    Review-based features. Generated by `reviews_features.ipynb`
- `../Data/agg_summary_embeddings.pkl`  
    Summary text embeddings. ***Where is the source file for this?***
- `../Data/reviewtext_features_df.pkl`    
  Review text embeddings and sentiment scores. Generated by `SimilarityScore.ipynb`

## Output Files

- `../Data/test_v3.parquet` (test split)
- `../Data/train_final_v3.parquet` (training split)
- `../Data/validationA_v3.parquet` (split for validation)
- `../Data/validationB_v3.parquet` (split for calibration)
- `../Data/CV_val_split.parquet` (indices to perform 3-fold cross-validation)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.neighbors import KNeighborsRegressor
from tqdm import tqdm
from ast import literal_eval
import os
tqdm.pandas()

In [2]:
import sys
from pathlib import Path

project_root = Path().resolve().parent   
sys.path.insert(0, str(project_root / "src"))

##### Import product metadata

In [3]:
# Import the cleaned metadata.
# Generated by meta_cleaning.ipynb
combined_df=pd.read_parquet("../Data/metadata_cleaned.parquet")

##### Add matches and components

In [4]:
matches_df=pd.read_csv("../Data/amazon_df_labels.csv")

In [5]:
# Drop duplicate rows
matches_df=matches_df.drop_duplicates()

In [6]:
# Find and label components
from generate_component_nums import get_components
matches_df['component_no']=get_components(matches_df)

In [7]:
# Merge matches with combined_df
combined_df=combined_df.merge(matches_df,on='asin')

In [8]:
# Import undropped asins after cleaning review_data.
review_cleaned_asins=pd.read_csv("../Data/asin_labels_clean_review_df.csv")

# Check no duplicate asin
assert(review_cleaned_asins.asin.duplicated().unique()==[False])

In [9]:
# Drop entries from combined_df that don't appear in review_cleaned_asins
combined_df=combined_df.merge(review_cleaned_asins[['asin']],on='asin')

##### Add reviews features

In [10]:
# Load Reviews_df (from reviews_features.ipynb)
reviews_features_df=pd.read_parquet("../Data/final_reviews.parquet")

# Verify no duplicated asins
assert(reviews_features_df.asin.duplicated().unique()==[False])

In [11]:
# Merge meta and review datasets
combined_df=combined_df.merge(reviews_features_df,on='asin')

In [12]:
# Drop num_of_rating column as it is identical to reviews_per_product
combined_df.drop(columns=['num_of_rating'])

Unnamed: 0,asin,category,title,missing_price,item_rank,match,incident_indices,component_no,avg_rating,min_rating,...,avg_verified_reviewers,min_date,max_date,product_lifespan,num_bots_per_asin,unique_reviewer_count,avg_reviews_per_day,reviews_per_product,avg_review_length_words,avg_review_length_chars
0,0000191639,Puzzles,Dr. Suess 19163 Dr. Seuss Puzzle 3 Pack Bundle,True,2230717.0,0,[],449,5.000000,5,...,1.000000,2013-12-26,2013-12-26,0 days,0,1,3.000000,1,23.000000,125.000000
1,0004983289,Games,Dutch Blitz Card Game,False,376337.0,0,[],452,4.800000,5,...,1.000000,2016-12-10,2018-03-28,473 days,0,5,0.801198,5,15.400000,107.200000
2,0020232233,Grown-Up Toys,Dungeons &amp; Dragons - &quot;Storm Kings Thu...,False,178217.0,0,[],454,4.130435,5,...,0.782609,2016-09-12,2018-04-06,571 days,0,23,0.642518,23,59.260870,329.608696
3,0096737581,Arts & Crafts,NUM NOMS figures Storage Case Organizer - hold...,True,989767.0,0,[],455,2.333333,5,...,0.666667,2016-12-03,2017-07-08,217 days,0,3,0.671922,3,22.666667,126.000000
4,014002316X,Toy Remote Control & Play Vehicles,UDI U806 Infrared Remote Control Helicopter W/...,True,3687991.0,0,[],456,1.000000,1,...,1.000000,2015-09-25,2015-09-25,0 days,0,1,1.000000,1,12.000000,52.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
547713,B01HJC53OE,Action Figures & Statues,Marvel Funko Pop Black Suit Spider-Man #79 (Gl...,True,482069.0,0,[],623284,4.500000,5,...,1.000000,2017-07-23,2018-03-09,229 days,0,4,1.010479,4,21.750000,109.750000
547714,B01HJDFWDK,Hobbies,Geilienergy 6V 2000mAh NiMH RX Battery Packs w...,False,350404.0,0,[],623285,4.400000,5,...,0.966667,2016-10-06,2018-10-04,728 days,0,30,0.615794,30,16.333333,86.133333
547715,B01HJDGVFS,Sports & Outdoor Play,Micord Baby Float Toddler Swimming Inflatable ...,True,652169.0,0,[],623286,4.200000,5,...,1.000000,2016-08-26,2018-08-13,717 days,0,5,0.603825,5,11.600000,57.400000
547716,B01HJDUNRU,Sports & Outdoor Play,"Premium Swimming Pool Float Hammock, Inflatabl...",False,253066.0,0,[],623287,4.222222,5,...,0.944444,2016-08-13,2018-09-05,753 days,0,36,0.760222,36,38.944444,204.194444


##### Add summary and review embeddings

In [13]:
# Load summary embeddings
summary_embeddings=pd.read_pickle("../Data/agg_summary_embeddings.pkl")

# Verify no duplicated asins
assert(summary_embeddings.asin.duplicated().unique()==[False])

In [14]:
combined_df=combined_df.merge(summary_embeddings,on='asin',how='left').set_axis(combined_df.index)

In [15]:
# Load review embeddings
review_embeddings=pd.read_pickle("../Data/reviewtext_features_df.pkl")

# Verify no duplicated asins
assert(review_embeddings.asin.duplicated().unique()==[False])

In [16]:
combined_df=combined_df.merge(review_embeddings,on='asin',how='left').set_axis(combined_df.index)

##### Randomly dropping items labelled zero

To obtain a more manageable data set, we randomly drop items whose ``match`` column is zero.

In [17]:
num_zeros=(combined_df.match==0).sum()
num_ones=(combined_df.match==1).sum()
print(f"There are {num_zeros} products labelled 0.")
print(f"There are {num_ones} products labelled 1.")

There are 546348 products labelled 0.
There are 1370 products labelled 1.


In [18]:
# This is the number of 0s in our final dataset.
TARGET=200_000

# Check TARGET is less than the total number of zeros
assert(TARGET<num_zeros)

In [19]:
# indices of items labelled zero
zero_indices=combined_df[combined_df.match==0].index
# indices of items labelled one
one_indices=combined_df[combined_df.match==1].index

# random subset of TARGET items labelled zero
rng=np.random.default_rng(seed=1067)
random_subset=rng.choice(zero_indices,TARGET,replace=False)

# total list of indices
indices=np.concat([random_subset,one_indices])

In [20]:
# Drop rows of combined_df with index not in indices
combined_df=combined_df.loc[indices]

In [21]:
print(combined_df.match.value_counts())

match
0    200000
1      1370
Name: count, dtype: int64


##### Stratified train-test split preserving groups

In [22]:
from custom_ttsplit import StratifiedGroupSplit

In [23]:
df_train,df_test=StratifiedGroupSplit(combined_df,'match','component_no',test_size=0.2,random_state=1066)

In [24]:
# Check test ratio
df_test.shape[0]/combined_df.shape[0]

0.2

In [25]:
# Check ratio of 1s in set before split
combined_df[(combined_df.match)==1].shape[0]/combined_df.shape[0]

0.0068033967323831756

In [26]:
# Check ratio of 1s in test set
df_test[(df_test.match)==1].shape[0]/df_test.shape[0]

0.0068033967323831756

In [27]:
# Check ratio of 1s in training set
df_train[(df_train.match)==1].shape[0]/df_train.shape[0]

0.0068033967323831756

These numbers are all very close.

In [28]:
# Check there are no component overlaps
comps_in_test=set(df_test.component_no.unique())
comps_in_train=set(df_train.component_no.unique())
assert(comps_in_train.intersection(comps_in_test)==set())

In [29]:
# Do not change these files unless the train-test split is changed
# df_train[['asin']].to_parquet("../Data/asins_in_splits/train_asins.parquet",compression='gzip')
# df_test[['asin']].to_parquet("../Data/asins_in_splits/test_asins.parquet",compression='gzip')

In [30]:
# Check we haven't changed the train-test split asins from the split on 13 Jun
saved_train=pd.read_parquet("../Data/asins_in_splits/train_asins.parquet")
saved_test=pd.read_parquet("../Data/asins_in_splits/test_asins.parquet")
assert(saved_train.shape[0]==df_train.shape[0])
assert((saved_train.asin!=df_train.asin).sum()==0)
assert(saved_test.shape[0]==df_test.shape[0])
assert((saved_test.asin!=df_test.asin).sum()==0)

##### Additional train_final,validation split

In [31]:
df_train_final,df_validation=StratifiedGroupSplit(df_train,'match','component_no',test_size=0.3,random_state=1043)

In [32]:
df_validation_1,df_validation_2=StratifiedGroupSplit(df_validation,'match','component_no',test_size=0.5,random_state=801)

In [33]:
# Check we haven't changed the split asins from the split on 16 Jun
# Uncomment if we're generating this file with new data.
saved_train_final=pd.read_parquet("../Data/asins_in_splits/train_final_2split_asins.parquet")
saved_validA=pd.read_parquet("../Data/asins_in_splits/validationA_asins.parquet")
saved_validB=pd.read_parquet("../Data/asins_in_splits/validationB_asins.parquet")
assert(saved_train_final.shape[0]==df_train_final.shape[0])
assert((saved_train_final.asin!=df_train_final.asin).sum()==0)
assert(saved_validA.shape[0]==df_validation_1.shape[0])
assert((saved_validA.asin!=df_validation_1.asin).sum()==0)
assert(saved_validB.shape[0]==df_validation_2.shape[0])
assert((saved_validB.asin!=df_validation_2.asin).sum()==0)

##### Drop unneeded columns

In [34]:
columns=combined_df.columns

In [35]:
columns[0:30]

Index(['asin', 'category', 'title', 'missing_price', 'item_rank', 'match',
       'incident_indices', 'component_no', 'avg_rating', 'min_rating',
       'num_of_rating', 'percent_positive', 'percent_negative',
       'avg_verified_reviewers', 'min_date', 'max_date', 'product_lifespan',
       'num_bots_per_asin', 'unique_reviewer_count', 'avg_reviews_per_day',
       'reviews_per_product', 'avg_review_length_words',
       'avg_review_length_chars', 'embed_0', 'embed_1', 'embed_2', 'embed_3',
       'embed_4', 'embed_5', 'embed_6'],
      dtype='object')

In [36]:
indices_to_drop=[0,2,6,7,10,22]
print("Columns to drop:")
print(columns[indices_to_drop])

Columns to drop:
Index(['asin', 'title', 'incident_indices', 'component_no', 'num_of_rating',
       'avg_review_length_chars'],
      dtype='object')


In [37]:
columns_to_keep=[col for i,col in enumerate(columns) if i not in indices_to_drop ]

##### Save to parquet files

In [38]:
df_test[columns_to_keep].to_parquet("../Data/test_v3.parquet", compression='gzip')

In [39]:
df_train_final[columns_to_keep].to_parquet("../Data/train_final_v3.parquet", compression='gzip')

In [40]:
df_validation_1[columns_to_keep].to_parquet("../Data/validationA_v3.parquet", compression='gzip')
df_validation_2[columns_to_keep].to_parquet("../Data/validationB_v3.parquet", compression='gzip')

In [41]:
# # Do not change these files unless the train-test split is changed
# df_train_final[['asin']].to_parquet("../Data/asins_in_splits/train_final_2split_asins.parquet",compression='gzip')
# df_validation_1[['asin']].to_parquet("../Data/asins_in_splits/validationA_asins.parquet",compression='gzip')
# df_validation_2[['asin']].to_parquet("../Data/asins_in_splits/validationB_asins.parquet",compression='gzip')

In [42]:
# # Save md5sums of files
# import hashlib

# def calculate_md5(filepath):
#     md5_hash = hashlib.md5()
#     with open(filepath, "rb") as file:
#         # Read the file in chunks to handle large files
#         for chunk in iter(lambda: file.read(4096), b""):
#             md5_hash.update(chunk)
#     return md5_hash.hexdigest()

# os.chdir("../Data/")
# file_list=["test_v3.parquet","train_final_v3.parquet","validationA_v3.parquet","validationB_v3.parquet"]
# output_file = "../Data/md5_checksums.txt"
# with open(output_file, "w") as f:
#     for file_path in file_list:
#         md5_value = calculate_md5(file_path)
#         f.write(f"{md5_value}  {file_path}\n")
# os.chdir("../feature_extractions")

##### Cross-Validation Split

Finally, we split final_test into 3 equal groups in order to perform cross-validation.

In [43]:
from sklearn.model_selection import StratifiedGroupKFold
splits=StratifiedGroupKFold(3,shuffle=True,random_state=932).split(df_train_final,y=df_train_final.match,groups=df_train_final.component_no)

In [44]:
train_sets=[]
valid_sets=[]
for train,test in splits:
    train_sets.append(train)
    valid_sets.append(test)


In [45]:
sizes=[]
num_ones=[]
num_components=[]
for i in range(3):
    valid=df_train_final.iloc[valid_sets[i]]
    ones=valid[valid.match==1].shape[0]
    comps=valid[valid.match==1].component_no.unique().shape[0]
    size=valid.shape[0]

    num_ones.append(ones)
    num_components.append(comps)
    sizes.append(size)

split_data=pd.DataFrame({'size':sizes,'num_ones':num_ones,'num_components':num_components})
split_data

Unnamed: 0,size,num_ones,num_components
0,37611,279,64
1,37660,329,65
2,37497,160,65


There are an uneven number of ones. We'll use a custom function instead.

In [46]:
CV_splits=StratifiedGroupSplit(df_train_final,label_col='match',component_col='component_no',n_splits=3,random_state=1023)

In [47]:
sizes=[]
num_ones=[]
num_components=[]
for i in range(3):
    valid=CV_splits[i]
    ones=valid[valid.match==1].shape[0]
    comps=valid[valid.match==1].component_no.unique().shape[0]
    size=valid.shape[0]
    sizes.append(size)
    num_ones.append(ones)
    num_components.append(comps)

split_data=pd.DataFrame({'size':sizes,'num_ones':num_ones,'num_components':num_components})
split_data

Unnamed: 0,size,num_ones,num_components
0,37590,256,65
1,37589,256,82
2,37589,256,47


In [48]:
cv_split_df=pd.DataFrame({'cv_index':np.zeros(df_train_final.shape[0])},index=df_train_final.index)
cv_split_df.loc[CV_splits[1].index]=1
cv_split_df.loc[CV_splits[2].index]=2
cv_split_df=cv_split_df.astype(int)

In [49]:
cv_split_df.to_parquet("../Data/CV_val_split.parquet")