# Product Metadata, Feature Merging, and Stratified Group Splitting

This notebook performs the following steps for Amazon product data:

1. **Data Import and Cleaning**
   - Loads cleaned product metadata and merges with match/component labels.
   - Ensures only products with cleaned reviews are included.
   - Merges in review-based features and text embeddings.

2. **Class Balancing**
   - Randomly subsamples the majority class (`match==0`).

3. **Stratified Group Splitting**
   - Uses a custom `StratifiedGroupSplit` function to split the data into train/test/validation sets, preserving both label proportions and component integrity.

4. **Split Validation**
   - Checks that splits are balanced in terms of size and label distribution.
   - Ensures no group/component appears in more than one split.

5. **Saving Outputs**
   - Saves the resulting splits and their indices to Parquet files for downstream modeling.

6. **Cross-Validation Split**
   - Performs a 3-fold cross-validation splitting using custom splitter, ensuring stratification and component integrity.

**Outputs:**
- Parquet files for train, test, validation, and cross-validation splits.


In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.neighbors import KNeighborsRegressor
from tqdm import tqdm
from ast import literal_eval
import os
tqdm.pandas()

##### Import product metadata

In [3]:
# Import the cleaned metadata.
# Generated by meta_cleaning.ipynb
combined_df=pd.read_parquet("../Data/metadata_cleaned.parquet")

##### Add matches and components

In [4]:
matches_df=pd.read_csv("../Data/amazon_df_labels.csv")

In [5]:
# Drop duplicate rows
matches_df=matches_df.drop_duplicates()

In [None]:
# Find and label components
from generate_component_nums import get_components
matches_df['component_no']=get_components(matches_df)

In [None]:
# Import matches with components
matches_df_old=pd.read_csv("../Data/amazon_df_labels_with_comps.csv",index_col=0)
matches_df_old.incident_indices=matches_df_old.incident_indices.apply(literal_eval)

In [None]:
# Merge matches with combined_df
combined_df=combined_df.merge(matches_df,on='asin')

In [None]:
# Import undropped asins after cleaning review_data.
review_cleaned_asins=pd.read_csv("../Data/asin_labels_clean_review_df.csv")

# Check no duplicate asin
assert(review_cleaned_asins.asin.duplicated().unique()==[False])

In [None]:
# Drop entries from combined_df that don't appear in review_cleaned_asins
combined_df=combined_df.merge(review_cleaned_asins[['asin']],on='asin')

##### Add reviews features

In [None]:
# Load Reviews_df (from reviews_features.ipynb)
reviews_features_df=pd.read_parquet("final_reviews.parquet")

# Verify no duplicated asins
assert(reviews_features_df.asin.duplicated().unique()==[False])

In [None]:
# Merge meta and review datasets
combined_df=combined_df.merge(reviews_features_df,on='asin')

In [None]:
# Drop num_of_rating column as it is identical to reviews_per_product
combined_df.drop(columns=['num_of_rating'])

##### Add summary and review embeddings

In [None]:
# Load summary embeddings
summary_embeddings=pd.read_pickle("../Data/agg_summary_embeddings.pkl")

# Verify no duplicated asins
assert(summary_embeddings.asin.duplicated().unique()==[False])

In [None]:
combined_df=combined_df.merge(summary_embeddings,on='asin',how='left').set_axis(combined_df.index)

In [None]:
# Load review embeddings
review_embeddings=pd.read_pickle("../Data/reviewtext_features_df.pkl")

# Verify no duplicated asins
assert(review_embeddings.asin.duplicated().unique()==[False])

In [None]:
combined_df=combined_df.merge(review_embeddings,on='asin',how='left').set_axis(combined_df.index)

##### Randomly dropping items labelled zero

To obtain a more manageable data set, we randomly drop items whose ``match`` column is zero.

In [None]:
num_zeros=(combined_df.match==0).sum()
num_ones=(combined_df.match==1).sum()
print(f"There are {num_zeros} products labelled 0.")
print(f"There are {num_ones} products labelled 1.")

In [None]:
# This is the number of 0s in our final dataset.
TARGET=200_000

# Check TARGET is less than the total number of zeros
assert(TARGET<num_zeros)

In [None]:
# indices of items labelled zero
zero_indices=combined_df[combined_df.match==0].index
# indices of items labelled one
one_indices=combined_df[combined_df.match==1].index

# random subset of TARGET items labelled zero
rng=np.random.default_rng(seed=1067)
random_subset=rng.choice(zero_indices,TARGET,replace=False)

# total list of indices
indices=np.concat([random_subset,one_indices])

In [None]:
# Drop rows of combined_df with index not in indices
combined_df=combined_df.loc[indices]

In [None]:
print(combined_df.match.value_counts())

##### Stratified train-test split preserving groups

In [None]:
from custom_ttsplit import StratifiedGroupSplit

In [None]:
df_train,df_test=StratifiedGroupSplit(combined_df,'match','component_no',test_size=0.2,random_state=1066)

In [None]:
# Check test ratio
df_test.shape[0]/combined_df.shape[0]

In [None]:
# Check ratio of 1s in set before split
combined_df[(combined_df.match)==1].shape[0]/combined_df.shape[0]

In [None]:
# Check ratio of 1s in test set
df_test[(df_test.match)==1].shape[0]/df_test.shape[0]

In [None]:
# Check ratio of 1s in training set
df_train[(df_train.match)==1].shape[0]/df_train.shape[0]

These numbers are all very close.

In [None]:
# Check there are no component overlaps
comps_in_test=set(df_test.component_no.unique())
comps_in_train=set(df_train.component_no.unique())
assert(comps_in_train.intersection(comps_in_test)==set())

In [None]:
# Do not change these files unless the train-test split is changed
# df_train[['asin']].to_parquet("../Data/asins_in_splits/train_asins.parquet",compression='gzip')
# df_test[['asin']].to_parquet("../Data/asins_in_splits/test_asins.parquet",compression='gzip')

In [None]:
# Check we haven't changed the train-test split asins from the split on 13 Jun
saved_train=pd.read_parquet("../Data/asins_in_splits/train_asins.parquet")
saved_test=pd.read_parquet("../Data/asins_in_splits/test_asins.parquet")
assert(saved_train.shape[0]==df_train.shape[0])
assert((saved_train.asin!=df_train.asin).sum()==0)
assert(saved_test.shape[0]==df_test.shape[0])
assert((saved_test.asin!=df_test.asin).sum()==0)

##### Additional train_final,validation split

In [None]:
df_train_final,df_validation=StratifiedGroupSplit(df_train,'match','component_no',test_size=0.3,random_state=1043)

In [None]:
df_validation_1,df_validation_2=StratifiedGroupSplit(df_validation,'match','component_no',test_size=0.5,random_state=801)

In [None]:
# Check we haven't changed the split asins from the split on 16 Jun
# Uncomment if we're generating this file with new data.
saved_train_final=pd.read_parquet("../Data/asins_in_splits/train_final_2split_asins.parquet")
saved_validA=pd.read_parquet("../Data/asins_in_splits/validationA_asins.parquet")
saved_validB=pd.read_parquet("../Data/asins_in_splits/validationB_asins.parquet")
assert(saved_train_final.shape[0]==df_train_final.shape[0])
assert((saved_train_final.asin!=df_train_final.asin).sum()==0)
assert(saved_validA.shape[0]==df_validation_1.shape[0])
assert((saved_validA.asin!=df_validation_1.asin).sum()==0)
assert(saved_validB.shape[0]==df_validation_2.shape[0])
assert((saved_validB.asin!=df_validation_2.asin).sum()==0)

##### Drop unneeded columns

In [None]:
columns=combined_df.columns

In [None]:
columns[0:30]

In [None]:
indices_to_drop=[0,2,6,7,10,22]
print("Columns to drop:")
print(columns[indices_to_drop])

In [None]:
columns_to_keep=[col for i,col in enumerate(columns) if i not in indices_to_drop ]

##### Save to parquet files

In [None]:
df_test[columns_to_keep].to_parquet("../Data/test_v3.parquet", compression='gzip')

In [None]:
df_train_final[columns_to_keep].to_parquet("../Data/train_final_v3.parquet", compression='gzip')

In [None]:
df_validation_1[columns_to_keep].to_parquet("../Data/validationA_v3.parquet", compression='gzip')
df_validation_2[columns_to_keep].to_parquet("../Data/validationB_v3.parquet", compression='gzip')

In [None]:
# # Do not change these files unless the train-test split is changed
# df_train_final[['asin']].to_parquet("../Data/asins_in_splits/train_final_2split_asins.parquet",compression='gzip')
# df_validation_1[['asin']].to_parquet("../Data/asins_in_splits/validationA_asins.parquet",compression='gzip')
# df_validation_2[['asin']].to_parquet("../Data/asins_in_splits/validationB_asins.parquet",compression='gzip')

In [None]:
# # Save md5sums of files
# import hashlib

# def calculate_md5(filepath):
#     md5_hash = hashlib.md5()
#     with open(filepath, "rb") as file:
#         # Read the file in chunks to handle large files
#         for chunk in iter(lambda: file.read(4096), b""):
#             md5_hash.update(chunk)
#     return md5_hash.hexdigest()

# os.chdir("../Data/")
# file_list=["test_v3.parquet","train_final_v3.parquet","validationA_v3.parquet","validationB_v3.parquet"]
# output_file = "../Data/md5_checksums.txt"
# with open(output_file, "w") as f:
#     for file_path in file_list:
#         md5_value = calculate_md5(file_path)
#         f.write(f"{md5_value}  {file_path}\n")
# os.chdir("../feature_extractions")

##### Cross-Validation Split

Finally, we split final_test into 3 equal groups in order to perform cross-validation.

In [None]:
from sklearn.model_selection import StratifiedGroupKFold
splits=StratifiedGroupKFold(3,shuffle=True,random_state=932).split(df_train_final,y=df_train_final.match,groups=df_train_final.component_no)

In [None]:
train_sets=[]
valid_sets=[]
for train,test in splits:
    train_sets.append(train)
    valid_sets.append(test)


In [None]:
sizes=[]
num_ones=[]
num_components=[]
for i in range(3):
    valid=df_train_final.iloc[valid_sets[i]]
    ones=valid[valid.match==1].shape[0]
    comps=valid[valid.match==1].component_no.unique().shape[0]
    size=valid.shape[0]

    num_ones.append(ones)
    num_components.append(comps)
    sizes.append(size)

split_data=pd.DataFrame({'size':sizes,'num_ones':num_ones,'num_components':num_components})
split_data

There are an uneven number of ones. We'll use a custom function instead.

In [None]:
CV_splits=StratifiedGroupSplit(df_train_final,label_col='match',component_col='component_no',n_splits=3,random_state=1023)

In [None]:
sizes=[]
num_ones=[]
num_components=[]
for i in range(3):
    valid=CV_splits[i]
    ones=valid[valid.match==1].shape[0]
    comps=valid[valid.match==1].component_no.unique().shape[0]
    size=valid.shape[0]
    sizes.append(size)
    num_ones.append(ones)
    num_components.append(comps)

split_data=pd.DataFrame({'size':sizes,'num_ones':num_ones,'num_components':num_components})
split_data

In [None]:
cv_split_df=pd.DataFrame({'cv_index':np.zeros(df_train_final.shape[0])},index=df_train_final.index)
cv_split_df.loc[CV_splits[1].index]=1
cv_split_df.loc[CV_splits[2].index]=2
cv_split_df=cv_split_df.astype(int)

In [None]:
cv_split_df.to_parquet("../Data/CV_val_split.parquet")