---------------------------------------------------------------------------------------------------------------------
- # ****Time-aware splitting****

- ## Since purchasing behavior is chronological, we avoided random splitting, which could lead to ****(Data Leakage)**** Instead, we adopted a ****(User-Based Splitting)**** strategy, where the model is trained on one group of users and its predictive ability is tested on a completely different group. This ensures the model can ****(Generalize)**** and accurately predict future orders.

- ## Training on raw ****(Transactional Data)**** leads to duplicate features for the same ****(User-Product)****, causing data inflation without adding new predictive information. Instead, we performed ****feature engineering**** to transform the data to the ****(User-Product level)****. This doesn't reduce data quality; rather, it focuses it into ****(Behavioral Features)**** that describe the user's relationship with the product over timeâ€”the optimal level for predictive repurchase decisions

In [None]:
#1. Obtain a list of unique users
all_users = My_Data_Aggregated['user_id'].unique()

In [None]:
# Randomly mix users once (to ensure fair distribution) with the seed installed
np.random.seed(42)
np.random.shuffle(all_users)

In [None]:
# Identifying split points
train_end = int(0.8 * len(all_users))
val_end = int(0.9 * len(all_users))

In [None]:
# Segmenting the user list
train_users = all_users[:train_end]
val_users = all_users[train_end:val_end]
test_users = all_users[val_end:]

In [None]:
# Extracting actual data based on users
train_df = My_Data_Aggregated[My_Data_Aggregated['user_id'].isin(train_users)]
val_df = My_Data_Aggregated[My_Data_Aggregated['user_id'].isin(val_users)]
test_df = My_Data_Aggregated[My_Data_Aggregated['user_id'].isin(test_users)]

In [None]:
print(len(My_Data_Aggregated))
print("-"*20)
print(f"{len(train_df)} {len(train_users)}")
print(f"{len(val_df)} {len(val_users)}")
print(f"{len(test_df)} {len(test_users)}")

13514162
--------------------
10824880 164967
1355278 20621
1334004 20621


- ## Adhering to the principle of ****(Time-aware Splitting)**** and to avoid ****(Data leakage)****, we split the data based on ****(User IDs)**** rather than randomly splitting rows. This ensures that the model is trained on a stable set of users and tested on a completely independent set, mimicking real-world scenarios where the system needs to predict ****(future orders)**** based on aggregated ****(past history)****, without any time overlap between the training and test sets.

In [None]:
# Examining the link between features and the goal
correlation = train_df.corr()['target'].sort_values(ascending=False)
print(correlation.head(10))

target                         1.000000
up_reorder_probability         0.281705
up_last_3_purchase_count       0.259557
up_purchase_count              0.195469
reordered                      0.164602
product_reorder_rate           0.133168
product_id_kfold_te            0.108807
log_total_product_sales        0.095508
aisle_id_kfold_te              0.082753
up_days_since_last_purchase    0.068042
Name: target, dtype: float64
