- ## ****Dimensionality & collinearity****
We will now perform a Variance Inflation Factor (VIF) test to identify and remove features with severe multicollinearity ($VIF > 10$).

Process:

Sample the Data: Calculating VIF on 32M rows is computationally expensive, so we take a representative random sample of 100,000 rows.

Calculate VIF: For each continuous numerical feature, we compute how much its variance is inflated by the other features.

Action: Features with high VIF scores (indicating they are redundant copies of others) will be candidates for removal to streamline the model.

In [None]:
exclude_cols = ['user_id', 'product_id', 'order_id', 'reordered', 'eval_set_test', 'eval_set_train']

exclude_cols += [col for col in My_Data.columns if '_kfold_te' in col or 'order_dow_' in col or 'department_id_' in col or 'order_hour_' in col]


numeric_cols = My_Data.select_dtypes(include=['float32', 'float64', 'int32', 'int64', 'int8']).columns.tolist()

features_for_vif = list(set(numeric_cols) - set(exclude_cols))

print(f"Features selected for VIF: {len(features_for_vif)}")
print(features_for_vif)


vif_data = My_Data[features_for_vif].sample(100000, random_state=42).copy()

vif_data = vif_data.replace([np.inf, -np.inf], 0).fillna(0)

vif_df = pd.DataFrame()
vif_df["feature"] = vif_data.columns
vif_df["VIF"] = [variance_inflation_factor(vif_data.values, i)
                 for i in range(len(vif_data.columns))]

print(vif_df.sort_values(by="VIF", ascending=False))

Features selected for VIF: 22
['add_to_cart_order', 'user_reorder_ratio', 'product_reorder_rate', 'days_since_prior_order', 'up_days_since_last_purchase', 'is_in_last_order', 'order_dow', 'up_reorder_probability', 'user_mean_days_between_orders', 'is_weekend', 'avg_pos_in_cart', 'log_total_product_sales', 'up_purchase_count', 'product_total_purchases', 'user_avg_basket_size', 'user_total_orders', 'order_number', 'day_period', 'product_avg_hour_of_day', 'up_last_3_purchase_count', 'aisle_id', 'product_name']


  return 1 - self.ssr/self.uncentered_tss


                          feature         VIF
18        product_avg_hour_of_day  160.510168
10                avg_pos_in_cart   49.239503
11        log_total_product_sales   43.889388
13        product_total_purchases   42.396481
21                   product_name   41.027708
2            product_reorder_rate   40.343477
1              user_reorder_ratio   23.149167
15              user_total_orders   13.842369
8   user_mean_days_between_orders   13.381530
7          up_reorder_probability   11.871686
6                       order_dow    8.326852
14           user_avg_basket_size    7.437191
17                     day_period    5.937772
9                      is_weekend    4.864623
4     up_days_since_last_purchase    4.853919
20                       aisle_id    4.837858
12              up_purchase_count    4.726452
19       up_last_3_purchase_count    3.276808
16                   order_number    2.165246
3          days_since_prior_order    1.606060
0               add_to_cart_order 

#### Based on the audit results, we detected severe multicollinearity in specific feature groups. To improve model stability and training speed, we are removing the redundant variables


In [None]:
features_to_drop = [
    'product_avg_hour_of_day',
    'product_total_purchases',
    'product_name',
    'avg_pos_in_cart'
]

cols_present = [c for c in features_to_drop if c in My_Data.columns]

if cols_present:
    My_Data.drop(columns=cols_present, inplace=True)
    print(f"Successfully dropped high VIF features: {cols_present}")
    print(f"New column count: {len(My_Data.columns)}")
else:
    print("Columns already dropped or not found.")


gc.collect()

Successfully dropped high VIF features: ['product_avg_hour_of_day', 'product_total_purchases', 'product_name', 'avg_pos_in_cart']
New column count: 79


0