- ## ****At least one engineered non-linear feature (log transforms, polynomial, interaction terms)****

- ## ****Log Transform****

In [None]:
# Converting the total product purchases to a logarithmic scale
# We use +1 to avoid the zero logarithm
My_Data['log_total_product_sales'] = np.log1p(My_Data['product_total_purchases']).astype('float32')
print(My_Data.head())
gc.collect()

   order_id  user_id  order_number  days_since_prior_order  product_id  \
0   2539329      1.0     -1.015522                      -1       196.0   
1   2539329      1.0     -1.015522                      -1     14084.0   
2   2539329      1.0     -1.015522                      -1     12427.0   
3   2539329      1.0     -1.015522                      -1     26088.0   
4   2539329      1.0     -1.015522                      -1     26405.0   

   add_to_cart_order  reordered  product_name  aisle_id  eval_set_test  ...  \
0          -1.147241          0         35791        77              0  ...   
1          -0.982113          0         15935        91              0  ...   
2          -0.816986          0          6476        23              0  ...   
3          -0.651859          0          2523        23              0  ...   
4          -0.486731          0          1214        54              0  ...   

   up_purchase_count  up_reorder_probability  up_days_since_last_purchase  \
0  

0

### We have engineered 83 features categorized into 5 strategic groups to drive model performance:

 Identifiers: user_id, product_id, order_id (Linking keys).
 User Profile: total_orders, avg_basket, reorder_ratio (Customer history & loyalty).
 Product Stats: reorder_rate, popularity_score (Item characteristics).
 Interaction: up_purchase_count, up_prob, in_last_3_orders (User-specific habits).
 Context: hour, day, weekend_flag (Temporal patterns).

In [None]:
print(len(My_Data.columns))
print(My_Data.columns.tolist())

83
['order_id', 'user_id', 'order_number', 'days_since_prior_order', 'product_id', 'add_to_cart_order', 'reordered', 'product_name', 'aisle_id', 'eval_set_test', 'eval_set_train', 'order_dow_1', 'order_dow_2', 'order_dow_3', 'order_dow_4', 'order_dow_5', 'order_dow_6', 'department_id_1', 'department_id_2', 'department_id_3', 'department_id_4', 'department_id_5', 'department_id_6', 'department_id_7', 'department_id_8', 'department_id_9', 'department_id_10', 'department_id_11', 'department_id_12', 'department_id_13', 'department_id_14', 'department_id_15', 'department_id_16', 'department_id_17', 'department_id_18', 'department_id_19', 'department_id_20', 'department_id_21', 'order_hour_of_day_1', 'order_hour_of_day_2', 'order_hour_of_day_3', 'order_hour_of_day_4', 'order_hour_of_day_5', 'order_hour_of_day_6', 'order_hour_of_day_7', 'order_hour_of_day_8', 'order_hour_of_day_9', 'order_hour_of_day_10', 'order_hour_of_day_11', 'order_hour_of_day_12', 'order_hour_of_day_13', 'order_hour_of_d

### Final Data Cleaning: Addressing Residual NaNs
Upon final inspection, 206,209 rows (< 1% of data) contained NaN values in product-specific features.

Audit Findings: These rows correspond to the Train/Test set entries (from orders.csv) which represent future orders. By definition, these new orders do not yet have product history directly attached via the prior dataset merge logic.

Action Plan: We cannot drop these rows as they are the target inputs for our model. instead, we apply a domain-aware imputation strategy:

Missing Counts/Rates: Fill with 0. (Implies no history exists yet for this specific context).
Result: Dataset integrity is preserved at 100%, ensuring every row is valid for prediction.

In [None]:
nan_counts = My_Data.isnull().sum()
print(nan_counts[nan_counts > 0])

product_reorder_rate           206209
product_total_purchases        206209
avg_pos_in_cart                206209
product_avg_hour_of_day        206209
up_purchase_count              206209
up_reorder_probability         206209
up_days_since_last_purchase    206209
log_total_product_sales        206209
dtype: int64


### Final Data Sanitization: Handling Residual Nulls
After merging all engineered features, we perform a final audit for missing values.

Observation: We expect to find NaN values in product-interaction features (e.g., up_purchase_count, product_reorder_rate) specifically for the Train/Test set rows.

Reason: These rows represent new orders where the specific user-product interaction might not have occurred in the prior history, or the product features could not be mapped (cold start).

In [None]:
nan_summary = My_Data.isnull().sum()
nan_only = nan_summary[nan_summary > 0]
print(nan_only)

if len(nan_only) > 0:
    cols_to_fix = nan_only.index.tolist()

    My_Data[cols_to_fix] = My_Data[cols_to_fix].fillna(0)

print(len(nan_only))

product_reorder_rate           206209
product_total_purchases        206209
avg_pos_in_cart                206209
product_avg_hour_of_day        206209
up_purchase_count              206209
up_reorder_probability         206209
up_days_since_last_purchase    206209
log_total_product_sales        206209
dtype: int64
8
