- ## ****User-level features****

User-level features were engineered by aggregating data per `user_id` to capture individual customer behavior:

- **Total number of orders**: Count of unique orders per user.
- **Average basket size**: Mean number of items per order.
- **Reorder ratio**: Proportion of reordered items across all orders (total reordered items / total items purchased).
- **Mean days between orders**: Average `days_since_prior_order` for the user.
- **Last order recency**: Number of days since the user's most recent order (relative to the latest date in the dataset or prediction point).

These features were computed using `groupby('user_id')` on the orders and order_products tables, then merged back to the main dataset.


### Total number of orders per user



In [None]:
user_total_orders = orders.groupby('user_id')['order_id'].nunique().rename('user_total_orders').reset_index()

print("Total number of orders per user" , "\n")
print(user_total_orders.head())

Total number of orders per user 

   user_id  user_total_orders
0        1                 11
1        2                 15
2        3                 13
3        4                  6
4        5                  5


### Average basket size per user


The merge between `orders` and `basket_sizes` on `order_id` was necessary to attach the basket size (number of items per order) to each order record in the `orders` DataFrame.

- `basket_sizes` was derived from `order_products_prior` using groupby on `order_id` to count items per order.
- The `orders` DataFrame contains user and order metadata (including `user_id`) but does not include basket size information.
- Merging on `order_id` (left join) added the `basket_size` column to each order, enabling subsequent groupby on `user_id` to compute the **average basket size per user**.



In [None]:
basket_sizes = order_products_prior.groupby('order_id')['product_id'].count().rename('basket_size').reset_index()
orders_with_basket = orders.merge(basket_sizes, on='order_id', how='left')
user_avg_basket = orders_with_basket.groupby('user_id')['basket_size'].mean().rename('user_avg_basket_size').reset_index()

print("Average basket size per user" , "\n")
print(user_avg_basket.head())

Average basket size per user 

   user_id  user_avg_basket_size
0        1              5.900000
1        2             13.928571
2        3              7.333333
3        4              3.600000
4        5              9.250000


### Reorder ratio per user

In [None]:
order_to_user = orders.set_index('order_id')['user_id'].to_dict()
order_products_prior['user_id'] = order_products_prior['order_id'].map(order_to_user)
user_reorder = order_products_prior.groupby('user_id')['reordered'].mean().rename('user_reorder_ratio').reset_index()

print("Reorder ratio per user" , "\n")
print(user_reorder.head())

Reorder ratio per user 

   user_id  user_reorder_ratio
0        1            0.694915
1        2            0.476923
2        3            0.625000
3        4            0.055556
4        5            0.378378


### Mean days between orders per user

In [None]:
user_mean_days = orders.groupby('user_id')['days_since_prior_order'].mean().rename('user_mean_days_between_orders').reset_index()

print("Mean days between orders per user" , "\n")
print(user_mean_days.head())

Mean days between orders per user 

   user_id  user_mean_days_between_orders
0        1                      19.000000
1        2                      16.285715
2        3                      12.000000
3        4                      17.000000
4        5                      11.500000


### Last order recency per user

In [None]:
max_order_num = orders.groupby('user_id')['order_number'].max().reset_index()

last_orders = max_order_num.merge(orders, on=['user_id', 'order_number'], how='left')

last_orders['user_last_order_recency'] = last_orders['days_since_prior_order'].fillna(0)

user_recency = last_orders[['user_id', 'user_last_order_recency']].drop_duplicates()

print("Last order recency per user" , "\n")
print(user_recency.head())

Last order recency per user 

   user_id  user_last_order_recency
0        1                     14.0
1        2                     30.0
2        3                     11.0
3        4                     30.0
4        5                      6.0


  has_large_values = (abs_vals > 1e6).any()


### Merged for all user-level features

In [None]:
My_Data['user_id'] = My_Data['user_id'].astype('int32')

user_tables = [user_total_orders, user_avg_basket, user_reorder, user_mean_days, user_recency]

for df in user_tables:
    if 'user_id' in df.columns:
        df['user_id'] = df['user_id'].astype('int32')

    current_float_cols = df.select_dtypes(include=['float64']).columns
    df[current_float_cols] = df[current_float_cols].astype('float32')

merge_all_features = user_total_orders \
    .merge(user_avg_basket, on='user_id') \
    .merge(user_reorder, on='user_id') \
    .merge(user_mean_days, on='user_id') \
    .merge(user_recency, on='user_id')

My_Data = My_Data.merge(merge_all_features, on='user_id', how='left')

del merge_all_features, user_total_orders, user_avg_basket, user_reorder, user_mean_days, user_recency
gc.collect()

print(My_Data.shape)

(32640698, 69)


In [None]:
print(My_Data.columns)
print("Final shape:", My_Data.shape)

Index(['order_id', 'user_id', 'order_number', 'days_since_prior_order',
       'product_id', 'add_to_cart_order', 'reordered', 'product_name',
       'aisle_id', 'eval_set_test', 'eval_set_train', 'order_dow_1',
       'order_dow_2', 'order_dow_3', 'order_dow_4', 'order_dow_5',
       'order_dow_6', 'department_id_1', 'department_id_2', 'department_id_3',
       'department_id_4', 'department_id_5', 'department_id_6',
       'department_id_7', 'department_id_8', 'department_id_9',
       'department_id_10', 'department_id_11', 'department_id_12',
       'department_id_13', 'department_id_14', 'department_id_15',
       'department_id_16', 'department_id_17', 'department_id_18',
       'department_id_19', 'department_id_20', 'department_id_21',
       'order_hour_of_day_1', 'order_hour_of_day_2', 'order_hour_of_day_3',
       'order_hour_of_day_4', 'order_hour_of_day_5', 'order_hour_of_day_6',
       'order_hour_of_day_7', 'order_hour_of_day_8', 'order_hour_of_day_9',
       'order_hour