# Main Program

In [1]:
import numpy as np
import pandas as pd
import features

# Load datasets
orders = pd.read_parquet("orders.parquet")
order_products_denormalized = pd.read_csv("order_products_denormalized.csv", dtype={'order_id': 'int64'})
tips_public = pd.read_csv("tips_public.csv", dtype={'order_id': 'int64'}).drop(columns=["Unnamed: 0"])

# Optimize memory usage by converting to categorical types
order_products_denormalized['department'] = order_products_denormalized['department'].astype('category')
order_products_denormalized['aisle'] = order_products_denormalized['aisle'].astype('category')

# Ensure consistent data types
orders['order_id'] = orders['order_id'].astype('int64')
orders['user_id'] = orders['user_id'].astype('int64')
order_products_denormalized['product_id'] = order_products_denormalized['product_id'].astype('int64')
tips_public['order_id'] = tips_public['order_id'].astype('int64')

# Feature Engineering

## Feature Overview

The following table lists all features engineered in this notebook, including their level, output columns, data types, and descriptions. All features are aggregated to the `order_id` level in the final DataFrame.

| **Feature Name** | **Level** | **Output Columns** | **Data Type** | **Description** |
|------------------|-----------|--------------------|---------------|-----------------|
| `user_alcohol_purchase_count` | User | `[user_id, user_alcohol_purchase_count]` | Integer | Counts the total number of alcohol products purchased by each user across all orders, merged via user_id. |
| `user_total_purchase_count` | User | `[user_id, user_total_purchase_count]` | Integer | Counts the total number of products purchased by each user across all orders, merged via user_id. |
| `user_unique_product_count` | User | `[user_id, user_unique_product_count]` | Integer | Counts the number of unique products purchased by each user, merged via user_id. |
| `user_unique_to_total_ratio` | User | `[user_id, user_unique_to_total_ratio]` | Float | Calculates the ratio of unique products to total products purchased by each user, merged via user_id. |
| `user_frequent_purchase_hour` | User | `[user_id, user_frequent_purchase_hour]` | Integer (0–23) | Identifies the hour of the day when the user places the most orders, defaulting to 12 (noon) if missing, merged via user_id. |
| `user_frequent_purchase_dow` | User | `[user_id, user_frequent_purchase_dow]` | Integer (0–6) | Identifies the day of the week (0=Monday, 6=Sunday) when the user places the most orders, defaulting to 0 (Monday), merged via user_id. |
| `user_avg_order_interval_hours` | User | `[user_id, user_avg_order_interval_hours]` | Float | Calculates the average time (in hours) between consecutive orders for each user, using the dataset median for users with one order, merged via user_id. |
| `user_frequent_hour_sin`, `user_frequent_hour_cos` | User | `[user_id, user_frequent_hour_sin, user_frequent_hour_cos]` | Float (-1 to 1) | Applies sine-cosine transformation to the most frequent purchase hour to capture its cyclical nature, merged via user_id. |
| `user_frequent_season_sin`, `user_frequent_season_cos` | User | `[user_id, user_frequent_season_sin, user_frequent_season_cos]` | Float (-1 to 1) | Applies sine-cosine transformation to the most frequent purchase month to capture seasonal cyclicality, defaulting to January, merged via user_id. |
| `order_has_alcohol` | Order | `[order_id, order_has_alcohol]` | Integer (0 or 1) | Flags whether an order contains any alcohol products (1 if yes, 0 if no). |
| `order_product_count` | Order | `[order_id, order_product_count]` | Integer | Counts the total number of items (products) in each order. |
| `order_unique_dept_count` | Order | `[order_id, order_unique_dept_count]` | Integer | Counts the number of unique departments in each order. |
| `order_unique_aisle_count` | Order | `[order_id, order_unique_aisle_count]` | Integer | Counts the number of unique aisles in each order. |
| `order_unique_dept_ratio` | Order | `[order_id, order_unique_dept_ratio]` | Float | Calculates the ratio of unique departments to total items in each order. |
| `order_unique_aisle_ratio` | Order | `[order_id, order_unique_aisle_ratio]` | Float | Calculates the ratio of unique aisles to total items in each order. |
| `order_dept_tip_rate` | Order | `[order_id, order_dept_tip_rate]` | Float (0 to 1) | Computes the average tip rate for the departments in an order based on prior orders, defaulting to 0.500111 for no history. |
| `order_aisle_tip_rate` | Order | `[order_id, order_aisle_tip_rate]` | Float (0 to 1) | Computes the average tip rate for the aisles in an order based on prior orders, defaulting to 0.500111 for no history. |
| `order_placed_hour` | Order | `[order_id, order_placed_hour]` | Integer (0–23) | Extracts the hour of the day when the order was placed. |
| `order_placed_dow` | Order | `[order_id, order_placed_dow]` | Integer (0–6) | Extracts the day of the week (0=Monday, 6=Sunday) when the order was placed. |
| `order_is_weekend` | Order | `[order_id, order_is_weekend]` | Integer (0 or 1) | Flags whether the order was placed on a weekend (Saturday or Sunday). |
| `order_placed_hour_sin`, `order_placed_hour_cos` | Order | `[order_id, order_placed_hour_sin, order_placed_hour_cos]` | Float (-1 to 1) | Applies sine-cosine transformation to the order’s hour to capture its cyclical nature. |
| `order_placed_season_sin`, `order_placed_season_cos` | Order | `[order_id, order_placed_season_sin, order_placed_season_cos]` | Float (-1 to 1) | Applies sine-cosine transformation to the order’s month to capture seasonal cyclicality. |
| `order_time_since_last_hours` | Order | `[order_id, order_time_since_last_hours]` | Float | Calculates the time (in hours) since the user’s previous order, using the dataset median for first orders. |
| `user_total_product_purchase_count` | User | `[user_id, user_total_product_purchase_count]` | Integer | Total count of products purchased by each user, aggregated from user-product level, merged via user_id. |
| `user_product_tip_prob` | Order | `[order_id, user_product_tip_prob]` | Float (0 to 1) | Average tip probability for user-product pairs in an order, aggregated to order_id, defaulting to 0.500111 for no history. |


In [2]:
# Generate and display the combined feature DataFrame
all_features_df = features.combine_all_features(orders, order_products_denormalized, tips_public)
display(all_features_df)


  merged['order_count'] = merged.groupby('department').cumcount().astype('int32')
  merged['tip_cumsum_before'] = merged.groupby('department')['tip'].cumsum() - merged['tip']
  merged['order_count'] = merged.groupby('aisle').cumcount().astype('int32')
  merged['tip_cumsum_before'] = merged.groupby('aisle')['tip'].cumsum() - merged['tip']


Merging feature with columns ['order_id', 'order_has_alcohol'], order_id dtype: int64
Merging feature with columns ['order_id', 'order_product_count'], order_id dtype: int64
Merging feature with columns ['order_id', 'order_unique_dept_count'], order_id dtype: int64
Merging feature with columns ['order_id', 'order_unique_aisle_count'], order_id dtype: int64
Merging feature with columns ['order_id', 'order_unique_dept_ratio'], order_id dtype: int64
Merging feature with columns ['order_id', 'order_unique_aisle_ratio'], order_id dtype: int64
Merging feature with columns ['order_id', 'order_dept_tip_rate'], order_id dtype: int64
Merging feature with columns ['order_id', 'order_aisle_tip_rate'], order_id dtype: int64
Merging feature with columns ['order_id', 'order_placed_hour'], order_id dtype: int64
Merging feature with columns ['order_id', 'order_placed_dow'], order_id dtype: int64
Merging feature with columns ['order_id', 'order_is_weekend'], order_id dtype: int64
Merging feature with co

Unnamed: 0,order_id,user_id,order_has_alcohol,order_product_count,order_unique_dept_count,order_unique_aisle_count,order_unique_dept_ratio,order_unique_aisle_ratio,order_dept_tip_rate,order_aisle_tip_rate,...,user_frequent_purchase_hour,user_frequent_purchase_dow,user_avg_order_interval_hours,user_frequent_hour_sin,user_frequent_hour_cos,user_frequent_season_sin,user_frequent_season_cos,user_total_product_purchase_count,user_product_tip_prob,tip
0,1374495,3,0.0,10.0,3.0,4.0,0.300000,0.400000,0.516237,0.523879,...,16,5,288.102417,-0.866025,-0.500000,1.224647e-16,-1.000000,88,0.500111,True
1,444309,3,0.0,9.0,5.0,9.0,0.555556,1.000000,0.502990,0.499790,...,16,5,288.102417,-0.866025,-0.500000,1.224647e-16,-1.000000,88,1.000000,True
2,3002854,3,0.0,6.0,4.0,6.0,0.666667,1.000000,0.510314,0.507816,...,16,5,288.102417,-0.866025,-0.500000,1.224647e-16,-1.000000,88,1.000000,True
3,2037211,3,0.0,5.0,4.0,5.0,0.800000,1.000000,0.442459,0.457083,...,16,5,288.102417,-0.866025,-0.500000,1.224647e-16,-1.000000,88,1.000000,True
4,2710558,3,0.0,11.0,4.0,7.0,0.363636,0.636364,0.518061,0.508472,...,16,5,288.102417,-0.866025,-0.500000,1.224647e-16,-1.000000,88,1.000000,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1463627,3059777,206208,0.0,7.0,4.0,6.0,0.571429,0.857143,0.542463,0.544971,...,15,0,176.729568,-0.707107,-0.707107,5.000000e-01,0.866025,677,0.558718,False
1463628,2239861,206208,0.0,23.0,6.0,14.0,0.260870,0.608696,0.528861,0.534091,...,15,0,176.729568,-0.707107,-0.707107,5.000000e-01,0.866025,677,0.321781,True
1463629,1285346,206208,0.0,8.0,4.0,6.0,0.500000,0.750000,0.520542,0.527629,...,15,0,176.729568,-0.707107,-0.707107,5.000000e-01,0.866025,677,0.366172,True
1463630,1882108,206208,0.0,17.0,6.0,12.0,0.352941,0.705882,0.530163,0.523381,...,15,0,176.729568,-0.707107,-0.707107,5.000000e-01,0.866025,677,0.383274,True
