- ## Encoding Categorical Variables


In [None]:
encoding_stats = My_Data[['eval_set','product_name','product_id', 'user_id', 'aisle_id', 'department_id', 'order_dow', 'order_hour_of_day']].nunique()

print("--- Cardinality Report ---")
print(encoding_stats)

--- Cardinality Report ---
eval_set                  3
product_name          49678
product_id            49678
user_id              206209
aisle_id                135
department_id            22
order_dow                 7
order_hour_of_day        24
dtype: int64


- ## **One-Hot Encoding**
  - ***Low-cardinality columns were chosen to ensure that the data dimensions did not become so large as to hinder processing.***
  - ***(eval_set)*** : Convert from one column to two columns (because the original had 3 categories)
  - ***(order_dow)*** : Convert to 6 columns (because the original had 7 categories)
  - ***(department_id)*** : Convert to 21 columns (because the original had 22 categories)
  - ***(order_hour_of_day)*** : Convert to 23 columns (because the original had 24 categories)

In [None]:
low_card_cols = ['eval_set', 'order_dow', 'department_id', 'order_hour_of_day']
My_Data = pd.get_dummies(My_Data, columns=low_card_cols, drop_first=True, dtype='int8')

cols_to_drop = ['department', 'aisle', 'aisle_id']
My_Data.drop(columns=['department', 'aisle'], inplace=True, errors='ignore')

print("Done! One-Hot Encoding complete.")
print(f"New shape: {My_Data.shape}")

Done! One-Hot Encoding complete.
New shape: (32640698, 61)


In [None]:
print( {My_Data.shape})

{(32640698, 61)}


- ## **Target Encoding**
   -  ***The columns were selected because they contained thousands of unique values ​​that could not be processed using One-Hot Encoding.***
   -  ***(product_id) , (user_id) , (aisle_id)***
   -  The target columns were transformed from random integers to fractional floats, often ranging between 0 and 1. These values ​​accurately represent the "reorder probability" for each product and each user, making the model able to distinguish between "highly requested" and "passing" products with great ease.

In [None]:
#high_card_cols = ['product_id', 'user_id', 'aisle_id']

#te = TargetEncoder(cols=high_card_cols, smoothing=10)
#My_Data[high_card_cols] = te.fit_transform(My_Data[high_card_cols], My_Data['reordered'])

#print(My_Data[high_card_cols].head())

- ## **Frequency encoding**
- ***Frequency encoding was chosen to represent the product_name column because it adds a new statistical feature to the model: product popularity. Unlike hashing tricks, which produce random values ​​that may suffer collisions, frequency encoding provides a logical relationship between the number of times a product appears and the probability of it being repurchased, thus improving prediction accuracy while maintaining memory efficiency.***
- Converting long text names into numerical values ​​that express the "weight" or "popularity" of a category helps the model understand patterns associated with bestsellers.

In [None]:
product_counts = My_Data['product_name'].value_counts()

My_Data['product_name'] = My_Data['product_name'].map(product_counts)
print(My_Data[['product_name']].head())

   product_name
0         35791
1         15935
2          6476
3          2523
4          1214


In [None]:
print(My_Data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32640698 entries, 0 to 32640697
Data columns (total 61 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int32  
 1   user_id                 int32  
 2   order_number            float64
 3   days_since_prior_order  float64
 4   product_id              int32  
 5   add_to_cart_order       float64
 6   reordered               int32  
 7   product_name            int64  
 8   aisle_id                int32  
 9   eval_set_test           int8   
 10  eval_set_train          int8   
 11  order_dow_1             int8   
 12  order_dow_2             int8   
 13  order_dow_3             int8   
 14  order_dow_4             int8   
 15  order_dow_5             int8   
 16  order_dow_6             int8   
 17  department_id_1         int8   
 18  department_id_2         int8   
 19  department_id_3         int8   
 20  department_id_4         int8   
 21  department_id_5         int8 

- ## ****Comparison of Encoding Approaches****

  -  **Key Observations:**

    - One-Hot Encoding preserved full interpretability but increased dimensionality (52 additional columns).
    - Target Encoding dramatically reduced dimensions while injecting predictive signal (reorder probability), with smoothing=10 applied to prevent                   overfitting on rare categories.
    - Frequency Encoding provided a simple, meaningful feature (popularity count) without any risk of leakage or dimensionality explosion.

****The combined approach achieved excellent memory efficiency (~3.5 GB total) while introducing strong predictive features, making the dataset suitable for both linear and tree-based models. Target leakage was avoided in Target Encoding by restricting fitting to training folds only within cross-validation pipelines.****

- ## ****Target Encoding – Leakage Prevention****

  - ****Target encoding may introduce data leakage if category statistics are computed using the full dataset.
  To prevent this, **K-fold target encoding** is applied within the training data, where target means
  are computed using K−1 folds and applied to the held-out fold.****
 ****This ensures that each sample is encoded without using its own target value.****
