- # ****Imbalanced data handling (classification)****


In [None]:
# Reading the train schedule
train_orders = pd.read_csv('/kaggle/input/instacart-market-basket-analysis/order_products__train.csv', usecols=['order_id', 'product_id', 'reordered'])

#  Since we want to predict for each (user + product)
# We will use 'product_id' and 'user_id' for linking (you need to link the train to the orders table first to get the user_id)
orders = pd.read_csv('/kaggle/input/instacart-market-basket-analysis/orders.csv', usecols=['order_id', 'user_id', 'eval_set'])
train_full = train_orders.merge(orders[['order_id', 'user_id']], on='order_id', how='left')

# Any record present in My_Data but not in train_full will be assigned NaN and then converted to 0
My_Data = My_Data.merge(train_full[['user_id', 'product_id', 'reordered']],
                        on=['user_id', 'product_id'],
                        how='left',
                        suffixes=('', '_final_target'))

# Dr. Cleaning the new target column
My_Data['target'] = My_Data['reordered_final_target'].fillna(0).astype('int8')
My_Data.drop(['reordered_final_target'], axis=1, inplace=True)

In [None]:
del train_orders
del orders
import gc
gc.collect()

0

- ### ****Examine the Target (Class Distribution) distribution****

In [None]:
print(My_Data['target'].value_counts())
print(My_Data['target'].value_counts(normalize=True))

target
0    28402695
1     4238003
Name: count, dtype: int64
target
0    0.870162
1    0.129838
Name: proportion, dtype: float64


- ### ****Row Count Check****

In [None]:
print(len(My_Data))
print(len(My_Data))

32640698
32640698


- ### ****Nulls Check****

In [None]:
print(My_Data['target'].isnull().sum())

0


- ### ****Logic Spot-check****

In [None]:
print(My_Data[My_Data['target'] == 1][['user_id', 'product_id', 'target']].head())

   user_id  product_id  target
0      1.0       196.0       1
3      1.0     26088.0       1
4      1.0     26405.0       1
5      1.0       196.0       1
6      1.0     10258.0       1


- #### Analysis of the target variable revealed a class imbalance of approximately ****1:7****. This means the data contains seven times more negative samples than positive ones, justifying the need to use imbalance correction techniques to ensure the model's accuracy in predicting buybacks

## $$\frac{\text{Zeros}}{\text{Ones}} = \frac{28,402,695}{4,238,003} \approx 6.7019$$

-----------------------------------------------------------------------------------------------------------
- ## ****Aggregation Strategy****

  - ### We will use an ****Aggregation Strategy**** instead of dealing with the entire dataset,because:
    -  *Transitioning from the order level to the (user-product) level*
    -  *Noise Reduction*
    -  *Computational Efficiency*
    -  *Feature Signal*

In [None]:
unique_pairs = My_Data.groupby(['user_id', 'product_id']).ngroups
print(f"Number of uniques:{unique_pairs}")

Number of uniques:13514162


- ### *By implementing an aggregation strategy, we successfully reduced the data size by approximately 58% (from 32.6 million to 13.5 million records) while preserving all user behavioral information. This not only improved memory efficiency but also made imbalance handling more precise and focused on the unique user-product binaries*

In [None]:
My_Data_Aggregated = My_Data.groupby(['user_id', 'product_id']).last().reset_index()

print( My_Data_Aggregated.shape)
print("-" * 30)
print(My_Data_Aggregated['target'].value_counts(normalize=True))

(13514162, 80)
------------------------------
target
0    0.93867
1    0.06133
Name: proportion, dtype: float64


- ### *After the aggregation process, we observed a decrease in the percentage of positive categories to 6.1%. This reflects the actual nature of the problem; we are moving from predicting at the level of 'every historical purchase' to predicting at the level of 'product repurchase opportunity'. This makes the task more challenging and more strongly justifies the need for the imbalance handling techniques that we will now apply*

## $$\frac{\text{Zeros}}{\text{Ones}} = \frac{0.93867}{0.06133} \approx 15.3$$

---------------------------------------------------------------------------------------------------------------
- ## ****Starting the experimentation process****

- ### ****Class Weights****

In [None]:
# Weight calculation (approximately 15.3)
pos_weight = 0.93867 / 0.06133
model_class_weights = XGBClassifier(scale_pos_weight=pos_weight, n_estimators=100, tree_method='hist')

- ### ****Undersampling****

In [None]:
# Separating the two categories
minority = My_Data_Aggregated[My_Data_Aggregated['target'] == 1]
majority = My_Data_Aggregated[My_Data_Aggregated['target'] == 0]

# Taking a random sample from the majority equal to the minority
majority_under = majority.sample(len(minority), random_state=42)

# Merge them into a new, separate trial schedule
df_under = pd.concat([minority, majority_under])

print("Undersampling:")
print(df_under['target'].value_counts(normalize=True))

Undersampling:
target
1    0.5
0    0.5
Name: proportion, dtype: float64


- ### ****SMOTE****

In [None]:
# We're taking a sample for testing to prevent the RAM from exploding
df_sample_smote = My_Data_Aggregated.sample(1000000, random_state=42)
X_sample = df_sample_smote.drop('target', axis=1)
y_sample = df_sample_smote['target']

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_sample, y_sample)

print("SMOTE:")
print(y_res.value_counts(normalize=True))

SMOTE:
target
0    0.5
1    0.5
Name: proportion, dtype: float64


- ### Because of a significant class imbalance in the data, we adopted the ****F1-Score**** as the primary criterion for evaluating our experiments, as it provides a precise balance between the ability to detect recalls and the accuracy of those predictions, thus preventing the misleading effects that can be caused by the traditional accuracy metric

In [None]:
def get_results(model, X, y, title):
    model.fit(X, y)
    preds = model.predict(X)
    print(f"--- {title} ---")
    print(f"F1-Score: {f1_score(y, preds):.4f}")
    print(f"Recall: {recall_score(y, preds):.4f}")
    print(f"Precision: {precision_score(y, preds):.4f}\n")


# A slightly modified function for SMOTE because the data is already ready and segmented
def get_results_for_smote(model, X, y, title):
    model.fit(X, y)
    preds = model.predict(X)
    print(f"--- {title} ---")
    print(f"F1-Score: {f1_score(y, preds):.4f}")
    print(f"Recall: {recall_score(y, preds):.4f}")
    print(f"Precision: {precision_score(y, preds):.4f}\n")


# Unprocessed Experiment (Original Data - Sample)
sample_orig = My_Data_Aggregated.sample(100000)
get_results(XGBClassifier(), sample_orig.drop('target', axis=1), sample_orig['target'], "Baseline")

# SMOTE
get_results_for_smote(XGBClassifier(), X_res, y_res, "SMOTE")

# Class Weights Test
get_results(XGBClassifier(scale_pos_weight=15), sample_orig.drop('target', axis=1), sample_orig['target'], "Class Weights")

#3. The Undersampling Experience
get_results(XGBClassifier(), df_under.drop('target', axis=1), df_under['target'], "Undersampling")

--- Baseline ---
F1-Score: 0.3643
Recall: 0.2263
Precision: 0.9334

--- SMOTE ---
F1-Score: 0.9655
Recall: 0.9359
Precision: 0.9969

--- Class Weights ---
F1-Score: 0.4455
Recall: 0.9406
Precision: 0.2919

--- Undersampling ---
F1-Score: 0.7534
Recall: 0.7605
Precision: 0.7465



- ## ****Analyze Numbers****

- ### *Baseline (unprocessed): The ÙŒ****Recall**** is catastrophic (0.22), meaning the model failed to detect 80% of repurchase transactions. It's only "cautious" (high precision) but "afraid" to predict 1*
- ----------------------------------------------------------------------------------------------------------
- ### *SMOTE: The numbers are very perfect (0.93) because we tested it on a small, artificially balanced sample, and it illustrates the "maximum potential" if huge computing resources were available*
- - ----------------------------------------------------------------------------------------------------------
- ### *Class Weights: The Recall jumped to (0.93)! The model became "very bold" in unit hunting, but at the expense of precision, which decreased. This is exactly what was expected when balancing the weights*
- - ----------------------------------------------------------------------------------------------------------
- ### *Undersampling: It gave you the best realistic balance; Precision and Recall are close (0.74 - 0.76), and F1-Score is very high (0.75)*

- ## ****Pros & Cons****
----------------------------------------------------------------------------------------------------------

- ## ****Class Weights****

  - ### ****Pros****: *We haven't lost a single record from the 13 million rows*
  - ### ****Cons****: *This led to a significant increase in False Positives (predicting that the customer will buy when they will not)*

- ## ****Undersampling****

  - ### ****Pros****: *Incredible speed in training and a very balanced performance between hunting single targets and accurate prediction.*
  - ### ****Cons****: *We removed millions of records from category (0), which may prevent the model from learning some special cases.*

- ## ****SMOTE****

  - ### ****Pros****: *It solves the problem of data scarcity by generating intelligent samples instead of simply iterating.*
  - ### ****Cons****: *High memory consumption and cannot be applied to the entire Instacart data (13M) in a limited Kagel environment*