<hr>

# ü§ñ MACHINE LEARNING ü§ñ

<style>
h1 {
    text-align: center;
    color: hotpink;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<hr>

In [1]:
# import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

plt.style.use('ggplot')
pd.set_option('display.max_columns', 200)

<hr>

# FEATURE INSIGHTS


<style>
h1 {
    text-align: center;
    color: purple;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: purple;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>

## Machine Learning Data Preparation

For machine learning, the dataset needs to be prepared so that all features are numeric and compatible with ML models.

---
### ‚öôÔ∏è ML-Friendly Data Types

| Column | ML-Friendly Type | Reason |
|--------|-----------------|--------|
| Administrative | int64 | Already numeric |
| Administrative_Duration | float64 | Continuous numeric |
| Informational | int64 | Numeric |
| Informational_Duration | float64 | Continuous numeric |
| ProductRelated | int64 | Numeric |
| ProductRelated_Duration | float64 | Continuous numeric |
| BounceRates | float64 | Continuous numeric |
| ExitRates | float64 | Continuous numeric |
| PageValues | float64 | Continuous numeric |
| SpecialDay | float64 | Continuous numeric (0‚Äì1 scale) |
| Month | category ‚Üí One-Hot Encoded | Few unique categories ‚Üí convert for ML |
| OperatingSystems | category ‚Üí One-Hot Encoded | Discrete IDs ‚Üí convert |
| Browser | category ‚Üí One-Hot Encoded | Discrete IDs ‚Üí convert |
| Region | category ‚Üí One-Hot Encoded | Discrete IDs ‚Üí convert |
| TrafficType | category ‚Üí One-Hot Encoded | Discrete IDs ‚Üí convert |
| VisitorType | category ‚Üí One-Hot Encoded | Few types ‚Üí convert |
| Weekend | bool ‚Üí int | Convert True/False to 1/0 |
| Revenue | bool ‚Üí int | Target: 1 = Buy, 0 = Not Buy |


### üîë Key Points for ML

| Feature Type | Recommended Preprocessing | Notes |
|--------------|---------------------------|-------|
| Categorical variables | One-Hot Encoding or Label Encoding | Depends on the algorithm (tree-based vs linear models) |
| Boolean columns | Convert to 0/1 | E.g., `Weekend`, `Revenue` |
| Continuous numeric features | Keep as `float64` | Scaling optional (helpful for gradient-based models) |
| Target column (`Revenue`) | Binary 0/1 | **1 = Buy**, **0 = Not Buy** |


<hr>

# FEATURE ENGINEERING


<style>
h1 {
    text-align: center;
    color: purple;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: purple;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>

to enhance model performance:

- Handling categorical variables (Month, VisitorType) ‚Üí one-hot encoding.
- Boolean columns (Weekend, Revenue) ‚Üí convert to integers (0/1).
- Aggregating / combining features ‚Üí like totals, ratios, or averages.
- Binning / scaling numeric features ‚Üí e.g., Administrative_Duration, BounceRates.
- Date/time handling ‚Üí Month can be converted to ordinal numbers.

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler, KBinsDiscretizer

# Assuming your dataset is in a DataFrame called df
df = pd.read_csv("../data/processed/online_shoppers_intention_01_standard.csv")

# ----------- Boolean to int -----------
df['weekend'] = df['weekend'].astype(int)
df['revenue'] = df['revenue'].astype(int)

# ----------- Month Encoding -----------
# Convert months to numeric (ordinal)
month_mapping = {
    'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'June':6,
    'Jul':7, 'Aug':8, 'Sep':9, 'Oct':10, 'Nov':11, 'Dec':12
}
df['month'] = df['month'].map(month_mapping)

# ----------- One-Hot Encoding for VisitorType -----------
df = pd.get_dummies(df, columns=['VisitorType'], drop_first=True)

# ----------- Feature Combinations / Ratios -----------
# Total page interactions
df['Total_Page_Views'] = df['Administrative'] + df['Informational'] + df['ProductRelated']
df['Total_Duration'] = df['Administrative_Duration'] + df['Informational_Duration'] + df['ProductRelated_Duration']

# Average duration per page type
df['Avg_Admin_Duration'] = df['Administrative_Duration'] / (df['Administrative'] + 1e-5)
df['Avg_Info_Duration'] = df['Informational_Duration'] / (df['Informational'] + 1e-5)
df['Avg_Product_Duration'] = df['ProductRelated_Duration'] / (df['ProductRelated'] + 1e-5)

# Ratios
df['Bounce_to_Exit_Ratio'] = df['BounceRates'] / (df['ExitRates'] + 1e-5)
df['PageValue_per_Product'] = df['PageValues'] / (df['ProductRelated'] + 1e-5)

# ----------- Scaling numeric features -----------
numeric_cols = [
    'Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration',
    'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 
    'PageValues', 'SpecialDay', 'Total_Page_Views', 'Total_Duration',
    'Avg_Admin_Duration', 'Avg_Info_Duration', 'Avg_Product_Duration',
    'Bounce_to_Exit_Ratio', 'PageValue_per_Product'
]

scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# ----------- Optional: Bin numeric features -----------
# Example: bin ProductRelated into 5 quantile bins
kbins = KBinsDiscretizer(n_bins=5, encode='onehot-dense', strategy='quantile')
product_bins = kbins.fit_transform(df[['ProductRelated']])
product_bins_df = pd.DataFrame(product_bins, columns=[f'ProductRelated_bin_{i}' for i in range(product_bins.shape[1])])
df = pd.concat([df, product_bins_df], axis=1)

print(df.head())
