# Phase 1: Setup, Data Loading, Initial Exploration & Train/Val/Test Split

## Objective
Predict whether an online shopping session leads to a purchase (binary classification)  
**Dataset**: [UCI Online Shoppers Purchasing Intention](https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset)

### Pipeline Order (No Data Leakage)
1. Load raw data & quick inspection  
2. **Split into Train (70%) / Validation (15%) / Test (15%)**  
3. Then preprocessing is fitted on Train only (Phase 2)

## 1.1 — Import Libraries

In [None]:
# ── Core Libraries ──
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings, time, os, sys

# ── Sklearn ──
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix, f1_score,
    precision_score, recall_score, accuracy_score, roc_auc_score
)

# ── Deep Learning ──
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# ── Settings ──
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
warnings.filterwarnings("ignore")
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)

print("✅ All libraries imported successfully")
print(f"Python {sys.version}")
print(f"PyTorch {torch.__version__}")
print(f"Random seed fixed at {RANDOM_SEED}")

## 1.2 — Load Dataset

In [None]:
# Download from UCI repository
from ucimlrepo import fetch_ucirepo

online_shoppers = fetch_ucirepo(id=468)
df = online_shoppers.data.original.copy()

print(f"Dataset shape: {df.shape}")
print(f"Target column: 'Revenue'")
df.head()

## 1.3 — Initial Raw Data Inspection
Quick look at types, missing values, and class distribution **before any transformation**.

In [None]:
# ── Data types & missing values ──
print("=" * 60)
print("DATA TYPES & MISSING VALUES")
print("=" * 60)
info_df = pd.DataFrame({
    "dtype": df.dtypes,
    "non_null": df.count(),
    "null_count": df.isnull().sum(),
    "null_%": (df.isnull().sum() / len(df) * 100).round(2)
})
print(info_df)

print("\n" + "=" * 60)
print("BASIC STATISTICS (Numerical)")
print("=" * 60)
df.describe().T

In [None]:
# ── Class distribution (target: Revenue) ──
print("CLASS DISTRIBUTION")
print("=" * 40)
class_counts = df["Revenue"].value_counts()
class_pct = df["Revenue"].value_counts(normalize=True) * 100
print(pd.DataFrame({"count": class_counts, "percentage": class_pct.round(2)}))

fig, ax = plt.subplots(1, 1, figsize=(6, 4))
sns.countplot(x="Revenue", data=df, palette="viridis", ax=ax)
ax.set_title("Target Distribution — Class Imbalance Check")
for p in ax.patches:
    ax.annotate(f'{int(p.get_height())} ({p.get_height()/len(df)*100:.1f}%)',
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='bottom', fontsize=11)
plt.tight_layout()
plt.show()

print(f"\n⚠️  Imbalance ratio: ~{class_counts.iloc[0]/class_counts.iloc[1]:.1f}:1 (negative:positive)")

In [None]:
# ── Identify column types ──
categorical_cols = df.select_dtypes(include=["object", "bool"]).columns.tolist()
numerical_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()

# Remove target from feature lists
if "Revenue" in categorical_cols:
    categorical_cols.remove("Revenue")
if "Revenue" in numerical_cols:
    numerical_cols.remove("Revenue")

print(f"Categorical features ({len(categorical_cols)}): {categorical_cols}")
print(f"Numerical features  ({len(numerical_cols)}): {numerical_cols}")
print(f"Target: Revenue")

## 1.4 — Train / Validation / Test Split (BEFORE Preprocessing)

**Critical**: We split the raw data first to prevent any data leakage.  
- **Train**: 70% — used to fit preprocessing & train models  
- **Validation**: 15% — used for hyperparameter tuning & model selection  
- **Test**: 15% — used **once** for final evaluation only  

We use **stratified** splitting to preserve class proportions across all sets.

In [None]:
# ── Separate features and target ──
X = df.drop(columns=["Revenue"])
y = df["Revenue"].astype(int)  # Ensure binary int (0/1)

# ── Step 1: Split into Train (70%) and Temp (30%) ──
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y,
    test_size=0.30,
    random_state=RANDOM_SEED,
    stratify=y
)

# ── Step 2: Split Temp into Validation (15%) and Test (15%) ──
# 50% of 30% = 15% each
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.50,
    random_state=RANDOM_SEED,
    stratify=y_temp
)

# ── Verify split sizes and class proportions ──
print("=" * 60)
print("SPLIT SUMMARY (Raw data, NO preprocessing applied yet)")
print("=" * 60)
for name, X_set, y_set in [("Train", X_train, y_train),
                             ("Validation", X_val, y_val),
                             ("Test", X_test, y_test)]:
    pos_rate = y_set.mean() * 100
    print(f"  {name:12s}: {X_set.shape[0]:5d} samples ({X_set.shape[0]/len(df)*100:.1f}%) | "
          f"Positive class: {pos_rate:.1f}%")

print(f"\n  Total: {len(X_train)+len(X_val)+len(X_test)} samples")
print(f"  Original: {len(df)} samples")
print("\n✅ Split done BEFORE any preprocessing — no data leakage.")

In [None]:
# ── Visual check: class proportion preserved across splits ──
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

for ax, (name, y_set) in zip(axes, [("Train", y_train), ("Validation", y_val), ("Test", y_test)]):
    counts = y_set.value_counts()
    ax.bar(["No Purchase (0)", "Purchase (1)"], counts.values, color=["#3498db", "#e74c3c"])
    ax.set_title(f"{name} Set (n={len(y_set)})")
    ax.set_ylabel("Count")
    for i, v in enumerate(counts.values):
        ax.text(i, v + 5, f"{v}\n({v/len(y_set)*100:.1f}%)", ha="center", fontsize=10)

plt.suptitle("Stratified Split Verification — Class Proportions", fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

print("✅ Phase 1 Complete — Data loaded, inspected, and split.")
print("   Preprocessing will be fitted on TRAIN set only in Phase 2.")