# Lesson 04: Data Preprocessing

**What you'll learn:**
- Encode categorical variables (text to numbers)
- Scale numeric features
- Split data into training and testing
- The correct order of preprocessing

---

## Section 1: Why Preprocess?

### READ

ML algorithms need NUMERIC data. Raw data often has problems:
- **Text values** that need to be converted to numbers
- **Different scales** (age: 0-100, salary: 0-100000)
- **Missing values** that need to be filled

Preprocessing fixes these problems!

### TRY IT

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Load data
df = pd.read_csv('../datasets/churn_modelling.csv')
print("Original data:")
print(df.head())
print(f"\nData types:")
print(df.dtypes)

---

## Section 2: Encoding Categorical Variables

### READ

**Categorical variables** contain text (like "red", "blue", "green").

**One-Hot Encoding**: Create a new column for each category
```
Color column: "red", "blue"
     ↓
Color_red: 0 or 1
Color_blue: 0 or 1
```

### TRY IT

In [None]:
# Find categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns
print("Categorical columns:", categorical_cols.tolist())

# Check unique values
for col in categorical_cols:
    print(f"\n{col}: {df[col].unique()}")

In [None]:
# One-Hot Encoding with pandas
df_encoded = pd.get_dummies(df, columns=['Geography', 'Gender'])

print("After One-Hot Encoding:")
print(df_encoded.columns.tolist())
print(f"\nOriginal columns: {df.shape[1]}")
print(f"After encoding: {df_encoded.shape[1]}")

In [None]:
# See the encoded columns
print(df_encoded[['Geography_France', 'Geography_Germany', 'Geography_Spain', 'Gender_Female', 'Gender_Male']].head())

### EXPLAIN

- `get_dummies()` creates binary columns for each category
- Each row has 1 in its category column, 0 elsewhere
- Geography (3 values) → 3 columns
- Gender (2 values) → 2 columns

---

## Section 3: Feature Scaling

### READ

**Problem**: Features on different scales can cause issues.
- Age: 18-92
- Balance: 0-250,000

The model might think Balance is more important just because the numbers are bigger!

**MinMaxScaler**: Scales all features to 0-1 range

### TRY IT

In [None]:
# Before scaling
print("Before Scaling:")
print(f"Age range: {df['Age'].min()} - {df['Age'].max()}")
print(f"Balance range: {df['Balance'].min()} - {df['Balance'].max()}")

In [None]:
# Apply MinMaxScaler
scaler = MinMaxScaler()

# Get numeric columns to scale
numeric_cols = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

# Scale
df_scaled = df_encoded.copy()
df_scaled[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])

print("After Scaling (0-1 range):")
print(f"Age range: {df_scaled['Age'].min():.2f} - {df_scaled['Age'].max():.2f}")
print(f"Balance range: {df_scaled['Balance'].min():.2f} - {df_scaled['Balance'].max():.2f}")

### EXPLAIN

- `fit_transform()` learns min/max AND transforms the data
- Now all numeric features are between 0 and 1
- The relationships between values are preserved

---

## Section 4: Train-Test Split

### READ

We split data into:
- **Training set** (80%): Model learns from this
- **Test set** (20%): Check how well model performs on NEW data

**NEVER test on training data!** That's like testing a student on questions they've already seen.

### TRY IT

In [None]:
# Prepare features (X) and target (y)
# Drop columns we don't need for prediction
X = df_scaled.drop(['RowNumber', 'CustomerId', 'Surname', 'Exited'], axis=1)
y = df_scaled['Exited']

print(f"Features (X): {X.shape}")
print(f"Target (y): {y.shape}")

In [None]:
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% for testing
    random_state=42,    # For reproducibility
    stratify=y          # Keep class proportions
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print(f"\nClass distribution in training:")
print(y_train.value_counts(normalize=True).round(3))
print(f"\nClass distribution in testing:")
print(y_test.value_counts(normalize=True).round(3))

### EXPLAIN

- `test_size=0.2`: 20% goes to test set
- `random_state=42`: Makes the split reproducible
- `stratify=y`: Keeps same class proportions in train and test

---

## Section 5: The CORRECT Order

### READ

**IMPORTANT: Order matters!**

```
CORRECT ORDER:
1. Split into train/test FIRST
2. Fit scaler on TRAINING data only
3. Transform BOTH train and test with same scaler
```

**Why?** If you scale before splitting, information from test data "leaks" into training. This gives unrealistically good results.

### TRY IT - The RIGHT Way

In [None]:
# Start fresh
df = pd.read_csv('../datasets/tomatjus.csv')
X = df.drop('quality', axis=1)
y = df['quality']

# STEP 1: Split FIRST (before any preprocessing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Step 1: Split data")
print(f"  Training: {len(X_train)} samples")
print(f"  Testing: {len(X_test)} samples")

In [None]:
# STEP 2: Fit scaler on TRAINING data only
scaler = MinMaxScaler()
scaler.fit(X_train)  # Only learn from training!
print("Step 2: Fit scaler on training data only")

In [None]:
# STEP 3: Transform BOTH sets with the SAME scaler
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Step 3: Transform both sets")
print(f"\nTraining data (first row):")
print(X_train_scaled[0])
print(f"\nTest data (first row):")
print(X_test_scaled[0])

### EXPLAIN

**Key Points:**
- `fit()` learns parameters from training data
- `transform()` applies those parameters
- Use `fit()` only on training, `transform()` on both
- This prevents data leakage!

---

## Quick Reference

| Task | Code |
|------|------|
| One-hot encode | `pd.get_dummies(df, columns=['col'])` |
| Create scaler | `scaler = MinMaxScaler()` |
| Fit on training | `scaler.fit(X_train)` |
| Transform | `X_scaled = scaler.transform(X)` |
| Fit + Transform | `X_scaled = scaler.fit_transform(X)` |
| Train-test split | `train_test_split(X, y, test_size=0.2)` |

---

## Next Lesson

In **Lesson 05: Your First ML Model**, you'll learn:
- How to train different classifiers
- The fit-predict pattern
- Making predictions on new data