# Session 6: Train/Test Split and Putting It Together

Last session we learned to encode text into numbers with one-hot encoding.

Now our data is ALL numbers. Almost ready for Machine Learning!

But first, we need to answer: **How do we know if the AI actually learned, or just memorized?**

Today we'll learn:
- **Why** we split data into training and testing sets
- **How** to split with `train_test_split()`
- The **complete pipeline** from raw data to ML-ready data

---
## 1. The Studying Analogy

Imagine you're studying for a math test:

- **Option A:** You practice 100 problems, then the test uses **the same 100 problems**.
  - You got 100%. But did you actually learn math? Or just memorize answers?

- **Option B:** You practice 80 problems, then the test has **20 NEW problems** you've never seen.
  - If you score well on problems you've never seen, you actually **learned**.

Machine Learning works the same way:

- **Training set** (80%) = the practice problems. The AI studies these.
- **Test set** (20%) = the final exam. We hide these and check later.

If the AI predicts well on cars it has **never seen before**, it actually learned the patterns.

---
## 2. X and y: Features vs Target

Before splitting, we need to separate our data into two parts:

- **X** = the **features** (the information the AI uses to predict). Think: "the questions on the exam."
- **y** = the **target** (the value the AI tries to predict). Think: "the answers."

For our car project:
- X = year, mileage, brand columns (what we know about the car)
- y = price (what we want to predict)

In [None]:
import pandas as pd

# Start with our encoded dataset (from Session 5)
data = {
    "brand":   ["Ford", "Toyota", "BMW", "Ford", "Toyota", "Tesla", "BMW", "Ford", "BMW", "Ford"],
    "year":    [2018, 2015, 2019, 2012, 2020, 2022, 2017, 2016, 2021, 2003],
    "mileage": [45000, 80000, 30000, 120000, 15000, 5000, 55000, 70000, 25000, 152000],
    "price":   [18000, 12000, 35000, 5000, 25000, 55000, 22000, 13000, 42000, 8000]
}

df = pd.DataFrame(data)

# Encode brand (same as Session 5)
df_encoded = pd.get_dummies(df, columns=["brand"])

print("--- Encoded data ---")
print(df_encoded)
print()
print("Columns:", list(df_encoded.columns))

In [None]:
# Separate features (X) and target (y)
X = df_encoded.drop("price", axis=1)  # Everything EXCEPT price
y = df_encoded["price"]                # ONLY price

print("--- X (Features: what the AI looks at) ---")
print(X)
print()
print("--- y (Target: what the AI predicts) ---")
print(y)

In [None]:
# Check the shapes
print(f"X shape: {X.shape} — {X.shape[0]} cars, {X.shape[1]} features each")
print(f"y shape: {y.shape} — {y.shape[0]} prices to predict")

---
## 3. train_test_split: Dividing the Data

Now we split X and y into:
- `X_train`, `y_train` — for the AI to study
- `X_test`, `y_test` — for the final exam

We use sklearn's `train_test_split` function.

In [None]:
from sklearn.model_selection import train_test_split

# Split: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% goes to the test set
    random_state=42     # The "shuffle seed" — makes the split reproducible
)

print(f"Training set: {len(X_train)} cars (the AI studies these)")
print(f"Test set:     {len(X_test)} cars (the final exam)")

### Understanding the arguments

| Argument | What it does | Example |
|----------|-------------|--------|
| `X, y` | The data to split | Our features and prices |
| `test_size=0.2` | How much to save for testing | 0.2 = 20%, 0.3 = 30% |
| `random_state=42` | The shuffle seed | Same number = same split every time |

The function returns **4 things** (in this exact order):
1. `X_train` — training features
2. `X_test` — test features
3. `y_train` — training prices
4. `y_test` — test prices

In [None]:
# Let's look at what the AI gets to study
print("--- Training features (X_train) ---")
print(X_train)
print()
print("--- Training prices (y_train) ---")
print(y_train)

In [None]:
# And the final exam — the AI has NEVER seen these
print("--- Test features (X_test) ---")
print(X_test)
print()
print("--- Test prices (y_test) — the correct answers ---")
print(y_test)

---
## 4. Experimenting with the Split

Let's see how changing the arguments affects the split.

In [None]:
# Try different test sizes
for size in [0.1, 0.2, 0.3, 0.5]:
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=size, random_state=42)
    pct = int(size * 100)
    print(f"test_size={size} → {len(X_tr)} train, {len(X_te)} test ({pct}% for testing)")

In [None]:
# What does random_state do?
# Same random_state = same split every time
_, X_test_a, _, _ = train_test_split(X, y, test_size=0.2, random_state=42)
_, X_test_b, _, _ = train_test_split(X, y, test_size=0.2, random_state=42)
_, X_test_c, _, _ = train_test_split(X, y, test_size=0.2, random_state=99)

print("random_state=42 (first time), test rows:", list(X_test_a.index))
print("random_state=42 (second time), test rows:", list(X_test_b.index))
print("random_state=99, test rows:", list(X_test_c.index))
print()
print("Same seed = same shuffle. Different seed = different shuffle.")
print("This way, your results are reproducible (anyone can get the same split).")

---
## 5. The Complete Pipeline (All Steps Together)

Let's build the entire data preparation pipeline from scratch, step by step.

This is what the n2 notebook does — and soon, you'll be able to build this from memory.

In [None]:
# ============================================
# THE COMPLETE ML DATA PIPELINE
# ============================================

import pandas as pd
from sklearn.model_selection import train_test_split

# --- STEP 1: Create or load the data ---
data = {
    "brand":   ["Ford", "Toyota", "BMW", "Ford", "Toyota", "Tesla", "BMW", "Ford", "BMW", "Ford"],
    "year":    [2018, 2015, 2019, 2012, 2020, 2022, 2017, 2016, 2021, 2003],
    "mileage": [45000, 80000, 30000, 120000, 15000, 5000, 55000, 70000, 25000, 152000],
    "price":   [18000, 12000, 35000, 5000, 25000, 55000, 22000, 13000, 42000, 8000]
}

df = pd.DataFrame(data)
print("STEP 1 — Raw data:")
print(df)
print()

In [None]:
# --- STEP 2: Encode text columns ---
df_encoded = pd.get_dummies(df, columns=["brand"])
print("STEP 2 — After encoding:")
print(df_encoded)
print()

In [None]:
# --- STEP 3: Separate features (X) and target (y) ---
X = df_encoded.drop("price", axis=1)
y = df_encoded["price"]
print("STEP 3 — X and y:")
print(f"  X shape: {X.shape} (features)")
print(f"  y shape: {y.shape} (target)")
print()

In [None]:
# --- STEP 4: Split into train and test ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("STEP 4 — Train/Test Split:")
print(f"  Training: {len(X_train)} cars")
print(f"  Testing:  {len(X_test)} cars")
print()
print("READY FOR MACHINE LEARNING!")
print("Next session: we give X_train and y_train to the AI, and it learns to predict prices.")

### The Pipeline Summary

```
Raw data (dictionary/CSV)
    ↓
DataFrame
    ↓
Encode text → pd.get_dummies()
    ↓
Split X and y → df.drop() / df["column"]
    ↓
Split train/test → train_test_split()
    ↓
Ready for ML!
```

Every ML project follows this pattern. The details change, but the steps are always the same.

---
## 6. Recap: What Each Variable Means

After the pipeline, you have 6 key variables:

| Variable | What it is | Shape in our example |
|----------|-----------|---------------------|
| `df` | The original raw data | 10 rows, 4 columns |
| `df_encoded` | Data with text converted to numbers | 10 rows, 7 columns |
| `X` | All features (everything except price) | 10 rows, 6 columns |
| `y` | The target (prices) | 10 values |
| `X_train` | Features the AI studies | 8 rows, 6 columns |
| `X_test` | Features for the final exam | 2 rows, 6 columns |
| `y_train` | Prices the AI studies | 8 values |
| `y_test` | Correct prices for the final exam | 2 values |

---
---
# CHALLENGES

### Challenge 1: Split and Explore

Using the encoded data from Section 5 (already created above as `X` and `y`):

1. Split with `test_size=0.3` (30% testing) and `random_state=7`
2. Print how many cars are in the training set and test set
3. Print the test set features (`X_test`) — which cars ended up in the test set?
4. Print the test set prices (`y_test`) — what are their actual prices?

In [None]:
from sklearn.model_selection import train_test_split

# X and y already exist from Section 5

# YOUR CODE HERE


### Challenge 2: Build the Pipeline from a New Dataset

Build the complete pipeline from scratch using this data:

```python
phones = {
    "brand":   ["Apple", "Samsung", "Apple", "Google", "Samsung", "Apple", "Google", "Samsung"],
    "storage": [128, 256, 256, 128, 128, 512, 256, 512],
    "age":     [1, 2, 1, 3, 2, 1, 1, 2],
    "price":   [800, 600, 900, 350, 450, 1100, 500, 700]
}
```

Steps:
1. Create a DataFrame
2. Encode the brand column with `pd.get_dummies()`
3. Separate X (features) and y (price)
4. Split into train/test with `test_size=0.25, random_state=42`
5. Print the shape of each: `X_train`, `X_test`, `y_train`, `y_test`

This is a completely different dataset — but the pipeline is identical!

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

phones = {
    "brand":   ["Apple", "Samsung", "Apple", "Google", "Samsung", "Apple", "Google", "Samsung"],
    "storage": [128, 256, 256, 128, 128, 512, 256, 512],
    "age":     [1, 2, 1, 3, 2, 1, 1, 2],
    "price":   [800, 600, 900, 350, 450, 1100, 500, 700]
}

# YOUR CODE HERE


### Challenge 3: Build from Memory

This is the big one. **Without looking at the code above**, build the complete pipeline for this car data:

```python
cars = {
    "brand":   ["Honda", "Ford", "Toyota", "Honda", "Ford", "Tesla"],
    "year":    [2020, 2018, 2019, 2017, 2021, 2023],
    "mileage": [30000, 55000, 42000, 78000, 12000, 3000],
    "price":   [23000, 15000, 19000, 11000, 28000, 52000]
}
```

From memory, do all 4 steps:
1. DataFrame
2. Encode
3. X/y split
4. Train/test split (80/20, random_state=42)

Print the number of training and test cars at the end.

If you get stuck, look back — but try at least 3 minutes on your own first!

In [None]:
# Build the COMPLETE pipeline from memory!

cars = {
    "brand":   ["Honda", "Ford", "Toyota", "Honda", "Ford", "Tesla"],
    "year":    [2020, 2018, 2019, 2017, 2021, 2023],
    "mileage": [30000, 55000, 42000, 78000, 12000, 3000],
    "price":   [23000, 15000, 19000, 11000, 28000, 52000]
}

# YOUR CODE HERE


### Challenge 4: The test_size Experiment

Using the car data from Challenge 3, try 5 different `test_size` values: `0.1, 0.2, 0.3, 0.4, 0.5`.

For each one, print how many cars go to training and how many go to testing.

Then answer as a comment:
- What happens if `test_size` is too small (like 0.1)?
- What happens if `test_size` is too large (like 0.5)?
- Why is 0.2 (20%) a common choice?

In [None]:
# YOUR CODE HERE

# Answers:
# Too small test_size: 
# Too large test_size: 
# Why 0.2 is common: 


### Challenge 5: Pipeline with Real Data

Apply the pipeline to the real CSV dataset:

1. Load `../data/usedcarprices_sujayr_train.csv`
2. Select only these columns: `Year`, `Kilometers_Driven`, `Fuel_Type`, `Transmission`, `Price`
   - Hint: `df = df[["Year", "Kilometers_Driven", "Fuel_Type", "Transmission", "Price"]]`
3. Drop rows with missing values: `df = df.dropna()`
4. Encode `Fuel_Type` and `Transmission` with `pd.get_dummies()`
5. Create X (drop Price) and y (just Price)
6. Split into train/test (80/20, random_state=42)
7. Print the shapes of X_train and X_test

You just prepared a **real dataset** for Machine Learning!

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# YOUR CODE HERE
