# Session 5: One-Hot Encoding (The Conceptual Bridge)

Last session we learned pandas — loading data, filtering, exploring.

But there's a problem: **Machine Learning models only understand numbers.**

Our dataset has text columns like `brand` ("Ford", "Toyota", "BMW"). How do we turn words into numbers?

Today we'll learn:
- **Why** we need to convert text to numbers
- **How** one-hot encoding works (by building it ourselves)
- The **pandas shortcut** that does it in one line
- How the n2 notebook uses encoding in a real pipeline

---
## 1. The Problem: Computers Can't Multiply Words

Imagine you're the AI. Someone gives you this data and says "predict the price":

| brand  | year | mileage |
|--------|------|---------|
| Ford   | 2018 | 45000   |
| Toyota | 2015 | 80000   |
| BMW    | 2019 | 30000   |

Year and mileage are numbers — you can do math with them.

But **"Ford"**? Can you multiply "Ford" times 3? Can you add "Toyota" plus "BMW"?

No. We need a way to turn brands into numbers.

### Why not just number them?

You might think: Ford = 1, Toyota = 2, BMW = 3. Simple!

But that creates a **fake relationship**. The AI would think:
- BMW (3) is "greater than" Ford (1)
- Toyota (2) is "between" Ford and BMW
- BMW minus Ford = Toyota (3 - 1 = 2)

None of that is true! Brands don't have an order or distance between them.

We need a smarter approach: **one-hot encoding**.

---
## 2. One-Hot Encoding: The Idea

Instead of one column with text, we create **a separate column for each brand**.

Each column asks a yes/no question: **"Is this car a Ford?"** → 1 = Yes, 0 = No.

| brand  | → | is_Ford | is_Toyota | is_BMW |
|--------|---|---------|-----------|--------|
| Ford   | → | 1       | 0         | 0      |
| Toyota | → | 0       | 1         | 0      |
| BMW    | → | 0       | 0         | 1      |
| Ford   | → | 1       | 0         | 0      |

Each row has exactly **one** 1 and the rest are 0s. That's why it's called "**one**-hot" — one value is "hot" (1), the rest are "cold" (0).

Now the AI sees only numbers! And no fake ordering.

---
## 3. Let's Build It Ourselves

Before using the pandas shortcut, let's understand what's happening under the hood.

We'll write the encoding step by step, using only things we learned in Sessions 1-3.

In [None]:
# Our data: a list of car brands
brands = ["Ford", "Toyota", "BMW", "Ford", "Toyota"]

### Step 1: Find all the unique brands

We need to know what columns to create. If there are 3 unique brands, we need 3 new columns.

In [None]:
# Find unique values (no duplicates)
unique_brands = []

for brand in brands:
    if brand not in unique_brands:
        unique_brands.append(brand)

print("Unique brands found:", unique_brands)
print(f"We need {len(unique_brands)} new columns.")

### Step 2: Create a column of 1s and 0s for ONE brand

Let's start with just Ford. For each car, ask: "Is it a Ford?" → 1 or 0.

In [None]:
# Create the "is_Ford" column
is_ford = []

for brand in brands:
    if brand == "Ford":
        is_ford.append(1)   # Yes, it's a Ford!
    else:
        is_ford.append(0)   # No, it's not a Ford

print("Original brands:", brands)
print("is_Ford column: ", is_ford)

### Step 3: Do the same for EVERY unique brand

Now we repeat Step 2 for each unique brand and store the result in a dictionary.

In [None]:
# Build the full one-hot encoded result
result = {}  # A dictionary to hold our new columns

for unique_brand in unique_brands:
    # Create a column name like "brand_Ford"
    column_name = f"brand_{unique_brand}"
    
    # Build the 1s and 0s for this brand
    column = []
    for brand in brands:
        if brand == unique_brand:
            column.append(1)
        else:
            column.append(0)
    
    # Store it in our dictionary
    result[column_name] = column
    print(f"{column_name}: {column}")

print()
print("Full result dictionary:")
print(result)

That dictionary of lists is exactly the structure we learned in Session 3 — and it can become a DataFrame!

In [None]:
import pandas as pd

# Turn our result into a table
encoded_df = pd.DataFrame(result)

print("--- Our hand-built one-hot encoding ---")
print(encoded_df)

---
## 4. The Function in some_utilities.py

The code we just wrote is packaged as a function in `../notebooks/some_utilities.py`.

Let's import it and see that it does the same thing.

In [None]:
# Add the notebooks folder to our path so we can import from it
import sys
sys.path.append("../notebooks")

from some_utilities import get_dummies

In [None]:
# Use the function from some_utilities.py
brands = ["Ford", "Toyota", "BMW", "Ford", "Toyota"]
encoded = get_dummies(brands, "brand")

# Show the result as a table
print(pd.DataFrame(encoded))

Same result as what we built by hand! The function just wraps our Steps 1-3 into a reusable block.

Let's trace through the function to make sure we understand every line:

```python
def get_dummies(data_list, column_name):
    # STEP 1: Find unique values
    unique_values = []
    for item in data_list:              # Loop through all brands
        if item not in unique_values:    # If we haven't seen it before...
            unique_values.append(item)   # ...add it to our list
    
    # STEP 2: Create a new column for each unique value
    result = {}
    for unique_val in unique_values:                  # For each unique brand...
        new_column_name = f"{column_name}_{unique_val}"  # Make a name like "brand_Ford"
        new_column = []
        for item in data_list:           # Go through ALL brands again
            if item == unique_val:        # Does this car match?
                new_column.append(1)      # Yes → 1
            else:
                new_column.append(0)      # No → 0
        result[new_column_name] = new_column
    
    return result
```

The **outer loop** picks each unique brand. The **inner loop** goes through every car and asks: "Is this car that brand?"

---
## 5. The Pandas Shortcut: pd.get_dummies()

Building it by hand taught us the concept. But in real projects, pandas does it in **one line**.

In [None]:
# Create a DataFrame with our brands
df = pd.DataFrame({"brand": ["Ford", "Toyota", "BMW", "Ford", "Toyota"]})

print("--- BEFORE encoding ---")
print(df)
print()

In [None]:
# One line to encode!
df_encoded = pd.get_dummies(df, columns=["brand"])

print("--- AFTER encoding ---")
print(df_encoded)

Same result as our hand-built version! `pd.get_dummies()` is just a faster, fancier version of what we coded ourselves.

Notice:
- The original `brand` column is **gone** (replaced by the new columns)
- New columns are named `brand_Ford`, `brand_Toyota`, `brand_BMW`
- Each row has exactly one `True` and the rest are `False` (True/False work like 1/0 in math)

### Encoding with a full dataset

Let's try it on a more realistic dataset with multiple columns.

In [None]:
# A small dealership dataset
data = {
    "brand":   ["Ford",  "Toyota", "BMW",   "Ford",  "Toyota"],
    "status":  ["Used",  "Used",   "New",   "Used",  "New"],
    "year":    [2018,    2015,     2019,    2012,    2020],
    "mileage": [45000,   80000,    30000,   120000,  15000],
    "price":   [18000,   12000,    35000,   5000,    25000]
}

df = pd.DataFrame(data)
print("--- Original data ---")
print(df)
print()
print("Column types:")
print(df.dtypes)

In [None]:
# Encode ALL text columns at once
df_encoded = pd.get_dummies(df, columns=["brand", "status"])

print("--- After encoding brand AND status ---")
print(df_encoded)
print()
print("Columns:", list(df_encoded.columns))

The text columns `brand` and `status` are gone. In their place, we have `brand_Ford`, `brand_Toyota`, `brand_BMW`, `status_Used`, `status_New`. All numbers!

The number columns (`year`, `mileage`, `price`) stayed the same — they didn't need encoding.

---
## 6. Walking Through n2: The Full Picture

Now let's see how encoding fits into the real ML pipeline. This follows the `n2_data_transformation` notebook.

The pipeline has 3 steps:
1. **Create the dataset** (or load from CSV)
2. **Encode** text columns to numbers
3. **Split** into features (X) and target (y)

In [None]:
import pandas as pd

# STEP 1: Create the dataset (same as n2)
data = {
    "brand":   ["Ford", "Toyota", "BMW", "Ford", "Toyota", "Tesla", "BMW", "Ford", "BMW", "Ford"],
    "year":    [2018, 2015, 2019, 2012, 2020, 2022, 2017, 2016, 2021, 2003],
    "mileage": [45000, 80000, 30000, 120000, 15000, 5000, 55000, 70000, 25000, 152000],
    "price":   [18000, 12000, 35000, 5000, 25000, 55000, 22000, 13000, 42000, 8000]
}

df = pd.DataFrame(data)
print("STEP 1 — Original data:")
print(df)
print()

In [None]:
# STEP 2: Encode the brand column
df_encoded = pd.get_dummies(df, columns=["brand"])

print("STEP 2 — After encoding:")
print(df_encoded)
print()
print("All columns are now numbers!")
print("Columns:", list(df_encoded.columns))

In [None]:
# STEP 3: Split into X (features) and y (target)
X = df_encoded.drop("price", axis=1)  # Everything EXCEPT price
y = df_encoded["price"]                # Just the price

print("STEP 3 — Features (X):")
print(X)
print()
print("Target (y):")
print(y)
print()
print(f"X has {X.shape[0]} rows and {X.shape[1]} columns (features).")
print(f"y has {len(y)} values (prices to predict).")

This is the complete data preparation pipeline:

**Raw data** → **Encode text** → **Split X and y** → Ready for Machine Learning!

Next session, we'll add one more step before training: splitting into train and test sets.

---
## 7. Quick Reference

| What | Code | When to use |
|------|------|------------|
| Our manual version | `get_dummies(list, name)` | To understand how it works |
| Pandas shortcut | `pd.get_dummies(df, columns=[...])` | In real projects |
| Check column types | `df.dtypes` | To find which columns are text |
| Count unique values | `df["col"].nunique()` | To see how many categories exist |

---
---
# CHALLENGES

### Challenge 1: Encode by Hand (on Paper, then in Code)

Given this list of colors:

```python
colors = ["Red", "Blue", "Red", "Green", "Blue"]
```

**First**, on paper (or as a comment), write out the one-hot encoded table:
- How many unique colors are there?
- What are the column names?
- What are the 1s and 0s for each row?

**Then** verify your answer by creating a DataFrame and using `pd.get_dummies()`.

In [None]:
# Write your paper answer as comments first:
# Unique colors: ???
# Columns: ???
#
# Row 0 (Red):   ???
# Row 1 (Blue):  ???
# Row 2 (Red):   ???
# Row 3 (Green): ???
# Row 4 (Blue):  ???

# Now verify with code:
import pandas as pd

# YOUR CODE HERE


### Challenge 2: Use Our get_dummies Function

Use the `get_dummies()` function from `some_utilities.py` to encode this list of fuel types:

```python
fuels = ["Gas", "Diesel", "Electric", "Gas", "Gas", "Diesel"]
```

1. Call `get_dummies(fuels, "fuel")`
2. Convert the result to a DataFrame and print it
3. Compare: how many columns did it create? Does that match the number of unique fuel types?

In [None]:
import sys
sys.path.append("../notebooks")
from some_utilities import get_dummies

fuels = ["Gas", "Diesel", "Electric", "Gas", "Gas", "Diesel"]

# YOUR CODE HERE


### Challenge 3: Full Encoding Pipeline

Given this dataset:

```python
cars = {
    "brand":  ["Honda", "Ford", "Toyota", "Honda", "Ford"],
    "fuel":   ["Gas", "Gas", "Electric", "Diesel", "Gas"],
    "year":   [2020, 2018, 2022, 2019, 2016],
    "price":  [22000, 15000, 45000, 24000, 12000]
}
```

1. Create a DataFrame
2. Check which columns are text (`df.dtypes`)
3. Encode **both** text columns using `pd.get_dummies()`
4. Print the result — how many columns are there now?
5. Create X (drop price) and y (just price) from the encoded DataFrame

In [None]:
import pandas as pd

cars = {
    "brand":  ["Honda", "Ford", "Toyota", "Honda", "Ford"],
    "fuel":   ["Gas", "Gas", "Electric", "Diesel", "Gas"],
    "year":   [2020, 2018, 2022, 2019, 2016],
    "price":  [22000, 15000, 45000, 24000, 12000]
}

# YOUR CODE HERE


### Challenge 4: Why Not Just Use Numbers?

This is a thinking challenge.

Imagine someone says: "Instead of one-hot encoding, just replace Ford=1, Toyota=2, BMW=3, Tesla=4."

Write your answers as comments:
1. What would the AI think about the relationship between Ford (1) and Tesla (4)?
2. What would the AI calculate for BMW (3) minus Ford (1)?
3. Why is one-hot encoding better for brands?
4. Can you think of a column where numbering (1, 2, 3) WOULD make sense? (Hint: think about something that has a natural order.)

In [None]:
# Write your answers as comments:
# 1. 
# 2. 
# 3. 
# 4. 


### Challenge 5: Encode the Real Dataset

Load the real CSV file and explore which columns need encoding:

1. Load `../data/usedcarprices_sujayr_train.csv`
2. Print the column types (`.dtypes`) — which columns are `object` (text)?
3. For each text column, print how many unique values it has (`.nunique()`)
4. Encode only the `Fuel_Type` column using `pd.get_dummies()`
5. Print the shape before and after encoding — how many columns were added?

**Bonus question:** The `Name` column has many unique values. Would it make sense to one-hot encode it? Why or why not?

In [None]:
import pandas as pd

real_data = pd.read_csv("../data/usedcarprices_sujayr_train.csv")

# YOUR CODE HERE
