# The Data Pit Stop (Data Transformation)

Before a race, we need to tune the engine.

In Machine Learning, we call this "Data Preprocessing".

We use the Pandas library to handle our data like a pro.

In [1]:
import pandas as pd

### 1. CREATING OUR DATASET (Mock version of a USA Cars Dataset)
Imagine this was a CSV file downloaded from Kaggle.

In [3]:
data = {
    'brand': ['Ford', 'Toyota', 'BMW', 'Ford', 'Toyota', 'Tesla', 'BMW', 'Ford', 'BMW', 'Ford'],
    'year': [2018, 2015, 2019, 2012, 2020, 2022, 2017, 2016, 2021, 2003],
    'mileage': [45000, 80000, 30000, 120000, 15000, 5000, 55000, 70000, 25000, 152000],
    'price': [18000, 12000, 35000, 5000, 25000, 55000, 22000, 13000, 42000, 8000]
}

# Create a DataFrame (Basically a spreadsheet/table)
df = pd.DataFrame(data)

print("--- 1. THE DATASET ---")
print(df)
print()

--- 1. THE DATASET ---
    brand  year  mileage  price
0    Ford  2018    45000  18000
1  Toyota  2015    80000  12000
2     BMW  2019    30000  35000
3    Ford  2012   120000   5000
4  Toyota  2020    15000  25000
5   Tesla  2022     5000  55000
6     BMW  2017    55000  22000
7    Ford  2016    70000  13000
8     BMW  2021    25000  42000
9    Ford  2003   152000   8000



### 2. ENCODING: WORDS TO NUMBERS

AI models are like calculators; they don't understand the word "Ford".

We use "One-Hot Encoding" to turn brands into 0s and 1s.

In [4]:
df_encoded = pd.get_dummies(df, columns=['brand'])

print("--- 2. ENCODED DATA (Numbers only!) ---")
print(df_encoded.head())
print("\nNotice how 'brand_Ford' is 1 if it's a Ford, and 0 if it's not.")
print()

--- 2. ENCODED DATA (Numbers only!) ---
   year  mileage  price  brand_BMW  brand_Ford  brand_Tesla  brand_Toyota
0  2018    45000  18000      False        True        False         False
1  2015    80000  12000      False       False        False          True
2  2019    30000  35000       True       False        False         False
3  2012   120000   5000      False        True        False         False
4  2020    15000  25000      False       False        False          True

Notice how 'brand_Ford' is 1 if it's a Ford, and 0 if it's not.



In [None]:
def udai_get_dummies(df, column):
    # DO encoding
    return df_encoded

### 3. SPLITTING THE DATA

We need to save some data to "test" the AI later.

X = Features (Year, Mileage, Brand)

y = Target (The Price we want to predict)


In [8]:
from sklearn.model_selection import train_test_split

X = df_encoded.drop('price', axis=1) # The "Questions"
y = df_encoded['price']              # The "Answers"

In [11]:
# Split: 80% to study (Train), 20% for the final exam (Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("--- 3. READY TO TRAIN ---")
print(f"Number of cars for the AI to study: {len(X_train)}")
print(f"Number of cars for the final exam: {len(X_test)}")

--- 3. READY TO TRAIN ---
Number of cars for the AI to study: 8
Number of cars for the final exam: 2


In [13]:
X_test

Unnamed: 0,year,mileage,brand_BMW,brand_Ford,brand_Tesla,brand_Toyota
0,2018,45000,False,True,False,False
9,2003,152000,False,True,False,False
