<a href="https://colab.research.google.com/github/wasiqs-classics/Code-Camp-Python-for-Data-Science-and-Machine-Learning/blob/master/ML_Project_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Make sure it's all numerical
In this Project, we will learn how to convert categorical data into numeric data and how to fill in missing values using Pandas and Scikit Learn. These Ninja techniques are kind of pre - requisites in all machine learning projects and you will find yourself doing these things frequently. Let's dive it!

**Computers love numbers!**

*   So one thing you'll often have to make sure of is that your datasets are in numerical form.
*   This even goes for datasets which contain non-numerical features that you may want to include in a model.

Let's figure it out.

First, we'll import the `car-sales-extended.csv` dataset.

**Use this link:**

https://raw.githubusercontent.com/wasiqs-classics/Code-Camp-Python-for-Data-Science-and-Machine-Learning/refs/heads/master/car-sales-extended.csv


Now let's do the maths!






In [None]:
# Standard imports first!

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import sklearn

In [None]:
# Import car-sales-extended.csv

car_sales = pd.read_csv("https://raw.githubusercontent.com/wasiqs-classics/Code-Camp-Python-for-Data-Science-and-Machine-Learning/refs/heads/master/car-sales-extended.csv")
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043
...,...,...,...,...,...
995,Toyota,Black,35820,4,32042
996,Nissan,White,155144,3,5716
997,Nissan,Blue,66604,4,31570
998,Honda,White,215883,4,4001


Now if we look at the data, we 1000 rows and 5 columns! But but but ... our dataset contains non - numerical data.
Check the `Make` and `Colour` columns!

We can check the dataset types with `.dtypes`.

In [None]:
car_sales.dtypes

Unnamed: 0,0
Make,object
Colour,object
Odometer (KM),int64
Doors,int64
Price,int64


Notice the `Make` and `Colour` features are of `dtype=object` (*they're strings*) where as the rest of the columns are of `dtype=int64`.

If we want to use the **Make** and **Colour** features in our model, we'll need to figure out how to turn them into numerical form. Because otherwise they will produce errors!

Let's try this data directly on **RandomForestRegressor** Model.

In [None]:
# First import train / testing split module
from sklearn.model_selection import train_test_split

# Now Split into X & y and train/test in 80/20 ratio
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
# Try to predict with random forest on price column
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: could not convert string to float: 'Honda'

Look at the **error**!

`ValueError: could not convert string to float: 'Honda'`

What does it say?

The error message `"ValueError: could not convert string to float: 'Honda'"` indicates that the **RandomForestRegressor** is encountering a string value ('Honda' in this case) in one of the features (columns) of your `X_train` DataFrame. **RandomForestRegressor**, like most machine learning models in scikit-learn, expects numerical input. This means it cannot directly handle categorical data like car makes (e.g., 'Honda', 'Toyota', 'Ford') that are represented as strings.

Looking at the traceback, the error occurs during the `model.fit(X_train, y_train)` call, which is when the model is being trained. It's due to one or more columns in X_train (your features) contain string values that need to be converted to numerical representations before the model can be trained.

So here is the thing again...

**Computers love numbers!**

Machine learning models prefer to work with numbers than text. So we'll have to convert the non-numerical features into numbers first.


# Encoding
The process of turning categorical features into numbers is often referred to as **encoding**.

Although, Scikit-Learn has a fantastic in-depth guide on [Encoding categorical features](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features).

But we will focus today on what of the most common approach known as **one-hot encoding.**

In machine learning, one-hot encoding gives a value of 1 to the target value and a value of 0 to the other values.

For example, let's say we had five samples and three car make options, Honda, Toyota, BMW.

And our samples were:

*   Honda
*   BMW
*   BMW
*   Toyota
*   Toyota

If we were to one-hot encode these, it would look like:

| Sample | Honda | Toyota | BMW |
|---|---|---|---|
| 1 | 1 | 0 | 0 |
| 2 | 0 | 0 | 1 |
| 3 | 0 | 0 | 1 |
| 4 | 0 | 1 | 0 |
| 5 | 0 | 1 | 0 |

Notice how there's a 1 for each target value but a 0 for each other value.

We can use the following steps to one-hot encode our dataset:

1. Import sklearn.preprocessing.OneHotEncoder to one-hot encode our features and sklearn.compose.ColumnTransformer to target the specific columns of our DataFrame to transform.
2. Define the categorical features we'd like to transform.
3. Create an instance of the OneHotEncoder.
4. Create an instance of ColumnTransformer and feed it the transforms we'd like to make.
5. Fit the instance of the ColumnTransformer to our data and transform it with the fit_transform(X) method.
Note: In Scikit-Learn, the term "transformer" is often used to refer to something that transforms data.

Let's implement!

In [None]:
# 1. Import OneHotEncoder and ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# 2. Define the categorical features to transform
categorical_features = ["Make", "Colour", "Doors"]

# 3. Create an instance of OneHotEncoder
one_hot = OneHotEncoder()

# 4. Create an instance of ColumnTransformer
transformer = ColumnTransformer([("one_hot", # name
                                  one_hot, # transformer
                                  categorical_features)], # columns to transform
                                  remainder="passthrough") # what to do with the rest of the columns? ("passthrough" = leave unchanged)

# 5. Turn the categorical features into numbers (this will return an array-like sparse matrix, not a DataFrame)
transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

**Note:** You might be thinking why we considered Doors as a categorical variable. Which is a good question considering Doors is already numerical. Well, the answer is that Doors could be either numerical or categorical. However, I've decided to go with categorical, since where I'm from, number of doors is often a different category of car. For example, you can shop for 4-door cars or shop for 5-door cars (which always confused me since where's the 5th door?). However, you could experiment with treating this value as numerical or categorical, training a model on each, and then see how each model performs.

Woah! Looks like our samples are all numerical, what did our data look like previously?

In [None]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


It seems `OneHotEncoder` and `ColumnTransformer` have turned all of our data samples into numbers.

Let's check out the first transformed sample.

In [None]:
# View first transformed sample
transformed_X[0]


array([0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
       1.0000e+00, 0.0000e+00, 3.5431e+04])

And what were these values originally?

In [None]:
# View original first sample
X.iloc[0]

Unnamed: 0,0
Make,Honda
Colour,White
Odometer (KM),35431
Doors,4


# Numerically encoding data with pandas
Another way we can numerically encode data is directly with pandas.

We can use the `pandas.get_dummies()` (or `pd.get_dummies()` for short) method and then pass it our target columns.

In return, we'll get a one-hot encoded version of our target columns.

Let's remind ourselves of what our DataFrame looks like.

In [None]:
# View head of original DataFrame
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


Now let's use pd.get_dummies() to turn our categorical variables into one-hot encoded variables.

In [None]:
# One-hot encode categorical variables
categorical_variables = ["Make", "Colour", "Doors"]
dummies = pd.get_dummies(data=car_sales[categorical_variables])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,False,True,False,False,False,False,False,False,True
1,5,True,False,False,False,False,True,False,False,False
2,4,False,True,False,False,False,False,False,False,True
3,4,False,False,False,True,False,False,False,False,True
4,3,False,False,True,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...
995,4,False,False,False,True,True,False,False,False,False
996,3,False,False,True,False,False,False,False,False,True
997,4,False,False,True,False,False,True,False,False,False
998,4,False,True,False,False,False,False,False,False,True


Nice!

Notice how there's a new column for each categorical option (e.g. `Make_BMW`, `Make_Honda`, etc).

But also notice how it also missed the Doors column?

This is because Doors is already numeric, so for `pd.get_dummies()` to work on it, we can change it to type object.

By default, `pd.get_dummies()` also turns all of the values to bools (True or False). We can get the returned values as 0 or 1 by setting `dtype=float`.

In [None]:
# Have to convert doors to object for dummies to work on it...
car_sales["Doors"] = car_sales["Doors"].astype(object)
dummies = pd.get_dummies(data=car_sales[["Make", "Colour", "Doors"]],
                         dtype=float)
dummies

Unnamed: 0,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White,Doors_3,Doors_4,Doors_5
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


**Woohoo!**

We've now turned our data into fully numeric form using *Scikit-Learn* and *pandas*.

Now you might be wondering...

**Should you use Scikit-Learn or pandas for turning data into numerical form?**

And the answer is either.

But as a rule of thumb:

* If you're performing *quick data analysis and running small modelling experiments*, use **pandas** as it's generally quite fast to get up and running.
* If you're performing a *larger scale modelling experiment* or would like to put your data processing steps into a production pipeline, I'd recommend leaning towards **Scikit-Learn**, specifically a [Scikit-Learn Pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline) (chaining together multiple estimator/modelling steps).

Since we've turned our data into numerical form, how about we try and fit our model again?

Let's recreate a train/test split except this time we'll use transformed_X instead of X.

In [None]:
np.random.seed(42)

# Create train and test splits with transformed_X
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y,
                                                    test_size=0.2)

# Create the model instance
model = RandomForestRegressor()

# Fit the model on the numerical data (this errored before since our data wasn't fully numeric)
model.fit(X_train, y_train)

# Score the model (returns r^2 metric by default, also called coefficient of determination, higher is better)
model.score(X_test, y_test)

0.3235867221569877

**What is the r^2 metric?**

The **r^2 metric**, also known as the **coefficient of determination**, is a statistical measure that indicates how well a regression model fits the observed data. It essentially tells you the proportion of the variance in the dependent variable (in your case, 'Price') that is predictable from the independent variables (the features like 'Make', 'Colour', 'Doors', 'Odometer (KM)', etc.).

**How is it calculated?**

* It's a value between 0 and 1.
* A value of 0 means the model doesn't explain any of the variability of the response data around its mean.
* A value of 1 indicates that the model perfectly explains all the variability of the response data around its mean.

**Implementation**
1. `model.score(X_test, y_test)`: This line of code calculates the r^2 score of your trained RandomForestRegressor model on the test data (`X_test, y_test`).
2. The `score` method for regression models in scikit-learn returns the r^2 score by default.
3. **Interpretation:** The higher the r^2 score, the better your model is at predicting the 'Price' of cars based on the given features. A higher score indicates that a larger proportion of the variance in 'Price' is explained by your model.

**In Simple Terms:**

Imagine you're trying to draw a line through a scatter plot of data points. The `r^2 value` tells you how well that line fits the data. If the line goes perfectly through all the points, the r^2 value would be 1 (perfect fit). If the line is completely random and doesn't follow the data at all, the r^2 value would be closer to 0 (poor fit).

**The r^2 score is a measure of how well your RandomForestRegressor model can predict car prices based on the features you've provided.** A higher score means a better prediction capability.

# Dealing with Missing Values in our Dataset
Holes in the data means holes in the patterns your machine learning model can learn.

Many machine learning models don't work well or produce errors when they're used on datasets with missing values.

A missing value can appear as a blank, as a `NaN` or something similar.

There are two main options when dealing with missing values:

1. **Fill them with some given or calculated value (imputation**) - For example, you might fill missing values of a numerical column with the mean of all the other values. The practice of calculating or figuring out how to fill missing values in a dataset is called imputing.
For a great resource on imputing missing values, I'd recommend refering to the [Scikit-Learn user guide](https://scikit-learn.org/stable/modules/impute.html).
2. **Remove them** - If a row or sample has missing values, you may opt to remove them from your dataset completely. However, this potentially results in using less data to build your model.

> **Note:** Dealing with missing values differs from problem to problem, meaning there's no 100% best way to fill missing values across datasets and problem types. It will often take careful experimentation and practice to figure out the best way to deal with missing values in your own datasets.

To practice dealing with missing values, let's import a version of the `car_sales` dataset with several missing values (namely `car-sales-extended-missing-data.csv`).

**Use this link:** https://raw.githubusercontent.com/wasiqs-classics/Code-Camp-Python-for-Data-Science-and-Machine-Learning/refs/heads/master/car-sales-extended-missing-data.csv



In [None]:
# Import car sales dataframe with missing values
car_sales_missing = pd.read_csv("https://raw.githubusercontent.com/wasiqs-classics/Code-Camp-Python-for-Data-Science-and-Machine-Learning/refs/heads/master/car-sales-extended-missing-data.csv")
car_sales_missing.head(10)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
5,Honda,Red,42652.0,4.0,23883.0
6,Toyota,Blue,163453.0,4.0,8473.0
7,Honda,White,,4.0,20306.0
8,,White,130538.0,4.0,9374.0
9,Honda,Blue,51029.0,4.0,26683.0


Notice the NaN value in *row 7* for the `Odometer (KM`) column, that means pandas has detected a missing value there.

However, if you're dataset is large, it's likely you aren't going to go through it sample by sample to find the missing values.

Luckily, pandas has a method called **pd.DataFrame.isna()** which is able to detect missing values.

Let's try it on our DataFrame.

In [None]:
# Get the sum of all missing values
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,49
Colour,50
Odometer (KM),50
Doors,50
Price,50


**Hmm...** seems there's about 50 or so missing values per column.

How about we try and split the data into features and labels, then convert the categorical data to numbers, then split the data into training and test and then try and fit a model on it (*just like we did before*)?

In [None]:
# Create features
X_missing = car_sales_missing.drop("Price", axis=1)
print(f"Number of missing X values:\n{X_missing.isna().sum()}")

Number of missing X values:
Make             49
Colour           50
Odometer (KM)    50
Doors            50
dtype: int64


In [None]:
# Create labels
y_missing = car_sales_missing["Price"]
print(f"Number of missing y values: {y_missing.isna().sum()}")

Number of missing y values: 50


Now we can convert the categorical columns into one-hot encodings (just as before).

In [None]:
# Let's convert the categorical columns to one hot encoded (code copied from above)
# Turn the categories (Make and Colour) into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()

transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                remainder="passthrough",
                                sparse_threshold=0) # return a sparse matrix or not

transformed_X_missing = transformer.fit_transform(X_missing)
transformed_X_missing

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 2.48360e+05]])

Finally, let's split the missing data samples into **train and test sets** and then try to fit and score a model on them.

In [None]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(transformed_X_missing,
                                                    y_missing,
                                                    test_size=0.2)

# Fit and score a model
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: Input y contains NaN.

`ValueError: Input y contains NaN.`

The error "ValueError: Input y contains NaN" arises because the target variable (`y_train` in this case) that you are using to train your `RandomForestRegressor` model contains missing values (NaN). `RandomForestRegressor`, like many machine learning models, cannot handle missing values in the target variable during training.

Looks like if we want to use RandomForestRegressor, we'll have to either fill or remove the missing values.

> **Note:** Scikit-Learn does have a list of models which can handle NaNs or missing values directly.
Such as, `sklearn.ensemble.HistGradientBoostingClassifier` or `sklearn.ensemble.HistGradientBoostingRegressor.`

As an experiment, you may want to try the following: (on your own!)


```python
from sklearn.ensemble import HistGradientBoostingRegressor

# Try a model that can handle NaNs natively
nan_model = HistGradientBoostingRegressor()
nan_model.fit(X_train, y_train)
nan_model.score(X_test, y_test)
```

Anyways, we had some missing values... let's see them again!



In [None]:
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,49
Colour,50
Odometer (KM),50
Doors,50
Price,50


How can fill (impute) or remove these?

# Fill missing data with pandas

Let's see how we might fill missing values with pandas.

For categorical values, one of the simplest ways is to fill the missing fields with the string `"missing"`.

We could do this for the `Make` and `Colour` features.

As for the Doors feature, we could use "missing" or we could fill it with the most common option of 4.

With the `Odometer (KM)` feature, we can use the mean value of all the other values in the column.

And finally, for those samples which are missing a `Price` value, we can remove them (since `Price` is the target value, removing probably causes less harm than imputing, however, you could design an experiment to test this).

In summary:

|Column/Feature|Fill missing value with|
|--------------|-----------------------|
|Make|"missing"|
|Colour|"missing"|
|Doors|4 (most common value)|
|Odometer (KM)|mean of Odometer (KM)|
|Price (target)|	NA, remove samples missing `Price`|



> Note: The practice of filling missing data with given or calculated values is called [imputation](https://scikit-learn.org/stable/modules/impute.html). And it's important to remember there's no perfect way to fill missing data (unless it's with data that should've actually been there in the first place). The methods we're using are only one of many. The techniques you use will depend heavily on your dataset. A good place to look would be searching for "data imputation techniques".

Let's start with the `Make` column.

We can use the pandas method `fillna(value="missing", inplace=True)` to fill all the missing values with the string `"missing"`.


In [None]:
# Fill the missing values in the Make column
# Note: In previous versions of pandas, inplace=True was possible, however this will be changed in a future version, can use reassignment instead.
# car_sales_missing["Make"].fillna(value="missing", inplace=True)

car_sales_missing["Make"] = car_sales_missing["Make"].fillna(value="missing")

And we can do the same with the `Colour` column.

In [None]:
# Note: In previous versions of pandas, inplace=True was possible, however this will be changed in a future version, can use reassignment instead.
# car_sales_missing["Colour"].fillna(value="missing", inplace=True)

# Fill the Colour column
car_sales_missing["Colour"] = car_sales_missing["Colour"].fillna(value="missing")

How many missing values do we have now?

In [None]:
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,0
Colour,0
Odometer (KM),50
Doors,50
Price,50


Wonderful! We're making some progress.

Now let's fill the Doors column with 4 (the most common value), this is the same as filling it with the median or mode of the Doors column.

In [None]:
# Find the most common value of the Doors column
car_sales_missing["Doors"].value_counts()

Unnamed: 0_level_0,count
Doors,Unnamed: 1_level_1
4.0,811
5.0,75
3.0,64


In [None]:
# Fill the Doors column with the most common value
car_sales_missing["Doors"] = car_sales_missing["Doors"].fillna(value=4)

Next, we'll fill the `Odometer (KM)` column with the mean value of itself.

In [None]:
# Fill the Odometer (KM) column
# Old: car_sales_missing["Odometer (KM)"].fillna(value=car_sales_missing["Odometer (KM)"].mean(), inplace=True)

car_sales_missing["Odometer (KM)"] = car_sales_missing["Odometer (KM)"].fillna(value=car_sales_missing["Odometer (KM)"].mean())

How many missing values do we have now?

In [None]:
# Check the number of missing values
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,0
Colour,0
Odometer (KM),0
Doors,0
Price,50


**Woohoo!** That's looking a lot better.

Finally, we can remove the rows which are missing the target value `Price`.

> **Note:** Another option would be to impute the Price value with the mean or median or some other calculated value (such as by using similar cars to estimate the price), however, to keep things simple and prevent introducing too many fake labels to the data, we'll remove the samples missing a Price value.

We can remove rows with missing values in place from a pandas DataFrame with the `pandas.DataFrame.dropna(inplace=True)` method.

In [None]:
# Remove rows with missing Price labels
car_sales_missing.dropna(inplace=True)

That should be no more missing values!

In [None]:
# Check the number of missing values
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,0
Colour,0
Odometer (KM),0
Doors,0
Price,0


Since we removed samples missing a Price value, there's now less overall samples but none of them have missing values.

In [None]:
# Check the number of total samples (previously was 1000)
len(car_sales_missing)

950

**Can we fit a model now?**

Let's try! **Create-Convert-Split-Fit**

* First we'll **create** the features and labels.

* Then we'll **convert** categorical variables into numbers via `one-hot encoding`.

* Then we'll **split** the data into training and test sets just like before.

* Finally, we'll try to **fit** a `RandomForestRegressor()` model to the newly filled data.

In [None]:
# Create features
X_missing = car_sales_missing.drop("Price", axis=1)
print(f"Number of missing X values:\n{X_missing.isna().sum()}")

# Create labels
y_missing = car_sales_missing["Price"]
print(f"Number of missing y values: {y_missing.isna().sum()}")

Number of missing X values:
Make             0
Colour           0
Odometer (KM)    0
Doors            0
dtype: int64
Number of missing y values: 0


In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()

transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                remainder="passthrough",
                                sparse_threshold=0) # return a sparse matrix or not

transformed_X_missing = transformer.fit_transform(X_missing)
transformed_X_missing

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [None]:
# Split data into training and test sets
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X_missing,
                                                    y_missing,
                                                    test_size=0.2)

# Fit and score a model
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.22011714008302485

**Fantastic!!!**

Looks like filling the missing values with pandas worked!

Our model can be fit to the data without issues.

# Filling missing data and transforming categorical data with Scikit-Learn

Now we've filled the missing columns using pandas functions, you might be thinking, "Why pandas? I thought this was a Scikit-Learn introduction?".

Not to worry, Scikit-Learn provides a class called `sklearn.impute.SimpleImputer()` which allows us to do a similar thing.

**SimpleImputer()** transforms data by filling missing values with a given strategy parameter.

And we can use it to fill the missing values in our DataFrame as above.

But rmember, at the moment, our dataframe has no mising values.

Let's reimport it so it has missing values and we can fill them with Scikit-Learn.

In [None]:
# Reimport the DataFrame (so that all the missing values are back)
car_sales_missing = pd.read_csv("https://raw.githubusercontent.com/wasiqs-classics/Code-Camp-Python-for-Data-Science-and-Machine-Learning/refs/heads/master/car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum()

To begin, we'll remove the rows which are missing a `Price` value.

In [None]:
# Drop the rows with missing in the Price column
car_sales_missing.dropna(subset=["Price"], inplace=True)

Now there are no rows missing a `Price` value.

In [None]:
car_sales_missing.isna().sum()

Since we don't have to fill any `Price` values, let's split our data into features (X) and labels (y).

We'll also split the data into training and test sets.

In [None]:
# Split into X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

# Split data into train and test
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

> **Note:** We've split the data into train & test sets here first to perform filling missing values on them separately. This is best practice as the test set is supposed to emulate data the model has never seen before. For categorical variables, it's generally okay to fill values across the whole dataset. However, for numerical vairables, you should **only fill values on the test set that have been computed from the training set**.

Training and test sets created!

Let's now setup a few instances of `SimpleImputer()` to fill various missing values.

We'll use the following strategies and fill values:

For categorical columns (`Make, Colour`), `strategy="constant"`, `fill_value="missing"` (fill the missing samples with a consistent value of `"missing"`.
* For the `Door` column, `strategy="constant"`, `fill_value=4` (fill the missing samples with a consistent value of `4`).
* For the numerical column `(Odometer (KM`)), `strategy="mean"` (fill the missing samples with the mean of the target column).
* Note: There are *more strategies* and fill options in the `SimpleImputer()` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html).

In [None]:
from sklearn.impute import SimpleImputer

# Create categorical variable imputer
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")

# Create Door column imputer
door_imputer = SimpleImputer(strategy="constant", fill_value=4)

# Create Odometer (KM) column imputer
num_imputer = SimpleImputer(strategy="mean")

Imputers created!

Now let's define which columns we'd like to impute on.

Why?

Because we'll need these shortly (I'll explain in the next text cell).

In [None]:
# Define different column features
categorical_features = ["Make", "Colour"]
door_feature = ["Doors"]
numerical_feature = ["Odometer (KM)"]

Columns defined!

Now how might we transform our columns?

Hint: we can use the `sklearn.compose.ColumnTransformer` class from Scikit-Learn, in a similar way to what we did before to get our data to all numeric values.

That's the reason we defined the columns we'd like to transform.

So we can use the `ColumnTransformer()` class.

`ColumnTransformer()` takes as input a list of tuples in the form `(name_of_transform, transformer_to_use, columns_to_transform)` specifying which columns to transform and how to transform them.

For example:

```python
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, categorical_features)
])
```

In this case, the variables in the tuple are:

* `name_of_transform` = `"cat_imputer"`
* `transformer_to_use` = `cat_imputer` (the instance of `SimpleImputer()` we defined above)
* `columns_to_transform` = `categorical_features` (the list of categorical features we defined above).

Let's exapnd upon this by extending the example.

In [None]:
from sklearn.compose import ColumnTransformer

# Create series of column transforms to perform
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, categorical_features),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, numerical_feature)])

Nice!

The next step is to fit our `ColumnTransformer()` instance (`imputer`) to the training data and transform the testing data.

In other words we want to:

1. Learn the imputation values from the training set.
2. Fill the missing values in the training set with the values learned in 1.
3. Fill the missing values in the testing set with the values learned in 1.

Why this way?

In our case, we're not calculating many variables (except the mean of the `Odometer (KM)` column), however, remember that the test set should always remain as unseen data.

So **when filling values in the test set, they should only be with values calculated or imputed from the training sets.**

We can achieve steps 1 & 2 simultaneously with the `ColumnTransformer.fit_transform()` method (`fit` = find the values to fill, `transform` = fill them).

And then we can perform step 3 with the `ColumnTransformer.transform()` method (we only want to transform the test set, not learn different values to fill).

In [None]:
# Find values to fill and transform training data
filled_X_train = imputer.fit_transform(X_train)

# Fill values in to the test set with values learned from the training set
filled_X_test = imputer.transform(X_test)

# Check filled X_train
filled_X_train

Wonderful!

Let's now turn our `filled_X_train` and `filled_X_test` arrays into DataFrames to inspect their missing values.

In [None]:
# Get our transformed data array's back into DataFrame's
filled_X_train_df = pd.DataFrame(filled_X_train,
                                 columns=["Make", "Colour", "Doors", "Odometer (KM)"])

filled_X_test_df = pd.DataFrame(filled_X_test,
                                columns=["Make", "Colour", "Doors", "Odometer (KM)"])

# Check missing data in training set
filled_X_train_df.isna().sum()

And is there any missing data in the test set?

In [None]:
# Check missing data in the testing set
filled_X_test_df.isna().sum()

What about the original?

In [None]:
# Check to see the original... still missing values
car_sales_missing.isna().sum()

Perfect!

No more missing values!

But wait...

Is our data all numerical?

In [None]:
filled_X_train_df.head()

Ahh... looks like our `Make` and `Colour` columns are still strings.

Let's **one-hot encode** them along with the `Doors` column to make sure they're numerical, just as we did previously.

In [None]:
# Now let's one hot encode the features with the same code as before
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()

transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                remainder="passthrough",
                                sparse_threshold=0) # return a sparse matrix or not

# Fill train and test values separately
transformed_X_train = transformer.fit_transform(filled_X_train_df)
transformed_X_test = transformer.transform(filled_X_test_df)

# Check transformed and filled X_train
transformed_X_train

Nice!

Now our data is:

1. All numerical
2. No missing values

Let's try and fit a model!

In [None]:
# Now we've transformed X, let's see if we can fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

# Make sure to use the transformed data (filled and one-hot encoded X data)
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

You might have noticed this result is slightly different to before.

**Why do you think this is?**

It's because we've created our training and testing sets differently.

We split the data into training and test sets before filling the missing values.

Previously, we did the reverse, filled missing values before splitting the data into training and test sets.

Doing this can lead to information from the training set leaking into the testing set.

Remember, one of the most important concepts in machine learning is making sure your model doesn't see any testing data before evaluation.

We'll keep practicing but for now, some of the **main takeaways** are:

* Keep your training and test sets separate.
* Most datasets you come across won't be in a form ready to immediately start using them with machine learning models. And some may take more preparation than others to get ready to use.
* For most machine learning models, your data has to be numerical. This will involve converting whatever you're working with into numbers. This process is often referred to as **feature engineering** or **feature encoding.**
* Some machine learning models aren't compatible with missing data. The process of filling missing data is referred to as **data imputation**.