## CONVERT STRING OR OBJECT TO NUMERICAL VALUES.

### SINCE MODEL ONLY UNDERSTANDS NUMBERS

LOAD MISSING DATA AND TRY TO FIT TO MODEL

In [2]:
import pandas as pd;
data = pd.read_csv("./car-sales-extended-missing-data.csv")
data.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [5]:
data.dtypes

Make              object
Colour            object
Odometer (KM)    float64
Doors            float64
Price            float64
dtype: object

In [10]:
X = data.drop('Price', axis=1)
Y = data['Price']


MAKE, COLOUR ARE STRINGS, WE NEED NUMERICAL VALUES LIKE 0 OR 1. in two ways:

1. Scikit Learn

2. Pandas

Should you use Scikit-Learn or pandas for turning data into numerical form?

And the answer is either.

But as a rule of thumb:

If you're performing quick data analysis and running small modelling experiments, use pandas as it's generally quite fast to get up and running.
If you're performing a larger scale modelling experiment or would like to put your data processing steps into a production pipeline, I'd recommend leaning towards Scikit-Learn, specifically a Scikit-Learn Pipeline (chaining together multiple estimator/modelling steps).

In [17]:
# 1.ScikitLearn
# 1.import OneHotEncoder and ColumnTransfer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# 2.create instane of one hot encoder 
one_hot = OneHotEncoder()

# 3.decine categorical features that we want to convert to numerical values 
categorical_features = ["Make", "Colour"]

# 4.create instance of ColumnTransformer with our encoder and categorical_features
transformer = ColumnTransformer(
                                [(
                                'one_hot',  # name
                                one_hot,  # Transformer
                                categorical_features # columns to transform
                                )],
                                remainder='passthrough'  # what to do with the rest of the columns? ("passthrough" = leave unchange
                               )

transformed_x = transformer.fit_transform(X)
transformed_x[0]  # It doenst have head check first indexed row


array([0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
       0.0000e+00, 3.5431e+04, 4.0000e+00])

In [21]:
# 2. Pandas

# 1. define categorical features
categorical_features = ["Make", "Colour"]

# 2.Pass to pandas to convert
dummies = pd.get_dummies(
     data=data[categorical_features], # Columns which we need to convert
     dtype=float # Default is true or false, here we are converting to numerical
     )

dummies

Unnamed: 0,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [29]:
# 3. Split data from numerical data
import numpy as np
np.random.seed(42)

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(transformed_x, Y, test_size=0.3)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

((700, 13), (300, 13), (700,), (300,))

In [30]:
# Fit to Modal
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

model.fit(x_train, y_train)

# model.score(x_test, y_test)

ValueError: Input y contains NaN.

## FILL MISSING VALUES: in two ways

### 1. Pandas
### 2. Scikt Learn

 What if there were missing values in the data?
Holes in the data means holes in the patterns your machine learning model can learn.

Many machine learning models don't work well or produce errors when they're used on datasets with missing values.

A missing value can appear as a blank, as a NaN or something similar.

There are two main options when dealing with missing values:

1. Fill them with some given or calculated value (imputation) - For example, you might fill missing values of a numerical column with the mean of all the other values. The practice of calculating or figuring out how to fill missing values in a dataset is called imputing. For a great resource on imputing missing values, I'd recommend refering to the Scikit-Learn user guide.

2. Remove them - If a row or sample has missing values, you may opt to remove them from your dataset completely. However, this potentially results in using less data to build your model.
Note: Dealing with missing values differs from problem to problem, meaning there's no 100% best way to fill missing values across datasets and problem types. It will often take careful experimentation and practice to figure out the best way to deal with missing values in your own datasets.

To practice dealing with missing values, let's import a version of the car_sales dataset with several missing values (namely car-sales-extended-missing-data.csv).

In [35]:
import pandas as pd

missing_data = pd.read_csv("./car-sales-extended-missing-data.csv")
missing_data

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [36]:
missing_data.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

Hmm... seems there's about 50 or so missing values per column.

How about we try and split the data into features and labels, then convert the categorical data to numbers, then split the data into training and test and then try and fit a model on it (just like we did before)?

Let's see how we might fill missing values with pandas.

For categorical values, one of the simplest ways is to fill the missing fields with the string "missing".

We could do this for the Make and Colour features.

As for the Doors feature, we could use "missing" or we could fill it with the most common option of 4.

With the Odometer (KM) feature, we can use the mean value of all the other values in the column.

And finally, for those samples which are missing a Price value, we can remove them (since Price is the target value, removing probably causes less harm than imputing, however, you could design an experiment to test this).

In summary:


Column/Feature	Fill missing value with

1. Make	=> "missing"

2. Colour =>	"missing"

3. Doors =>	4 (most common value)

4. Odometer (KM) =>	mean of Odometer (KM)

Price (target)	NA, remove samples missing Price
Note: The practice of filling missing data with given or calculated values is called imputation. And it's important to remember there's no perfect way to fill missing data (unless it's with data that should've actually been there in the first place). The methods we're using are only one of many. The techniques you use will depend heavily on your dataset. A good place to look would be searching for "data imputation techniques".

Let's start with the Make column.

We can use the pandas method fillna(value="missing", inplace=True) to fill all the missing values with the string "missing"

In [38]:
# Fill the missing values in the Make column
# Note: In previous versions of pandas, inplace=True was possible, however this will be changed in a future version, can use reassignment instead.
# car_sales_missing["Make"].fillna(value="missing", inplace=True)

missing_data['Make'] = missing_data["Make"].fillna(value="missing")

In [40]:
# Note: In previous versions of pandas, inplace=True was possible, however this will be changed in a future version, can use reassignment instead.
# car_sales_missing["Colour"].fillna(value="missing", inplace=True)

# Fill the Colour column
missing_data["Colour"] = missing_data["Colour"].fillna(value="missing")

In [42]:
missing_data.isna().sum()


Make              0
Colour            0
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

Wonderful! We're making some progress.

Now let's fill the Doors column with 4 (the most common value), this is the same as filling it with the median or mode of the Doors column.


In [43]:
missing_data["Doors"] = missing_data["Doors"].fillna(value=4)

In [44]:
missing_data["Odometer (KM)"] = missing_data["Odometer (KM)"].fillna(value= missing_data["Odometer (KM)"].mean())

How many missing values do we have now?



In [46]:
# Check the number of missing values
missing_data.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

Woohoo! That's looking a lot better.

Finally, we can remove the rows which are missing the target value Price.

Note: Another option would be to impute the Price value with the mean or median or some other calculated value (such as by using similar cars to estimate the price), however, to keep things simple and prevent introducing too many fake labels to the data, we'll remove the samples missing a Price value.

We can remove rows with missing values in place from a pandas DataFrame with the pandas.DataFrame.dropna(inplace=True) method.

In [48]:
# Remove rows with missing Price labels
missing_data.dropna(inplace=True)

That should be no more missing values!



In [49]:
missing_data.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

Since we removed samples missing a Price value, there's now less overall samples but none of them have missing values.



In [51]:
# Check the number of total samples (previously was 1000)
len(missing_data)

950

Can we fit a model now?

Let's try!

First we'll create the features and labels.

Then we'll convert categorical variables into numbers via one-hot encoding.

Then we'll split the data into training and test sets just like before.

Finally, we'll try to fit a RandomForestRegressor() model to the newly filled data.

In [52]:
#  Create features
X_missing = missing_data.drop("Price", axis=1)
print(f"Number of missing X values:\n{X_missing.isna().sum()}")

# Create labels
y_missing = missing_data["Price"]
print(f"Number of missing y values: {y_missing.isna().sum()}")

Number of missing X values:
Make             0
Colour           0
Odometer (KM)    0
Doors            0
dtype: int64
Number of missing y values: 0


In [60]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([(
        'one_hot_encoder',
        one_hot,
        categorical_features
        
    )], remainder= "passthrough", 
    sparse_threshold=0 # return a sparse matrix or not
    )

transformed_missing_x = transformer.fit_transform(X_missing)
transformed_missing_x[0]

array([0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
       0.0000e+00, 3.5431e+04, 4.0000e+00])

In [63]:
# Fit to the model here 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(transformed_missing_x, y_missing, test_size=0.3)
model = RandomForestRegressor()

model.fit(x_train, y_train)
model.score(x_test, y_test)

0.26183319644339753

Fantastic!!!

Looks like filling the missing values with pandas worked!

Our model can be fit to the data without issues.

## Filling missing data and transforming categorical data with Scikit-Learn


Now we've filled the missing columns using pandas functions, you might be thinking, "Why pandas? I thought this was a Scikit-Learn introduction?".

Not to worry, Scikit-Learn provides a class called sklearn.impute.SimpleImputer() which allows us to do a similar thing.

SimpleImputer() transforms data by filling missing values with a given strategy parameter.

And we can use it to fill the missing values in our DataFrame as above.

At the moment, our dataframe has no mising values.

Let's reimport it so it has missing values and we can fill them with Scikit-Learn.



In [64]:
# Reimport the DataFrame (so that all the missing values are back)
# car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv") # read from local directory
car_sales_missing = pd.read_csv("./car-sales-extended-missing-data.csv") # read directly from URL (source: https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/data/car-sales-extended-missing-data.csv)
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [67]:
# Drop rows with Target Price row is nan
car_sales_missing.dropna(subset=["Price"], inplace=True)

# Now there are no rows missing a Price value.
car_sales_missing.isna().sum()


Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [68]:
# Split into X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

# Split data into train and test
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

Note: We've split the data into train & test sets here first to perform filling missing values on them separately. This is best practice as the test set is supposed to emulate data the model has never seen before. For categorical variables, it's generally okay to fill values across the whole dataset. However, for numerical vairables, you should only fill values on the test set that have been computed from the training set.

Training and test sets created!

Let's now setup a few instances of SimpleImputer() to fill various missing values.

We'll use the following strategies and fill values:

1. For categorical columns (Make, Colour), strategy="constant", fill_value="missing" (fill the missing samples with a consistent value of "missing".
2. For the Door column, strategy="constant", fill_value=4 (fill the missing samples with a consistent value of 4).
3. For the numerical column (Odometer (KM)), strategy="mean" (fill the missing samples with the mean of the target column).

Note: There are more strategy and fill options in the SimpleImputer() documentation.

In [None]:
from sklearn.impute import SimpleImputer

# Create categorical variable imputer
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")

# Create Door column imputer
door_imputer = SimpleImputer(strategy="constant", fill_value="missing")

# Create Odometer (KM) column imputer
num_imputer  = SimpleImputer(strategy="mean")


Imputers created!

Now let's define which columns we'd like to impute on.

Why?

Because we'll need these shortly (I'll explain in the next text cell).

In [71]:
# Define different column features
categorical_features = ["Make", "Colour"]
door_feature = ["Doors"]
numerical_feature = ["Odometer (KM)"]

In [77]:
from sklearn.compose import ColumnTransformer

# Create series of column transforms to perform
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, categorical_features),
    ("door_imputer", door_imputer, door_feature)
    # ("num_imputer", num_imputer, numerical_feature)
    ])

In [78]:
# Find values to fill and transform training data
filled_X_train = imputer.fit_transform(X_train)

# Fill values in to the test set with values learned from the training set
filled_X_test = imputer.transform(X_test)

# Check filled X_train
filled_X_train

ValueError: fill_value='missing' (of type <class 'str'>) cannot be cast to the input data that is dtype('float64'). Make sure that both dtypes are of the same kind.

Wonderful!

Let's now turn our filled_X_train and filled_X_test arrays into DataFrames to inspect their missing values.

In [79]:
# Get our transformed data array's back into DataFrame's
filled_X_train_df = pd.DataFrame(filled_X_train, 
                                 columns=["Make", "Colour", "Doors", "Odometer (KM)"])

filled_X_test_df = pd.DataFrame(filled_X_test, 
                                columns=["Make", "Colour", "Doors", "Odometer (KM)"])

# Check missing data in training set
filled_X_train_df.isna().sum()

NameError: name 'filled_X_train' is not defined

In [80]:
filled_X_test_df.isna().sum()


NameError: name 'filled_X_test_df' is not defined

In [81]:
# Now let's one hot encode the features with the same code as before 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()

transformer = ColumnTransformer([("one_hot", 
                                  one_hot, 
                                  categorical_features)],
                                remainder="passthrough",
                                sparse_threshold=0) # return a sparse matrix or not

# Fill train and test values separately
transformed_X_train = transformer.fit_transform(filled_X_train_df)
transformed_X_test = transformer.transform(filled_X_test_df)

# Check transformed and filled X_train
transformed_X_train

NameError: name 'filled_X_train_df' is not defined

In [82]:
# Now we've transformed X, let's see if we can fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

# Make sure to use the transformed data (filled and one-hot encoded X data)
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

NameError: name 'transformed_X_train' is not defined

You might have noticed this result is slightly different to before.

Why do you think this is?

It's because we've created our training and testing sets differently.

We split the data into training and test sets before filling the missing values.

Previously, we did the reverse, filled missing values before splitting the data into training and test sets.

Doing this can lead to information from the training set leaking into the testing set.

Remember, one of the most important concepts in machine learning is making sure your model doesn't see any testing data before evaluation.

We'll keep practicing but for now, some of the main takeaways are:

Keep your training and test sets separate.
Most datasets you come across won't be in a form ready to immediately start using them with machine learning models. And some may take more preparation than others to get ready to use.
For most machine learning models, your data has to be numerical. This will involve converting whatever you're working with into numbers. This process is often referred to as feature engineering or feature encoding.
Some machine learning models aren't compatible with missing data. The process of filling missing data is referred to as data imputation.