In [44]:
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

# Preprocessing - Categorical Data

## Categorical Data

When your data has categories represented by strings, it will be difficult to use them to train machine learning models which often only accepts numeric data.

Instead of ignoring the categorical data and excluding the information from our model, you can tranform the data so it can be used in your models.

Take a look at the table below, it is the same data set that we used in the multiple regression chapter.

In [40]:
df = pd.read_csv('../../data.csv')
df.head(10)

Unnamed: 0,Car,Model,Volume,Weight,CO2
0,Toyoty,Aygo,1000,790,99
1,Mitsubishi,Space Star,1200,1160,95
2,Skoda,Citigo,1000,929,95
3,Fiat,500,900,865,90
4,Mini,Cooper,1500,1140,105
5,VW,Up!,1000,929,105
6,Skoda,Fabia,1400,1109,90
7,Mercedes,A-Class,1500,1365,92
8,Ford,Fiesta,1500,1112,98
9,Audi,A1,1600,1150,99


In the multiple regression chapter, we tried to predict the CO2 emitted based on the volume of the engine and the weight of the car but we excluded information about the car brand and model.

The information about the car brand or the car model might help us make a better prediction of the CO2 emitted.

## One Hot Encoding

We cannot make use of the Car or Model column in our data since they are not numeric. A linear relationship between a categorical variable, Car or Model, and a numeric variable, CO2, cannot be determined.

To fix this issue, we must have a numeric representation of the categorical variable. One way to do this is to have a column representing each group in the category.

For each column, the values will be 1 or 0 where 1 represents the inclusion of the group and 0 represents the exclusion. This transformation is called one hot encoding.

You do not have to do this manually, the Python Pandas module has a function that called `get_dummies()` which does one hot encoding.

Here I'm using SciKit Learn's `OneHotEncoder` to do the same thing.

In [41]:
enc = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
enc_data = enc.fit_transform(df[['Car']])

categories = enc.categories_

feture_names = [f'Car_{category}' for category in categories[0]]

df = pd.concat([pd.DataFrame(enc_data, columns=feture_names), df.drop(['Car', 'Model'], axis=1)], axis=1)
df

Unnamed: 0,Car_Audi,Car_BMW,Car_Fiat,Car_Ford,Car_Honda,Car_Hundai,Car_Hyundai,Car_Mazda,Car_Mercedes,Car_Mini,Car_Mitsubishi,Car_Opel,Car_Skoda,Car_Suzuki,Car_Toyoty,Car_VW,Car_Volvo,Volume,Weight,CO2
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1000,790,99
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1200,1160,95
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1000,929,95
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,900,865,90
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1500,1140,105
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1000,929,105
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1400,1109,90
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1500,1365,92
8,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1500,1112,98
9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1600,1150,99


## Predict CO2
We can use this additional information alongside the volume and weight to predict CO2

In [91]:
X = df.drop(['CO2'], axis=1)
y = df['CO2']

In [92]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('l_reg', LinearRegression())
])

In [93]:
pipeline.fit(X.values, y.values)

Finally we can predict the CO2 emissions based on the car's weight, volume, and manufacturer.

In [98]:
##predict the CO2 emission of a VW where the weight is 2300kg, and the volume is 1300cm3:
predictedCO2 = pipeline.predict([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1300, 2300]])

In [104]:
predictedCO2, pipeline['l_reg'].coef_

(array([122.65107203]),
 array([-0.22198006,  0.43700382, -0.02688432, -0.3999056 , -1.23991712,
        -0.87580315,  0.91002911, -0.67073278, -0.50590307,  0.9337549 ,
        -0.23173598, -0.74704008, -0.41164575,  0.87948196,  1.41186994,
         2.15577401,  0.20355378,  3.98855386,  2.53037811]))

We now have a coefficient for the volume, the weight, and each car brand in the dataset.