# Notebook1 Prepare Dataset

Readme: In this notebook, CO2 Emission data is preprocessed, and two types of dataset are saved for the following two work streams.

- **Content in this notebook:** Data Processing

    - Feature Engineering: Divide two Features: `Transmission` and `Model`

    - Data Encoding: Change all the categorical variable into dummy variable

    - Dataset Division: Split the two dataset into train and test dataset (set random seeed to maintain reprocducibility)

    - Dataset Saving: Save the four files: dataset1-train, dataset1-test, dataset2-train, dataset2-test.

- **Input/Output of this notebook:**

     - Input: *CO2 Emissions_Canada.csv*

     - Output: four csv files: 
     
        *Dataset1_train.csv*, *Dataset1_test.csv*,
        
        *Dataset2_train.csv*, *Dataset2_test.csv*

Note: In this part, data standerdization is not included. 

## Import Modules

In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## Import Data

In [30]:
df_co2 = pd.read_csv('CO2 Emissions_Canada.csv')
df_co2.head(3)
df_co2.shape # (7385, 12)

(7385, 12)

## Feature Manipulation

### Drop Nature Gas Types

We decide that Nature Gas Type is not included in the following analysis.

In [31]:
df_co2 = df_co2[df_co2["Fuel Type"] != "N"]
df_co2.shape # (7385, 12)

(7384, 12)

### Transmission Feature Split

In [32]:
def gearbox_type(data):
    Automatic = "A"
    Manual = "M"
    encoded = 9999
    if Automatic in data:
        encoded = 1
    if Manual in data:
        encoded = 0
    return encoded

def gear_number(data):
    encoded = 9999
    number = list(data)[-1]
    no_change_needed = list("456789")
    if number in no_change_needed:
        return float(number)
    elif number in "0":
        return 10.0
    elif number in "V":
        return float(1926/281)
    else:
        return encoded

In [33]:
df_co2["Gearbox_Type"] = df_co2["Transmission"].map(gearbox_type)
df_co2["Gearbox_Type"].value_counts()

Gearbox_Type
1    5553
0    1831
Name: count, dtype: int64

In [34]:
df_co2["Gearbox_Number"] = df_co2["Transmission"].map(gear_number)
df_co2["Gearbox_Number"].value_counts()

Gearbox_Number
6.000000     3258
8.000000     1802
7.000000     1026
9.000000      419
5.000000      307
6.854093      295
10.000000     210
4.000000       67
Name: count, dtype: int64

### Vehicle Model Features

In [35]:
def vehicle_feature(data):
    data_string = data.split(" ")
    for item in data_string:
        code = None
        if "4WD" in item:
            code = "Four-wheel drive"
        if "4X4" in item:
            code = "Four-wheel drive"
        if "FFV" in item:
            code = "Flexible-fuel vehicle"
        if "SWB" in item:
            code = "Short wheelbase"
        if "LWB" in item:
            code = "Long wheelbase"
        if "EWB" in item:
            code = "Extended wheelbase"
    return code

In [36]:
df_co2["Model Features"] = df_co2["Model"].map(vehicle_feature)
df_co2["Model Features"].value_counts()

Model Features
Four-wheel drive         549
Flexible-fuel vehicle    496
Long wheelbase            26
Short wheelbase           15
Extended wheelbase        15
Name: count, dtype: int64

## Set Dataset for following workflow

For creating dataset for the following two streams:

- Delete `Model`, replaced by  `Model Features`

- Delete `Transmission`, replaced by `Gearbox_Type` and `Gearbox_Number`

- **Dataset1**: Used for Current Vehicle Manufacturer Benchmarking, 

- **Dataset2**: Used for Future Vehicle Design Evaluation, which means fuel efficiency related is not available, thus **drop** columns `Fuel Consumption Hwy (L/100 km)`, `Fuel Consumption Comb (L/100 km)`, and `Fuel Consumption Comb (mpg)`.

In [37]:
# For workstream 1, current state benchmarking
dataset1 = df_co2[['Model Features', 'Vehicle Class', 'Engine Size(L)', 'Cylinders',
       'Gearbox_Type', 'Gearbox_Number', 'Fuel Type', 'Fuel Consumption City (L/100 km)',
       'Fuel Consumption Hwy (L/100 km)', 'Fuel Consumption Comb (L/100 km)',
       'Fuel Consumption Comb (mpg)', 'CO2 Emissions(g/km)']]  

# For workstream 2, future design evaluation
dataset2 = df_co2[['Model Features', 'Vehicle Class', 'Engine Size(L)', 'Cylinders',
       'Gearbox_Type', 'Gearbox_Number', 'Fuel Type', 'CO2 Emissions(g/km)']]


Set random seed to guarantee reproducibility

In [38]:
import random

# Set Random seeds
random.seed(42)
np.random.seed(42)

Encode categorical features

In [39]:
cat_features_set1 = dataset1.select_dtypes(include=['object']).columns  # select categorical columns
cat_features_set2 = dataset2.select_dtypes(include=['object']).columns  # select categorical columns

dataset1_encoded = pd.get_dummies(dataset1, columns=cat_features_set1, drop_first=True, dtype=int)
dataset2_encoded = pd.get_dummies(dataset2, columns=cat_features_set1, drop_first=True, dtype=int)

dataset1_encoded['Make'] = df_co2['Make']
dataset2_encoded['Make'] = df_co2['Make']

In [40]:
dataset1_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7384 entries, 0 to 7384
Data columns (total 32 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Engine Size(L)                          7384 non-null   float64
 1   Cylinders                               7384 non-null   int64  
 2   Gearbox_Type                            7384 non-null   int64  
 3   Gearbox_Number                          7384 non-null   float64
 4   Fuel Consumption City (L/100 km)        7384 non-null   float64
 5   Fuel Consumption Hwy (L/100 km)         7384 non-null   float64
 6   Fuel Consumption Comb (L/100 km)        7384 non-null   float64
 7   Fuel Consumption Comb (mpg)             7384 non-null   int64  
 8   CO2 Emissions(g/km)                     7384 non-null   int64  
 9   Model Features_Flexible-fuel vehicle    7384 non-null   int32  
 10  Model Features_Four-wheel drive         7384 non-null   int32  
 

Divide Train Dataset and Test Dataset

In [41]:
from sklearn.model_selection import train_test_split

# Prepare for Dataset 1
dataset1_train_features = dataset1_encoded.drop(columns=['CO2 Emissions(g/km)', 'Make'])
dataset1_train_y = dataset1_encoded['CO2 Emissions(g/km)']

dataset1_features_train, dataset1_features_test, dataset1_y_train, dataset1_y_test = train_test_split(
    dataset1_train_features, 
    dataset1_train_y, 
    test_size=0.2, random_state=42)

# Combine X & y into one data set
Dataset1_train = pd.concat([dataset1_features_train, dataset1_y_train], axis=1)
Dataset1_test = pd.concat([dataset1_features_test, dataset1_y_test], axis=1)

Dataset1_test.head()

Unnamed: 0,Engine Size(L),Cylinders,Gearbox_Type,Gearbox_Number,Fuel Consumption City (L/100 km),Fuel Consumption Hwy (L/100 km),Fuel Consumption Comb (L/100 km),Fuel Consumption Comb (mpg),Model Features_Flexible-fuel vehicle,Model Features_Four-wheel drive,...,Vehicle Class_SUBCOMPACT,Vehicle Class_SUV - SMALL,Vehicle Class_SUV - STANDARD,Vehicle Class_TWO-SEATER,Vehicle Class_VAN - CARGO,Vehicle Class_VAN - PASSENGER,Fuel Type_E,Fuel Type_X,Fuel Type_Z,CO2 Emissions(g/km)
5632,6.2,8,0,7.0,18.2,12.5,15.6,18,0,0,...,0,0,0,1,0,0,0,0,1,368
1550,3.6,6,1,6.0,14.8,9.9,12.6,22,0,0,...,0,1,0,0,0,0,0,1,0,290
1128,4.2,8,0,6.0,20.5,11.7,16.6,17,0,0,...,0,0,0,1,0,0,0,0,1,382
6498,2.0,4,1,8.0,10.3,7.5,9.0,31,0,0,...,0,1,0,0,0,0,0,0,1,211
3270,1.8,4,1,6.0,9.4,6.8,8.2,34,0,0,...,0,0,0,0,0,0,0,1,0,193


In [42]:

# Prepare for Dataset 2
dataset2_train_features = dataset2_encoded.drop(columns=['CO2 Emissions(g/km)', 'Make'])
dataset2_train_y = dataset2_encoded['CO2 Emissions(g/km)']

dataset2_features_train, dataset2_features_test, dataset2_y_train, dataset2_y_test = train_test_split(
    dataset2_train_features, 
    dataset2_train_y, 
    test_size=0.2, random_state=42)

# Combine X & y into one data set
Dataset2_train = pd.concat([dataset2_features_train, dataset2_y_train], axis=1)
Dataset2_test = pd.concat([dataset2_features_test, dataset2_y_test], axis=1)

Dataset2_test.head()

Unnamed: 0,Engine Size(L),Cylinders,Gearbox_Type,Gearbox_Number,Model Features_Flexible-fuel vehicle,Model Features_Four-wheel drive,Model Features_Long wheelbase,Model Features_Short wheelbase,Vehicle Class_FULL-SIZE,Vehicle Class_MID-SIZE,...,Vehicle Class_SUBCOMPACT,Vehicle Class_SUV - SMALL,Vehicle Class_SUV - STANDARD,Vehicle Class_TWO-SEATER,Vehicle Class_VAN - CARGO,Vehicle Class_VAN - PASSENGER,Fuel Type_E,Fuel Type_X,Fuel Type_Z,CO2 Emissions(g/km)
5632,6.2,8,0,7.0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,368
1550,3.6,6,1,6.0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,290
1128,4.2,8,0,6.0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,382
6498,2.0,4,1,8.0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,211
3270,1.8,4,1,6.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,193


Save Output

In [43]:
Dataset1_train.to_csv('Dataset1_train.csv', index=False)
Dataset1_test.to_csv('Dataset1_test.csv', index=False)

Dataset2_train.to_csv('Dataset2_train.csv', index=False)
Dataset2_test.to_csv('Dataset2_test.csv', index=False)