# Notebook1 Prepare Dataset

Readme: In this notebook, CO2 Emission data is preprocessed, and two types of dataset are saved for the following two work streams.

- **Content in this notebook:** Data Processing

    - Feature Engineering: Divide two Features: `Transmission` and `Model`

    - Data Encoding: Change all the categorical variable into dummy variable

    - Dataset Division: Split the two dataset into train and test dataset (set random seeed to maintain reprocducibility)

    - Dataset Saving: Save the four files: dataset1-train, dataset1-test, dataset2-train, dataset2-test.

- **Input/Output of this notebook:**

     - Input: *CO2 Emissions_Canada.csv*

     - Output: four csv files: 
     
        *Dataset1_train.csv*, *Dataset1_test.csv*,
        
        *Dataset2_train.csv*, *Dataset2_test.csv*

Note: In this part, data standerdization is not included. 

## Import Modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## Import Data

In [2]:
df_co2 = pd.read_csv('CO2 Emissions_Canada.csv')
df_co2.head(3)

Unnamed: 0,Make,Model,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption City (L/100 km),Fuel Consumption Hwy (L/100 km),Fuel Consumption Comb (L/100 km),Fuel Consumption Comb (mpg),CO2 Emissions(g/km)
0,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136


## Transmission Feature

In [3]:
def gearbox_type(data):
    Automatic = "A"
    Manual = "M"
    encoded = 9999
    if Automatic in data:
        encoded = 1
    if Manual in data:
        encoded = 0
    return encoded

def gear_number(data):
    encoded = 9999
    number = list(data)[-1]
    no_change_needed = list("456789")
    if number in no_change_needed:
        return float(number)
    elif number in "0":
        return 10.0
    elif number in "V":
        return float(1926/281)
    else:
        return encoded

In [4]:
df_co2["Gearbox_Type"] = df_co2["Transmission"].map(gearbox_type)
df_co2["Gearbox_Type"].value_counts()

Gearbox_Type
1    5554
0    1831
Name: count, dtype: int64

In [5]:
df_co2["Gearbox_Number"] = df_co2["Transmission"].map(gear_number)
df_co2["Gearbox_Number"].value_counts()

Gearbox_Number
6.000000     3259
8.000000     1802
7.000000     1026
9.000000      419
5.000000      307
6.854093      295
10.000000     210
4.000000       67
Name: count, dtype: int64

## Vehicle Model Features

In [6]:
def vehicle_feature(data):
    data_string = data.split(" ")
    for item in data_string:
        code = None
        if "4WD" in item:
            code = "Four-wheel drive"
        if "4X4" in item:
            code = "Four-wheel drive"
        if "FFV" in item:
            code = "Flexible-fuel vehicle"
        if "SWB" in item:
            code = "Short wheelbase"
        if "LWB" in item:
            code = "Long wheelbase"
        if "EWB" in item:
            code = "Extended wheelbase"
    return code

In [7]:
df_co2["Model Features"] = df_co2["Model"].map(vehicle_feature)
df_co2["Model Features"].value_counts()

Model Features
Four-wheel drive         549
Flexible-fuel vehicle    496
Long wheelbase            26
Short wheelbase           15
Extended wheelbase        15
Name: count, dtype: int64

## Set Dataset for following workflow

For creating dataset for the following two streams:

- Delete `Model`, replaced by  `Model Features`

- Delete `Transmission`, replaced by `Gearbox_Type` and `Gearbox_Number`

- **Dataset1**: Used for Current Vehicle Manufacturer Benchmarking, 

- **Dataset2**: Used for Future Vehicle Design Evaluation, which means fuel efficiency related is not available, thus **drop** columns `Fuel Consumption Hwy (L/100 km)`, `Fuel Consumption Comb (L/100 km)`, and `Fuel Consumption Comb (mpg)`.

In [8]:
# For workstream 1, current state benchmarking
dataset1 = df_co2[['Model Features', 'Vehicle Class', 'Engine Size(L)', 'Cylinders',
       'Gearbox_Type', 'Gearbox_Number', 'Fuel Type', 'Fuel Consumption City (L/100 km)',
       'Fuel Consumption Hwy (L/100 km)', 'Fuel Consumption Comb (L/100 km)',
       'Fuel Consumption Comb (mpg)', 'CO2 Emissions(g/km)']]  

# For workstream 2, future design evaluation
dataset2 = df_co2[['Model Features', 'Vehicle Class', 'Engine Size(L)', 'Cylinders',
       'Gearbox_Type', 'Gearbox_Number', 'Fuel Type', 'CO2 Emissions(g/km)']]


Set random seed to guarantee reproducibility

In [9]:
import random

# Set Random seeds
random.seed(42)
np.random.seed(42)

Encode categorical features

In [10]:
cat_features_set1 = dataset1.select_dtypes(include=['object']).columns  # select categorical columns
cat_features_set2 = dataset2.select_dtypes(include=['object']).columns  # select categorical columns

dataset1_encoded = pd.get_dummies(dataset1, columns=cat_features_set1, drop_first=True, dtype=int)
dataset2_encoded = pd.get_dummies(dataset2, columns=cat_features_set1, drop_first=True, dtype=int)

dataset1_encoded['Make'] = df_co2['Make']
dataset2_encoded['Make'] = df_co2['Make']


In [11]:
dataset1_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7385 entries, 0 to 7384
Data columns (total 33 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Engine Size(L)                          7385 non-null   float64
 1   Cylinders                               7385 non-null   int64  
 2   Gearbox_Type                            7385 non-null   int64  
 3   Gearbox_Number                          7385 non-null   float64
 4   Fuel Consumption City (L/100 km)        7385 non-null   float64
 5   Fuel Consumption Hwy (L/100 km)         7385 non-null   float64
 6   Fuel Consumption Comb (L/100 km)        7385 non-null   float64
 7   Fuel Consumption Comb (mpg)             7385 non-null   int64  
 8   CO2 Emissions(g/km)                     7385 non-null   int64  
 9   Model Features_Flexible-fuel vehicle    7385 non-null   int32  
 10  Model Features_Four-wheel drive         7385 non-null   int3

Divide Train Dataset and Test Dataset

In [12]:
from sklearn.model_selection import train_test_split

Dataset1_train, Dataset1_test, Dataset2_train, Dataset2_test = train_test_split(dataset1_encoded, dataset2_encoded, test_size = 0.2, random_state=42)

Dataset1_train.head()

Unnamed: 0,Engine Size(L),Cylinders,Gearbox_Type,Gearbox_Number,Fuel Consumption City (L/100 km),Fuel Consumption Hwy (L/100 km),Fuel Consumption Comb (L/100 km),Fuel Consumption Comb (mpg),CO2 Emissions(g/km),Model Features_Flexible-fuel vehicle,...,Vehicle Class_SUV - SMALL,Vehicle Class_SUV - STANDARD,Vehicle Class_TWO-SEATER,Vehicle Class_VAN - CARGO,Vehicle Class_VAN - PASSENGER,Fuel Type_E,Fuel Type_N,Fuel Type_X,Fuel Type_Z,Make
6590,3.0,6,1,8.0,11.4,8.1,9.9,29,231,0,...,0,0,0,0,0,0,0,0,1,BMW
6274,4.0,6,1,5.0,14.7,10.3,12.7,22,299,0,...,0,0,0,0,0,0,0,1,0,NISSAN
2251,3.0,6,0,6.0,13.8,9.0,11.7,24,273,0,...,0,0,0,0,0,0,0,0,1,AUDI
3149,3.4,6,0,7.0,11.3,7.9,9.8,29,230,0,...,0,0,1,0,0,0,0,0,1,PORSCHE
4362,2.0,4,1,8.0,10.1,7.0,8.7,32,204,0,...,0,0,0,0,0,0,0,0,1,VOLVO


Save Output

In [13]:
Dataset1_train.to_csv('Dataset1_train.csv', index=False)
Dataset1_test.to_csv('Dataset1_test.csv', index=False)

Dataset2_train.to_csv('Dataset2_train.csv', index=False)
Dataset2_test.to_csv('Dataset2_test.csv', index=False)

In [14]:
Dataset1_train

Unnamed: 0,Engine Size(L),Cylinders,Gearbox_Type,Gearbox_Number,Fuel Consumption City (L/100 km),Fuel Consumption Hwy (L/100 km),Fuel Consumption Comb (L/100 km),Fuel Consumption Comb (mpg),CO2 Emissions(g/km),Model Features_Flexible-fuel vehicle,...,Vehicle Class_SUV - SMALL,Vehicle Class_SUV - STANDARD,Vehicle Class_TWO-SEATER,Vehicle Class_VAN - CARGO,Vehicle Class_VAN - PASSENGER,Fuel Type_E,Fuel Type_N,Fuel Type_X,Fuel Type_Z,Make
6590,3.0,6,1,8.000000,11.4,8.1,9.9,29,231,0,...,0,0,0,0,0,0,0,0,1,BMW
6274,4.0,6,1,5.000000,14.7,10.3,12.7,22,299,0,...,0,0,0,0,0,0,0,1,0,NISSAN
2251,3.0,6,0,6.000000,13.8,9.0,11.7,24,273,0,...,0,0,0,0,0,0,0,0,1,AUDI
3149,3.4,6,0,7.000000,11.3,7.9,9.8,29,230,0,...,0,0,1,0,0,0,0,0,1,PORSCHE
4362,2.0,4,1,8.000000,10.1,7.0,8.7,32,204,0,...,0,0,0,0,0,0,0,0,1,VOLVO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5191,2.0,4,0,6.000000,10.3,7.4,9.0,31,210,0,...,0,0,0,0,0,0,0,0,1,MINI
5226,3.5,6,1,7.000000,10.6,7.3,9.1,31,214,0,...,0,0,0,0,0,0,0,1,0,NISSAN
5390,3.5,6,1,8.000000,11.7,8.8,10.4,27,242,0,...,0,1,0,0,0,0,0,1,0,TOYOTA
860,2.5,4,1,6.854093,9.5,7.4,8.6,33,198,0,...,1,0,0,0,0,0,0,1,0,NISSAN
