 # Python ML final _ Fuel efficiency prediction with One Hot Encoding
 
 **By 國立陽明交通大學 醫學三 陳冠元 10701054**

 Inspired by https://www.freecodecamp.org/news/how-to-build-your-first-neural-network-to-predict-house-prices-with-keras-f8db83049159/
 
 **Why I chose to do this project:**
> It would be interesting to be able to predict the fuel efficiency of a vehicle just by looking at the specs and the categorical data of the vehicle. This is actually not very difficult for individuals that knows about cars (yep, I am sort of a automobile fan), but I think it would be interesting to see if the same "educated guess" could be made using Python deep learning.

>I have seen projects that tried to analyze this dataset with regression moodels, but not one hot encoding

>The dataset I used contained **7385 car models** produced in the past decade cataloged by the Canadian government.

>>**The dataset: https://www.kaggle.com/debajyotipodder/co2-emission-by-vehicles**

>I put the sources that helped me along the way below:

# Set our Fuel Efficiency target

According to Environmental Protection Agency of US, 
average fuel efficiency of typical passenger vehicle is **10.69L/100km, or 9.35 km/L**

https://www.epa.gov/greenvehicles/greenhouse-gas-emissions-typical-passenger-vehicle

In [None]:
# Please input the target fuel efficiency below:
# Accuracy and AUC of the model would vary given the different fuel efficiency goals.
target_fuel_efficiency = 9.35
#Unit:km/L

#  Import Data
Import the csv file to have a look about the datasets.

In [None]:
# These are the things imported by Kaggle by default...

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import csv

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Import the dataset so we can take a look about the dataset we are dealing with...
data = pd.read_csv('../input/co2-emission-by-vehicles/CO2 Emissions_Canada.csv')

# Print out the dataset, a gigantic csv file.
data

# Model
#    4WD/4X4 = Four-wheel drive
#    AWD = All-wheel drive
#    FFV = Flexible-fuel vehicle
#    SWB = Short wheelbase
#    LWB = Long wheelbase
#    EWB = Extended wheelbase
    
# Transmission
#    A = automatic
#    AM = automated manual
#    AS = automatic with select shift
#    AV = continuously variable
#    M = manual
#    3 - 10 = Number of gears
    
# Fuel type
#    X = regular gasoline
#    Z = premium gasoline
#    D = diesel
#    E = ethanol (E85)
#    N = natural gas
    
# Fuel consumption: City and highway fuel consumption ratings are shown in litres per 100 kilometres (L/100 km)
# - the combined rating (55% city, 45% hwy) is shown in L/100 km and in miles per imperial gallon (mpg)
# CO2 emissions: the tailpipe emissions of carbon dioxide (in grams per kilometre) for combined city and highway driving


# Preprocessing by removing columns
Make and model of the vehicle is removed since it could overwhelm the neural network.

Other fuel consumption data is removed due to the algebraic relationship between fuel consuption and CO2 emission.

In [None]:
# Cannot quantify Make and Model of an automobile.
data.pop('Make')
data.pop('Model')

# These four values have tight algebraic relationship with the predicted values.
data.pop('Fuel Consumption City (L/100 km)')
data.pop('Fuel Consumption Hwy (L/100 km)')
data.pop('Fuel Consumption Comb (mpg)')
data.pop('CO2 Emissions(g/km)')

#I thought taking these categorical data would increase model accuracy but it did not.
#data.pop('Cylinders')
#data.pop('Transmission')
#data.pop('Fuel Type')

# Save the modified csv file as a new file.
data.to_csv('preprocessed_fuel_efficiency.csv', index=False)
pd.read_csv('preprocessed_fuel_efficiency.csv')

# Make sure the columns are properly removed.
data

# Preprocessing with Microsoft Excel

Microsoft Excel function for averaging a column according to the condition of another column:

https://www.excelforum.com/excel-general/541631-calculate-average-in-a-column-based-on-criteria-in-another-column.html

In [None]:
# Read in the csv file that contain the information about the means of different groups.
means_of_groups = pd.read_csv('../input/means-of-groups/means of groups.csv')

MOG = np.ndarray.tolist(means_of_groups.values)
# Convert the csv file into a numpy array
# And then convert it into a list
# I really don't know why but lists behave better in loops and appendings compared to numpy arrays.

means_of_groups
# What I really did is use Microsoft Excel to calculate the average CO2 emission according to the categorical data below:
# I know this could be done with a loop and some appending, but Excel is just so much faster...
# This csv file is crucial for the ONE HOT ENCODONG process performed below.
# This also showed that categorical data can also be a clue when predicting fuel efficiency of automobiles.
# For example, COMPACT < MID - SIZE < FULL - SIZE when it comes to CO2 emissions of different vehicles.

In [None]:
# The new csv dataset with several columns kicked out is opened and converted to numpy array.
new_data = pd.read_csv("preprocessed_fuel_efficiency.csv")
dataset = new_data.values

# Print out the array and see if it is converted correctly.
dataset

# One Hot Encoding

https://www.kite.com/python/answers/how-to-append-to-a-2d-list-in-python

https://numpy.org/doc/stable/reference/generated/numpy.ndarray.tolist.html

https://www.w3schools.com/python/python_lists_remove.asp

https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

In [None]:
# This part took me quite a lot of time to think of and to compose the loop.
# I had the loop misbehaving and the result turned out to be the numpy array.
# Therefore, I coverted the array to a list.
# To prevent confusion, dataset is called datalist after the conversion.
datalist = np.ndarray.tolist(dataset)

# I came up with this by myself but I have no programming background...
#I really wonder if there is a solution more elegant.

# 2 variables are required to operate the loop.
a,b=0,0
for a in range(len(datalist)):
    b=0
    for b in range(len(datalist)):
        if b<16:
            if MOG[b][0]==datalist[a][0]:# The categorical data is compared by the order of "means of groups(MOG)" list.
                datalist[a].append(1)# If the contents of the two list match, a 1 is appended to "datalist".
                del datalist[a][0]# After 1 is appended, the categorical data in "datalist" is no lunger usful thus deleted.
            else:
                datalist[a].append(0)# If the contents don't match, then a 0 is appended.
        elif b<43: # Since the categorical data of the first column of "dataset" ends in the 15th row...
            if MOG[b][0]==datalist[a][2]: # We can skip to the next column to do the rest of the comprarisons.
                datalist[a].append(1)
                del datalist[a][2]
            else:
                datalist[a].append(0)
        elif b<48:
            if MOG[b][0]==datalist[a][2]:
                datalist[a].append(1)
                del datalist[a][2]
            else:
                datalist[a].append(0)
        else:b=b+1
    a=a+1
    
# The results are not all printed otherwise the page will contain something really long without abbreviating...
# But the first, middle, last row is printed to check if everything is okay.
# I think these are the common places loops go wrong...
print(datalist[0])
print(datalist[int(round(len(datalist)/2, 0))])
print(datalist[-1]) # [-1] means the last row.

# Converting the fuel efficiency outcomes to binary

https://stackoverflow.com/questions/18716564/python-cant-assign-to-literal

https://www.guru99.com/variables-in-python.html

https://www.geeksforgeeks.org/python-output-formatting/

In [None]:
#This loop simplifies the fuel efficiency to 0 = below average; 1 = above average. 
x,y=0,0
for x in range(len(datalist)):
    if datalist[x][2] < (100/target_fuel_efficiency): # The units have to be converted from km/L to L/100km
        datalist[x].append(1)
        y=y+1
    else:
        datalist[x].append(0)
    del datalist[x][2]
    x=x+1
z=y/x # Just to see the amount of automobile models that meet the fuel efficiency target.

#Just to examine the percentage of car models that perform above average in fuel efficiency.
print('Approximately {} % of vehicle models meet the fuel efficiency requested.'.format(round(z*100, 2)))
print()

In [None]:
#See how the attribution information and fuel efficiency data is modified.
print(datalist[0])
print(datalist[int(round(len(datalist)/2, 0))])
print(datalist[-1])

In [None]:
#The reason why I have to save the array back to csv file is that lists got rejected by the deep learning model.
# Since numpy arrays are accepted and kaggle display csv files and numpy arrays better, I convert the datalist into csv then to numpy array.

#Heading will be lost therefore it had to be added back manually.
dataset_csv = np.asarray(datalist)
np.savetxt("preprocessed_fuel_efficiency.csv", dataset_csv, delimiter=",", 
           header='Engine Size(L),Cylinders,COMPACT,FULL-SIZE,MID-SIZE,MINICOMPACT,MINIVAN,PICKUP TRUCK - SMALL,PICKUP TRUCK - STANDARD,SPECIAL PURPOSE VEHICLE,STATION WAGON - MID-SIZE,STATION WAGON - SMALL,SUBCOMPACT,SUV - SMALL,SUV - STANDARD,TWO-SEATER,VAN - CARGO,VAN - PASSENGER,A4,A5,A6,A7,A8,A9,A10,AM5,AM6,AM7,AM8,AM9,AS4,AS5,AS6,AS7,AS8,AS9,AS10,AV,AV6,AV7,AV8,AV10,M5,M6,M,D,E,N,X,Z,Above average fuel efficiency')
new_data = pd.read_csv("preprocessed_fuel_efficiency.csv")

#This is the dataset AFTER PREPROCESSING.
# Categorical data is turned into binary.
# Fuel efficiency outcomes are saved as binary also.
new_data

Splitting the fuel efficiency results into a separated list.

In [None]:
#Converting the preprocessed csv file to array...again.
dataset = new_data.values

#Splitting the results
X = dataset[:,0:50] # These are the inputs that will be dumped into the deen learning model.
Y = dataset[:,50] # These are the answers for validation and testing.

print(X)
print(Y)

# This way there is a dot after the every element, since the elements are registered as floats.
# Therefore the data can be accepted by the model.

In [None]:
#Make the values of every column between 0 and 1 so the deep learning model can process it.
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_scale = min_max_scaler.fit_transform(X)

# See if the data is scaled properly
X_scale

In [None]:
#Since N=7385, I think 10% for testing and validation is quite enough.
from sklearn.model_selection import train_test_split

X_train, X_val_and_test, Y_train, Y_val_and_test = train_test_split(X, Y, test_size=0.1)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_and_test, Y_val_and_test, test_size=0.5)

print(X_train.shape, X_val.shape, X_test.shape, Y_train.shape, Y_val.shape, Y_test.shape)

# Deep Learning Model

https://keras.io/api/layers/activations/

In [None]:
from keras.models import Sequential
from keras.layers import Dense

In [None]:
#5 inputs with 4 hidden neuron layers and 1 output.
model = Sequential([
    Dense(50, activation='relu', input_shape=(50,)),
    Dense(64, activation='relu'),
    Dense(64, activation='relu'),
    Dense(10, activation='relu'),
    Dense(1, activation='sigmoid'),
])

# Have a general idea of what the model is like.
model.summary()

https://keras.io/api/metrics/

https://keras.io/api/losses/probabilistic_losses/

https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/

In [None]:
#adam optimizer works somehow works better than sgd.
model.compile(
    optimizer='adam',
    loss='mse', # Works better than binary_crossentropy somehow...
    metrics=['AUC','accuracy'])

In [None]:
#I find 32 per batch is optimal and takes at least 100 epochs.
hist = model.fit(X_train, Y_train,
          batch_size=32, epochs=100,
          validation_data=(X_val, Y_val))

# Model Evaluation

The model has an above 90% accuracy, I find it satsfactory since the categorical data is overwhelming.

In [None]:
model.evaluate(X_test, Y_test)[1]
print(X_test)
print(Y_test)

# Error report

The vehicles that the model did not make the correct prediction is displayed and saved as a csv file.

https://www.geeksforgeeks.org/python-save-list-to-csv/

In [None]:
# First, convery the numpy array into a list
prediction = np.ndarray.tolist(np.round(model.predict(X_test),0))
X_test_list = np.ndarray.tolist(X_test)
converted=[]

# This loop with 4 variables has two functions:
## 1.Rebuild the categorical data based on the one hot encoding results.
## 2.Check if the accuracy reported from the trained model is correct.
i,j,k,l=0,0,0,0
for i in range(len(Y_test)):
    if Y_test[i]==prediction[i][0]:
        j=j+1
    else:
        converted.append([int(i)])
        for l in range(len(MOG)):
            if l in range(16):
                if int(X_test[i][l+2])==1:
                    converted[k].append(str(MOG[l][0]))
                    converted[k].append(X_test_list[i][0])
                    converted[k].append(X_test_list[i][1])
            elif l in range(43):
                if int(X_test[i][l+2])==1:
                    converted[k].append(str(MOG[l][0]))
            elif l in range(48):
                if int(X_test[i][l+2])==1:
                    converted[k].append(str(MOG[l][0]))
                    converted[k].append(int(Y_test[i]))
                    converted[k].append(int(np.round(model.predict(X_test),0)[i]))
            l=l+1
        k=k+1
    i=i+1


# This tool is handy when it comes to saving list to csv file.
import csv
  
fields = ['X_test index', 'Vehicle class', 'Engine size', 'Cylinders', 'Transmission', 'Fuel type', 'Y_test', 'MODEL PREDICTION'] 
    
rows = converted # data rows of csv file 
  
with open('error_report.csv', 'w') as f:
      
    write = csv.writer(f)# using csv.writer method from CSV package
      
    write.writerow(fields)
    write.writerows(rows)

error_report_csv = pd.read_csv("error_report.csv")

# Make sure the accuracy of the model is properly displayed
print('confirmed accuracy:',round(j/i, 4))
print()
error_report_csv

Over estimation and Under estimation of the vehicle's fuel efficiency

https://www.geeksforgeeks.org/python-ways-to-remove-duplicates-from-list/

https://www.geeksforgeeks.org/python-list-sort/

In [None]:
# The original dataset is reloaded.
data2 = pd.read_csv('../input/co2-emission-by-vehicles/CO2 Emissions_Canada.csv')
datalist2 = np.ndarray.tolist(data2.values)

underestimate_efficiency = []
overestimate_efficiency = []

# If all the elements match, the index from the original dataset will be appended.
p,q=0,0
for p in range(len(converted)):
    for q in range(len(datalist2)):
        if converted[p][1]==datalist2[q][2] and converted[p][2]==datalist2[q][3] and (converted)[p][3]==datalist2[q][4] and converted[p][4]==datalist2[q][5] and converted[p][5]==datalist2[q][6]: 
            if (converted[p][7]==0 and datalist2[q][9]<(100/target_fuel_efficiency)): 
                underestimate_efficiency.append(q)
            elif (converted[p][7]==1 and datalist2[q][9]>(100/target_fuel_efficiency)):
                overestimate_efficiency.append(q)
        q=q+1
    p=p+1

underestimate_efficiency2 = []
for i in underestimate_efficiency:
    if i not in underestimate_efficiency2:
        underestimate_efficiency2.append(i)
        
overestimate_efficiency2 = []
for i in overestimate_efficiency:
    if i not in overestimate_efficiency2:
        overestimate_efficiency2.append(i)

# The results are sorted in ascending order.        
underestimate_efficiency2.sort()
overestimate_efficiency2.sort()
        
print('Models that had its fuel efficiency underestimated: ')
print(underestimate_efficiency2)
print()
print('Models that had its fuel efficiency overestimated: ') 
print(overestimate_efficiency2)

# Fuel efficiency underestimated

In [None]:
underestimate_error = []

a=0
for a in range(len(underestimate_efficiency2)):
    A= underestimate_efficiency2[a]
    underestimate_error.append(datalist2[A])
    a=a+1

import csv
  
fields = ['Make','Model','Vehicle Class','Engine Size(L)','Cylinders','Transmission','Fuel Type','Fuel Consumption City (L/100 km)','Fuel Consumption Hwy (L/100 km)','Fuel Consumption Comb (L/100 km)','Fuel Consumption Comb (mpg)','CO2 Emissions(g/km)'] 
rows = underestimate_error 
  
with open('underestimate_error.csv', 'w') as f:  
    write = csv.writer(f) 
    write.writerow(fields)
    write.writerows(rows)

underestimate_error_csv = pd.read_csv("underestimate_error.csv")
underestimate_error_csv

# Fuel efficiency overestimated

In [None]:
overestimate_error = []

b=0
for b in range(len(overestimate_efficiency2)):
    B=overestimate_efficiency2[b]
    overestimate_error.append(datalist2[B])
    b=b+1

import csv
   
fields = ['Make','Model','Vehicle Class','Engine Size(L)','Cylinders','Transmission','Fuel Type','Fuel Consumption City (L/100 km)','Fuel Consumption Hwy (L/100 km)','Fuel Consumption Comb (L/100 km)','Fuel Consumption Comb (mpg)','CO2 Emissions(g/km)']    
rows = overestimate_error 
  
with open('overestimate_error.csv', 'w') as f:
    write = csv.writer(f)
    write.writerow(fields)
    write.writerows(rows)

overestimate_error_csv = pd.read_csv("overestimate_error.csv")
overestimate_error_csv

# Visualization of the model

In [None]:
#just plot the stuff above out.
import matplotlib.pyplot as plt
plt.plot(hist.history['val_loss'])
plt.plot(hist.history['loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Val', 'Train'], loc='upper right')
plt.show()

In [None]:
plt.plot(hist.history['val_auc'])
plt.plot(hist.history['auc'])
plt.plot(hist.history['val_accuracy'])
plt.plot(hist.history['accuracy'])
plt.title('Model AUC and Accuracy')
plt.ylabel('AUC and Accuracy')
plt.xlabel('Epoch')
plt.legend(['Val AUC', 'Train AUC','Val Accuracy', 'Train Accuracy'], loc='lower right')
plt.show()

# My previous attempt to deal with categorical data:

**Link to my previous project:** https://www.kaggle.com/galenchen/python-ml-final-fuel-efficiency-prediction

>Basically, I **took the average of the CO2 emissions of different categories**. Then **replacing the categorical data with the means** I calculated.

As a result: With similar deep learning model,

>With **one hot encoding**, the accuracy is about **93%** (loss = **mean_square_error** works better)

>With **means of each category**, the accuracy is about **87%** (loss = **binary_crossentropy** works pooply)

**In conclusion, one hot encoding is more accurate when it comes to predicting the fuel efficiency of automobiles.**