# Course 7: Machine Learning introduction (Regression)

Objectives: In this course and notebook, the objective will be for you to be able to load, manipulate and export a dataset. After that you will be building your first machine learning model and we will test it in an interactive Python web app done in streamit.

At the end of this course, you will be able to:
- **EXTRACT, TRANSFORM and LOAD** the data like if the last courses
- **BUILD A REGRESSION MODEL** using sklearn
- **APPLY THIS MODEL** to new data

<center><img src="https://www.eurixgroup.com/wp-content/uploads/2021/01/ml-e1610553826718.jpg" title="Python Logo"/></center>



## 0. Presentation of the dataset

We will be working on the dataset called the **CARS_POLLUTION**. It is dataset containing the information of different cars that drives in France. The objective is to predict the quantity of CO2 that a car will emit. The dataset can be found at this address: https://www.data.gouv.fr/fr/datasets/r/6ff09b59-84ca-4346-a8d1-3587ed94da15

<center><img src="https://s.france24.com/media/display/c41bbe3a-0942-11e9-87bc-005056bff430/w:1280/p:16x9/pollution-paris-circulation.webp" title="Python Logo" width = 400/></center>



## 1. Load the libraries

First we will start by loading the libraries that we need for this course

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk

## 2. Load the Data

As usual, start by loading your data into your Python environment

In [7]:
df_car_pollution = pd.read_csv("CARS_POLLUTION.csv", sep = ';',encoding='latin-1')

And let's have a look at the data

In [8]:
df_car_pollution.head()

Unnamed: 0,Marque,Modèle dossier,Modèle UTAC,Désignation commerciale,CNIT,Type Variante Version (TVV),Carburant,Hybride,Puissance administrative,Puissance maximale (kW),...,HC (g/km),NOX (g/km),HC+NOX (g/km),Particules (g/km),masse vide euro min (kg),masse vide euro max (kg),Champ V9,Date de mise à jour,Carrosserie,gamme
0,ALFA-ROMEO,159,159,159 1750 Tbi (200ch),M10ALFVP000G340,939AXN1B52C,ES,non,12,147.0,...,0.052,0.032,,0.002,1505,1505,715/2007*692/2008EURO5,juin-13,BREAK,MOY-SUPER
1,ALFA-ROMEO,159,159,159 2.0 JTDm (170ch) ECO,M10ALFVP000U221,939AXP1B54C,GO,non,9,125.0,...,,0.169,0.19,0.003,1565,1565,715/2007*692/2008EURO5,juin-13,BERLINE,MOY-SUPER
2,ALFA-ROMEO,159,159,159 2.0 JTDm (136ch),M10ALFVP000E302,939AXR1B64,GO,non,7,100.0,...,,0.149,0.175,0.001,1565,1565,715/2007*692/2008EURO5,juin-13,BERLINE,MOY-SUPER
3,ALFA-ROMEO,159,159,159 2.0 JTDm (136ch),M10ALFVP000F303,939AXR1B64B,GO,non,7,100.0,...,,0.149,0.175,0.001,1565,1565,715/2007*692/2008EURO5,juin-13,BERLINE,MOY-SUPER
4,ALFA-ROMEO,159,159,159 2.0 JTDm (170ch),M10ALFVP000G304,939AXS1B66,GO,non,9,125.0,...,,0.164,0.193,0.001,1565,1565,715/2007*692/2008EURO5,juin-13,BERLINE,MOY-SUPER


## 3. Analyze the data

### 3.1 Analyze Data

As usual, when loading data, it is important to make a fast analysis to understand on what we will be working on

In [11]:
df_car_pollution.describe(include = "all")

Unnamed: 0,Marque,Modèle dossier,Modèle UTAC,Désignation commerciale,CNIT,Type Variante Version (TVV),Carburant,Hybride,Puissance administrative,Puissance maximale (kW),...,HC (g/km),NOX (g/km),HC+NOX (g/km),Particules (g/km),masse vide euro min (kg),masse vide euro max (kg),Champ V9,Date de mise à jour,Carrosserie,gamme
count,44850,44850,44850,44850,44850,44850,44850,44850,44850.0,44850.0,...,10403.0,44547.0,34191.0,41708.0,44850.0,44850.0,44615,44850,44850,44850
unique,51,458,419,3582,44191,28781,13,2,,,...,,,,,,,13,3,10,7
top,MERCEDES-BENZ,VIANO,VIANO,VIANO 2.2 CDI,M10LADVP000T028,263AXG1B05,GO,non,,,...,,,,,,,715/2007*692/2008EURO5,juin-13,MINIBUS,MOY-INFER
freq,38450,14031,14031,5874,16,32,37778,44593,,,...,,,,,,,26426,43910,32744,20428
mean,,,,,,,,,11.018997,124.780834,...,0.030499,0.311837,0.224788,0.000961,2070.96165,2169.545284,,,,
std,,,,,,,,,5.554475,49.158804,...,0.018408,0.463112,0.041681,0.006469,342.872975,410.600541,,,,
min,,,,,,,,,1.0,10.0,...,0.008,0.001,0.038,0.0,825.0,825.0,,,,
25%,,,,,,,,,9.0,100.0,...,0.008,0.158,0.201,0.0,1976.0,2043.5,,,,
50%,,,,,,,,,10.0,120.0,...,0.031,0.197,0.22,0.001,2076.0,2185.0,,,,
75%,,,,,,,,,11.0,125.0,...,0.044,0.228,0.248,0.001,2256.0,2355.0,,,,


In [15]:
df_car_pollution.shape

(44850, 26)

In [17]:
df_car_pollution.columns

Index(['Marque', 'Modèle dossier', 'Modèle UTAC', 'Désignation commerciale',
       'CNIT', 'Type Variante Version (TVV)', 'Carburant', 'Hybride',
       'Puissance administrative', 'Puissance maximale (kW)',
       'Boîte de vitesse', 'Consommation urbaine (l/100km)',
       'Consommation extra-urbaine (l/100km)', 'Consommation mixte (l/100km)',
       'CO2 (g/km)', 'CO type I (g/km)', 'HC (g/km)', 'NOX (g/km)',
       'HC+NOX (g/km)', 'Particules (g/km)', 'masse vide euro min (kg)',
       'masse vide euro max (kg)', 'Champ V9', 'Date de mise à jour',
       'Carrosserie', 'gamme'],
      dtype='object')

In [18]:
df_car_pollution["Type Variante Version (TVV)"].value_counts()

263AXG1B05                  32
906AC35KNG71349EMCE21WA9    16
312PXA1AP0F                 16
S-DKZ111AACD7A9BDA5         16
S-DKZ111AADD7A9BDA5         16
                            ..
906AC35KMMC1349NSEB25UA7     1
906AC35KMMC1349NSEB23WA7     1
906AC35KMMC1349NSEB22WA7     1
906AC35KMMC1349NSEA25WX7     1
BZ90H6                       1
Name: Type Variante Version (TVV), Length: 28781, dtype: int64

### 3.2 Pre-process the data

Delete `nan` values in the objective

In [22]:
df_drop_nan = df_car_pollution.dropna()

Delete the columns `Date de mise à jour`

In [23]:
df_cleaned = df_drop_nan.drop(columns=['Date de mise à jour'])

Convert into onehot encoding

In [27]:
def get_dummies(df_input):
    df = df_input.copy()
    from sklearn import preprocessing
    encoder = preprocessing.LabelEncoder()
    for column_name in df.columns:
        if df[column_name].dtype == object:
            df[column_name] = encoder.fit_transform(df[column_name])
        else:
            pass
    return df, encoder


df_get_dummies, encoder = get_dummies(df_cleaned)

Drop lines containing at least a `nan`

In [29]:
df = df_get_dummies.dropna()
df

Unnamed: 0,Marque,Modèle dossier,Modèle UTAC,Désignation commerciale,CNIT,Type Variante Version (TVV),Carburant,Hybride,Puissance administrative,Puissance maximale (kW),...,CO type I (g/km),HC (g/km),NOX (g/km),HC+NOX (g/km),Particules (g/km),masse vide euro min (kg),masse vide euro max (kg),Champ V9,Carrosserie,gamme
2416,0,2,2,4,7,33,0,0,15,175.0,...,0.348,0.035,0.164,0.202,0.001,1845,1925,0,0,1
2417,0,2,2,4,8,34,0,0,15,175.0,...,0.348,0.035,0.164,0.202,0.001,1935,1955,0,0,1
2522,1,0,0,1,22,30,0,0,8,100.0,...,0.272,0.022,0.121,0.154,0.0,1680,1730,1,1,3
2523,1,0,0,2,23,31,0,0,10,120.0,...,0.272,0.022,0.121,0.154,0.0,1680,1730,1,1,3
2526,1,0,0,0,6,32,0,0,8,100.0,...,0.272,0.022,0.121,0.154,0.0,1609,1678,1,1,3
2549,1,4,4,10,0,1,0,0,12,147.0,...,0.109,0.023,0.135,0.161,0.0,1945,2033,2,1,3
2550,1,4,4,10,18,1,0,0,12,147.0,...,0.109,0.023,0.135,0.161,0.0,1945,2033,2,1,3
2551,1,4,4,9,1,2,0,0,13,147.0,...,0.109,0.023,0.135,0.161,0.0,1945,2033,2,1,3
2552,1,4,4,9,20,2,0,0,13,147.0,...,0.109,0.023,0.135,0.161,0.0,1945,2033,2,1,3
2553,1,4,4,14,15,3,0,0,12,147.0,...,0.1,0.02,0.165,0.188,0.0,1945,2033,2,1,3


In [30]:
df.columns

Index(['Marque', 'Modèle dossier', 'Modèle UTAC', 'Désignation commerciale',
       'CNIT', 'Type Variante Version (TVV)', 'Carburant', 'Hybride',
       'Puissance administrative', 'Puissance maximale (kW)',
       'Boîte de vitesse', 'Consommation urbaine (l/100km)',
       'Consommation extra-urbaine (l/100km)', 'Consommation mixte (l/100km)',
       'CO2 (g/km)', 'CO type I (g/km)', 'HC (g/km)', 'NOX (g/km)',
       'HC+NOX (g/km)', 'Particules (g/km)', 'masse vide euro min (kg)',
       'masse vide euro max (kg)', 'Champ V9', 'Carrosserie', 'gamme'],
      dtype='object')

## 4. Make your first Machine Learning model

### 4.1 Build a first machine learning model

Now you will be creating your first machine learning model: a RandomForest regression model. We will see in the next course a detailed comprehension of the algorithm. For now, we will be using the algorithm to train a model, then to predict the output and compute the performance of the model

In [31]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# 0. Define X and y
col_X = ['Marque', 'Modèle dossier', 'Modèle UTAC', 'Désignation commerciale',
       'CNIT', 'Type Variante Version (TVV)', 'Carburant', 'Hybride',
       'Puissance administrative', 'Puissance maximale (kW)',
       'Boîte de vitesse', 'Consommation urbaine (l/100km)',
       'Consommation extra-urbaine (l/100km)', 'Consommation mixte (l/100km)',
       'CO type I (g/km)', 'HC (g/km)', 'NOX (g/km)',
       'HC+NOX (g/km)', 'Particules (g/km)', 'masse vide euro min (kg)',
       'masse vide euro max (kg)', 'Champ V9', 'Carrosserie', 'gamme']
X = df[col_X]
y = df["CO2 (g/km)"].values.ravel() #values.ravel is used to make the dataset 2D to 1D

# 1. Define the model
model = LogisticRegression(max_iter=1000)

# 2. Learn the model
model.fit(X,y)

# 3. Predict on data
y_pred = model.predict(X)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Show the prediction

In [32]:
print(y_pred)

[199. 199. 172. 172. 161. 187. 187. 209. 209. 187. 209. 213. 213. 227.
 227. 213. 213. 187. 209. 213. 227. 217. 217. 217. 217. 217. 217. 217.
 230. 230. 230. 230. 230. 217. 230. 230. 109. 109. 109. 109. 114.]


Compute the accuracy score. You will need the function `sklearn.metrics.r2_score`. Documentation can be found: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

In [33]:
from sklearn.metrics import accuracy_score
accuracy_score(y, y_pred)

0.8536585365853658

### 4.2 Save your machine learning model

In [34]:
import joblib
# Save the model as a pickle in a file
joblib.dump(model, "model.pkl")

['model.pkl']

## 5. Make a robust Machine Learning model

We have seen in the Course that Learning an Testing on the same data is a very bad practice in Machine Learning because you will be overfitting your data. Remember, what is the best model below :

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/1200px-Overfitting.svg.png" title="Python Logo" width = 300/></center>



### 5.1 Train / Test Split

First we need to divide our dataset into a train / test datasets. The function you will be using is `sklearn.model_selection.train_test_split`. Documentation can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [36]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training and 20% test


### 5.2 Learn the model on the train data

Like in the previous section, you can now train your model

In [38]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)

### 5.3 Apply your model on the test data

Now apply you model to the test data

In [39]:
df_y_predicted = model.predict(X_test)

### 5.4 Compute the score

Compute the score

In [43]:
from sklearn.metrics import r2_score

r2 = r2_score(df_y_predicted, y_test)
print("Score:", r2)

Score: 0.9403929391867896


### 5.5 Save your machine learning model

Finally, save your machine learning model in pickle format

In [44]:
joblib.dump(model, "model.pkl")

['model.pkl']

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=4159b792-047c-4d4b-8828-e08b006ee629' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>