<a href="https://githubtocolab.com/alevant/mlcourse/MultivariateReression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

# Multivariate regression exercise

October 2021  
Data Science study group
  
Practical example of multivariate regression to illustrate good practices in notebooks.

## Libraries

In [None]:
!pip install shap

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting shap
  Downloading shap-0.40.0-cp37-cp37m-manylinux2010_x86_64.whl (564 kB)
[K     |████████████████████████████████| 564 kB 5.3 MB/s 
Collecting slicer==0.0.7
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Installing collected packages: slicer, shap
Successfully installed shap-0.40.0 slicer-0.0.7


In [None]:
!git clone https://github.com/alevant/mlcourse mlcourse

In [None]:
#General purpose
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

#Visualization
import seaborn as sns
import matplotlib.pyplot as plt

#Encoding
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import LabelEncoder

#Scaler
from sklearn.preprocessing import StandardScaler

#Model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

#Metrics
from sklearn import metrics
import shap

## Import data

In [None]:
df = pd.read_csv('CO2_Emissions_Canada.csv')

In [None]:
df.head()

Datasets description  

Make -> Company  
Vehicle Class -> depending on their utility, capacity and weight  
Trasmission -> Transmission type with number of gears  
Fuel Consumption City (L/100 km) -> Fuel consumption in city roads (L/100 km)  
Fuel Consumption Hwy (L/100 km) -> Fuel consumption in highways roads (L/100 km)  
Fuel Consumption Comb (mpg) -> The combined fuel consumption (55% city, 45% highway) is shown in L/100 km

## EDA

In [None]:
df.info()

In [None]:
df.describe()

### Univariate analysis

In [None]:
def bar_graph_values(col):
    sns.countplot(y =col,
                  data=df,
                  orient = "h",
                  order=df[col].value_counts().index)

In [None]:
bar_graph_values('Vehicle Class')

In [None]:
bar_graph_values('Engine Size(L)')

In [None]:
#More detail description
def generate_stats(col):
    print(col+" statical")
    print("Max value: ",df[col].max())
    print("Min value: ",df[col].min())
    print("Moda: ",df[col].mode())
    print("Avg value: ",df[col].mean())
    print("Std value: ",df[col].std())

In [None]:
generate_stats('Engine Size(L)')

In [None]:
bar_graph_values('Cylinders')

In [None]:
bar_graph_values('Transmission')

In [None]:
bar_graph_values('Fuel Type')

In [None]:
df['CO2 Emissions(g/km)'].hist()

### Bivariate analysis

In [None]:
def box_graph_bivar(colx,coly):
    plt.figure(figsize = (10,10))
    sns.boxplot(data = df, x=colx, y=coly, palette = 'cubehelix')
    plt.xticks(rotation = 90)
    plt.show()

In [None]:
box_graph_bivar('Make', 'CO2 Emissions(g/km)')

In [None]:
box_graph_bivar('Vehicle Class', 'CO2 Emissions(g/km)')

In [None]:
box_graph_bivar('Fuel Type', 'CO2 Emissions(g/km)')

In [None]:
# City Fuel Consumption vs Highway Fuel Consumption with Fuel Category
plt.figure(figsize = (10,10))
sns.scatterplot(data=df, 
                x='Fuel Consumption City (L/100 km)', 
                y='Fuel Consumption Hwy (L/100 km)',
                hue='Fuel Type')
plt.show()

In [None]:
df.groupby(by = 'Fuel Type')['Fuel Consumption Comb (L/100 km)'].mean()

In [None]:
# Pivot table with Cylinders, Fuel Type and C02 Emissions
df.pivot_table(values = ['CO2 Emissions(g/km)'], index = ['Cylinders','Fuel Type'], aggfunc = 'mean')

In [None]:
# Heatmap for correlation values
plt.figure(figsize = (10,10))
sns.heatmap(df.corr(), annot=True)
plt.show()

 All Numerical Values are highly correlated to C02 Emission.

## ETL

In [None]:
#From trasmission, I only need gears (last value)
df['Gears'] = df['Transmission'].apply(lambda x:x[-1])
df['Gears'] = df['Gears'].replace('V','0')
df['Gears'].astype('int')
bar_graph_values('Gears')

### Outlier Treatment
Tukey Test

In [None]:
#Relevate quantile
Q1=df['CO2 Emissions(g/km)'].quantile(0.25)
Q3=df['CO2 Emissions(g/km)'].quantile(0.75)
#Interquantile range
IQR=Q3-Q1
#Filter
df = df.loc[df['CO2 Emissions(g/km)']<= (Q3+1.5*IQR)]
#Final shape
df.shape

The df shape continues to be useful to perform regression.

### Select meaningful data

In [None]:
cols = ['Vehicle Class',
       'Engine Size(L)',
       'Cylinders',
       'Gears',
       'Fuel Type',
       'Fuel Consumption Comb (mpg)',
       'CO2 Emissions(g/km)']
model_df = df[cols]
model_df.head()

### Encoding

In [None]:
#Vehicle Class
le_vehicle = LabelEncoder()
model_df['Vehicle Class'] = le_vehicle.fit_transform(model_df['Vehicle Class'])

In [None]:
#Fuel Type
le_fuel = LabelEncoder()
model_df['Fuel Type'] = le_fuel.fit_transform(model_df['Fuel Type'])

### Scaler

In [None]:
scaler = StandardScaler()
scaler.fit(model_df)
scaler_df = pd.DataFrame(scaler.transform(model_df), 
                      columns = model_df.columns)
scaler_df.head()

## Model

### Split datasets

In [None]:
target = scaler_df['CO2 Emissions(g/km)']
features = scaler_df[scaler_df.columns.drop('CO2 Emissions(g/km)')]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state = 0)

### BuIlding & Training

In [None]:
regr = LinearRegression()
regr.fit(X_train, y_train)

## Evaluation

In [None]:
y_pred = regr.predict(X_test)

In [None]:
print(metrics.mean_absolute_error(y_test, y_pred))
print(metrics.mean_squared_error(y_test, y_pred))
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print(metrics.r2_score(y_test, y_pred))

### Shapley

Shapley Additive explanations is inspired by game theory to explain black-box function, like "predict" in Machine Learning models.

It provides a way of measuring the contribution of each feature to produced output in the prediction.

In [None]:
explainer = shap.Explainer(regr.predict, X_train)
shap_values = explainer(X_train)

In [None]:
shap.plots.waterfall(shap_values[0])

In [None]:
shap.plots.bar(shap_values)

Of course, the most important feature to predict emissions is Fuel Consumption Comb.