# Predict the miles per gallon for a car
In this lab you'll be building a model to predict the miles per gallon for a car.
The lab is based on a open dataset from [the UCI machine learning database](https://archive.ics.uci.edu/ml/index.php). 

At the end of this lab you will have a working model that you can deploy and use.
Follow the steps in this lab to train the model.

## Step 1: Load the data 
The first step in the process of building the model is to load the data from disk. The data is contained in a file called `auto-mpg.csv`. Use the pandas function [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) to do this.

In [2]:
import pandas as pd

In [3]:
df_data = pd.read_csv('auto-mpg.csv',na_values=['?'])

## Step 2: Prep the data
In the previous step you've loaded the dataset as a pandas dataframe. 
Let see what the quality is of the data. Are there any columns with nulll values, what are the datatypes of the columns? Are there other things that need to be fixed? 

Remember, you can use the [info](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) method on a dataframe to get information about null values. Also, you can use the [describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method to get some statistics about the data.

In [4]:
df_data.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
count,398.0,398.0,398.0,392.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.469388,2970.424623,15.56809,76.01005,1.572864
std,7.815984,1.701004,104.269838,38.49116,846.841774,2.757689,3.697627,0.802055
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,104.25,75.0,2223.75,13.825,73.0,1.0
50%,23.0,4.0,148.5,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,262.0,126.0,3608.0,17.175,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


In [5]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      392 non-null float64
weight          398 non-null int64
acceleration    398 non-null float64
model year      398 non-null int64
origin          398 non-null int64
car name        398 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB


In [6]:
df_data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


### Let's take a look at the datatype 
Look at the data colum types, which colum stands out and why? Fix the column and convert it to a datatype float64.
Take a look at [the selection methods](https://pandas.pydata.org/pandas-docs/stable/indexing.html) that pandas provide to get access to specific columns. Use the [astype](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html) method on a series to convert its type.

In [7]:
df_data = df_data.dropna()

## Solution 
There is a '?' in the output. This gives two types of datatypes in one column, a string and integer. We can fix this several ways. In this case there are only 6 '?' from a total of 398 entries (use value_counts()). Therefore we can just remove the entries with the value '?'. And after that change the column type into a int64. 

In [None]:
df_data = df_data[df_data.horsepower != '?']

In [None]:
df_data.horsepower.unique()

In [None]:
df_data['horsepower'] = df_data['horsepower'].astype('int64')

In [None]:
df_data.info()

## Step 3: Analyse the data 
Linear regression models work because they are based on the assumption that there's a correlation between the input variables you choose and the output that you want to predict. So before we actually build the model, we need to select features that are correlated to the MPG value of a car.

First step is to look at the correlation between the inputs and the outputs. A good way to see correlations between the input features and the output is to make a correlation heatmap.

Use the [corr](http://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.corr.html) method on your dataframe to calculate the pearson correlation coefficient. Then use the [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) function from the seaborn package to visualize the heatmap for the correlations.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_data.corr(method='pearson')

In [None]:
sns.heatmap(data=df_data.corr())

Which correlations looks strong and could be usefull features for our model? 

## Step 4: Let's build a model

For this regression problem we are going to build a Lineaire Regression Model. Like discussed during the workshop there are three posibilities. 

* Standard Lineair Regression
* Ridge 
* Lasso 

For more information about lineair regression, see [this, somewhat statistical, blog](https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-lasso-regression/)

Let's first build the modelwith standard lineair regression! Do you remember the steps? 

* Define the features and target
* Scale the data 
* Split the data into train and test sets 
* Build the model 
* Validate the model with the root mean squared error
* Optimize the model 

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

## Step 5 : Validate your model 
Now that you have a model it is important to test whether the model fits the data well enough. For this we're going to use the `score` method on the model that you've trained. This returns [the R-squared score](http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit). This score expresses how well the observations fit the line you've trained.

You'll probably notice that the score is pretty good for the model. Now this should make sense since you're using the whole dataset to validate your model. Remember from the slides that this is a bad idea. So go ahead, add another cell to the top of this notebook and [split the dataset](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Use the training to retrain your model and use the test set to validate the model.

## Step 6: Optimize the model
As explained before, linear regression models are sensitive to outliers. When you visualize the data for the model you will see that there are a few outliers. To lessen the effects of outliers on the model you can use a slightly different linear regression model, called ridge regression.

Try to train and validate a ridge regression model and see if that improves the situation.

In [None]:
from sklearn.linear_model import Ridge

### Another alternative model
Ridge regression is one alternative to regular linear regression. But there's another, called [lasso regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) that uses a slightly different approach to fix outliers and other problems. Give the lasso regression as shot as well and see how that model scores.

In [None]:
from sklearn.linear_model import Lasso

## Step 7: Perform extra validation using cross validation
The previous validation steps work, but you can get unlucky because the random split of the data fell short. A good trick to dampen the effects of the RNG is to use [cross validation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html). Give it a shot!

In [None]:
from sklearn.model_selection import cross_val_score