# Linear Regression in Sci-Kit Learn - Introduction

This dataset concerns housing values in suburbs of Boston. The original dataset was taken from the StatLib library which is maintained at Carnegie Mellon University, here it is downloaded from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/).

Your goal is to create and train a model that can estimate the average housing price.

### Dataset description (columns)

     1. CRIM     per capita crime rate by town
     2. ZN       proportion of residential land zoned for lots over 
                 25,000 sq.ft.
     3. INDUS    proportion of non-retail business acres per town
     4. CHAS     Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
     5. NOX      nitric oxides concentration (parts per 10 million)
     6. RM       average number of rooms per dwelling
     7. AGE      proportion of owner-occupied units built prior to 1940
     8. DIS      weighted distances to five Boston employment centres
     9. RAD      index of accessibility to radial highways
    10. TAX      full-value property-tax rate per 10,000 USD
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks 
                 by town
    13. LSTAT    % lower status of the population
    14. MEDV     Median value of owner-occupied homes in 1000's of dollars
    

In [None]:
import pandas as pd
import numpy as np

Load and display data.

In [None]:
# Uncomment this if you are using Google Colab
!wget https://raw.githubusercontent.com/PrzemekSekula/DeepLearningClasses1/master/LinearRegressionSKLearn/housing.csv

--2020-11-11 20:43:07--  https://raw.githubusercontent.com/PrzemekSekula/DeepLearningClasses1/master/LinearRegressionSKLearn/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38448 (38K) [text/plain]
Saving to: ‘housing.csv’


2020-11-11 20:43:08 (2.38 MB/s) - ‘housing.csv’ saved [38448/38448]



In [None]:
df = pd.read_csv('housing.csv')

print(df.shape)
df.head()

(506, 14)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


### Task 1
Select X (columns `['CRIM', 'TAX', 'RM']`) and y (column `MEDV`)

In [None]:
x = df[['CRIM', 'TAX', 'RM']]
x.head()

In [None]:
y = df['MEDV']
y.head()

### Task 2
Split data into two subsets
- train subset: 70% of data
- test subset: 30% of data
- set random_state to 1

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=1)
print("X train: ", x_train.shape)
print("X test: ", x_test.shape)
print("Y train: ", y_train.shape)
print("Y test: ", y_test.shape)

### Task 3
Create and train linear regression model.

In [None]:
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

model = LinearRegression().fit(x_train,y_train)
model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Task 4
Compute $R^2$ coefficient for train and test datasets. Use `model.score()` to do it.

$$R^2=1-\frac{\Sigma{(y-\hat{y})^2}}{\Sigma{(y-\overline{y})^2}}$$

Where:
- $y$ - real `y` values
- $\hat{y}$ - model predictions
- $\overline{y}$ - mean value of `y`

In [None]:
print("R2 train:", model.score(x_train,y_train))
print("R2 test:", model.score(x_test, y_test))

R2 train: 0.5096603576929335
R2 test: 0.6901893330926419


### MAPE - Mean Absolute Percentage Error

$$MAPE = \frac{1}{n} \sum{ \left\lvert{\frac{y-\hat{y}}{y}}\right\rvert}$$

Where:
- $y$ - real `y` values
- $\hat{y}$ - model predictions
- $n$ - number of samples

In [None]:
y_pred = model.predict(x_train)
mape_train= np.mean(np.abs((y_train-y_pred) / y_train))
print("Train MAPE [%]:", mape_train*100)

Train mape [%]: 21.552430568659016


### Task 5
Create a function mape, that returns  𝑀𝐴𝑃𝐸  value given  𝑋 ,  𝑦  and the model that is used to create  𝑦̂   estimates. Then use your function to compute  𝑀𝐴𝑃𝐸  for train and test datasets. 

In [None]:
def mape(model, X, y):
    y_pred = model.predict(X)
    return 100 * np.mean(np.abs(y - y_pred) / y)

In [None]:
print ('Train MAPE:', mape(model, x_train, y_train))
print ('Test MAPE:', mape(model, x_test, y_test))

Train MAPE: 21.552430568659016
Test MAPE: 20.78375470750852


## Random forest regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor().fit(x_train,y_train)

print ('Train MAPE:', mape(model, x_train, y_train))
print ('Test MAPE:', mape(model, x_test, y_test))
# Tutaj tej overfitting

Train MAPE: 7.6961234623951995
Test MAPE: 16.99907377803256


### Task 6
Experiment with `min_samples_leaf` parameter to avoid overfitting.

In [None]:
model = RandomForestRegressor(min_samples_leaf=15).fit(x_train,y_train)

print ('Train MAPE:', mape(model, x_train, y_train))
print ('Test MAPE:', mape(model, x_test, y_test))

Train MAPE: 17.274656322415417
Test MAPE: 18.725356761040242


# Part 2

### Task 7
Select all 13 features as $X$ and split dataset into two subsets (the same split ratio and random state).

In [None]:
df.head()


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [None]:
x = df[['CRIM','ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']]
x.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33


In [None]:
y = df['MEDV']
y.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=1)
print("X train: ", x_train.shape)
print("X test: ", x_test.shape)
print("Y train: ", y_train.shape)
print("Y test: ", y_test.shape)

X train:  (354, 13)
X test:  (152, 13)
Y train:  (354,)
Y test:  (152,)


### Task 8
Train and test linear regression model. Compare the results with the previous ones.

In [None]:
model = LinearRegression().fit(x_train,y_train)

print ('Train MAPE:', mape(model, x_train, y_train))
print ('Test MAPE:', mape(model, x_test, y_test))

Train MAPE: 16.714624689642484
Test MAPE: 16.207536032281517


### Task 9
Train and test Random Forest model (keep all parameters default). Does your model suffer from overfitting / underfitting?

In [None]:
model = RandomForestRegressor().fit(x_train,y_train)

print ('Train MAPE:', mape(model, x_train, y_train))
print ('Test MAPE:', mape(model, x_test, y_test))

Train MAPE: 4.245851611741198
Test MAPE: 11.559038472404808


### Task 10
Try to modify `min_samples_leaf` parameter to get the best model possible.

In [None]:
model = RandomForestRegressor(min_samples_leaf=11).fit(x_train,y_train)

print ('Train MAPE:', mape(model, x_train, y_train))
print ('Test MAPE:', mape(model, x_test, y_test))

Train MAPE: 10.894179800515962
Test MAPE: 13.932058830126358
