<a href="https://colab.research.google.com/github/ralsouza/machine_learning_python/blob/master/notebooks/03_machine_learning_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Define the business problem
Let's create a predictive model that is able to predict the house's prices based on a set of variables about in several houses in an neibourhood in Boston.

Dataset: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

Variables
There are 14 attributes in each case of the dataset. They are:
* CRIM - per capita crime rate by town
* ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS - proportion of non-retail business acres per town.
* CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOX - nitric oxides concentration (parts per 10 million)
* RM - average number of rooms per dwelling
* AGE - proportion of owner-occupied units built prior to 1940
* DIS - weighted distances to five Boston employment centres
* RAD - index of accessibility to radial highways
* TAX - full-value property-tax rate per $10,000
* PTRATIO - pupil-teacher ratio by town
* B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LSTAT - % lower status of the population
* MEDV - Median value of owner-occupied homes in $1000's

# 2. Model Evaluation
https://scikit-learn.org/stable/modules/model_evaluation.html

## 2.1 Metrics to Regression Algorithms

- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R Squared (R²)
- Adjusted R Squared (R²)
- Mean Square Percentage Error (MSPE)
- Mean Absolute Percentage Error (MAPE)
- Root Mean Squared Logarithmic Error (RMSLE)


### 2.1.1 MSE
Probably the most simple and comum metric to regression evaluation, but useless too. The MSE measures the mean squared error. To each point, calculates the squared difference between predictions and the real value of target variable and then calculates the mean of these values.

How bigger this value, worse the model will be. This value never be negative, since we are elevating the individual prediction errors to square, but could be zero to a perfect model.

The higher the percentage, the worse the performance.

In [None]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
col = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = col)
array = data.values


# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Define train and teste datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Making model
model = LinearRegression()

# Training the model
model.fit(X_train,Y_train)

# Making Predictions
Y_pred = model.predict(X_test)

# Results
mse = mean_squared_error(Y_test,Y_pred)
print('MSE: ', mse)

MSE:  28.53045876597476


### 2.1.2 MAE
Mean Absolute Error, is the difference of the sum between predictions and real values. Provides how wrong are our predictions, the value `0` indicates that there are no errors.

In [None]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
col = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = col)
array = data.values


# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Define train and teste datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Making model
model = LinearRegression()

# Training the model
model.fit(X_train,Y_train)

# Making Predictions
Y_pred = model.predict(X_test)

# Results
mae = mean_absolute_error(Y_test,Y_pred)
print('MAE: ', mae)

MAE:  3.455034932248358


### 2.1.3 Rˆ2
This metric provides the precision level about the observated values. Also called coefficient of determination.
Values between `0`and `1`, being `0` the ideial.

In [None]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
col = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = col)
array = data.values


# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Define train and teste datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Making model
model = LinearRegression()

# Training the model
model.fit(X_train,Y_train)

# Making Predictions
Y_pred = model.predict(X_test)

# Results
r2 = r2_score(Y_test,Y_pred)
print('r2 score: ', r2)

r2 score:  0.6956551656111588


# 3. Regression Algorithms
Assumes that the data are in Normal Distribuition, that the values are relevant to model and aren't collinears (variables with high correlation). The data scientist must deliver relevant variables to the algorithm.


## 3.1 Linear Regression

In [4]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = cols)
array = data.values

# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Separate the data into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33)

# Create model
model = LinearRegression()

# Train model
model.fit(X_train, Y_train)

# Make prediction
Y_pred = model.predict(X_test)

# Result
mse = mean_squared_error(Y_test, Y_pred)
print("MSE is:", mse)

MSE is: 25.659298567970456


## 3.2 Ridge Regression
An extension of linear regression where the `loss function` is modified to minimize the model complexity.

In [5]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = cols)
array = data.values

# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Separate the data into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33)

# Create model
model = Ridge()

# Train model
model.fit(X_train, Y_train)

# Make prediction
Y_pred = model.predict(X_test)

# Result
mse = mean_squared_error(Y_test, Y_pred)
print("MSE is:", mse)

MSE is: 28.42802678081147


## 3.3 Lasso Regression
Lasso (Least Absolute Shrinkage and Selection Operator) Regression is a customization of linear regression and just as Ridge, the loss function is modified to minimize the model complexity.

In [6]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = cols)
array = data.values

# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Separate the data into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33)

# Create model
model = Lasso()

# Train model
model.fit(X_train, Y_train)

# Make prediction
Y_pred = model.predict(X_test)

# Result
mse = mean_squared_error(Y_test, Y_pred)
print("MSE is:", mse)

MSE is: 18.989093683107722


## 3.4 ElasticNet Regression
Is on way to regularize of regression that combine the propriets of Ridge and LASS. The goal is minimize the model complexity, penalizing the model using the sum of the squares of the coefficients.

In [7]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import ElasticNet

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = cols)
array = data.values

# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Separate the data into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33)

# Create model
model = ElasticNet()

# Train model
model.fit(X_train, Y_train)

# Make prediction
Y_pred = model.predict(X_test)

# Result
mse = mean_squared_error(Y_test, Y_pred)
print("MSE is:", mse)

MSE is: 28.189466092930203


## 3.5 KNN to Regression

In [9]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = cols)
array = data.values

# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Separate the data into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33)

# Create model
model = KNeighborsRegressor()

# Train model
model.fit(X_train, Y_train)

# Make prediction
Y_pred = model.predict(X_test)

# Result
mse = mean_squared_error(Y_test, Y_pred)
print("MSE is:", mse)

MSE is: 43.98385389221558


## 3.6 CART

In [10]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = cols)
array = data.values

# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Separate the data into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33)

# Create model
model = DecisionTreeRegressor()

# Train model
model.fit(X_train, Y_train)

# Make prediction
Y_pred = model.predict(X_test)

# Result
mse = mean_squared_error(Y_test, Y_pred)
print("MSE is:", mse)

MSE is: 15.947784431137729


## 3.7 SVM to Regression

In [11]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = cols)
array = data.values

# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Separate the data into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33)

# Create model
model = SVR()

# Train model
model.fit(X_train, Y_train)

# Make prediction
Y_pred = model.predict(X_test)

# Result
mse = mean_squared_error(Y_test, Y_pred)
print("MSE is:", mse)

MSE is: 74.91295307495547


# 4. Optimize Model
All of machine learning algorithm are configurable, that means that it can be adjusted thought parameters tuning. Your job is to find the best parameters to each algorithm.

This process is also called `hyperparameters optimization`, the scikit-learn offers two methods: `Grid Search Parameter Tuning` and `Random Search Parameter Tuning`.

## 4.1 Grid Search Parameter Tuning
This method creates methodically a combination between all parameters, making a grid. Bellow the value 1 reached the best performance. 

In [12]:
# Import modules
from pandas import read_csv
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = cols)
array = data.values

# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Define the values that will be tested
alpha_values = np.array([1,0.1,0.01,0.001,0.0001,0])
grid_values = dict(alpha = alpha_values)

# Create model
model = Ridge()

# Create the grid
grid = GridSearchCV(estimator = model, param_grid = grid_values)
grid.fit(X, Y)

# Print result
print("Best parameters:\n", grid.best_estimator_)

Best parameters:
 Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)


## 4.2 Random Search Parameter Tuning
This method generate parameters samples from a randomic uniform distribuition to a fix iteration number. The model is built and tested to each parameter combination. Bellow the near value of 1 will show the best result.

In [15]:
# Import dos módulos
from pandas import read_csv
import numpy as np
from scipy.stats import uniform
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV

# Carregando os dados
arquivo = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
colunas = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
dados = read_csv(arquivo, delim_whitespace = True, names = colunas)
array = dados.values

# Separando o array em componentes de input e output
X = array[:,0:8]
Y = array[:,8]

# Definindo os valores que serão testados
valores_grid = {'alpha': uniform()}
seed = 7

# Criando o modelo
modelo = Ridge()
iterations = 100
rsearch = RandomizedSearchCV(estimator = modelo, 
                             param_distributions = valores_grid, 
                             n_iter = iterations, 
                             random_state = seed)
rsearch.fit(X, Y)

# Print do resultado
print("Melhores Parâmetros do Modelo:\n", rsearch.best_estimator_)

Melhores Parâmetros do Modelo:
 Ridge(alpha=0.9779895119966027, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)


# 5. Saving Results

In [16]:
# Import dos módulos
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
import pickle

# Carregando os dados
arquivo = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
colunas = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
dados = read_csv(arquivo, delim_whitespace = True, names = colunas)
array = dados.values

# Separando o array em componentes de input e output
X = array[:,0:13]
Y = array[:,13]

# Definindo os valores para o número de folds
teste_size = 0.35
seed = 7

# Criando o dataset de treino e de teste
X_treino, X_teste, Y_treino, Y_teste = train_test_split(X, Y, test_size = teste_size, random_state = seed)

# Criando o modelo
modelo = Ridge()

# Treinando o modelo
modelo.fit(X_treino, Y_treino)

# Salvando o modelo
arquivo = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/modelo_regressor_final.sav'
pickle.dump(modelo, open(arquivo, 'wb'))
print("Modelo salvo!")

# Carregando o arquivo
modelo_regressor_final = pickle.load(open(arquivo, 'rb'))
print("Modelo carregado!")

# Print do resultado
# Fazendo previsões
Y_pred = modelo_regressor_final.predict(X_test)

# Resultado
mse = mean_squared_error(Y_test, Y_pred)
print("O MSE do modelo é:", mse)

Modelo salvo!
Modelo carregado!
O MSE do modelo é: 30.524598974146706
