<a href="https://colab.research.google.com/github/ralsouza/machine_learning_python/blob/master/notebooks/03_machine_learning_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Define the business problem
Let's create a predictive model that is able to predict the house's prices based on a set of variables about in several houses in an neibourhood in Boston.

Dataset: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

Variables
There are 14 attributes in each case of the dataset. They are:
* CRIM - per capita crime rate by town
* ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS - proportion of non-retail business acres per town.
* CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOX - nitric oxides concentration (parts per 10 million)
* RM - average number of rooms per dwelling
* AGE - proportion of owner-occupied units built prior to 1940
* DIS - weighted distances to five Boston employment centres
* RAD - index of accessibility to radial highways
* TAX - full-value property-tax rate per $10,000
* PTRATIO - pupil-teacher ratio by town
* B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LSTAT - % lower status of the population
* MEDV - Median value of owner-occupied homes in $1000's

# 2. Model Evaluation
https://scikit-learn.org/stable/modules/model_evaluation.html

## 2.1 Metrics to Regression Algorithms

- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R Squared (R²)
- Adjusted R Squared (R²)
- Mean Square Percentage Error (MSPE)
- Mean Absolute Percentage Error (MAPE)
- Root Mean Squared Logarithmic Error (RMSLE)


### 2.1.1 MSE
Probably the most simple and comum metric to regression evaluation, but useless too. The MSE measures the mean squared error. To each point, calculates the squared difference between predictions and the real value of target variable and then calculates the mean of these values.

How bigger this value, worse the model will be. This value never be negative, since we are elevating the individual prediction errors to square, but could be zero to a perfect model.

The higher the percentage, the worse the performance.

In [7]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
col = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = col)
array = data.values


# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Define train and teste datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Making model
model = LinearRegression()

# Training the model
model.fit(X_train,Y_train)

# Making Predictions
Y_pred = model.predict(X_test)

# Results
mse = mean_squared_error(Y_test,Y_pred)
print('MSE: ', mse)

MSE:  28.53045876597476


### 2.1.2 MAE
Mean Absolute Error, is the difference of the sum between predictions and real values. Provides how wrong are our predictions, the value `0` indicates that there are no errors.

In [10]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
col = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = col)
array = data.values


# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Define train and teste datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Making model
model = LinearRegression()

# Training the model
model.fit(X_train,Y_train)

# Making Predictions
Y_pred = model.predict(X_test)

# Results
mae = mean_absolute_error(Y_test,Y_pred)
print('MAE: ', mae)

MAE:  3.455034932248358


### 2.1.3 Rˆ2
This metric provides the precision level about the observated values. Also called coefficient of determination.
Values between `0`and `1`, being `0` the ideial.

In [12]:
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

# Loading data
file = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/boston-houses.csv'
col = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = col)
array = data.values


# Separating the array on input and output components
X = array[:,0:13]
Y = array[:,13]

# Define train and teste datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

# Making model
model = LinearRegression()

# Training the model
model.fit(X_train,Y_train)

# Making Predictions
Y_pred = model.predict(X_test)

# Results
r2 = r2_score(Y_test,Y_pred)
print('r2 score: ', r2)

r2 score:  0.6956551656111588
