# Introduction to Machine Learning

## First of all, what is Machine Learning?

- Generally speaking, is the field of computer science and engineering that studys the capability of computers to get knowledge from data.


<img src="Images/curiosa-1.gif" width="500" align="center">

## A New Approach

<figure class="image">
  <img src="Images/Traditional_Approach.png" align="center" width="500">
  <figcaption> Tradition Approach in Programming (Source: "Hands-On Machine Learning with Scikit-Learn
and TensorFlow", Aurélien Géron, O'REILLY, 2017). </figcaption>
</figure>

<figure class="image">
  <img src="Images/Machine_Learning.png" align="center" width="500">
  <figcaption> Machine Learning Approach (Source: "Hands-On Machine Learning with Scikit-Learn
and TensorFlow", Aurélien Géron, O'REILLY, 2017). </figcaption>
</figure>

## What are the applications?

<figure class="image">
    <img src="Images/SgML.gif" align="center" width="500">
  <figcaption> - Email Spam Detection. </figcaption>
    <img src="Images/ImML.gif" align="center" width="500">
  <figcaption> - Image Classification. </figcaption>
    <img src="Images/MdML.gif" align="center" width="500">
  <figcaption> - Medical Diagnosis. </figcaption>
    <img src="Images/StML.gif" align="center" width="500">
  <figcaption> - Stock Market Predictions. </figcaption>
    <img src="Images/ScML.gif" align="center" width="500">
  <figcaption> - Real-Time Object Detection and Self-Driving Cars. </figcaption>
</figure>

## Domains of Machine Learning

#### Supervised Learning

- In Supervised Learning, all training data instances include the expected solutions/output (labels).

<figure class="image">
  <img src="Images/SupervisedLabel.png" align="center" width="800">
  <figcaption> Supervised Learning Approach (Source: "Hands-On Machine Learning with Scikit-Learn
and TensorFlow", Aurélien Géron, O'REILLY, 2017). </figcaption>
</figure>

It can be subdivided on the following types of tasks:

- Classification (Binary/Multiclass):
- Regression.

#### Unsupervised Learning

- In Unsupervised Learning, as opposed to the previous case, training data do not have any prior labelling. We want to find patterns in the unstructured data.

<figure class="image">
  <img src="Images/Unsupervised.png" align="center" width="700">
  <figcaption> Set of unlabeled data (Source: "Hands-On Machine Learning with Scikit-Learn
and TensorFlow", Aurélien Géron, O'REILLY, 2017). </figcaption>
</figure>

#### Reinforcement Learning

- A more focused Machine Learning field for information theory and game theory. Interaction between an agent and an environment, where the agent receives penaltys/rewards and optimizes the set of actions (policy) to maximize the rewards.

<figure class="image">
  <img src="Images/Reinforcement.png" align="center" width="700">
  <figcaption> Agent-Environment Interaction (Source: "Reinforcement Learning: An Introduction", Richard S. Sutton and Andrew G. Barto, 2018, MIT Press). </figcaption>
</figure>

## Examples

### Supervised Learning

<img src="Images/Exemplo1.png" width="700" align="center">

<img src="Images/Exemplo2.gif" width="700" align="center">

### Unsupervised Learning

<img src="Images/Exemplo3.png" width="700" align="center">

<img src="Images/Exemplo4.jpg" width="700" align="center">

### Reinforcement Learning

<img src="Images/Exemplo5.gif" width="700" align="center">

<img src="Images/Exemplo7.gif" width="700" align="center">

## Standard Procedures in Machine Learning

####  Implementing any Machine Learning solution normally incorporates the following steps:

- Identifying the problem you want to solve;
- Fit the problem in the related Machine Learning domain;
- Collecting data;
- Structuring/Cleaning the obtained data and gather the first insights;
- Get the most important components of your data;
- Model building;
- Model evaluation and Results interpretation;
- Deployment.

### In this session, we will focus on building and training a model. More specifically, we will solve a regression task to predict the price of residential homes in the United States.

#### Let us first recap what regression tasks handle and understand how generally we train our models.

### Regression Modeling - Linear Regression

- As opposed to binary or multiclass modeling, regression tasks deal with continuous labelled values. In case of predicting prices, we are trying to guess a value belonging to real domain, $y \in \mathbb{R}$. We are given a set of input variables, so we are dealing with multivariate analysis.

- Let us first start with the simplest regression model available, the linear regression.

In Linear Regression, the model makes a prediction by calculating the weighted sum of the $n$ features and an initial term, $\theta_{0}$,

<font size="20">
$ y = \theta_{0} + \theta_{1}x_{1} + ... + \theta_{n}x_{n} $
</font>

where $y$ is the output we want to predict,  $n$ the number of input variables we have in our data, $x$ is the input variable value and $\theta$ the model parameter/weight.

<figure class="image">
  <img src="Images/Linear.gif" align="center" width="500">
  <figcaption> "Linear Regression: The Easier Way", Towards Data Science, Sagar Sharma, 2017 </figcaption>
</figure>

- We train the model to set the optimal parameters so it fits the data the best way. This is practically the same as saying that we are trying to find the best values of $\theta$ so it minimizes the Root Mean Square Error (RMSE).

The Root Mean Square Error is a standardized manner of evaluate the error of the regression model. It is defined as,

<font size="100">
$ RMSE = \sqrt{\sum_{i=1}^{n} \frac{(\hat y_{i}-y_{i})^{2}}{n}}$
</font>

where $\hat y_{n}$ are the predicted values by the model, $y_{i}$ the real output observations, and $n$ the total number of observations.

### How to train Regression Models

First of all, we need to "show" the model examples of what we want them to predict. As previously mentioned, we desire, for every new instance, a price prediction of the new house. After feeding the model with labelled examples (training process), it is necessary to evaluate the model's performance, so as it behaves decently when new instances are given (evaluation on a test set).

<img src="Images/DataCamp.png" width="500" align="center">

#### Now it is time to train a model!

<img src="Images/Train.gif" width="500" align="center">

## Training a Regression Model

Let's define our Machine Learning problem we want to solve. Using the attributes information of individual houses, we want to train a model that, based on previous examples, predictes a price for a given house. Since it predicts a value $y \in \mathbb{R}$, we will a regressor predictor - Linear Regression model,

#### 1. Importing all necessary libraries.

In [1]:
# Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
from pylab import rcParams
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.externals import joblib



In [2]:
#Set numpy output options 

np.set_printoptions(edgeitems=3)
np.core.arrayprint._line_width = 30

# Pandas output options

pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 100)

#### First, let's take a look at our data. As previously mentioned, we are building a model to predict the price of residential homes. 

#### 2. Gather the data.

In [3]:
%time

# Importing the data with pandas

data_train = pd.read_csv('train.csv', index_col=0)
print(data_train[:10])

Wall time: 0 ns
    MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
Id                                                                    
1           60       RL         65.0     8450   Pave   NaN      Reg   
2           20       RL         80.0     9600   Pave   NaN      Reg   
3           60       RL         68.0    11250   Pave   NaN      IR1   
4           70       RL         60.0     9550   Pave   NaN      IR1   
5           60       RL         84.0    14260   Pave   NaN      IR1   
6           50       RL         85.0    14115   Pave   NaN      IR1   
7           20       RL         75.0    10084   Pave   NaN      Reg   
8           60       RL          NaN    10382   Pave   NaN      IR1   
9           50       RM         51.0     6120   Pave   NaN      Reg   
10         190       RL         50.0     7420   Pave   NaN      Reg   

   LandContour Utilities LotConfig  ... PoolArea PoolQC  Fence MiscFeature  \
Id                                  ...              

#### 3. General Statistics and Visualization

In [4]:
# Check data types

print('Feature Types: \n', data_train.dtypes.value_counts())

Feature Types: 
 object     43
int64      34
float64     3
dtype: int64


In [5]:
# Check all nan values

print('Data NaN values: \n', data_train.isna().sum())

Data NaN values: 
 MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual           37
BsmtCond           37
BsmtExposure       38
BsmtFinType1       37
BsmtFinSF1          0
BsmtFinType2       38
BsmtFinSF2          0
BsmtUnfSF           0
TotalBsmtSF         0
Heating             0
HeatingQC           0
CentralAir          0
Electrical          1
1stFlrSF            0
2ndFlrSF            0
LowQualFinSF 

In [6]:
# Checking for duplicated values

duplicated_values = data_train.duplicated().sum()
print('There is {} duplicated values.'.format(duplicated_values))

There is 0 duplicated values.


In [7]:
%matplotlib notebook

# Data distribution

data_train.hist(figsize=(27, 35), bins=50, xlabelsize=10, ylabelsize=10)
plt.show()

<IPython.core.display.Javascript object>

In [8]:
%matplotlib notebook

# Price distribution

sn.distplot(data_train['SalePrice'])
plt.title('Sale Price Distribution')
plt.show()

<IPython.core.display.Javascript object>

#### This dataset contains a considerable sized feature space $X$. The $SalePrice$ column is our target variable, $Y$. To this initial approach, we will use only numerical features.

#### 4. Setting the training set.

In [9]:
# Selecting only numerical features

data_train = data_train.select_dtypes(include=['int64', 'float64'])

In [10]:
# Removing null features

data_train.dropna()
data_train.fillna(0, inplace=True)
data_train = data_train.loc[:, data_train.any()]

#### 5. Setting train and test sets.

In [11]:
# Getting features and label data

X = data_train.drop(['SalePrice'], axis = 1)
Y = data_train['SalePrice']


# Train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

print('Train Feature Set \n', X_train)
print('Train Label Set \n', y_train)

Train Feature Set 
       MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
Id                                                                            
1162          20          0.0    14778            6            7       1954   
1000          20         64.0     6762            7            5       2006   
766           20         75.0    14587            9            5       2008   
1250          20         60.0     7200            5            7       1950   
341           60         85.0    14191            8            5       2002   
...          ...          ...      ...          ...          ...        ...   
642           60          0.0     7050            7            5       2001   
1344          50         57.0     7558            6            6       1928   
779           90         60.0     8400            5            5       1977   
1307         120         48.0     6955            7            5       2005   
1192         160         24.0   

#### We are now ready to train our model. 

#### 7. Setting Model.

In [12]:
# Defining the model

model = LinearRegression()

# Training the model

model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#### 8. Evaluate model parameters.

In [13]:
# Getting the model weights

print('Model Weights \n', model.coef_)
print('Total Parameter Weights \n', len(model.coef_))

Model Weights 
 [-1.82427433e+02 -1.59407097e+01  2.52117788e-01  1.81071130e+04
  4.38229248e+03  2.92163775e+02  1.11493286e+02  2.75572555e+01
  8.26894503e+00 -2.95632325e+00  2.74788196e-01  5.58740997e+00
  1.89989801e+01  1.79568391e+01 -1.00870986e+01  2.68687206e+01
  1.03845544e+04  4.23068162e+03  6.09156692e+03 -1.74529931e+03
 -1.11499096e+04 -1.74421788e+04  6.08248025e+03  4.68049523e+03
 -1.55579523e+01  1.54395710e+04  9.71901102e+00  2.83838400e+01
 -1.88483701e+01  5.61868676e+00  1.78289818e+01  6.45521689e+01
 -3.12322639e+00 -1.94407496e-01 -1.45871223e+02 -3.60081988e+02]
Total Parameter Weights 
 36


#### After training our model, we need to evaluate it. For that, we will use some performance metrics, including the RMSE. 

#### 10. Model Performance Metrics.

In [14]:
# Testing the model on the test set

y_prediction = model.predict(X_test)

# RMSE Metric

ms = mean_absolute_error(y_test, y_prediction)
print('RMSE \n', np.sqrt(ms))

# R2 Score
print('R2 Score \n', r2_score(y_test, y_prediction))

RMSE 
 145.0805516186001
R2 Score 
 0.8573284253656729


#### 11. Visualization of the model performance.

In [15]:
%matplotlib notebook

# Expected vs Predicted Values with Linear Regressin Model plot

sn.regplot(y_test, y_prediction, fit_reg=True)
plt.title ('Expected vs Predicted Values with Linear Regressin Model')
plt.xlabel('Expected Price')
plt.ylabel('Predicted Price')
plt.show()

<IPython.core.display.Javascript object>

#### Well done! We now have a linear regression model that predicts house sale prices. It is ready to be delivered to any application!

<img src="Images/Well_Done.gif" width="500" align="center">

#### We can also save it to future applications.

#### 12. Saving and Loading the model.

In [16]:
# Save model in PKL file

model_file = 'Linear_Regression.pkl'  
joblib.dump(model, model_file)

['Linear_Regression.pkl']

In [17]:
# Load Model

load_model = joblib.load(model_file)

# Make new predictions

load_model.predict(X_test)

array([138041.26171422,  71932.81023537, 358810.11801288, 207421.5813202 ,
       166595.89021925, 111557.7653868 , 167471.56991732, 185401.5328418 ,
       194694.43790652, 194558.19739132, 134861.57970046, 159765.87204092,
       107372.18233853, 239343.92597709,  62984.66991012, 321509.83699484,
       243338.84586615, 143190.51650365, 144181.70683432,  70646.41466108,
       174852.55849233,  42033.4110129 ,  88963.92925927, 218569.61268791,
       106838.52747483, 277860.64483903, 221038.26202485, 194638.37128238,
       106148.1201809 , 191258.38366388, 260585.09406965, 128353.9667647 ,
       101744.20843584, 326008.69548451, 111301.63319925, 154938.65251606,
       227199.05551571, 136421.93166338, 208252.02351751, 139681.13728225,
       144534.98825074, 195900.30547044, 119138.23555408, 209596.92246047,
       296329.84448487,  83259.57977224, 182284.43714587, 151000.75024481,
       188193.81396162, 112579.31265185, 160618.92887407, 249824.71656297,
       154208.91806627, 1

##### You are now ready to make your first Machine Learning model. See you in the "Intro to Machine Learning - Exercises" section!  👍