In [83]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split



#### Machine Learning
As a programmer my job is to write the rules that tell a computer exactly how to solve a specific problem. ML is different approach. Machine Learning is where machine itself 'learns' the rules to solve a problem without being explicitly programmed.
Check out this [article](https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12) for more.

I will be using supervised Learning, which is the branch of ML where the computer learns how to perform a function by looking at labeled trainig data.

We train the supervised learning model by giving it data and showing it what the correct value output should be for that data and our machine learning algorithm uses that data to generalize the rules to reproduce those same results. Imagine that you're a real estate agent with years of experience selling houses. You did it for years and you can figure out the cost of any house instantly due to your deep experience in the field. Your business is growing and you higher couple interns to help manage all your clients. The problem is that your trainees don't possess all the knowledge and skills that you have. To provide help to your interns, we want to write a program that can estimate the value of a house based on certain parameters (features, independant variables) like number of bedrooms/baths, total sqft size, it's neighborhood and so on. We can do this with supervised machine learning. First we'll get the data for last three months, when someone sells a house in our area. For each house, we'll write down the basic characteristics of the house, like number of bedrooms, the house's size and square feet, the neighborhood the house is in, and so on.

But most importantly, we'll write down the final sales price of the house. This is our training data. To build our program, we'll feed the training data into a machine learning algorithm and the algorithm will work out how to come up with the correct answer for each house. This is supervised machine learning. We call it learning, because the computer is learning how to model the price of a house based on the values we're feeding into it. We say it's supervised, because we're giving the computer the correct answer for each house's value. All the computer has to do is work out the relationship between the input data and the final price.

I will use [NumPy](http://numpy.org) which provides data structures and algorithms for fast numerical computations, [pandas](http://pandas.pydata.org) help to make life easy while cleaning up the dataset, [scikit-learn](https://scikit-learn.org/stable/) Swiss army knife for machine learning.

#### Naive approach.
Let's build simple program to estimate the price of a house given just 2 attributes. 


In [5]:
def estimate_home_val(number_of_bedrooms, size_in_sqft):
    # Assume all homes are worth at least $30,000
    value = 30000
    
    # Adjust the cost based on the size sqft
    value += size_in_sqft*100
    
    # Adjust the cost based on the number of bedrooms
    value += number_of_bedrooms*10000
    
    return value
# Estimate the falue of the house:
# 5 bedrooms
# 4000 sq ft
# Actual value: $400,000

value = estimate_home_val(number_of_bedrooms = 5, size_in_sqft = 4000 )
print('Estimated value: {}'.format(value))
print('Actual value: {}'.format(400000))

Estimated value: 480000
Actual value: 400000


#### Training Data
The data I will be using is in ml_house_data_set.csv file. [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) stands for comma-separated values that can be opened and operated in any spreadsheet software like excel. 

In [161]:
import pandas as pd
# Read the data
df = pd.read_csv('data/ml_house_data_set.csv')

# Check the columns

print('Total number of columns: {}'.format(len(df.columns.values)))
print('Last column \'{}\', is called target variable'.format(df.columns.values[-1])) # Value we will be predicting with our model, also called Y ( Dependant Variable )
print('Rest of the columns are called features variables',df.columns.values) # Values we will be feeding in to our ML model, also called X ( Independant Variables)


Total number of columns: 20
Last column 'sale_price', is called target variable
Rest of the columns are called features variables ['year_built' 'stories' 'num_bedrooms' 'full_bathrooms' 'half_bathrooms'
 'livable_sqft' 'total_sqft' 'garage_type' 'garage_sqft' 'carport_sqft'
 'has_fireplace' 'has_pool' 'has_central_heating' 'has_central_cooling'
 'house_number' 'street_name' 'unit_number' 'city' 'zip_code' 'sale_price']


In [27]:
# Check the first 5 rows 
print(df.head(5))

   year_built  stories  num_bedrooms  full_bathrooms  half_bathrooms  \
0        1978        1             4               1               1   
1        1958        1             3               1               1   
2        2002        1             3               2               0   
3        2004        1             4               2               0   
4        2006        1             4               2               0   

   livable_sqft  total_sqft garage_type  garage_sqft  carport_sqft  \
0          1689        1859    attached          508             0   
1          1984        2002    attached          462             0   
2          1581        1578        none            0           625   
3          1829        2277    attached          479             0   
4          1580        1749    attached          430             0   

   has_fireplace  has_pool  has_central_heating  has_central_cooling  \
0           True     False                 True                 True   
1 

In [28]:
# Check the index of the dataframe
print(df.index)

RangeIndex(start=0, stop=42703, step=1)


In [29]:
# We see that there are some missing values in the unit_number attribute
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42703 entries, 0 to 42702
Data columns (total 20 columns):
year_built             42703 non-null int64
stories                42703 non-null int64
num_bedrooms           42703 non-null int64
full_bathrooms         42703 non-null int64
half_bathrooms         42703 non-null int64
livable_sqft           42703 non-null int64
total_sqft             42703 non-null int64
garage_type            42703 non-null object
garage_sqft            42703 non-null int64
carport_sqft           42703 non-null int64
has_fireplace          42703 non-null bool
has_pool               42703 non-null bool
has_central_heating    42703 non-null bool
has_central_cooling    42703 non-null bool
house_number           42703 non-null int64
street_name            42703 non-null object
unit_number            3088 non-null float64
city                   42703 non-null object
zip_code               42703 non-null int64
sale_price             42703 non-null float64
dtypes: b

In [30]:
# We see some interesting statistics, like some houses has 31 bedrooms.
df.describe()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_sqft,carport_sqft,house_number,unit_number,zip_code,sale_price
count,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,3088.0,42703.0,42703.0
mean,1990.993209,1.365759,3.209283,1.923659,0.527153,1987.758986,2127.155446,455.8498,41.656324,18211.767347,2027.395402,11030.991476,413507.1
std,19.199987,0.513602,1.043396,0.759699,0.499268,846.76627,922.807342,243.453463,168.715867,27457.109993,1141.38377,573.576228,318549.7
min,1852.0,0.0,0.0,0.0,0.0,-3.0,5.0,-4.0,0.0,0.0,3.0,10004.0,626.0
25%,1980.0,1.0,3.0,1.0,0.0,1380.0,1466.0,412.0,0.0,674.0,1063.0,10537.0,270899.0
50%,1994.0,1.0,3.0,2.0,1.0,1808.0,1937.0,464.0,0.0,4530.0,2033.0,11071.0,378001.0
75%,2005.0,2.0,4.0,2.0,1.0,2486.0,2640.0,606.0,0.0,24844.5,2921.0,11510.0,497697.0
max,2017.0,4.0,31.0,8.0,1.0,12406.0,15449.0,8318.0,9200.0,99971.0,3998.0,11989.0,21042000.0


#### How much Data ?
Ideally Data should has as many different combinations of features as possible.
If Data set doesn't have a data point for a certain combination of features, ML model won't be able to make a good estimate prediction.
A rule of thumb is to aim to have at least 10 times more data points than the number of features. In our dataset we have 19 features so at least 190 data points would be a starting point.
In most cases it is better to have mor data.

#### Feature engineering ?!
While using supervised learning to solve a problem, we show examples ( X - features , Y - target ) to machine learning algorithm, and the algorithm learns rule to predict the correct output based on those examples. In practice you will see that not all features are useful for modeling the problem, so it could be a better choice to drop, or combine some of the features.
[One - Hot Encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f) is a way for us to represent categorical data in a way that the machine learning model can understand. It creates a new feature in our data set for each unique category in the categorical data. Example zip - code. Let's look at the data set and do some feature engineering.

In [172]:
# Not that garage_type - categorical data, with values None, attached garage, detached ( meaning it is a separate building ). Preprocess using One - Hot Encoding.
# has_fireplace, has_pool, has_central_heating, has_central_cooling are fine since True False values a friendly with ML sklearn library and no preprocessing needed.
# House number, unit number are useless features so we drop them. 
# Location of the house has a big influence of the value, so as a starting point let's only includ the city in our model.
df = pd.read_csv('data/ml_house_data_set.csv')
Y_prep = df[['sale_price']]
X_prep = df.drop(['house_number','street_name','unit_number','zip_code','sale_price'],axis = 1)
#print(len(X.columns.values))
#print(len(Y.columns.values))

# Replace categorical data with one-hot encoded data
X_one_hot = pd.get_dummies(X_prep,columns = ['garage_type','city'])
X_one_hot = X_one_hot.drop(['city_Toddshire','garage_type_detached'],axis = 1)




In [173]:
Y_arr = np.array(Y_prep).reshape(-1,1)
X_arr = np.array(X_one_hot.values)

In [177]:
from joblib import dump
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor

X_train, X_test, y_train, y_test = train_test_split(X_arr, Y_arr.ravel() , test_size = 0.3, random_state= 42)
y_train = y_train.astype(int,copy = True)



In [189]:
model = GradientBoostingRegressor(n_estimators = 1000, learning_rate = 0.1, max_depth = 6,min_samples_leaf=9,max_features=0.1)
#model = GradientBoostingRegressor()

In [190]:
model.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=6, max_features=0.1,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=9,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=1000, n_iter_no_change=None, presort='auto',
             random_state=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

In [191]:
from sklearn.metrics import mean_absolute_error
# find error rate on the training set
mse_train = mean_absolute_error(y_train, model.predict(X_train))

In [192]:
# find error rate on the testing set
mse_test = mean_absolute_error(y_test, model.predict(X_test))

In [183]:
print('Training set mean Absolute Error: %.4f' % mse_train)
print('Testing set mean Absolute Error: %.4f' % mse_test)

Training set mean Absolute Error: 70503.2175
Testing set mean Absolute Error: 73070.2331


In [188]:
print('Training set mean Absolute Error: %.4f' % mse_train)
print('Testing set mean Absolute Error: %.4f' % mse_test)

Training set mean Absolute Error: 70412.9899
Testing set mean Absolute Error: 73411.8743


In [193]:
print('Training set mean Absolute Error: %.4f' % mse_train)
print('Testing set mean Absolute Error: %.4f' % mse_test)

Training set mean Absolute Error: 50435.3333
Testing set mean Absolute Error: 65394.0667


#### Overfitting vs Underfitting
<dl>
    <dt>Training set error very low </dt>
    <dt>Test set error very high </dt>
 </dl>
Models that are too complex will overfit, we can fix it by making the model less complex. One can try using fewer decision trees, making each decision tree smaller, or by preferring simple decision trees over complex ones. 

It's also possible that the model is underfitting, because we don't have enough training data. If reducing the complexity of the model doesn't help, it's possible that you might not have enough training data to solve the problem.

If the error rate for both our training data set and test data sets are high, that means our model is underfit. It didn't capture the patterns in the data set very well. Models that are too simple will underfit. You need to make the model more complex. You can make a gradient boosting model more complex by using more decision trees, or making each decision tree deeper. If the error rate for both our training set and test sets are low, that means our model's working well. It is accurate for the training data and test data. So that means the model has learned the real patterns behind the data.

By tuning the hyper parameters of the model, we can fix underfitting and overfitting issues, and end up with a model that fits well.

Often the best way to find the best settings is just through trial and error, but it can take a lot of work to try all the possible combinations. We have six different parameters here that we can tune and most of these parameters accept any number, so we literally have an infinite number of combinations we could try.

A solution for this problem is to use a grid search. A grid search is where you list out a range of settings you want to try for each parameter, and you literally try them all. You train and test the model for every combination of parameters. The combination of parameters that generates the best predictions are the set of parameters you should use for your real model. Scikit-learn totally automates this process.

The [param grid](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterGrid.html) has an array for each parameter. For each setting, we add the range of values that we want to try. The ranges we have here are good values to try for most problems. A good strategy is to try a few values for each parameter, where it increases or decreases by a significant amount, like 1.0 to 0.3 to 0.1, like we have here. There's not much point in trying values that are very close, like 1.0 to 0.95, since the results probably won't be that much different.

Next, define the grid search using the [grid search CV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) function. This takes in the model object, the param grid, and the number of CPUs we want to use to run our grid search. If you have a computer with more than one CPU, you can speed things up by using all of them. Next, we call fit on the grid search object to run the grid search. It's very important that we only pass the training data into the grid search CV function. We don't give it access to our test data set. The CV in grid search CV stands for cross-validation. The function will automatically slice up the training data into smaller subsets and use part of the data for training different models and a different part of the data for testing those models.

This means that the model configuration's done without ever seeing our test data. It keeps our test data totally hidden to make sure we are doing a completely blind test of the final model. Running the grid search will take a long time since it's actually training a model for every possible combination of parameters in the para grid several times ( by default cv = 3).

In [None]:
from sklearn.model_selection import GridSearchCV


# Hyper parameters we will be searching through, each compinations will be used. Total # = 3*3*4*4*3*3 combinations.
param_grid = {
    'n_estimators': [500,1000,3000],
    'max_depth':[4,6,8],
    'min_samples_leaf': [3,5,9,17],
    'learning_rate': [0.1,0.05,0.02,0.01],
    'max_features':[1.0,0.3,0.1],
    'loss':['ls','lad','huber']
}

# Define the grid search. To run in parallel use n_jobs = 4
gs_cv = GridSearchCV(model, param_grid,n_jobs = 4)
# Run the grid search - on trainig set
gs_cv.fit(X_train, y_train)
# Print best parameters
print(gs_cv.best_params_)




#### Retraining the Estimator

#### Next Steps