# Building Machine Learning Pipelines: Model Build Phase

In this and the upcoming videos we will focus on creating Machine Learning Pipelines considering all the life cycle of a Data Science Projects. This will be important for professionals who have not worked with huge dataset.

## Project Name: House Prices: Advanced Regression Techniques

The main aim of this project is to predict the house price based on various features which we will discuss as we go ahead

#### Dataset to downloaded from the below link
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

### All the Lifecycle In A Data Science Projects 
1. Data Analysis
2. Feature Engineering
3. Feature Selection
4. Model Building
5. Model Deployment

### Contents 

1st Pipeline : House Price - Exploratory Data Analysis 

1. Data Analysis aka EDA ( Exploratory Data Analysis )
    a.   Missing Values
              i)   Explore how missing values affect our data
              ii)  Check relationship of missing values features and the dependent feature ( SalePrice ) using BARPLOT
    b.   Numerical Variables
              i)   Explore numerical features in our data
              ii)  Explore Temporal Variable ( Eg : Datetime )
              iii) Types of Numerical Variables
              iv)  Distribution of Numerical Variables in our data
              v)   Log Transformation of Variable
    c.   Categorical Variables and its cardinality  
              i)   Explore categorical features in our data
              ii)  Relationship between Categorical Variables and dependent feature
    d.   Outliers 


2nd Pipeline : House Price - Feature Engineering  

2. Feature Engineering  
    a.   Handling Missing Values  
              i)   Handling Categorical Values
              ii)  Handling Numerical Values
    b.   Handling Temporal Variables  
    c.   Handling Numerical Variables  
              i)  Skewed Data to Log Normal Distribution
    d.   Handling Categorical Variables  
              i)  Rare Category Elimination
              ii) Conversion of Category to Numerical Values
    d.   Standarize the values of Variables  

3rd Pipeline : House Price - Feature Selection

3. Feature Selection  
    a.   Lasso and SelectFromModel to select the features
    
4th Pipeline : House Price - Build Model  

4. Building Model   
    a.   Model Building using XGBOOST  
    b.   Hyper parameters tuning for XGBOOST  

In [1]:
#Importing Libraries
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

pd.pandas.set_option('display.max_columns',None)
pd.set_option("display.max_rows",None)

#For feature selection
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

In [17]:
#Importing Dataset
X_train = pd.read_csv('X_train_after_Feature_Selection.csv')
y_train = pd.read_csv('y_train_after_Feature_Selection.csv')
X_test = pd.read_csv('X_test_after_Feature_Selection.csv')

In [18]:
X_train.head()

Unnamed: 0,MSSubClass,MSZoning,Neighborhood,OverallQual,YearRemodAdd,RoofStyle,BsmtQual,BsmtExposure,HeatingQC,CentralAir,1stFlrSF,GrLivArea,BsmtFullBath,KitchenQual,Fireplaces,FireplaceQu,GarageType,GarageFinish,GarageCars,PavedDrive,SaleCondition
0,0.235294,0.75,0.636364,0.666667,0.098361,0.0,0.75,0.25,1.0,1,0.356155,0.577712,0.333333,0.666667,0.0,0.2,0.8,0.666667,0.5,1.0,0.75
1,0.0,0.75,0.5,0.555556,0.52459,0.0,0.75,1.0,1.0,1,0.503056,0.470245,0.0,0.333333,0.333333,0.6,0.8,0.666667,0.5,1.0,0.75
2,0.235294,0.75,0.636364,0.666667,0.114754,0.0,0.75,0.5,1.0,1,0.383441,0.593095,0.333333,0.666667,0.333333,0.6,0.8,0.666667,0.5,1.0,0.75
3,0.294118,0.75,0.727273,0.666667,0.606557,0.0,0.5,0.25,0.75,1,0.399941,0.579157,0.333333,0.666667,0.333333,0.8,0.4,0.333333,0.75,1.0,0.0
4,0.235294,0.75,1.0,0.777778,0.147541,0.0,0.75,0.75,1.0,1,0.466237,0.666523,0.333333,0.666667,0.333333,0.6,0.8,0.666667,0.75,1.0,0.75


In [19]:
y_train.head()

Unnamed: 0,SalePrice
0,12.247694
1,12.109011
2,12.317167
3,11.849398
4,12.429216


In [20]:
X_test.head()

Unnamed: 0,MSSubClass,MSZoning,Neighborhood,OverallQual,YearRemodAdd,RoofStyle,BsmtQual,BsmtExposure,HeatingQC,CentralAir,1stFlrSF,GrLivArea,BsmtFullBath,KitchenQual,Fireplaces,FireplaceQu,GarageType,GarageFinish,GarageCars,PavedDrive,SaleCondition
0,0.0,0.6,0.5,0.444444,0.822581,0.2,1.0,1.0,1.0,1.0,0.312253,0.312253,0.0,1.0,0.0,0.6,0.166667,1.0,0.2,1.0,0.8
1,0.0,0.8,0.5,0.555556,0.870968,0.6,1.0,1.0,1.0,1.0,0.468253,0.468253,0.0,0.5,0.0,0.6,0.166667,1.0,0.2,1.0,0.8
2,0.235294,0.8,0.333333,0.444444,0.225806,0.2,0.5,1.0,0.5,1.0,0.326139,0.548792,0.0,1.0,0.25,1.0,0.166667,0.0,0.4,1.0,0.8
3,0.235294,0.8,0.333333,0.555556,0.225806,0.2,1.0,1.0,0.0,1.0,0.325285,0.542672,0.0,0.5,0.25,0.4,0.166667,0.0,0.4,1.0,0.8
4,0.588235,0.8,0.916667,0.777778,0.322581,0.2,0.5,1.0,0.0,1.0,0.453388,0.453388,0.0,0.5,0.0,0.6,0.166667,0.666667,0.4,1.0,0.8


In [22]:
import xgboost
classifier = xgboost.XGBRegressor()
classifier.fit(X_train,y_train)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [23]:
import pickle
filename = 'finalized_model.pkl'
pickle.dump(classifier,open(filename,'wb'))

In [24]:
y_pred = classifier.predict(X_test)

In [25]:
y_pred

array([11.785125, 11.97815 , 11.995672, ..., 12.099141, 11.829487,
       12.210889], dtype=float32)

In [26]:
#Create sample submission file and submit
pred = pd.DataFrame(y_pred)
sub_df = pd.read_csv('sample_submission.csv')

In [28]:
datasets = pd.concat([sub_df['Id'],pred],axis=1)
datasets

Unnamed: 0,Id,0
0,1461,11.785125
1,1462,11.97815
2,1463,11.995672
3,1464,12.006305
4,1465,12.238935
5,1466,12.01582
6,1467,11.936601
7,1468,11.910883
8,1469,12.003639
9,1470,11.779385


In [29]:
datasets.columns=['Id','SalePrice']

In [30]:
datasets.to_csv('sample_submission.csv',index=False)