After exploring the data, we're going to find of much of it can be relevant for our decision tree. This is a critical point for every Data Science project, since too much train data can easily result in bad model generalisation (accuracy on test/real/unseen observations). 

In this Ipython notebook, I am going to cover Random Forest. Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.

Advantages of Random Forest:

1) This algorithm can solve both type of problems i.e. classification and regression and does a decent estimation at both fronts.
      
2) One of benefits of Random forest which excites me most is, <B> the power of handle large data set with higher dimensionality. It can handle thousands of input variables and identify most significant variables so it is considered as one of the dimensionality reduction methods. </B>  Further, the model outputs Importance of variable, which can be a very handy feature (on some random data set).

3) It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.

4) It has methods for balancing errors in data sets where classes are imbalanced.

5) The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.

6) Random Forest involves sampling of the input data with replacement called as bootstrap sampling. Here one third of the data is not used for training and can be used to testing. These are called the out of bag samples. Error estimated on these out of bag samples is known as out of bag error. Study of error estimates by Out of bag, gives evidence to show that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.





In [73]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import skew
from scipy.stats.stats import pearsonr


%config InlineBackend.figure_format = 'retina' #set 'png' here when working on notebook
%matplotlib inline

In [62]:
train_df = pd.read_csv('train_cleaned.csv')
test_df  = pd.read_csv('test_cleaned.csv')

id = test_df['Id']
train_df.drop('Id',axis = 1, inplace = True)
test_df.drop('Id',axis = 1 , inplace = True)
y_train = train_df['SalePrice']
x_train = train_df.drop('SalePrice', axis = 1)

test_df.head()


Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,Functional_Min2,BldgType_Twnhs,RoofStyle_Mansard,RoofMatl_CompShg,SaleCondition_Partial,GarageCond_Ex,Functional_Maj2,SaleType_New,GarageType_BuiltIn,Exterior2nd_CBlock
0,3.044522,80,9.360741,5,6,1961,1961,0.0,6.150603,4.976734,...,0,0,0,1,0,0,0,0,0,0
1,3.044522,81,9.565775,6,6,1958,1958,4.691348,6.828712,0.0,...,0,0,0,1,0,0,0,0,0,0
2,4.110874,74,9.534668,5,5,1997,1998,0.0,6.674561,0.0,...,0,0,0,1,0,0,0,0,0,0
3,4.110874,78,9.208238,6,6,1998,1998,3.044522,6.401917,0.0,...,0,0,0,1,0,0,0,0,0,0
4,4.795791,43,8.518392,8,5,1992,1992,0.0,5.575949,0.0,...,0,0,0,1,0,0,0,0,0,0


In [65]:
y_train.head()
feat_labels = train_df.columns.values
#feat_labels
x_train.isnull().values.any()

False

Since, Random Forest is good regression method which has an ability to handle a large amount of high-dimensional data ( here we have ~ 280 columns as features), we are going to construct a function which handles the features and list the features which are important in prediction. 

In [91]:
from sklearn.ensemble import RandomForestRegressor

def printFeature(model,x_train, y_train , performCV = True , cvFolds = 10):
    
    model.fit(x_train,y_train)
    feature_importances = model.feature_importances_
    feat_labels = [ x for x in x_train.columns.values]
    feat_imp = pd.Series(feature_importances,feat_labels).sort_values(ascending=False)
    #print(feat_imp.dtypes)
    #plt.bar(feat_labels[:],feat_imp[:])
    plt.ylabel('Feature Importances')
    #plt.show()
    
def modelFit(model,x_train,x_test,performCV = True , cvFolds = 10):
    
    model.fit(x_train,y_train)
    x_train_predictions = model.predict(x_train)
    
    if performCV == True :
        cv_score_meanSqError = cross_val_score(model,x_train,y_train,cv= cvFolds,scoring = 'mean_squared_error')
        cv_score_r2 = cross_val_score(model,x_train,y_train,cv= cvFolds,scoring = 'r2_score')
        
    print("\n Model Report")
    print("\n Mean Squared Error = %.4g" % metrics.mean_squared_error(x_train.values,x_train_predictions))
    print("\n R2 score = %.4g" % metrics.r2_score(x_train.values,x_train_prediction))
          
    
    print("CV Score (Mean Square Error) : Mean - %.7g | Std - %.7g | Min - %.7g | Max - %.7g" % (np.mean(cv_score_meanSqError),np.std(cv_score_meanSqError),np.min(cv_score_meanSqError),np.max(cv_score_meanSqError)))
    print("CV Score ( R square        ): Mean - %.7g | Std - %.7g | Min - %.7g | Max - %.7g" % (np.mean(cv_score_r2),np.std(cv_score_r2),np.min(cv_score_r2),np.max(cv_score_r2)))




In [None]:
#defining the model
n_estimators = 1000
random_state = 0
n_jobs = -1
rf = RandomForestRegressor(n_estimators = n_estimators,random_state = random_state,n_jobs = n_jobs) 
printFeature(rf,x_train,y_train)
import warnings
warnings.filterwarnings("ignore")
modelFit(rf,x_train,y_train,performCV = True, cvFolds = 10)

<B> Cross Validation </B>  is a model validation technique that splits the training dataset in a given number of "folds". Each split uses different data for training and testing purposes, allowing the model to be trained and tested with different data each time. This allows the algorithm to be trained and tested with all available data across all folds, avoiding any splitting bias and giving a good idea of the generalisation of the chosen model. The main downside is that Cross Validation requires the model to be trained for each fold, so the computational cost can be very high for complex models or huge datasets.

Here, I am going to use K-Fold Cross Validation Technique. I am goinfg to construct a function which performa a kfold crsoo validation technique on training data set and gives out a mean squared error.