# Section III d - Random Forest Regressions

A Random forest is another version of ensemble learning;other examples include gradient boosting. Ensemble learning is when you take multiple algorithms or the same alogorithm multiple times and you put them together to get something much more powerful than the original. In this case, we are in essences running multiple decision trees but instead of getting one prediction value from a decision tree, you get prediction values from the entire forest.  

Pros: Powerful and accurate algorithm that can be used to tackle many types of problems; including non linear

Cons: Lose interpretability and over fitting can easily occur. 

* **Principal Component Analysis** — Before feeding into the SVR, we will do a PCA; varying the number of components from N=10 to N=50.


#### Steps of Random Forest
1. Pick at random K data points from the training set
2. Build the Decision Tree associated to these K data points 
3. Choose the number of Ntree of trees you want to build and repeat Steps 1&2
4. For a new data point, make each one of your Ntree trees predict the value of Y for the data point in question, and assign the new data point the average across all of the predicted Y values. 

### Random Forest Regression with PCA (n=10)

In [1]:
import numpy as np   #Mathematics library
import matplotlib.pyplot as plt # for plotting
import pandas as pd  #manage datasets
import seaborn as sea


df = pd.read_csv('FinishMissing.csv')
df=df.drop('Unnamed: 0',axis=1)

In [2]:
#Drop outliers before splitting ex and y
avg = df['logerror'].mean()
std = df['logerror'].std()
upper_outlier = avg + 2*std
lower_outlier = avg - 2*std
#round up to drop outliers, til reasonable
df=df[ df.logerror > -0.32 ]
df=df[ df.logerror < 0.34 ]

In [3]:
###############Create Dummy variables for Categorical data
df=pd.get_dummies(df,columns=['taxdelinquencyflag','fireplaceflag','propertyzoningdesc','propertycountylandusecode','hashottuborspa','airconditioningtypeid','architecturalstyletypeid','buildingqualitytypeid','buildingclasstypeid','decktypeid','fips','heatingorsystemtypeid','pooltypeid10','pooltypeid2','pooltypeid7','propertylandusetypeid','regionidcounty','regionidcity','regionidzip','regionidneighborhood','storytypeid','typeconstructiontypeid','month','day'],drop_first=True)
dataset=df

In [4]:
#split into response and features, skip to next for testing
X = dataset.iloc[:, 2:].values
y = dataset.iloc[:, 1].values

In [5]:
###################### Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


In [6]:
##################### Featurebb Scaling required for Neural Network & Feature Extraction
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [7]:
##### DIMENSIONALITY REDUCTION : PRINCIPAL COMPONENT ANALYSIS(PCA)
# Applying PCA * requires feature scaling
from sklearn.decomposition import PCA
pca = PCA(n_components = 10) # number of principal components explain variance, use '0' first
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
X = np.concatenate((X_train,X_test),axis=0)

In [8]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)

In [9]:
# Predicting a new result
y_pred = regressor.predict(X_test)


# Mean Square Error
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred, sample_weight=None, multioutput='uniform_average')
print(mse)
rmse = np.sqrt(mse)
print(rmse)

0.0065062360553
0.0806612425846


In [10]:
# 10-fold cross-validation with all three features
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(regressor, X, y, cv=4, scoring='neg_mean_squared_error')
mse_scores = -scores
# calculate the average MSE
print(mse_scores.mean())
rmse_kfold = np.sqrt(mse_scores.mean())
print(rmse_kfold)



0.00678085814639
0.0823459661816


### Random Forest Regression with PCA (n=20)

In [11]:
dataset=df

#split into response and features, skip to next for testing
X = dataset.iloc[:, 2:].values
y = dataset.iloc[:, 1].values


In [12]:
###################### Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


In [13]:
##################### Feature Scaling required for Neural Network & Feature Extraction
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [14]:
##### DIMENSIONALITY REDUCTION : PRINCIPAL COMPONENT ANALYSIS(PCA)
# Applying PCA * requires feature scaling
from sklearn.decomposition import PCA
pca = PCA(n_components = 20) # number of principal components explain variance, use '0' first
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
X = np.concatenate((X_train,X_test),axis=0)


In [15]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)

In [16]:
# Predicting a new result
y_pred = regressor.predict(X_test)
# Mean Square Error
mse = mean_squared_error(y_test, y_pred, sample_weight=None, multioutput='uniform_average')
print(mse)
rmse = np.sqrt(mse)
print(rmse)


0.00649311269898
0.0805798529347


In [17]:
# 10-fold cross-validation with all three features
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(regressor, X, y, cv=4, scoring='neg_mean_squared_error')
mse_scores = -scores
# calculate the average MSE
print(mse_scores.mean())
rmse_kfold = np.sqrt(mse_scores.mean())
print(rmse_kfold)

0.00683408730047
0.0826685387585


### Random Forest Regression with PCA (n=50)

In [18]:
dataset=df

#split into response and features, skip to next for testing
X = dataset.iloc[:, 2:].values
y = dataset.iloc[:, 1].values


In [19]:
###################### Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


In [20]:
##################### Feature Scaling required for Neural Network & Feature Extraction
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


In [21]:
##### DIMENSIONALITY REDUCTION : PRINCIPAL COMPONENT ANALYSIS(PCA)
# Applying PCA * requires feature scaling
from sklearn.decomposition import PCA
pca = PCA(n_components = 50) # number of principal components explain variance, use '0' first
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
X = np.concatenate((X_train,X_test),axis=0)


In [22]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X_train, y_train)


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)

In [23]:
# Predicting a new result
y_pred = regressor.predict(X_test)



# Mean Square Error
mse = mean_squared_error(y_test, y_pred, sample_weight=None, multioutput='uniform_average')
print(mse)
rmse = np.sqrt(mse)
print(rmse)


0.00649319397194
0.0805803572339


In [24]:
# 10-fold cross-validation with all three features
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(regressor, X, y, cv=4, scoring='neg_mean_squared_error')
mse_scores = -scores
# calculate the average MSE
print(mse_scores.mean())
rmse_kfold = np.sqrt(mse_scores.mean())
print(rmse_kfold)

0.00690225572011
0.0830798153591
