# Section III c -Decision Tree Regressions

The Decision Tree Regression uses non linear methods to classify the dataset. The Decision tree builds regression model by breaking down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. Decisions trees takes each feature and makes a decision / split on a value that provides the most information gain(minimum entropy). Then it alternates to another feature and makes the same split decision. It continues to do so until no further information gain is available. The overall structure of a decision tree includes a decision node and a leaf node. The decision has two or more branches that each represent the value of the feature tested while the leaf node represents the decision target. Decision trees can handle both numerical and categorical data.



![](./images/15.jpg "")
![](./images/16.jpg "")

When classifying a new data point, take the average values of all the data points in that decision section.

Pros: There's no need for feature scaling so results can be interpretable and works on non linear data. 

Cons: Does not work well if dataset is too small and overfitting can easilty occur

* **Principal Component Analysis** — Before feeding into the Decision Tree, we will do a PCA; varying the number of components from N=10 to N=50.

### Decision Tree Regression with PCA (n=10)

In [11]:
import numpy as np   #Mathematics library
import matplotlib.pyplot as plt # for plotting
import pandas as pd  #manage datasets
import seaborn as sea


df = pd.read_csv('FinishMissing.csv')
df=df.drop('Unnamed: 0',axis=1)

In [12]:
#Drop outliers before splitting ex and y
avg = df['logerror'].mean()
std = df['logerror'].std()
upper_outlier = avg + 2*std
lower_outlier = avg - 2*std
#round up to drop outliers, til reasonable
df=df[ df.logerror > -0.32 ]
df=df[ df.logerror < 0.34 ]
df.to_csv('OutlierRemoved.csv')  #big file



In [13]:
###############Create Dummy variables for Categorical data
df=pd.get_dummies(df,columns=['taxdelinquencyflag','fireplaceflag','propertyzoningdesc','propertycountylandusecode','hashottuborspa','airconditioningtypeid','architecturalstyletypeid','buildingqualitytypeid','buildingclasstypeid','decktypeid','fips','heatingorsystemtypeid','pooltypeid10','pooltypeid2','pooltypeid7','propertylandusetypeid','regionidcounty','regionidcity','regionidzip','regionidneighborhood','storytypeid','typeconstructiontypeid','month','day'],drop_first=True)



In [14]:
dataset=df

#split into response and features, skip to next for testing
X = dataset.iloc[:, 2:].values
y = dataset.iloc[:, 1].values

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


In [16]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [17]:
# Applying PCA * requires feature scaling
from sklearn.decomposition import PCA
pca = PCA(n_components = 10) # number of principal components explain variance, use '0' first
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
X = np.concatenate((X_train,X_test),axis=0)


In [18]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=0,
           splitter='best')

In [19]:
# Predicting a new result
y_pred = regressor.predict(X_test)

In [20]:
# Mean Square Error
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred, sample_weight=None, multioutput='uniform_average')
print(mse)
rmse = np.sqrt(mse)
print(rmse)

0.0123499774861
0.11113045256


In [21]:
# 10-fold cross-validation with all three features
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(regressor, X, y, cv=4, scoring='neg_mean_squared_error')
mse_scores = -scores
# calculate the average MSE
print(mse_scores.mean())
rmse_kfold = np.sqrt(mse_scores.mean())
print(rmse_kfold)



0.0125399244758
0.111981804218


### Decision Tree Regression with PCA (n=20)

In [22]:
dataset=df

#split into response and features, skip to next for testing
X = dataset.iloc[:, 2:].values
y = dataset.iloc[:, 1].values


In [23]:
###################### Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


In [24]:
##################### Feature Scaling required for Neural Network & Feature Extraction
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [25]:
##### DIMENSIONALITY REDUCTION : PRINCIPAL COMPONENT ANALYSIS(PCA)
# Applying PCA * requires feature scaling
from sklearn.decomposition import PCA
pca = PCA(n_components = 20) # number of principal components explain variance, use '0' first
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
X = np.concatenate((X_train,X_test),axis=0)

In [26]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=0,
           splitter='best')

In [27]:
# Predicting a new result
y_pred = regressor.predict(X_test)

In [28]:
# Mean Square Error
mse = mean_squared_error(y_test, y_pred, sample_weight=None, multioutput='uniform_average')
print(mse)
rmse = np.sqrt(mse)
print(rmse)

0.0125167714828
0.111878378084


In [29]:
# 10-fold cross-validation with all three features
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(regressor, X, y, cv=4, scoring='neg_mean_squared_error')
mse_scores = -scores
# calculate the average MSE
print(mse_scores.mean())
rmse_kfold = np.sqrt(mse_scores.mean())
print(rmse_kfold)

0.0129311123273
0.113715048816


### Decision Tree Regression with PCA (n=50)

In [30]:
################PCA N=50
dataset=df

#split into response and features, skip to next for testing
X = dataset.iloc[:, 2:].values
y = dataset.iloc[:, 1].values


In [31]:
###################### Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


In [32]:
##################### Feature Scaling required for Neural Network & Feature Extraction
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


In [33]:
##### DIMENSIONALITY REDUCTION : PRINCIPAL COMPONENT ANALYSIS(PCA)
# Applying PCA * requires feature scaling
from sklearn.decomposition import PCA
pca = PCA(n_components = 50) # number of principal components explain variance, use '0' first
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
X = np.concatenate((X_train,X_test),axis=0)

In [34]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=0,
           splitter='best')

In [35]:
# Predicting a new result
y_pred = regressor.predict(X_test)

In [36]:
# Mean Square Error
mse = mean_squared_error(y_test, y_pred, sample_weight=None, multioutput='uniform_average')
print(mse)
rmse = np.sqrt(mse)
print(rmse)

0.0127332906904
0.112841883582


In [37]:
# 10-fold cross-validation with all three features
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(regressor, X, y, cv=4, scoring='neg_mean_squared_error')
mse_scores = -scores
# calculate the average MSE
print(mse_scores.mean())
rmse_kfold = np.sqrt(mse_scores.mean())
print(rmse_kfold)

0.0130116648027
0.114068684584
