# Lab 2, Part 1 - Decision Tree & Random Forest

## Lab Instruction 

In this lab, you are to create a Decision Tree and Random Forest model  to predict the sale price of houses **SalePrice = (Low, Medium, High)** from a given set of attributes. <br>

The data file is`lab2_dataset.csv` <br>

Note that you need to explore and process/drop attributes, <br>
and map numerical values of sale prices to categorical values (Low, Medium, High)

###  1. Import the Dataset and Learn About the Data

In [31]:
import pandas as pd
df = pd.read_csv("lab2_dataset.csv")
print(pd.__version__) # You should use version 0.21+

0.23.4


In [32]:
pd.options.display.max_rows = None
print("NA Percentage(%)")
print((df.isna().sum()/df.shape[0])*100)

NA Percentage(%)
Id                0.000000
MSSubClass        0.000000
MSZoning          0.000000
LotFrontage      17.739726
LotArea           0.000000
Street            0.000000
Alley            93.767123
LotShape          0.000000
LandContour       0.000000
Utilities         0.000000
LotConfig         0.000000
LandSlope         0.000000
Neighborhood      0.000000
Condition1        0.000000
Condition2        0.000000
BldgType          0.000000
HouseStyle        0.000000
OverallQual       0.000000
OverallCond       0.000000
YearBuilt         0.000000
YearRemodAdd      0.000000
RoofStyle         0.000000
RoofMatl          0.000000
Exterior1st       0.000000
Exterior2nd       0.000000
MasVnrType        0.547945
MasVnrArea        0.547945
ExterQual         0.000000
ExterCond         0.000000
Foundation        0.000000
BsmtQual          2.534247
BsmtCond          2.534247
BsmtExposure      2.602740
BsmtFinType1      2.534247
BsmtFinSF1        0.000000
BsmtFinType2      2.602740
BsmtFinSF2 

In [33]:
pd.options.display.max_rows = 10

###  2. Preprocessing
Try to think about what data's feature that the model accept and how the model compute those data. Then use techniques that you have learned to preprocess the data. 

**For example:** 
-  Remove non-informative features
-  Remove features with too many NA
-  Remove rows with incomplete data
-  Remove features with highly unbalanced labels
-  Encode categorical variables as appropriate

Then, create one dataframe for the features and another frame for the output variable.

In [34]:
drop_noninfo = df.drop(columns = ['Id'],axis = 1)

In [35]:
drop_many_na_col = drop_noninfo.drop(columns = ['Alley','FireplaceQu','PoolQC','Fence','MiscFeature'], axis = 1)
# drop_many_na_col.info()

In [36]:
drop_many_na_row = drop_many_na_col.dropna(how='any')
drop_many_na_row.shape

(1094, 75)

In [38]:
# Encode with one hot
data_encode_onehot =  pd.get_dummies(drop_many_na_row) 
data_encode_onehot.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1094 entries, 0 to 1459
Columns: 260 entries, MSSubClass to SaleCondition_Partial
dtypes: float64(3), int64(34), uint8(223)
memory usage: 563.0 KB


In [41]:
data_encode_onehot['SaleLevel'] = pd.qcut(data_encode_onehot['SalePrice'], q=[0,0.33,0.66,1], labels=['Low','Median','High'])
data_encode_onehot.shape

(1094, 261)

In [45]:
cleaned_dataset = data_encode_onehot.drop(columns = ['SalePrice'],axis = 1)
print(cleaned_dataset.shape)
# for i in cleaned_dataset.columns:
#     print(i)

(1094, 260)


In [46]:
dataset_x = cleaned_dataset.drop(columns = ['SaleLevel'],axis=1).copy()
dataset_y = cleaned_dataset.SaleLevel.copy()

In [47]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(dataset_x,dataset_y,test_size=0.2)

## Decision Tree Classifier

Use both Hold-out and K-fold CV to evaluate your model

Analyze the model results. Do you think the model is good enough? <br>
Does it overfit or underfit the data? <br>
Explain and provide evidence to support your claims.
Look at various classification matrix of train and test sets.



In [48]:
from sklearn.tree import DecisionTreeClassifier 

dt_clf = DecisionTreeClassifier() 
dt_clf = dt_clf.fit(x_train, y_train)

In [49]:
yhat_train = dt_clf.predict(x_train) 
yhat_train_prob = dt_clf.predict_proba(x_train)

In [50]:
yhat_test = dt_clf.predict(x_test) 
yhat_test_prob = dt_clf.predict_proba(x_test)

### 3.1 Hold-out evaluation

Evaluate and analyse the result using classification_report module and confusion matrix
- See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

### Train Set

In [51]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

print('Train Accuracy: %s'%accuracy_score(y_train, yhat_train))
print(classification_report(y_train, yhat_train))
print('Confusion Matrix')
print(confusion_matrix(y_train, yhat_train , labels=["Low", "Median","High"]))

Train Accuracy: 1.0
             precision    recall  f1-score   support

       High       1.00      1.00      1.00       296
        Low       1.00      1.00      1.00       295
     Median       1.00      1.00      1.00       284

avg / total       1.00      1.00      1.00       875

Confusion Matrix
[[295   0   0]
 [  0 284   0]
 [  0   0 296]]


### Test Set

In [52]:
print('Test Accuracy: %s'%(accuracy_score(y_test, yhat_test)*100))
print(classification_report(y_test, yhat_test))
print('Confusion Matrix')
print(confusion_matrix(y_test, yhat_test , labels=["Low", "Median","High"]))

Test Accuracy: 78.99543378995433
             precision    recall  f1-score   support

       High       0.84      0.85      0.85        75
        Low       0.79      0.90      0.84        69
     Median       0.72      0.63      0.67        75

avg / total       0.79      0.79      0.79       219

Confusion Matrix
[[62  7  0]
 [16 47 12]
 [ 0 11 64]]


### Visualize  Tree 
- We will use https://github.com/xflr6/graphviz

To install this package, type the following command in the Anaconda Prompt.

```conda install -c conda-forge graphviz python-graphviz ```

Run the following cell to visualize your graph

In [16]:
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(dt_clf, out_file=None,
                                feature_names=x_train.columns,
                                class_names=['Low','Median','High'], 
                                filled=True, 
                                rounded=True, 
                                special_characters=True, 
                                proportion=True)
graph = graphviz.Source(dot_data) 
graph.render("tree")

'tree.pdf'

The output from the code above is a pdf file that have a visualization of your decision tree model. Open the pdf file and analyse your model

### 3.2 K-Fold CV 


### Train Set

In [17]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
dt_score_train = cross_val_score(DecisionTreeClassifier(), x_train, y_train, cv = 5)
print(dt_score_train)
print("5-Fold Cross Validation Accuracy : %1.4f" % dt_score_train.mean())

[0.74431818 0.74431818 0.76       0.74712644 0.70689655]
5-Fold Cross Validation Accuracy : 0.7405


### Test Set

In [18]:
dt_score_test = cross_val_score(DecisionTreeClassifier(), x_test, y_test, cv = 5)
print(dt_score_test)
print("5-Fold Cross Validation Accuracy : %1.4f" % dt_score_test.mean())

[0.6        0.68181818 0.63636364 0.70454545 0.61904762]
5-Fold Cross Validation Accuracy : 0.6484


## Random Forest Classifier

Use both Hold-out and K-fold CV to evaluate the classifier. 

Analyze the model results. 
- Do you think the model is good enough? Does it overfit or underfit the data? 
- How does it perform compared to basic decision tree classifier ? <br>

Explain and provide evidence to support your claims.

In [19]:
from sklearn.ensemble import RandomForestClassifier 
rf_clf = RandomForestClassifier() 
rf_clf = rf_clf.fit(x_train, y_train) 
rf_clf

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [20]:
yrf_train_pred = rf_clf.predict(x_train) 
yrf_train_prob = rf_clf.predict_proba(x_train)

yrf_test_pred = rf_clf.predict(x_test) 
yrf_test_prob = rf_clf.predict_proba(x_test)

### 4.1 Hold-out evaluation

Evaluate and analyse your result using classification_report module and confusion matrix
- See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

### Train Set

In [21]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

print('Train Accuracy: %s'%accuracy_score(y_train, yrf_train_pred))
print(classification_report(y_train, yrf_train_pred))
print('Confusion Matrix')
print(confusion_matrix(y_train, yrf_train_pred , labels=["Low", "Median","High"]))

Train Accuracy: 0.9931428571428571
             precision    recall  f1-score   support

       High       1.00      1.00      1.00       292
        Low       0.99      1.00      0.99       293
     Median       1.00      0.98      0.99       290

avg / total       0.99      0.99      0.99       875

Confusion Matrix
[[293   0   0]
 [  4 285   1]
 [  0   1 291]]


### Test Set

In [22]:
print('Test Accuracy: %s'%accuracy_score(y_test, yrf_test_pred))
print(classification_report(y_test, yrf_test_pred))
print('Confusion Matrix')
print(confusion_matrix(y_test, yrf_test_pred , labels=["Low", "Median","High"]))

Test Accuracy: 0.7488584474885844
             precision    recall  f1-score   support

       High       0.85      0.86      0.86        79
        Low       0.75      0.83      0.79        71
     Median       0.62      0.54      0.57        69

avg / total       0.74      0.75      0.74       219

Confusion Matrix
[[59 12  0]
 [20 37 12]
 [ 0 11 68]]


### 4.2 K-Fold CV with accuracy metric


### Train Set

In [23]:
from sklearn.model_selection import cross_val_score

scoreTrain = cross_val_score(RandomForestClassifier(), x_train, y_train, cv = 5)
print(scoreTrain) 
print("5-Fold Cross Validation Accuracy : %1.4f" % scoreTrain.mean())

[0.80113636 0.77840909 0.85142857 0.78735632 0.81609195]
5-Fold Cross Validation Accuracy : 0.8069


### Test Set

In [24]:
scoreTest = cross_val_score(RandomForestClassifier(), x_test, y_test, cv = 5)
print(scoreTest) 
print("5-Fold Cross Validation Accuracy : %1.4f" % scoreTest.mean())

[0.66666667 0.68181818 0.72727273 0.68181818 0.71428571]
5-Fold Cross Validation Accuracy : 0.6944


### 4.3 Evaluate using multiple metrics
- precision: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
- recall: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

In [25]:
from sklearn.metrics import precision_score,recall_score
print('Precision[train]'+"\t"+str(precision_score(y_train, yrf_train_pred, average=None))) 
print("Recal[train]:" + "\t\t"+str(recall_score(y_train, yrf_train_pred, average=None)))
print('Precision[test]'+"\t\t"+str(precision_score(y_test, yrf_test_pred, average=None))) 
print("Recal[test]:" + "\t\t"+str(recall_score(y_test, yrf_test_pred, average=None)))

Precision[train]	[0.99657534 0.98653199 0.9965035 ]
Recal[train]:		[0.99657534 1.         0.98275862]
Precision[test]		[0.85       0.74683544 0.61666667]
Recal[test]:		[0.86075949 0.83098592 0.53623188]


### 4.4 Parameter Tuning using GridSearch

Try grid search on parameters max_depth, max_features, and n_estimators.
Determine which parameter set  achieve the best result. <br>


- See more: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [26]:
from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth':[3,5,7], 'max_features':[5, 10]} 
clf = GridSearchCV(RandomForestClassifier(), param_grid) 
clf.fit(x_train, y_train)

s = pd.DataFrame(clf.cv_results_) 
s.sort_values('mean_test_score',ascending=False)



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_max_features,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
3,0.017483,0.002307,0.001995,4.052337e-07,5,10,"{'max_depth': 5, 'max_features': 10}",0.744027,0.818493,0.803448,0.788571,0.032195,1,0.893471,0.852487,0.878632,0.874863,0.016942
5,0.016612,0.000464,0.002004,1.365593e-05,7,10,"{'max_depth': 7, 'max_features': 10}",0.740614,0.794521,0.831034,0.788571,0.037142,1,0.95189,0.93825,0.935043,0.941728,0.007304
1,0.020267,0.002863,0.002661,0.0009441457,3,10,"{'max_depth': 3, 'max_features': 10}",0.774744,0.767123,0.748276,0.763429,0.011114,3,0.810997,0.778731,0.806838,0.798855,0.014331
4,0.015286,0.000949,0.002343,0.0004722527,7,5,"{'max_depth': 7, 'max_features': 5}",0.754266,0.773973,0.748276,0.758857,0.010973,4,0.927835,0.886792,0.876923,0.897184,0.022045
2,0.025912,0.002141,0.003012,0.001415787,5,5,"{'max_depth': 5, 'max_features': 5}",0.744027,0.739726,0.703448,0.729143,0.018176,5,0.835052,0.787307,0.813675,0.812011,0.019527
0,0.017967,0.004959,0.001992,4.440535e-06,3,5,"{'max_depth': 3, 'max_features': 5}",0.699659,0.743151,0.7,0.714286,0.020429,6,0.745704,0.744425,0.757265,0.749132,0.005775


In [27]:
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
print()
print(clf.best_params_, clf.best_score_)

0.714 (+/-0.041) for {'max_depth': 3, 'max_features': 5}
0.763 (+/-0.022) for {'max_depth': 3, 'max_features': 10}
0.729 (+/-0.036) for {'max_depth': 5, 'max_features': 5}
0.789 (+/-0.064) for {'max_depth': 5, 'max_features': 10}
0.759 (+/-0.022) for {'max_depth': 7, 'max_features': 5}
0.789 (+/-0.074) for {'max_depth': 7, 'max_features': 10}

{'max_depth': 5, 'max_features': 10} 0.7885714285714286
