<a href="https://colab.research.google.com/github/arbi11/YCBS-272/blob/master/Tree_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tree-Based Methods













### Load file
Commonly two libraries are used to load a csv files.
- numpy function `np.loadtext` and `np.genfromtext ` 
- pandas function `pd.read_csv`

Here we prefer using pandas

In [0]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

# you need to install graphviz-python from Anaconda, 
#it is not installed by default.
# run this line in the command line: 
#conda install python-graphviz

import graphviz

%matplotlib inline

In [0]:
! git clone https://github.com/arbi11/YCBS-272.git


In [0]:
! ls


In [0]:
cd YCBS-272/


In [0]:
! ls


In [0]:

df1 = pd.read_csv('spamdata.csv')

In [0]:
df1.head()
df1.info()

### 1-Fitting Classification Trees
The sklearn library has a lot of useful tools for constructing classification and regression trees:

Import the necessary libraries

We'll start by using classification trees to analyze the Spam data set. In order to properly evaluate the performance of a classification tree on the data, we must estimate the test error rather than simply computing the training error. We first split the observations into a training set and a test set:

In [0]:
X = df1.iloc[:,:57]
y = df1.iloc[:,-1]

X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size = 0.2, random_state=1)

Code a decision tree using the scikit learn library for classification

Name your tree as 'classification_tree_spam'

We now use the DecisionTreeClassifier() function to fit a classification tree in order to predict Spam. Use max_depth argument to limit the depth of the tree

 Now, one of the most attractive properties of trees is that they can be graphically displayed. Unfortunately, this is a bit of a roundabout process in sklearn. We use the export_graphviz() function to export the tree structure to a temporary .dot file, and the graphviz.Source() function to display the image:

In [0]:
export_graphviz(classification_tree_spam, 
                out_file = "spam_tree.dot", 
                feature_names = X_train.columns)

with open("spam_tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

The most important indicator of spam emails appears to be the word Dollar.

Finally, let's evaluate the tree's performance on the test data. The predict() function can be used for this purpose. We can then build a confusion matrix, which shows that we are making correct predictions for around 87.8% of the test data set:

In [0]:
pred = classification_tree_spam.predict(X_test)
cm = pd.DataFrame(confusion_matrix(y_test, pred).T, 
                  index = ['No', 'Yes'], 
                  columns = ['No', 'Yes'])
print(cm)

In [0]:
#We can use accuracy_score function
accuracy_score(y_test, pred)

#### Exercise

Now, let's try different depth of the tree and compare in terms of accuracy on the test set.

dt10 = Decision Tree depth=10

dt11 = Decision Tree depth=11

dt12 = Decision Tree depth=12

# you may play with depth and prune the tree in different levels

dt10.fit(X_train,y_train)

dt11.fit(X_train,y_train)

dt12.fit(X_train,y_train)


In [0]:

y10_pred = dt10.predict(X_test)
y11_pred = dt11.predict(X_test)
y12_pred = dt12.predict(X_test)

accuracy_score(y_test, y10_pred)

In [0]:
accuracy_score(y_test, y11_pred)

In [0]:
accuracy_score(y_test, y12_pred)

## For adavanced Usage:

Move to regression if you find this section new

### "Cross-Validation" for Depth of the Regression Tree

In [0]:
# Initialize the accuracy_score vector
acc = []
depth = np.arange(1, 25)
# Calculate accuracy score on the test set for different depths of the tree
for i in depth:
    # Fit the Regression Tree
    dt = DecisionTreeClassifier(max_depth=i)
    dt.fit(X_train,y_train)
    # Predict on the test set
    y_pred = dt.predict(X_test)
    # Compute the accuracy
    score = accuracy_score(y_test, y_pred)  
    acc.append(score)
# Plot results    
plt.plot(depth, acc, '-')
plt.xlabel('Depth of the tree')
plt.ylabel('Accuracy')
plt.title('spam');

### Bagging Classifier

In [0]:
#X = df1.iloc[:,:57]
#y = df1.iloc[:,-1]

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1)

classification_tree_spam = DecisionTreeClassifier(max_depth = 10)

from sklearn.ensemble import BaggingClassifier
bag = BaggingClassifier(classification_tree_spam, n_estimators=100, \
                        random_state=1)
bag.fit(X_train, y_train)

y_hat = bag.predict(X_train)
accuracy_score(y_train, y_hat)

In [0]:
y_hat = bag.predict(X_test)
accuracy_score(y_test, y_hat)

## 2-Fitting Regression Trees

In [0]:

df2 = pd.read_csv('Auto.csv', na_values=['?'], na_filter=True)
df2 = df2.dropna() #Removes the whole raw if 1 missing value

In [0]:
df2.head()

In [0]:
X = df2[['cylinders', 'displacement', 'horsepower', \
         'weight', 'acceleration']]
y = df2['mpg']
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size = 0.2, random_state = 0)

## Write code to create an object of Decision Tree for Regression

Train it next on X_train, y_train

Name your tree as : regr_tree_auto

Let's take a look at the tree:

In [0]:
export_graphviz(regr_tree_auto, 
                out_file = "auto_tree.dot", 
                feature_names = X_train.columns)

with open("auto_tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

Now let's see how it does on the test data:

In [0]:
pred = regr_tree_auto.predict(X_test)

plt.scatter(pred, 
            y_test, 
            label = 'mpg')

plt.plot([0, 1], 
         [0, 1], 
         '--k', 
         transform = plt.gca().transAxes)

plt.xlabel('pred')
plt.ylabel('y_test')

In [0]:
mean_squared_error(y_test, pred)

#### Exercise

Now, let's try different depth of the tree and compare in terms of accuracy on the test set.

Define Regression trees for varying length

In [0]:

# you may play with depth
dt10.fit(X_train,y_train)
dt11.fit(X_train,y_train)
dt12.fit(X_train,y_train)

y10_pred = dt10.predict(X_test)
y11_pred = dt11.predict(X_test)
y12_pred = dt12.predict(X_test)

mean_squared_error(y_test, y10_pred)

In [0]:
mean_squared_error(y_test, y11_pred)

In [0]:
mean_squared_error(y_test, y12_pred)

### "Cross-Validation" for Depth of the Regression Tree

In [0]:
#X = df2[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration']]
#y = df2['mpg']
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Initialize the MSE vector
mse = []
depth = np.arange(1, 11)

# Calculate MSE on the test set for different depths of the tree
for i in depth:
    # Fit the Regression Tree
    dt = DecisionTreeRegressor(max_depth=i)
    dt.fit(X_train,y_train)
    # Predict on the test set
    y_pred = dt.predict(X_test)
    # Compute the MSE
    score = mean_squared_error(y_test, y_pred)
    
    mse.append(score)
    
# Plot results    
plt.plot(depth, mse, '-')
plt.xlabel('Depth of the tree')
plt.ylabel('MSE')
plt.title('mpg');

### Advanced Usage
### Bagging Regressor

In [0]:
#X = df2[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration']]
#y = df2['mpg']

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [0]:
regr_tree_auto = DecisionTreeRegressor(max_depth = 2)

from sklearn.ensemble import BaggingRegressor
bag = BaggingRegressor(regr_tree_auto, n_estimators=100, random_state=1)
bag.fit(X_train, y_train)

y_hat = bag.predict(X_train)
mean_squared_error(y_train, y_hat)

In [0]:
y_hat = bag.predict(X_test)
mean_squared_error(y_test, y_hat)