# Adapted from CS109a Introduction to Data Science
## Seminar 8, Exercise 3: Bagging Classification with Decision Boundary

## Description :
The goal of this exercise is to use **Bagging** (Bootstrap Aggregated) to solve a classification problem and visualize the influence on Bagging on trees with varying depths.

## Instructions:

- Read the dataset `Crimes_-_Maps.csv`.
- Assign the predictor and response variables as `X` and `y`.
- Split the data into train and test sets with `test_split=0.2` and `random_state=44`.
- Fit a single `DecisionTreeClassifier()` and find the accuracy of your prediction.
- Complete the helper function `prediction_by_bagging()` to find the average predictions for a given number of bootstraps.
- Perform `Bagging` using the helper function, and compute the new accuracy.
- Plot the accuracy as a function of the number of bootstraps.
- Use the helper code to plot the decision boundaries for varying max_depth along with `num_bootstraps`. Investigate the effect of increasing bootstraps on the variance.

<a href="https://data.cityofchicago.org/Public-Safety/Crimes-Map/dfnk-7re6" target="_blank"> Chicago crime data</a>

## Hints:

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn-tree-decisiontreeclassifier" target="_blank">sklearn.tree.DecisionTreeClassifier()</a>
A decision tree classifier.

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit" target="_blank">DecisionTreeClassifier.fit()</a>
Build a decision tree classifier from the training set (X, y).

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict" target="_blank">DecisionTreeClassifier.predict()</a>
Predict class or regression value for X.

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank">train_test_split()</a>
Split arrays or matrices into random train and test subsets.

<a href="https://docs.scipy.org/doc//numpy-1.10.4/reference/generated/numpy.random.choice.html" target="_blank">np.random.choice</a>
Generates a random sample from a given 1-D array.

<a href="https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplots.html" target="_blank">plt.subplots()</a>
Create a figure and a set of subplots.

<a href="https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.axes.Axes.plot.html" target="_blank">ax.plot()</a>
Plot y versus x as lines and/or markers

In [1]:
# Import necessary libraries
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn import metrics
import scipy.optimize as opt
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Used for plotting later
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#F7345E','#80C3BD'])
cmap_light = ListedColormap(['#FFF4E5','#D2E3EF'])


In [2]:
# Read the file 'Crimes_-_Map.csv' as a Pandas dataframe
df = pd.read_csv('C:\\Users\\wirze\\DataspellProjects\\DS_DAPS\\8-lab\\data\\Crimes_-_Map.csv')


df = df.dropna()
# Take a quick look at the data
# Note that the latitude & longitude values are not normalized
df.head()


Unnamed: 0,CASE#,DATE OF OCCURRENCE,BLOCK,IUCR,PRIMARY DESCRIPTION,SECONDARY DESCRIPTION,LOCATION DESCRIPTION,ARREST,DOMESTIC,BEAT,WARD,FBI CD,X COORDINATE,Y COORDINATE,LATITUDE,LONGITUDE,LOCATION
5,JE266628,06/15/2021 09:30:00 AM,080XX S DREXEL AVE,0820,THEFT,$500 AND UNDER,STREET,N,N,631,8.0,06,1183633.0,1851786.0,41.748486,-87.602675,"(41.748486365, -87.602675062)"
6,JE266536,06/15/2021 07:50:00 AM,042XX W MADISON ST,0560,ASSAULT,SIMPLE,SIDEWALK,N,N,1115,28.0,08A,1148227.0,1899678.0,41.880661,-87.731186,"(41.880660786, -87.731186405)"
8,JE267466,06/15/2021 09:01:00 PM,007XX S KEDZIE AVE,051B,ASSAULT,AGGRAVATED - OTHER FIREARM,SIDEWALK,Y,N,1134,24.0,04A,1155154.0,1896404.0,41.87154,-87.705839,"(41.87154041, -87.705838807)"
9,JE266473,06/15/2021 07:47:00 AM,062XX S MORGAN ST,0110,HOMICIDE,FIRST DEGREE MURDER,APARTMENT,N,N,712,16.0,01A,1170714.0,1863474.0,41.780851,-87.649674,"(41.780850996, -87.649674221)"
10,JE267222,06/15/2021 01:55:00 AM,015XX S KENNETH AVE,4386,OTHER OFFENSE,VIOLATION OF CIVIL NO CONTACT ORDER,APARTMENT,N,Y,1012,24.0,26,1146970.0,1892136.0,41.859989,-87.735995,"(41.859988733, -87.73599476)"


In [3]:
# Set the values of latitude & longitude predictor variables
X = df[['LATITUDE', 'LONGITUDE']].values

# Use the column "ARREST" as the response variable
y = df['ARREST'].values


In [4]:
# Split data in train an test, with test size = 0.2
# and set random state as 44
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=44)


In [5]:
# Define the max_depth of the decision tree
max_depth = 20

# Define a decision tree classifier with a max depth as defined above
# and set the random_state as 44
clf = DecisionTreeClassifier(max_depth=max_depth, random_state=44)

# Fit the model on the training data
clf.fit(X_train, y_train)


DecisionTreeClassifier(max_depth=20, random_state=44)

In [6]:
# Use the trained model to predict on the test set
prediction = clf.predict(X_test)

# Calculate the accuracy of the test predictions of a single tree
single_acc = accuracy_score(prediction, y_test)

# Print the accuracy of the tree
print(f'Single tree Accuracy is {single_acc*100}%')


Single tree Accuracy is 87.10627799468251%


In [7]:
# Complete the function below to get the prediction by bagging

# Inputs: X_train, y_train to train your data
# X_to_evaluate: Samples that you are goin to predict (evaluate)
# num_bootstraps: how many trees you want to train
# Output: An array of predicted classes for X_to_evaluate

def prediction_by_bagging(X_train, y_train, X_to_evaluate, num_bootstraps):

    # List to store every array of predictions
    predictions = []

    # Generate num_bootstraps number of trees
    for i in range(num_bootstraps):

        # Sample data to perform first bootstrap, here, we actually bootstrap indices,
        # because we want the same subset for X_train and y_train
        resample_indexes = np.random.choice(np.arange(y_train.shape[0]), size=y_train.shape[0])

        # Get a bootstrapped version of the data using the above indices
        X_boot = X_train[resample_indexes]
        y_boot = y_train[resample_indexes]

        # Initialize a Decision Tree on bootstrapped data
        # Use the same max_depth and random_state as above
        clf = DecisionTreeClassifier(max_depth=max_depth, random_state=44)

        # Fit the model on bootstrapped training set
        clf.fit(X_boot,y_boot)

        # Use the trained model to predict on X_to_evaluate samples
        pred = clf.predict(X_to_evaluate)

        # Append the predictions to the predictions list
        predictions.append(pred)

    # The list "predictions" has [prediction_array_0, prediction_array_1, ..., prediction_array_n]
    # To get the majority vote for each sample, we can find the average
    # prediction and threshold them by 0.5
    average_prediction = np.stack( predictions, axis=0)

    # Return the average prediction
    return average_prediction


In [8]:
# Define the number of bootstraps
num_bootstraps = 200

# Calling the prediction_by_bagging function with appropriate parameters
y_pred = prediction_by_bagging(X_train,y_train,X_test,num_bootstraps=num_bootstraps)

In [9]:
def voteban(x):
    if x >= 0.5:
        return 'Y'
    else:
        return 'N'

In [10]:
predictions_df = pd.DataFrame(np.transpose(y_pred))\
    .replace('Y', 1)\
    .replace('N', 0)\
    .mean(axis=1)

predictions_df.head()

0    0.065
1    0.080
2    0.000
3    0.000
4    0.000
dtype: float64

In [11]:
predictions_df = predictions_df.apply(voteban)
predictions_df.head()

0    N
1    N
2    N
3    N
4    N
dtype: object

In [12]:
# Compare the average predictions to the true test set values
# and compute the accuracy
bagging_accuracy = accuracy_score(y_test, predictions_df.values)

# Print the bagging accuracy
print(f'Accuracy with Bootstrapped Aggregation is  {bagging_accuracy*100}%')

Accuracy with Bootstrapped Aggregation is  88.39253634817601%


In [None]:
# Helper code to plot accuracy vs number of bagged trees

n = np.linspace(1,250,250).astype(int)
acc = []
for n_i in n:
    acc.append(np.mean(prediction_by_bagging(X_train, y_train, X_test, n_i)==y_test))
plt.figure(figsize=(10,8))
plt.plot(n,acc,alpha=0.7,linewidth=3,color='#50AEA4', label='Model Prediction')
plt.title('Accuracy vs. Number of trees in Bagging ',fontsize=24)
plt.xlabel('Number of trees',fontsize=16)
plt.ylabel('Accuracy',fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='best',fontsize=12)
plt.show();


## Bagging Visualization

Bagging does well to reduce overfitting, but only upto a certain extent.

Vary the `max_depth` and `numboot` variables to see how Bagging helps reduce overfitting with the help of the visualization below

In [None]:
# Making plots for three different values of `max_depth`
fig,axes = plt.subplots(1,3,figsize=(20,6))

# Make a list of three max_depths to investigate
max_depth = [2,5,100]

# Fix the number of bootstraps
numboot = 100

for index,ax in enumerate(axes):

    for i in range(numboot):
        df_new = df.sample(frac=1,replace=True)
        y = df_new.ARREST.values
        X = df_new[['LATITUDE', 'LONGITUDE']].values
        dtree = DecisionTreeClassifier(max_depth=max_depth[index])
        dtree.fit(X, y)
        ax.scatter(X[:, 0], X[:, 1], c=y-1, s=50,alpha=0.5,edgecolor="k",cmap=cmap_bold)
        plot_step_x1= 0.1
        plot_step_x2= 0.1
        x1min, x1max= X[:,0].min(), X[:,0].max()
        x2min, x2max= X[:,1].min(), X[:,1].max()
        x1, x2 = np.meshgrid(np.arange(x1min, x1max, plot_step_x1), np.arange(x2min, x2max, plot_step_x2) )
        # Re-cast every coordinate in the meshgrid as a 2D point
        Xplot= np.c_[x1.ravel(), x2.ravel()]

        # Predict the class
        y = dtree.predict( Xplot )
        y= y.reshape( x1.shape )
        cs = ax.contourf(x1, x2, y, alpha=0.02)

    ax.set_xlabel('Latitude',fontsize=14)
    ax.set_ylabel('Longitude',fontsize=14)
    ax.set_title(f'Max depth = {max_depth[index]}',fontsize=20)



## Mindchow 🍲
Play around with the following parameters:

- max_depth
- numboot

Based on your observations, answer the questions below:

- How does the plot change with varying `max_depth`

- How does the plot change with varying `numboot`

- How are the three plots essentially different?

- Does more bootstraps reduce overfitting for
    - High depth
    - Low depth