# Homework 7 - Decision Trees and K-Means Clustering

## 0 Load Data

Setting the scene: you recently picked up tennis and want to play more. The only problem is that you don't want to play if the weather is bad. And you're super indecisive. To get around these problems you decide to use a decision tree on some data that you collected from past tennis sessions. This tree will help you decide if you want to play tennis on any given day. 

Run the following code block to load in the data. Take note of the different features.


In [None]:
# ------------------ RUN THIS CODE BLOCK ---------------------
import pandas as pd
data = [['Sunny', 'Hot', 'High', 'Weak', 'No'],
       ['Sunny', 'Hot', 'High', 'Strong', 'No'],
       ['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
       ['Rain', 'Mild', 'High', 'Weak', 'Yes'],
       ['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
       ['Rain', 'Cool', 'Normal', 'Strong', 'No'],
       ['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
       ['Sunny', 'Mild', 'High', 'Weak', 'No'],
       ['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
       ['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
       ['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
       ['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
       ['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
       ['Rain', 'Mild', 'High', 'Strong', 'No']]
colums = ['Outlook', 'Temperature', 'Humidity', 'Wind', 'Play']
dataset = pd.DataFrame(data, columns = colums)
display(dataset)
# -------------------------------------------------------------

## 1 Decision Tree

Unlike much of the data that we have worked with in this class so far this data is categorical. This is extremely common in real world data applications. Unfortunately, our favorite package's (sklearn) decision tree implementation does not support categorical data. 

There are many ways to get around this, such as 'encoding' the categorical as numerical values. This is beyond the scope of this homework but feel free to read about an example technique here: https://en.wikipedia.org/wiki/One-hot

Regardless, we are nice so we've included a custom decision tree classifer that works with categorical. If you are interested you can take a look at the code within `hw7Treecode.py`. If not... that's fine the main points are below:

### 1.1 Implementing the Tree with depth 5

We have imported the classifier `DecisionTree` from `hw7TreeCode` for you. It works mostly the same as most of the classifiers you have seen. You need to create and instance of the class and pass along some initializing conditions. The class also has fit and predict functions.

In this section you will fit the two trees with different depths and use them to determine whether or not you want to play tennis.

**NOTE** For the printing of the tree, the first branch refers to the outcome for True and the second refers to the outcome for False.

Example:

0 - Goes to JHU = Yes

$\quad$ 1 - Is a Blue Jay

$\quad$ 1 - Is not a Blue Jay

Here if the student goes to JHU they are a Blue Jay, otherwise they are not a Blue Jay.

The number in front represents the depth of that branch. 

**Tasks**

- [2 pt] Instantiate `dt5` as a `DecisionTree` object. Set `data = dataset`, `label = "Play"` and `max_depth = 5`
- [1 pt] Call the `.fit()` method of the class. Note that because we set the data and label (X and y) in step 1 you don't need to pass anything to this function
- [1 pt] Call the `.print_tree()` method of the class to print the decision tree

In [None]:
from hw7TreeCode import DecisionTree
# TODO


### 1.2 Discussion

[2 pt] Based on the implementations of the decision tree, what feature and value provided the most information gain about whether or not you would play tennis on any given day. In other words, what was the biggest predictor of you playing tennis?

**Ans** 

# 2 Random Forest (sklearn)

In this section you will use a random forest classifier to determine the most important features in predicting breast cancer using Breast Cancer Wisconsin (https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic).

To do this you will need to split the data, fit a classifier, then determine the most important features. 

Run the following code to load in the data

In [43]:
from sklearn.datasets import load_breast_cancer
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

### 2.1 Splitting data

**Tasks**

- [2 pt] Using sklearns `train_test_split` split X and y into `X_train`, `X_test`, `y_train`, `y_test` with a `test_size` of 30% and a `random_state` of 42

In [44]:
from sklearn.model_selection import train_test_split
# TODO: Split the data

### 2.2 Train and predict

**Tasks**
- [1 pt] Use `RandomForestClassifier` to create `rf_clf`. The forest should include 100 trees. Set random state to 42
- [1 pt] Fit the `rf_clf` on the training data
- [1 pt] Predict `X_test` and set it to `y_pred_rf`
- [1 pt] Print the accuracy of the classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# TODO: Train Random Forest Classifier

# TODO: Predict the test set results

# TODO: Print

### 2.3 Extract Feature importances 

You may need to read the sklearn documentation for this section. https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html

**Tasks**

- [1 pt] Extract the feature importances from `rf_clf`
- [1 pt] Get the list of features from `data`
- [2 pt] Print the top 5 most important feature names and their importances:
    - ex: "Feature 1: _______, Importance: _______"

In [None]:
# TODO: Extract feature Importance

# TODO: Print 5 most important features


# 3 K-Means Clustering

In this section you are going to fill in some missing code for a manual K-Means implementation.

First run the following code block to generate the clusters (no funny backstory for this data)

In [None]:
# --------------- RUN THIS CODE ------------------------------------
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np

def plot_clusters(centroids, clusters):
    colors = ['r', 'g', 'b', 'y', 'c', 'm']
    for i, cluster in enumerate(clusters):
        cluster = np.array(cluster)
        plt.scatter(cluster[:, 0], cluster[:, 1], color=colors[i % len(colors)], label=f'Cluster {i+1}')
    plt.scatter(centroids[:, 0], centroids[:, 1], s=100, c='black', marker='o', label='Centroids')
    plt.legend()
    plt.show()

X, y = make_blobs(n_samples=1000, centers=5, n_features=2, random_state=1, cluster_std=.75)
plt.scatter(X[:,0], X[:,1])
plt.show()
# ---------------------------------------------------------------------

### 3.1 Euclidean Distance

In your manual KMeans implementation you will measuring the distance between the centroids and the data points using the euclidean distance.

**Tasks** 

- [3 pt] Given a point $a$ and a point $b$ the following function should return the distance between them in Euclidean space

In [48]:
def euclidean_distance(a, b):
    # TODO

### 3.2 Manual KMeans 

Provide the missing code from the following manual KMeans implementation

**Tasks** 

- [2 pt] Use the `euclidean_distance` function to calculate each points distance from the centroids. It should be an array of shape (k,) (or alternatively a list) where k is the number of centroids.
- [2 pt] Assign each point to a cluster based on its distance from the centroid of that cluster
- [2 pt] Update the centroids to the average of each cluster

In [49]:
def kmeans(data, k, max_iters=100, random_state = 4):
    # Randomly initialize the centroids
    np.random.seed(random_state)
    centroids = data[np.random.choice(data.shape[0], k, replace=False)] 
    
    for _ in range(max_iters):
        clusters = [[] for _ in range(k)]
        for point in data:
            distances =  # TODO: Calculate the distances from the centroid for each point
            
            # TODO: Assign each point to a cluster

        new_centroids =  # TODO: Update the centroids to the average of each cluster
        
        centroids = new_centroids
    
    return centroids, clusters

### 3.3 Plotting

**Tasks**
- [1 pt] Calculate the centroids and clusters using the kmeans function you just created. Choose an appropriate number of clusters
- [2 pt] Use the `plot_clusters` to plot the clusters and centroids

In [None]:
# TODO

### 3.4 Discussion

[2 pt] Experiment with different random_state values. What do you notice? What impact does the random state have on the performance of the clustering?

**Ans**

[1 pt] Thought experiment time! Imagine we had categorical data like in the random forest problem. How could we alter the KMeans algorithm to classify categorical data?
- There is no one correct answer

**Ans** 

<font color = "red">

# MAKE SURE TO SET random_state = 4 FOR 3.3 BEFORE SUBMITTING