In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab16.ipynb")

# Lab 16: Decision trees (10 pts)
Please work with your final project partner.
  
**Submission instruction**: Please create a zip file and a pdf via File -> Print (or cmd + P on mac), and upload it to Gradescope.   Even if you decide to finish it at home, please submit what you have by the end of class to make sure you get some credit!


In [None]:
# edit these names to your name and your final project partner's name
me = ["Rick Marks"]
partner = ["Piper Marks"]
...

In [None]:
grader.check("name")

## Part A: Preparing data

Our goal today is to figure out what the defining characteristics of each penguin species are such that if we encounter a new penguin in the wild, we can predict their species. Start by answering the following questions:

- Is this more of a **classification** problem or a **regression** problem? Choose one closest answer.
- What are the **predictor variables** (or features or X)?
- What are the **target variables** (or labels or y)?

[Your answers]

Run the following cell to load the penguin dataset as a `pandas` `DataFrame` called `penguins`. I've also supplied code to shorten the penguins species name for convenient exploration and plotting. 

In [None]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
pd.set_option("future.no_silent_downcasting", True)

penguins = pd.read_csv("palmer_penguins.csv")

# shorten the species name
penguins["Species"] = penguins["Species"].str.split().str.get(0)

For today's exercise, keep only the following columns: `'Species'`, `'Island'`, `'Culmen Length (mm)'`, `'Culmen Depth (mm)'`, `'Flipper Length (mm)'`, `'Body Mass (g)'`, `'Sex'`. Calling `penguins.filter([...])` with the column names inside should make this happen. Reassign this table to `penguins`. The updated `penguins` table should have 344 rows and 7 columns.

In [None]:
# Write your code here
...



You might have noticed that your table contains rows with `NaN` values. Calling `penguins.dropna()` will remove these rows. Do this below, and reassign the result back to `penguins`. Your updated `penguins` table should have 334 rows and 7 columns.

In [None]:
# Write your code here
...


In an ideal world, we would train our model on the entire dataset, collect data from new penguins, then test it on the new data. However, this is obviously not feasible in this case. In cases like this, most people randomly split the existing samples into train and test, and "pretend" like the samples in the test set are actually coming from penguins they haven't met yet.

Fill in the blank such that this sentence describes what the code does:

    We will randomly put 80% of the ___1____ into the __2___ set, and put the remaining __1_____ into the ____2___ set.

- Options for 1: rows or columns
- Options for 2: train or test

The corresponding code:

```Python
train = penguins.sample(frac=0.8)
test = penguins.drop(index=train.index)
print(train.shape, test.shape)
```

[Your answers]

In [None]:
# Copy and run the code here
...



## Part B: Manual decision tree

We'll first approach this problem manually, meaning that you'll be the one designing the prediction algorithm, not your computer.

Calculate the mean of each numeric variable in the table PER penguin species in your `train` data.

In [None]:
# write your code here
...


For the categorical variable `Island`, I'll give you the code. Copy and run following code:

```Python
island_counts = pd.DataFrame(train.groupby('Species').Island.value_counts())

sns.barplot(island_counts, x='Species', hue='Island', y='count')

island_counts
```

In [None]:
# copy and run the code here
...

Based on your findings from these tables and barplot, propose a miniature decision tree to help distinguish between the penguin species. Your decision tree might have rules like the following: 

1. First, check the island on which the penguin was found. 
    1. If Torgersen, then check the body mass. 
        1. If the body mass is over 4,000g, then guess Adelie. 
        1. Otherwise, guess Chinstrap
    1. If Biscoe, then check the sex of the penguin. 
        1. If female, guess Gentoo
        1. Otherwise, guess Chinstrap
    1. If Dream, then guess Adelie.     
      
Your decision tree should operate using **no more than three columns** from the data. 

Below your decision tree, write an explanation of how you came up with it and how the tables that you created above informed your choices. 

[Your solution here]

Write your decision tree directly as a Python function. This example algorithm would look like this:

```Python
def decision_tree(island, mass, sex):
    if island == "Torgersen":
        if mass > 4000:
            return "Adelie"
        else:
            return "Chinstrap"
    elif island == "Biscoe":
        if sex == "FEMALE":
            return "Gentoo"
        else:
            return "Chinstrap"
    else: 
        return "Adelie"
    
decision_tree("Biscoe", 5000, "MALE")
```

In [None]:
# implement your decision tree function here
# note: your decision tree should NOT be the same as the example provided
...

## Part C: Automated decision tree

Now let's see what the automated version looks like.

Once again, these `scikit-learn` functions don't know how to handle text variables like `Island` and `Sex`, so we'll have to turn them into numbers for them. You could use boolean indexing like we did in lecture, but the Pandas `replace` method is the fastest way:

```Python
train['Species'] = train.Species.replace({'Adelie': 0, 'Chinstrap': 1, 'Gentoo': 2})
train['Island'] = train.Island.replace({'Dream': 0, 'Biscoe': 1, 'Torgersen': 2})
train['Sex'] = train.Sex.replace({'MALE': 0, 'FEMALE': 1, '.' : 2})

test['Species'] = test.Species.replace({'Adelie': 0, 'Chinstrap': 1, 'Gentoo': 2})
test['Island'] = test.Island.replace({'Dream': 0, 'Biscoe': 1, 'Torgersen': 2})
test['Sex'] = test.Sex.replace({'MALE': 0, 'FEMALE': 1, '.' : 2})

train = train.astype(float)
test = test.astype(float)
```

In [None]:
# copy the code and run it here
...


Each of your table needs to be split into two parts (`X` and `y`) for the automated algorithm to understand.  Remember that `X` corresponds to a table where each column is a feature or a predictor variable, and `y` corresponds to an array with the target variable or the labels. 

I've given you partial code that creates four new variables `y_train`, `X_train`, `X_test`, `y_test`. **Fill in the missing parts marked with ...**, then copy and run the code. The answer is a single column name that is the same in all four places. Pause and make sure you understand what is going on. 

```Python
y_train = train[...] # select column with target variable
X_train = train.drop(columns=[...]) # keep all other columns with predictor variables
print(X_train.shape, y_train.shape)

y_test = test[...]
X_test = test.drop(columns=[...]) 
print(X_test.shape, y_test.shape)
```

In [None]:
# copy and fill in the code here, then run it
...



Almost done! 

I've also given you mostly finished code that will automatically create a decision tree classifier. Put in `X_train`, `y_train`, `X_test`, `y_test` in appropriate places, then copy and run the code.

```Python
from sklearn.tree import DecisionTreeClassifier, plot_tree

T = DecisionTreeClassifier(max_depth=3)
T.fit(..., ...) # train the model

print('Score on train:', T.score(..., ...)) # evaluate on train data
print('Score on test:', T.score(..., ...)) # evaluate on test data

fig, ax = plt.subplots(1, figsize = (20, 20))
p = plot_tree(T, filled = True, feature_names = X_train.columns)
```

In [None]:
# copy and fill in the code here, then run it
...



What do you think about this tree? Do you think this does a good job at classifying penguin species? Did the computer create a similar algorithm to your manual one, or something very different?

[Write your answer here]

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Submit zip and PDF file to Gradescope Lab 16

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)