In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab19.ipynb")

# Lab 19: One-hot encoding and neural networks (10 pts)
Please work with your final project partner.
  
**Submission instruction**: Please create a zip file and a pdf via File -> Print (or cmd + P on mac), and upload it to Gradescope.   Even if you decide to finish it at home, please submit what you have by the end of class to make sure you get some credit!


In [None]:
# edit these names to your name and your final project partner's name
me = ["Rick Marks"]
partner = ["Piper Marks"]
...

In [None]:
grader.check("name")

In this lab, we are going to use k-nearest neighbors to predict Species from Island, Sex, and Body Mass.  Let's prepare the data first:

## Part A: Preparing data



Run the following cell to load the penguin dataset as a `pandas` `DataFrame` called `penguins`. I've also supplied code to shorten the penguins species name for convenient exploration and plotting. 

In [None]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
pd.set_option("future.no_silent_downcasting", True)

penguins = pd.read_csv("palmer_penguins.csv")

# shorten the species name
penguins["Species"] = penguins["Species"].str.split().str.get(0)
rstate = 0 # random state for reproducibility

Write code to keep the relevant columns and drop NA values, storing the result in `penguins_clean` (you should know how to do this for the exam)

In [None]:
# clean the data: keep only the columns we need, and drop rows with missing data
...

Make a boxplot of Species vs. Body Mass, with hue representing the Island.

In [None]:
# your boxplot code here
...

What Island(s) do Chinstrap penguins come from?  Does that mean if a penguin is from there, it is a Chinstrap penguin?

*Your answer here*

Now draw a boxplot of Species vs. Body Mass, with hue representing the Sex.

In [None]:
# your boxplot code here
...

Does it look like Body Mass will be a good way to distinguish between Adelie and Chinstrap penguins?  Why or why not?

*Your answer here*

Looking at the legend, notice the green indicator.  Does you see any green in your boxplot?

*Your answer here*

A good way to check what values are in a categorical feature is to use value_counts():
```Python
penguins_clean['Sex'].value_counts()
```

In [None]:
# copy the above code and run it
...

Note that there is only 1 penguin with a Sex of '.'  Let's decide that this is likely a data entry error, so use boolean indexing to drop that row below.

In [None]:
# write your code here to drop the row
penguins_clean = ...

We are nearly ready!  Run the code below to map the categorical labels like we have been doing in other labs.

In [None]:
# run this code
penguins_mapped = penguins_clean.copy()
penguins_mapped['Sex'] = penguins_clean['Sex'].map({'MALE':0, 'FEMALE':1}) 
penguins_mapped['Island'] = penguins_clean['Island'].map({'Biscoe':0, 'Dream':1, 'Torgersen':2})


## Part B: Cross-validation and k-nearest neighbor

In [None]:

# run this to conduct 5-fold cross-validation for a k-nearest neighbor model
from sklearn.model_selection import KFold, cross_val_score
kf = KFold(n_splits=5, shuffle=True, random_state=rstate)

from sklearn.neighbors import KNeighborsClassifier
y = penguins_mapped['Species'] # select column with target variable
X = penguins_mapped.drop(columns=['Species']) # keep all other columns with predictor variables
scores = []
for k in range(1, 31):
    model = KNeighborsClassifier(n_neighbors=k)
    scores.append(cross_val_score(model, X, y, cv=kf).mean())
plt.plot(range(1, 31), scores)



From the plot, what value of k would you use for the best results?

*Your answer here*

Oops!  k-NN uses distances, and we forgot to normalize again!  Write code to normalize the predictor variables in X using min-max normalization and copy the cross-validation code above to see how the results change.

In [None]:
# write your normalization code here
...

Did the results change?  Did they get better?  Did what value you would use for k change?

*Your answer here*

# Part C - One-hot encoding and k-nearest neighbor
The last code cell of part A is doing ***label-encoding***, which assigns numeric values to each label.  But for the Island feature, there are 3 nominal categories that we assigned numbers 0 1 2.  For no good reason, we inherently implied that Biscoe and Torgersen islands are twice as far from each other as Dream is from either, which could affect our results.

An alternative kind of way to handle this situation is using ***one-hot encoding*** as mentioned in lecture.  This makes 3 new columns for the 3 categories, each with a boolean value.  Pandas has a quick way to do this:
```Python
penguins_onehot = pd.get_dummies(penguins_clean, columns=['Island'])
penguins_onehot.head()
```

In [None]:
# copy the code above and run it
...

Also note that we still need to map Sex because it also nominal and we are computing distances.  We can use one-hot encoding again, but since there are only 2 values, it isn't necessary to add 2 new columns; we just need 1 column to fully represent the results.  We've done this before with boolean indexing, or label mapping, but you can also do it with one-hot encoding by adding an extra argument:
```Python
penguins_onehot = pd.get_dummies(penguins_onehot, columns=['Sex'], drop_first=True)
penguins_onehot.head()
```

In [None]:
# copy the code above and run it
...

Finally, do cross-validation for k-nearest neighbors one more time.  First, create X and y again from the new table: 

In [None]:
y = penguins_onehot['Species'] # select column with target variable
X = penguins_onehot.drop(columns=['Species']) # keep all other columns with predictor variables
X.head()

Note that True should evalutate to 1, and False should evaluate to 0, so the new columns are already normalized!  But to make this happen we need to tell pandas to make them float instead of bool:

In [None]:
X = X.astype(float)
X.head()

Don't forget to normalize Body Mass!  You can just normalize all of X again (it won't affect the dummy columns).  Finish the code below to add normalization.

In [None]:
# run cross-validation for k-nearest neighbors one more time
X_normalized = ...
from sklearn.model_selection import KFold, cross_val_score
kf = KFold(n_splits=5, shuffle=True, random_state=rstate)
from sklearn.neighbors import KNeighborsClassifier

scores = []
for k in range(1, 31):
    model = KNeighborsClassifier(n_neighbors=k)
    scores.append(cross_val_score(model, X_normalized, y, cv=kf).mean())
    
plt.plot(range(1, 31), scores)

Did the results change?  Did the best choice for k change?

*Your answer here*

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Submit zip and PDF file to Gradescope Lab 19

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)