# Palmer Penguins Modeling

Import the Palmer Penguins dataset and print out the first few rows.

Suppose we want to predict `species` using the other variables in the dataset.

**Dummify** all variables that require this.

In [21]:
# pip install palmerpenguins

In [22]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from palmerpenguins import load_penguins
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.compose import make_column_selector, ColumnTransformer

In [23]:
pen = load_penguins()
pen = pen.dropna()
pen.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


Let's use the other variables to predict `species`. Prepare your data and fit the following models on the entire dataset:

* Two kNN models (for different values of K)
* Two decision tree models (for different complexities of trees)

Compute the following, for each of your models, on test data. Keep in mind that you may need to stratify your creation of the training and test data.

* Confusion matrix
* Overall Accuracy
* Precision, Recall, AUC, and F1-score for each species

Create one ROC plot for the species of your choice.

In [24]:
X = pen.drop(["species"], axis = 1)
y = pen["species"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 34)

ct = ColumnTransformer(
  [
    ("dummify", 
    OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
    make_column_selector(dtype_include=object)),
    ("standardize", 
    StandardScaler(), 
    make_column_selector(dtype_include=np.number))
  ],
  remainder = "passthrough"
)

In [25]:
knn_pipe_1 = Pipeline(
  [("preprocessing", ct),
  ("knn", KNeighborsRegressor(n_neighbors=3))]
)

In [26]:
knn_pipe_2 = Pipeline(
  [("preprocessing", ct),
  ("knn", KNeighborsRegressor(n_neighbors=10))]
)

In [27]:
tree_pipeline_1 = Pipeline(
    [("preprocessing", ct),
    ('tree', DecisionTreeRegressor(max_depth=3))]
)

In [28]:
tree_pipeline_2 = Pipeline(
    [("preprocessing", ct),
    ('tree', DecisionTreeRegressor(max_depth=10))]
)