# Palmer Penguins Modeling

Import the Palmer Penguins dataset and print out the first few rows.

Suppose we want to predict `bill_depth_mm` using the other variables in the dataset.

**Dummify** all variables that require this.

In [22]:
# pip install palmerpenguins

In [23]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from palmerpenguins import load_penguins
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score

In [24]:
pen = load_penguins()
pen = pen.dropna()
pen.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


Let's use the other variables to predict `bill_depth_mm`. Prepare your data and fit the following models on the entire dataset:

* Your best multiple linear regression model from before
* Two kNN models (for different values of K)
* A decision tree model

Create a plot like the right plot of Fig 1. in our `Model Validation` chapter with the training and test error plotted for each of your four models.

Which of your models was best?

In [25]:
X = pen.drop(["bill_depth_mm"], axis = 1)
y = pen["bill_depth_mm"]

ct = ColumnTransformer(
  [
    ("dummify", 
    OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
    make_column_selector(dtype_include=object)),
    ("standardize", 
    StandardScaler(), 
    make_column_selector(dtype_include=np.number))
  ],
  remainder = "passthrough"
)

In [26]:
lin_pipe = Pipeline(
  [("preprocessing", ct),
  ("linear_regression", LinearRegression())]
)

scores = cross_val_score(lin_pipe, X, y, cv=5, scoring='neg_mean_squared_error')
-scores.mean()

0.8658344032118516

In [27]:
knn_pipe_1 = Pipeline(
  [("preprocessing", ct),
  ("knn", KNeighborsRegressor(n_neighbors=3))]
)

scores = cross_val_score(knn_pipe_1, X, y, cv=5, scoring='neg_mean_squared_error')
-scores.mean()

1.1505668626564147

In [28]:
knn_pipe_2 = Pipeline(
  [("preprocessing", ct),
  ("knn", KNeighborsRegressor(n_neighbors=12))]
)

scores = cross_val_score(knn_pipe_2, X, y, cv=5, scoring='neg_mean_squared_error')
-scores.mean()

1.2116957353384588

In [29]:
tree_pipeline = Pipeline(
    [("preprocessing", ct),
    ('tree', DecisionTreeRegressor(max_depth=3))]
)

scores = cross_val_score(tree_pipeline, X, y, cv=5, scoring='neg_mean_squared_error')
-scores.mean()

0.8042738970010828

My best model was the decision tree model.