# Palmer Penguins Modeling

Import the Palmer Penguins dataset and print out the first few rows.

Suppose we want to predict `bill_depth_mm` using the other variables in the dataset.

Which variables would we need to **dummify**?

In [98]:
# pip install palmerpenguins

In [99]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from palmerpenguins import load_penguins
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [100]:
pen = load_penguins()
pen.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [101]:
pen = pd.get_dummies(pen, columns=['species', 'island', 'sex'], drop_first=True)
pen.dtypes

Unnamed: 0,0
bill_length_mm,float64
bill_depth_mm,float64
flipper_length_mm,float64
body_mass_g,float64
year,int64
species_Chinstrap,bool
species_Gentoo,bool
island_Dream,bool
island_Torgersen,bool
sex_male,bool


We need to dummify species, island, and sex.

Let's use `bill_length_mm` to predict `bill_depth_mm`. Prepare your data and fit the following models on the entire dataset:

* Simple linear regression (e.g. straight-line) model
* Quadratic (degree 2 polynomial) model
* Cubic (degree 3 polynomial) model
* Degree 10 polynomial model

Make predictions for each model and plot your fitted models on the scatterplot.

In [102]:
lr = LinearRegression()
pen = pen.dropna()
y = pen[['bill_depth_mm']]
X = pen[['bill_length_mm']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

## SIMPLE LINEAR REGRESSION

In [103]:
lr_fit = lr.fit(X_train, y_train)
train_preds = lr_fit.predict(X_train)
test_preds = lr_fit.predict(X_test)

In [104]:
r2_score(y_train, train_preds)

0.04569054570566922

In [105]:
r2_score(y_test, test_preds)

0.07767961027878456

In [106]:
mean_squared_error(y_train, train_preds)

3.8061942946837806

In [107]:
mean_squared_error(y_test, test_preds)

3.288933902970652

## QUADRATIC MODEL

In [108]:
pen["x_sq"] = pen["bill_length_mm"]**2
X = pen[['bill_length_mm', 'x_sq']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [109]:
lr_fit = lr.fit(X_train, y_train)
train_preds = lr_fit.predict(X_train)
test_preds = lr_fit.predict(X_test)

In [110]:
r2_score(y_train, train_preds)

0.10576419156317562

In [111]:
r2_score(y_test, test_preds)

0.12434268851702668

In [112]:
mean_squared_error(y_train, train_preds)

3.643659425324497

In [113]:
mean_squared_error(y_test, test_preds)

2.8963952084597056

## CUBIC MODEL

In [114]:
pen["x_cube"] = pen["bill_length_mm"]**3
X = pen[['bill_length_mm', 'x_sq', "x_cube"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [115]:
lr_fit = lr.fit(X_train, y_train)
train_preds = lr_fit.predict(X_train)
test_preds = lr_fit.predict(X_test)

In [116]:
r2_score(y_train, train_preds)

0.14357674736373005

In [117]:
r2_score(y_test, test_preds)

0.10716993252812157

In [118]:
mean_squared_error(y_train, train_preds)

3.546394639994234

In [119]:
mean_squared_error(y_test, test_preds)

2.7745104787768864

## DEGREE 10 POLYNOMIAL MODEL

In [120]:
pen["x_4"] = pen["bill_length_mm"]**4
pen["x_5"] = pen["bill_length_mm"]**5
pen["x_6"] = pen["bill_length_mm"]**6
pen["x_7"] = pen["bill_length_mm"]**7
pen["x_8"] = pen["bill_length_mm"]**8
pen["x_9"] = pen["bill_length_mm"]**9
pen["x_10"] = pen["bill_length_mm"]**10
X = pen[['bill_length_mm', 'x_sq', "x_cube", "x_4", "x_5", "x_6", "x_7", "x_8", "x_9", "x_10"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [121]:
lr_fit = lr.fit(X_train, y_train)
train_preds = lr_fit.predict(X_train)
test_preds = lr_fit.predict(X_test)

In [122]:
r2_score(y_train, train_preds)

0.2738347110352868

In [123]:
r2_score(y_test, test_preds)

0.31759097141943426

In [124]:
mean_squared_error(y_train, train_preds)

2.7200943295870377

In [125]:
mean_squared_error(y_test, test_preds)

2.9427809830003855

* Are any of the models above underfitting the data? If so, which ones and how can you tell?
* Are any of thhe models above overfitting the data? If so, which ones and how can you tell?
* Which of the above models do you think fits the data best and why?

The simple linear regresison model is definitely underfitting the data with an R^2 for the test and training precitions being very low (<.1), and a rpetty high MSE for both as well. The 10 degree polynomial is likely overfiitting the data given that the testing MSE is greater than the training MSE, though only by a little. If I had to pick one of these models as best fitting the data I think I would go with the cubic model because it had the lowest MSE for the testing predicitons.