<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/TRAIN_AWS_P1_Day_11_Projects_%5BSTUDENTS%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Day #11 Project: Analysis of Palmer Archipelago (Antarctica) Penguins**
---

### **Description** 
In this notebook, you will apply everything you have learned to process, explore, and model a dataset collected on penguin species in Antarctica using pandas, matplotlib, and sklearn as well as advanced techniques such as specialized encodings and K-Folds Cross Validation.

<br>

### **Overview**
You will analyze a dataset containing information on the Palmer Archipelago Penguins in Antarctica. In particular, you will wrangle and explore this dataset. You will then train ML models to classify penguins by species and predict their mass. This is real data collected by biologists as part of the [Long Term Ecological Research Network](https://lternet.edu/).

*Thank you to Dr. Gorman, Palmer Station LTER, the LTER Network, and Marty Downs—Director, LTER Network Office.*

<br>

**Codebook:**

Below is a list of variables in this dataset. Unless you are a biologist familiar with penguins, you will likely not understand the meaning of every single variable. It is typical for ML practitioners to go into a project unaware of the full context within which their data lives. As you go through this Part, consider if it would be helpful to learn more about any information you are unfamiliar with.
* `studyName`: name of the specific study for which the data was collected
* `Sample Number`: unique number identifying each data point
* `Species`: the species of each
*	`Region`: the region where this data was collected
* `Island`: the island in this region where this data was collected
* `Stage`: the stage of life that the penguin was in that was sampled
* `Individual_ID`: unique label identifying each penguin
* `Clutch Completion`: whether or not the nest had a "full clutch", meaning 2 eggs
* `Date Egg`: date of data collection
* `Culmen_Length (mm)`: length of the dorsal ridge of a bird's bill
* `Culmen_Depth (mm)`: depth of the dorsal ridge of a bird's bill
* `Flipper Length (mm)`: length of the flipper
* `Body Mass (g)`: mass of the penguin
* `Sex`: the sex of the penguin, labeled as MALE or FEMALE
* `Delta 15 N (o/oo)`: ratio of stable isotopes 15N:14N
* `Delta 13 C (o/oo)`: ratio of stable isotopes 13C:12C
* `Comments`: any extra information provided by the researchers

<br>

### **Key questions to answer:**
1. What features most strongly predict a penguin's body mass?
2. What features most strongly predict a penguin's island of origin?
3. How do KNN and Logistic Regression compare when both used?

<br>

### **Goals:**
By the end of these projects, you will have:
1. Visualized relationships between various variables in the data.
2. Visualized the behavior of variables across features (e.g. grouped bar graphs, etc.).
3. Implemented linear regression, KNN, and logistic regression models tuned to best fit this dataset for a variety of tasks.

<br>

### **Lab Structure**
**Part 1**:  [Data Exploration, Wrangling, and Visualization](#p1)

**Part 2**:  [Predicting Body Mass](#p2)

> **Part 2.1**:  [Using All Numerical Features and Label Encodings](#p2.1)

> **Part 2.2**:  [Using All Numerical Features and Dummy Variable Encodings](#p2.2)

> **Part 2.3**:  [Using the 4 Best Numerical Features and the Best Encodings Where Relevant](#p2.3)

> **Part 2**:  [Wrapup](#p2w)

**Part 3**:  [Predicting Island of Origin](#p3)

> **Part 3.1**:  [Using All Numerical Features and Label Encodings](#p3.1)

> **Part 3.2**:  [Using All Numerical Features and Dummy Variable Encodings](#p3.2)

> **Part 3.3**:  [Using the 4 Best Numerical Features and the Best Encodings Where Relevant](#p3.3)

> **Part 3**:  [Wrapup](#p3w)



</br>

### **Cheat Sheets**

* [pandas Commands](https://docs.google.com/document/d/1v-MZCgoZJGRcK-69OOu5fYhm58x2G0JUWyi2H53j8Ls/edit)

* [Feature Engineering and Selection with pandas](https://docs.google.com/document/d/191CH-X6zf4lESuThrdIGH6ovzpHK6nb9NRlqSIl30Ig/edit?usp=sharing)

* [Standardization, Encoding, and K-Folds with sklearn](https://docs.google.com/document/d/1wu_J33O9PooGahfrnyyN2-Mwza869Ab8GnzDypqjTaw/edit?usp=sharing)

* [Data Visualizations with matplotlib](https://docs.google.com/document/d/1EC3tTjRRL5ruNjc1n8UmJNGvN82_S-7rx7LLkMvv1Qk/edit?usp=share_link)

* [Linear Regression with sklearn](https://docs.google.com/document/d/1oucIbrFgNu6rYbHqCwqKh_XWM8CXreyrPdtgUJwKyk0/edit?usp=sharing)

* [K-Nearest Neighbors with sklearn](https://docs.google.com/document/d/1fCZ1Gp9eM-Oxs_qb6cOiyPpwkqz155L0GMJl2oxQfXo/edit?usp=share_link)

* [Logistic Regression with sklearn](https://docs.google.com/document/d/1rLTuWGgx9E-K1pgWYxUF4B1ExKKxt6MVSkgEKoUbhuE/edit?usp=sharing)


<br>

**Before starting, run the code below to import all necessary functions, libraries, and data.**

In [None]:
#!pip install scikit-learn

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import *

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier



url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRt7krGK0b_0oRATwhWaVDxyKt2jzOQnoP7JaNmGrMgVcbXDpoyxhbXFsF1XV4GCcU1x9e-3i7-ab4p/pub?gid=828946676&single=true&output=csv'
penguin_df = pd.read_csv(url)

<a name="p1"></a>

---
## **Part 1: Data Exploration, Wrangling, and Visualization**
---

First, you need to explore and wrangle this dataset.

### **Problem #1**
---

Using pandas functions, look at the first few rows of data.

### **Problem #2**
---

This data currently has no consistent naming convention for columns, which is very bad practice. So, rename each column to be of the style, `'Column Name'` or `'Column Name (unit)`' if the variable has an associated unit such as `'mm'`, where each word is separated by a space (not an underscore, slash, or anything else) and starts uppercase. Furthermore, make sure all words are spelled correctly.

<br>

**Hint**: It may make you life easier to quickly print the current column names here using the `.columns` attribute and then copy the names that need to be updated.

### **Problem #3**
---

Drop any duplicate rows.

### **Problem #4**
---

Determine the datatypes and number of non-null values in each column.

### **Problem #5**
---

You should have seen from Problem #4 that there are some data points containing null values. Decide to either drop or impute these so that there are no remaining null values.

<br>

For `Comments` in particular, there are only values for the data points where a researcher specifically added a comment. To make this clearer, fill out any null value with the string `'no comments`'.

In [None]:
# COMPLETE THIS CODE

penguin_df.info()

### **Problem #6**
---

Using pandas functions, determine the mean (average) and std (spread/standard deviation) of the numerical variables.

<br>

**Consider the results and decide if you think the data should be standardized before modeling.**

### **Problem #7**
---

Convert the following columns from string to numerical values:
* `Sex` from `'FEMALE'` and `'MALE'` to 0 and 1 respectively.
* `Clutch Completion` from `'No'` and '`Yes'` to 0 and 1 respectively.

NOTE: recheck for NAs after converting, and if necessary, handle accordingly (refer to problem #5)

In [None]:
# Sex

In [None]:
# Clutch Completion

### **Problem #8**
---

Using pandas functions, determine all the regions, islands, species, and stages that were included in this dataset.

In [None]:
# Region

In [None]:
# Island

In [None]:
# Species

In [None]:
# Stage

### **Problem #9**
---

Based on the results above, for each column either:
* Encode it if it seems to offer valuable information to distinguish different data points creating both label encoding and dummy variable encoding (within the original dataset). You will try using both encodings later on.
* Drop the column if not.

In [None]:
# Region

In [None]:
# Island

In [None]:
# Species

In [None]:
# Stage

### **Problem #10**
---

The columns listed below are purely human labels for organizing their data collection, but have nothing to do with the biological samples they collected. To avoid our models picking up on patterns that aren't there, let's drop these columns:
* `Study Name`
* `Sample Number`
* `Individual ID`
* `Date Egg`

### **Problem #11**
---

Create separate scatterplots for each of the following relationships:

1. `Flipper Length (mm)` and `Body Mass (g)` with each `Sex` colored differently.
2. `Culmen Depth (mm)` and `Body Mass (g)` with each `Sex` colored differently.
3. `Culmen Depth (mm)` and `Body Mass (g)` with each `Species` colored differently.
4. `Culmen Length (mm)` and `Body Mass (g)` with each `Species` colored differently.
5. `Culmen Length (mm)` and `Body Mass (g)` with each `Island` colored differently.

<br>

**Make sure to include a meaningful title, x label, y label, and legend for all plots.**

#### **1. `Flipper Length (mm)` and `Body Mass (g)` with each `Sex` colored differently.**

#### **2. `Culmen Depth (mm)` and `Body Mass (g)` with each `Sex` colored differently.**

#### **3. `Culmen Depth (mm)` and `Body Mass (g)` with each `Species` colored differently.**

#### **4. `Culmen Length (mm)` and `Body Mass (g)` with each `Species` colored differently.**

#### **5. `Culmen Length (mm)` and `Body Mass (g)` with each `Island` colored differently.**

### **Problem #12**
---

Create either 3 bar graphs or one grouped bar graph to show how the following variables are distributed across species:
* Average `Culmen Depth (mm)`
* Average `Culmen Length (mm)`
* Average `Flipper Length (mm)`

<br>

You will need to do this in two parts:
1. Calculate the average of each variable by species.
2. Plot these averages.

#### **1. Calculate the average of each variable by species.**

**Hint**: One approach would be to use `groupby(...)` and calculate the averages.

#### **2. Plot these averages.**

<a name="p2"></a>

---
## **Part 2: Predicting Body Mass**
---

Now you will create, evaluate, and train an ML model to predict the body mass of a penguin based on the provided numerical features. It is up to you to determine if you should be using Linear Regression, KNN, or Logistic Regression here. 

<br>

You will do this three times as follows:

**Part 2.1**: Using All Numerical Features and Label Encodings

**Part 2.2**: Using All Numerical Features and Dummy Variable Encodings

**Part 2.3**: Using the 4 Best Numerical Features and the Best Encodings Where Relevant

<br>

You will follow the 8 step process for implementing ML models that we have learned this last week:

1. Load in the data (**this has already been done in Part 1**)
2. Decide independent and dependent variables
3. Split the data into training and testing datasets
4. Import a ML algorithm
5. Set the model’s parameters
6. Fit the model on the training set and test the model on the test dataset. Draw a visualization (if applicable to the model)
7. Evaluate the model’s performance
8. Apply your model

<a name="p21"></a>

---
### **Part 2.1: Using All Numerical Features and Label Encodings**
---

#### **Step #2: Decide independent and dependent variables**
---

Complete the code below to decide the independent and dependent variables. Make sure to only use label encodings where relevant instead of other representations of the same variable(s).

<br>

**NOTE**: The dependent variable (label) for all of Part 2 is `Body Mass (g)`. Using one of several pandas functions, you can determine the numerical features available and use them all as the independent variables.

In [None]:
penguin_df.# COMPLETE THIS LINE

In [None]:
x = # COMPLETE THIS LINE
y = # COMPLETE THIS LINE

#### **Step #3: Split data into training and testing data and standardize appropriately**
---

Complete the code below to split the data, using 80% for training and 20% for testing.

#### **Step #4: Import the algorithm**
---

#### **Step #5:  Initialize the model and set hyperparameters**
---

Specifically,
* For Linear Regression, there are no hyperparameters to set.
* For KNN, choose a reasonable value for `n_neighbors`. You are encouraged to try Steps #5 - 7 for several values and picking the model with the highest performance.
* For Logistic Regression, decide whether you need to specify `multiclass = 'ovr'` or not.

<br>

**NOTE**: Since both KNN and Logistic Regression are used for classification, you should always try modeling the data with *both* and seeing which one works best after hyperparameter tuning.

In [None]:
model_1 = # COMPLETE THIS LINE

#### **Steps #6 - 7: Fit your model, evaluating using 10-Folds Cross Validation. Create a visualization if applicable**
---

Specifically,

1. Fit the model to the training data, determining an average relevant evaluation metric using 10-Folds Cross Validation.
2. Train the final model and visualize the results.

##### **1. Fit the model to the training data, determining an average relevant evaluation metric using 10-Folds Cross Validation.**


**NOTE**: The y-data in this section is still in the form of a pandas DataFrame, so to access a specific index you need to use `.iloc[...]`.

##### **2. Visualize the results.**

The code is provided for both linear regression and KNN, but it is up to you to decide which one makes the most sense here.

###### **Regression Visualization**

In [None]:
# Visualize comparison of predictions vs. actual values
plt.scatter(y_test, pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color = 'black', label='Correct Predictions')


plt.xlabel('True Value')
plt.ylabel('Predicted Value')
plt.title('Real vs Value')
plt.legend()

plt.show()

###### **Classification Visualization**

In [None]:
feature_1_name = 'Flipper Length (mm)'
feature_2_name = 'Culmen Length (mm)'


# Make the same scatter plot of the training data
fig, ax = plt.subplots(figsize=(10,6))

xx, yy = np.meshgrid(np.arange(penguin_df[feature_1_name].min() - 1, penguin_df[feature_1_name].max() + 1, 10),
                     np.arange(penguin_df[feature_2_name].min() - 1, penguin_df[feature_2_name].max() + 1, 10))

means = x.mean()
inputs = [[means[0], y, means[2], x, means[4], means[5], means[6], means[7], means[8]] for (x, y) in np.c_[xx.ravel(), yy.ravel()]]
z = model_1.predict(scaler.transform(inputs))
z = z.reshape(xx.shape)
z = [[island_map[island] for island in islands] for islands in z]

ax.pcolormesh(xx, yy, z, alpha=0.1)

for label, data in penguin_df.groupby('Body Mass (g)'):
  ax.scatter(data[feature_1_name], data[feature_2_name], label=label)

ax.set_title("Decision Boundary of the Classifier")
ax.set_xlabel(feature_1_name)
ax.set_ylabel(feature_2_name)
ax.legend()
plt.show()

#### **Step #8: Use the model**
---

Specifically,

1. Predict the body mass of two new penguins.

2. Visualize the modeled relationship between `Body Mass (g)` and `Flipper Length (mm)` to see if a qualitative relationship can be inferred.

3. *If you used linear regression*, look at the coefficients and intercept to determine the modeled relationships quantitatively.

##### **1. Predict the body mass of these new penguins:**

**Penguin 1**

* `Clutch Completion`: Yes
* `Culmen Length (mm)`: 33
* `Culmen Depth (mm)`: 15
* `Flipper Length (mm)`: 250
* `Sex`: `'MALE'`
* `Delta 15 N (o/oo)`: 8.7
* `Delta 13 C (o/oo)`: -25.6
* `Species`: `'Gentoo penguin (Pygoscelis papua)'`
* `Island`: `'Dream'`

<br>

**Penguin 2**

* `Clutch Completion`: No
* `Culmen Length (mm)`: 47
* `Culmen Depth (mm)`: 18
* `Flipper Length (mm)`: 175
* `Sex`: `'FEMALE'`
* `Delta 15 N (o/oo)`: 8.7
* `Delta 13 C (o/oo)`: -25.6
* `Species`: `'Chinstrap penguin (Pygoscelis antarctica)'`
* `Island`: `'Torgersen'`

<br>

**NOTE**: You will need to use your `StandardScaler` to transform these new points *and* you will need to determine how these species are represented in this given encoding.

##### **2. Visualize the modeled relationship between `Body Mass (g)` and `Flipper Length (mm)` in the test dataset to see if a qualitative relationship can be inferred.**


**NOTE**: Here you can use the non-standardized, and more interpretable, `X_test` since the predictions have already been made.

##### **3. *If you used linear regression*, complete the cells below to look at the coefficients and intercept to determine the modeled relationships quantitatively.**

In [None]:
coefficients = model_1.# COMPLETE THIS LINE
intercept = model_1.# COMPLETE THIS LINE

coefficients = pd.DataFrame([coefficients], columns = X_test.columns)
intercept = pd.DataFrame([intercept], columns = ["Body Mass (g)"])

In [None]:
print("Coefficients:")
coefficients.head()

In [None]:
print("\nIntercept:")
intercept.head()

<a name="p22"></a>

---
### **Part 2.2: Using All Numerical Features and Dummy Variable Encodings**
---

#### **Step #2: Decide independent and dependent variables**
---

Complete the code below to decide the independent and dependent variables. Make sure to only use dummy variable encodings where relevant instead of other representations of the same variable(s).

<br>

**NOTE**: The dependent variable (label) for all of Part 2 is `Body Mass (g)`. Using one of several pandas functions, you can determine the numerical features available and use them all as the independent variables.

In [None]:
penguin_df.# COMPLETE THIS LINE

In [None]:
x = # COMPLETE THIS LINE
y = # COMPLETE THIS LINE

#### **Step #3: Split data into training and testing data and standardize appropriately**
---

Complete the code below to split the data, using 80% for training and 20% for testing.

#### **Step #4: Import the algorithm**
---

#### **Step #5:  Initialize the model and set hyperparameters**
---

Specifically,
* For Linear Regression, there are no hyperparameters to set.
* For KNN, choose a reasonable value for `n_neighbors`. You are encouraged to try Steps #5 - 7 for several values and picking the model with the highest performance.
* For Logistic Regression, decide whether you need to specify `multiclass = 'ovr'` or not.

<br>

**NOTE**: Since both KNN and Logistic Regression are used for classification, you should always try modeling the data with *both* and seeing which one works best after hyperparameter tuning.

In [None]:
model_2 = # COMPLETE THIS LINE

#### **Steps #6 - 7: Fit your model, evaluating using 10-Folds Cross Validation. Create a visualization if applicable**
---

Specifically,

1. Fit the model to the training data, determining an average relevant evaluation metric using 10-Folds Cross Validation.
2. Train the final model and visualize the results.

##### **1. Fit the model to the training data, determining an average relevant evaluation metric using 10-Folds Cross Validation.**


**NOTE**: The y-data in this section is still in the form of a pandas DataFrame, so to access a specific index you need to use `.iloc[...]`.

##### **2. Visualize the results.**

The code is provided for both linear regression and KNN, but it is up to you to decide which one makes the most sense here.

###### **Regression Visualization**

In [None]:
# Visualize comparison of predictions vs. actual values
plt.scatter(y_test, pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color = 'black', label='Correct Predictions')


plt.xlabel('True Value')
plt.ylabel('Predicted Value')
plt.title('Real vs Value')
plt.legend()

plt.show()

###### **Classification Visualization**

In [None]:
feature_1_name = 'Flipper Length (mm)'
feature_2_name = 'Culmen Length (mm)'


# Make the same scatter plot of the training data
fig, ax = plt.subplots(figsize=(10,6))

xx, yy = np.meshgrid(np.arange(penguin_df[feature_1_name].min() - 1, penguin_df[feature_1_name].max() + 1, 10),
                     np.arange(penguin_df[feature_2_name].min() - 1, penguin_df[feature_2_name].max() + 1, 10))

means = x.mean()
inputs = [[means[0], y, means[2], x, means[4], means[5], means[6], means[7], means[8]] for (x, y) in np.c_[xx.ravel(), yy.ravel()]]
z = model_1.predict(scaler.transform(inputs))
z = z.reshape(xx.shape)
z = [[island_map[island] for island in islands] for islands in z]

ax.pcolormesh(xx, yy, z, alpha=0.1)

for label, data in penguin_df.groupby('Body Mass (g)'):
  ax.scatter(data[feature_1_name], data[feature_2_name], label=label)

ax.set_title("Decision Boundary of the Classifier")
ax.set_xlabel(feature_1_name)
ax.set_ylabel(feature_2_name)
ax.legend()
plt.show()

#### **Step #8: Use the model**
---

Specifically,

1. Predict the body mass of two new penguins.

2. Visualize the modeled relationship between `Body Mass (g)` and `Flipper Length (mm)` to see if a qualitative relationship can be inferred.

3. *If you used linear regression*, look at the coefficients and intercept to determine the modeled relationships quantitatively.

##### **1. Predict the body mass of these new penguins:**

**Penguin 1**

* `Clutch Completion`: Yes
* `Culmen Length (mm)`: 33
* `Culmen Depth (mm)`: 15
* `Flipper Length (mm)`: 250
* `Sex`: `'MALE'`
* `Delta 15 N (o/oo)`: 8.7
* `Delta 13 C (o/oo)`: -25.6
* `Species`: `'Gentoo penguin (Pygoscelis papua)'`
* `Island`: `'Dream'`

<br>

**Penguin 2**

* `Clutch Completion`: No
* `Culmen Length (mm)`: 47
* `Culmen Depth (mm)`: 18
* `Flipper Length (mm)`: 175
* `Sex`: `'FEMALE'`
* `Delta 15 N (o/oo)`: 8.7
* `Delta 13 C (o/oo)`: -25.6
* `Species`: `'Chinstrap penguin (Pygoscelis antarctica)'`
* `Island`: `'Torgersen'`

<br>

**NOTE**: You will need to use your `StandardScaler` to transform these new points *and* you will need to determine how these species are represented in this given encoding.

##### **2. Visualize the modeled relationship between `Body Mass (g)` and `Flipper Length (mm)` in the test dataset to see if a qualitative relationship can be inferred.**


**NOTE**: Here you can use the non-standardized, and more interpretable, `X_test` since the predictions have already been made.

##### **3. *If you used linear regression*, complete the cells below to look at the coefficients and intercept to determine the modeled relationships quantitatively.**

<a name="p23"></a>

---
### **Part 2.3: Using the 4 Best Numerical Features and the Best Encodings Where Relevant**
---

#### **Step #2: Decide independent and dependent variables**
---

Complete the code below to decide the independent and dependent variables, specifically choosing the 4 best features according to `SelectKBest(...)`. Make sure to only use the best performing encodings where relevant instead of other representations of the same variable(s).

<br>

**NOTE**: The dependent variable (label) for all of Part 2 is `Body Mass (g)`. Using one of several pandas functions, you can determine the numerical features available and use them all as the independent variables.

In [None]:
penguin_df.# COMPLETE THIS LINE

In [None]:
x = # COMPLETE THIS LINE
y = # COMPLETE THIS LINE

# COMPLETE THIS CODE

best_features = # COMPLETE THIS CODE

#### **Step #3: Split data into training and testing data and standardize appropriately**
---

Complete the code below to split the data, using 80% for training and 20% for testing. **Make sure to use only the best 4 features found above.**

#### **Step #4: Import the algorithm**
---

#### **Step #5:  Initialize the model and set hyperparameters**
---

Specifically,
* For Linear Regression, there are no hyperparameters to set.
* For KNN, choose a reasonable value for `n_neighbors`. You are encouraged to try Steps #5 - 7 for several values and picking the model with the highest performance.
* For Logistic Regression, decide whether you need to specify `multiclass = 'ovr'` or not.

<br>

**NOTE**: Since both KNN and Logistic Regression are used for classification, you should always try modeling the data with *both* and seeing which one works best after hyperparameter tuning.

In [None]:
model_3 = # COMPLETE THIS LINE

#### **Steps #6 - 7: Fit your model, evaluating using 10-Folds Cross Validation. Create a visualization if applicable**
---

Specifically,

1. Fit the model to the training data, determining an average relevant evaluation metric using 10-Folds Cross Validation.
2. Train the final model and visualize the results.

##### **1. Fit the model to the training data, determining an average relevant evaluation metric using 10-Folds Cross Validation.**


**NOTE**: The y-data in this section is still in the form of a pandas DataFrame, so to access a specific index you need to use `.iloc[...]`.

##### **2. Visualize the results.**

The code is provided for both linear regression and KNN, but it is up to you to decide which one makes the most sense here.

###### **Regression Visualization**

In [None]:
# Visualize comparison of predictions vs. actual values
plt.scatter(y_test, pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color = 'black', label='Correct Predictions')


plt.xlabel('True Value')
plt.ylabel('Predicted Value')
plt.title('Real vs Value')
plt.legend()

plt.show()

###### **Classification Visualization**

In [None]:
feature_1_name = 'Flipper Length (mm)'
feature_2_name = 'Culmen Length (mm)'


# Make the same scatter plot of the training data
fig, ax = plt.subplots(figsize=(10,6))

xx, yy = np.meshgrid(np.arange(penguin_df[feature_1_name].min() - 1, penguin_df[feature_1_name].max() + 1, 10),
                     np.arange(penguin_df[feature_2_name].min() - 1, penguin_df[feature_2_name].max() + 1, 10))

means = x.mean()
inputs = [[means[0], y, means[2], x, means[4], means[5], means[6], means[7], means[8]] for (x, y) in np.c_[xx.ravel(), yy.ravel()]]
z = model_1.predict(scaler.transform(inputs))
z = z.reshape(xx.shape)
z = [[island_map[island] for island in islands] for islands in z]

ax.pcolormesh(xx, yy, z, alpha=0.1)

for label, data in penguin_df.groupby('Body Mass (g)'):
  ax.scatter(data[feature_1_name], data[feature_2_name], label=label)

ax.set_title("Decision Boundary of the Classifier")
ax.set_xlabel(feature_1_name)
ax.set_ylabel(feature_2_name)
ax.legend()
plt.show()

#### **Step #8: Use the model**
---

Specifically,

1. Predict the body mass of two new penguins.

2. Visualize the modeled relationship between `Body Mass (g)` and `Flipper Length (mm)` to see if a qualitative relationship can be inferred.

3. *If you used linear regression*, look at the coefficients and intercept to determine the modeled relationships quantitatively.

##### **1. Predict the body mass of these new penguins:**

**Penguin 1**

* `Clutch Completion`: Yes
* `Culmen Length (mm)`: 33
* `Culmen Depth (mm)`: 15
* `Flipper Length (mm)`: 250
* `Sex`: `'MALE'`
* `Delta 15 N (o/oo)`: 8.7
* `Delta 13 C (o/oo)`: -25.6
* `Species`: `'Gentoo penguin (Pygoscelis papua)'`

<br>

**Penguin 2**

* `Clutch Completion`: No
* `Culmen Length (mm)`: 47
* `Culmen Depth (mm)`: 18
* `Flipper Length (mm)`: 175
* `Sex`: `'FEMALE'`
* `Delta 15 N (o/oo)`: 8.7
* `Delta 13 C (o/oo)`: -25.6
* `Species`: `'Chinstrap penguin (Pygoscelis antarctica)'`

<br>

**NOTE**: You will need to use your `StandardScaler` to transform these new points *and* you will need to determine how these species are represented in this given encoding.

##### **2. Visualize the modeled relationship between `Body Mass (g)` and `Flipper Length (mm)` in the test dataset to see if a qualitative relationship can be inferred.**


**NOTE**: Here you can use the non-standardized, and more interpretable, `X_test` since the predictions have already been made.

##### **3. *If you used linear regression*, complete the cells below to look at the coefficients and intercept to determine the modeled relationships quantitatively.**

<a name="p2w"></a>

---
### **Part 2: Wrapup**
---

Now that you have trained several models to accomplish this task, answer the following questions:

1. Is Linear Regression, KNN, or Logistic Regression better suited to this task? Why?
2. Were there any hyperparameters you need to tune and, if so, what were the best values you found?
3. Did selecting a smaller number of features improve or decrease the performance of your model?
4. What 4 variables seem to play the largest role in determining the `Body Mass (g)` based on your work in this part?

<a name="p3"></a>

---
## **Part 3: Predicting Island of Origin**
---

Now you will create, evaluate, and train an ML model to predict the island of origin of a penguin based on the provided numerical features. It is up to you to determine if you should be using Linear Regression, KNN, or Logistic Regression here. 

<br>

You will do this three times as follows:

**Part 3.1**: Using All Numerical Features and Label Encodings

**Part 3.2**: Using All Numerical Features and Dummy Variable Encodings

**Part 3.3**: Using the 4 Best Numerical Features and the Best Encodings Where Relevant

<br>

You will follow the 8 step process for implementing ML models that we have learned this last week:

1. Load in the data (**this has already been done in Part 1**)
2. Decide independent and dependent variables
3. Split the data into training and testing datasets
4. Import a ML algorithm
5. Set the model’s parameters
6. Fit the model on the training set and test the model on the test dataset. Draw a visualization (if applicable to the model)
7. Evaluate the model’s performance
8. Apply your model

<a name="p31"></a>

---
### **Part 3.1: Using All Numerical Features and Label Encodings**
---

#### **Step #2: Decide independent and dependent variables**
---

Complete the code below to decide the independent and dependent variables. Make sure to only use label encodings where relevant instead of other representations of the same variable(s). Furthermore, make sure not to include any encodings of the dependent variable (label) as features since this gives the model the answer!

<br>

**NOTE**: The dependent variable (label) for all of Part 3 is `Island`. Using one of several pandas functions, you can determine the numerical features available and use them all as the independent variables.

In [None]:
penguin_df.# COMPLETE THIS LINE

In [None]:
x = # COMPLETE THIS LINE
y = # COMPLETE THIS LINE

#### **Step #3: Split data into training and testing data and standardize appropriately**
---

Complete the code below to split the data, using 80% for training and 20% for testing.

#### **Step #4: Import the algorithm**
---

#### **Step #5:  Initialize the model and set hyperparameters**
---

Specifically,
* For Linear Regression, there are no hyperparameters to set.
* For KNN, choose a reasonable value for `n_neighbors`. You are encouraged to try Steps #5 - 7 for several values and picking the model with the highest performance.
* For Logistic Regression, decide whether you need to specify `multiclass = 'ovr'` or not.

<br>

**NOTE**: Since both KNN and Logistic Regression are used for classification, you should always try modeling the data with *both* and seeing which one works best after hyperparameter tuning.

In [None]:
model_1 = # COMPLETE THIS LINE

#### **Steps #6 - 7: Fit your model, evaluating using 10-Folds Cross Validation. Create a visualization if applicable**
---

Specifically,

1. Fit the model to the training data, determining an average relevant evaluation metric using 10-Folds Cross Validation.
2. Train the final model and visualize the results.

##### **1. Fit the model to the training data, determining an average relevant evaluation metric using 10-Folds Cross Validation.**


**NOTE**: The y-data in this section is still in the form of a pandas DataFrame, so to access a specific index you need to use `.iloc[...]`.

##### **2. Visualize the results.**

The code is provided for both linear regression and KNN, but it is up to you to decide which one makes the most sense here.

###### **Regression Visualization**

In [None]:
# Visualize comparison of predictions vs. actual values
plt.scatter(y_test, pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color = 'black', label='Correct Predictions')


plt.xlabel('True Value')
plt.ylabel('Predicted Value')
plt.title('Real vs Value')
plt.legend()

plt.show()

###### **Classification Visualization**

In [None]:
feature_1_name = 'Flipper Length (mm)'
feature_2_name = 'Culmen Length (mm)'


# Make the same scatter plot of the training data
fig, ax = plt.subplots(figsize=(10,6))

xx, yy = np.meshgrid(np.arange(penguin_df[feature_1_name].min() - 1, penguin_df[feature_1_name].max() + 1, 0.1),
                     np.arange(penguin_df[feature_2_name].min() - 1, penguin_df[feature_2_name].max() + 1, 0.1))

means = x.mean()
inputs = [[means[0], y, means[2], x, means[4], means[5], means[6], means[7], means[8]] for (x, y) in np.c_[xx.ravel(), yy.ravel()]]
z = model_1.predict(scaler.transform(inputs))
z = z.reshape(xx.shape)
z = [[island_map[island] for island in islands] for islands in z]

ax.pcolormesh(xx, yy, z, alpha=0.1)

for label, data in penguin_df.groupby('Island'):
  ax.scatter(data[feature_1_name], data[feature_2_name], label=label)

ax.set_title("Decision Boundary of the Classifier")
ax.set_xlabel(feature_1_name)
ax.set_ylabel(feature_2_name)
ax.legend()
plt.show()

#### **Step #8: Use the model**
---

Specifically,

1. Predict the island of origin of two new penguins.

2. Visualize the modeled relationship between `Island` and `Flipper Length (mm)` to see if a qualitative relationship can be inferred.

3. *If you used linear regression*, look at the coefficients and intercept to determine the modeled relationships quantitatively.

##### **1. Predict the island of origin of these new penguins:**

**Penguin 1**

* `Clutch Completion`: Yes
* `Culmen Length (mm)`: 33
* `Culmen Depth (mm)`: 15
* `Flipper Length (mm)`: 250
* `Body Mass (g)`: 4715
* `Sex`: `'MALE'`
* `Delta 15 N (o/oo)`: 8.7
* `Delta 13 C (o/oo)`: -25.6
* `Species`: `'Gentoo penguin (Pygoscelis papua)'`

<br>

**Penguin 2**

* `Clutch Completion`: No
* `Culmen Length (mm)`: 47
* `Culmen Depth (mm)`: 18
* `Flipper Length (mm)`: 175
* `Body Mass (g)`: 3600
* `Sex`: `'FEMALE'`
* `Delta 15 N (o/oo)`: 8.7
* `Delta 13 C (o/oo)`: -25.6
* `Species`: `'Chinstrap penguin (Pygoscelis antarctica)'`

<br>

**NOTE**: You will need to use your `StandardScaler` to transform these new points *and* you will need to determine how these species are represented in this given encoding.

##### **2. Visualize the modeled relationship between `Island` and `Flipper Length (mm)` in the test dataset to see if a qualitative relationship can be inferred.**


**NOTE**: Here you can use the non-standardized, and more interpretable, `X_test` since the predictions have already been made.

##### **3. *If you used linear regression*, complete the cells below to look at the coefficients and intercept to determine the modeled relationships quantitatively.**

In [None]:
coefficients = model_1.# COMPLETE THIS LINE
intercept = model_1.# COMPLETE THIS LINE

coefficients = pd.DataFrame([coefficients], columns = X_test.columns)
intercept = pd.DataFrame([intercept], columns = ["Island"])

In [None]:
print("Coefficients:")
coefficients.head()

In [None]:
print("\nIntercept:")
intercept.head()

<a name="p32"></a>

---
### **Part 3.2: Using All Numerical Features and Dummy Variable Encodings**
---

#### **Step #2: Decide independent and dependent variables**
---

Complete the code below to decide the independent and dependent variables. Make sure to only use dummy variable encodings where relevant instead of other representations of the same variable(s).

<br>

**NOTE**: The dependent variable (label) for all of Part 3 is `Island`. Using one of several pandas functions, you can determine the numerical features available and use them all as the independent variables.

In [None]:
penguin_df.# COMPLETE THIS LINE

In [None]:
x = # COMPLETE THIS LINE
y = # COMPLETE THIS LINE

#### **Step #3: Split data into training and testing data and standardize appropriately**
---

Complete the code below to split the data, using 80% for training and 20% for testing.

#### **Step #4: Import the algorithm**
---

#### **Step #5:  Initialize the model and set hyperparameters**
---

Specifically,
* For Linear Regression, there are no hyperparameters to set.
* For KNN, choose a reasonable value for `n_neighbors`. You are encouraged to try Steps #5 - 7 for several values and picking the model with the highest performance.
* For Logistic Regression, decide whether you need to specify `multiclass = 'ovr'` or not.

<br>

**NOTE**: Since both KNN and Logistic Regression are used for classification, you should always try modeling the data with *both* and seeing which one works best after hyperparameter tuning.

In [None]:
model_2 = # COMPLETE THIS LINE

#### **Steps #6 - 7: Fit your model, evaluating using 10-Folds Cross Validation. Create a visualization if applicable**
---

Specifically,

1. Fit the model to the training data, determining an average relevant evaluation metric using 10-Folds Cross Validation.
2. Train the final model and visualize the results.

##### **1. Fit the model to the training data, determining an average relevant evaluation metric using 10-Folds Cross Validation.**


**NOTE**: The y-data in this section is still in the form of a pandas DataFrame, so to access a specific index you need to use `.iloc[...]`.

##### **2. Visualize the results.**

The code is provided for both linear regression and KNN, but it is up to you to decide which one makes the most sense here.

###### **Regression Visualization**

In [None]:
# Visualize comparison of predictions vs. actual values
plt.scatter(y_test, pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color = 'black', label='Correct Predictions')


plt.xlabel('True Value')
plt.ylabel('Predicted Value')
plt.title('Real vs Value')
plt.legend()

plt.show()

###### **Classification Visualization**

In [None]:
feature_1_name = 'Flipper Length (mm)'
feature_2_name = 'Culmen Length (mm)'


# Make the same scatter plot of the training data
fig, ax = plt.subplots(figsize=(10,6))

xx, yy = np.meshgrid(np.arange(penguin_df[feature_1_name].min() - 1, penguin_df[feature_1_name].max() + 1, 0.1),
                     np.arange(penguin_df[feature_2_name].min() - 1, penguin_df[feature_2_name].max() + 1, 0.1))

means = x.mean()
inputs = [[means[0], y, means[2], x, means[4], means[5], means[6], means[7], means[8], means[9]] for (x, y) in np.c_[xx.ravel(), yy.ravel()]]
z = model_2.predict(scaler.transform(inputs))
z = z.reshape(xx.shape)
z = [[island_map[island] for island in islands] for islands in z]

ax.pcolormesh(xx, yy, z, alpha=0.1)

for label, data in penguin_df.groupby('Island'):
  ax.scatter(data[feature_1_name], data[feature_2_name], label=label)

ax.set_title("Decision Boundary of the Classifier")
ax.set_xlabel(feature_1_name)
ax.set_ylabel(feature_2_name)
ax.legend()
plt.show()

#### **Step #8: Use the model**
---

Specifically,

1. Predict the island of origin of two new penguins.

2. Visualize the modeled relationship between `Island` and `Flipper Length (mm)` to see if a qualitative relationship can be inferred.

3. *If you used linear regression*, look at the coefficients and intercept to determine the modeled relationships quantitatively.

##### **1. Predict the island of origin of these new penguins:**

**Penguin 1**

* `Clutch Completion`: Yes
* `Culmen Length (mm)`: 33
* `Culmen Depth (mm)`: 15
* `Flipper Length (mm)`: 250
* `Body Mass (g)`: 4715
* `Sex`: `'MALE'`
* `Delta 15 N (o/oo)`: 8.7
* `Delta 13 C (o/oo)`: -25.6
* `Species`: `'Gentoo penguin (Pygoscelis papua)'`

<br>

**Penguin 2**

* `Clutch Completion`: No
* `Culmen Length (mm)`: 47
* `Culmen Depth (mm)`: 18
* `Flipper Length (mm)`: 175
* `Body Mass (g)`: 3600
* `Sex`: `'FEMALE'`
* `Delta 15 N (o/oo)`: 8.7
* `Delta 13 C (o/oo)`: -25.6
* `Species`: `'Chinstrap penguin (Pygoscelis antarctica)'`

<br>

**NOTE**: You will need to use your `StandardScaler` to transform these new points *and* you will need to determine how these species are represented in this given encoding.

##### **2. Visualize the modeled relationship between `Island` and `Flipper Length (mm)` in the test dataset to see if a qualitative relationship can be inferred.**


**NOTE**: Here you can use the non-standardized, and more interpretable, `X_test` since the predictions have already been made.

##### **3. *If you used linear regression*, complete the cells below to look at the coefficients and intercept to determine the modeled relationships quantitatively.**

In [None]:
coefficients = model_1.# COMPLETE THIS LINE
intercept = model_1.# COMPLETE THIS LINE

coefficients = pd.DataFrame([coefficients], columns = X_test.columns)
intercept = pd.DataFrame([intercept], columns = ["Island"])

In [None]:
print("Coefficients:")
coefficients.head()

In [None]:
print("\nIntercept:")
intercept.head()

<a name="p33"></a>

---
### **Part 3.3: Using the 4 Best Numerical Features and the Best Encodings Where Relevant**
---

#### **Step #2: Decide independent and dependent variables**
---

Complete the code below to decide the independent and dependent variables, specifically choosing the 4 best features according to `SelectKBest(...)`. Make sure to only use the best performing encodings where relevant instead of other representations of the same variable(s).

<br>

**NOTE**: The dependent variable (label) for all of Part 3 is `Island`. Using one of several pandas functions, you can determine the numerical features available and use them all as the independent variables.

In [None]:
penguin_df.# COMPLETE THIS LINE

In [None]:
x = # COMPLETE THIS LINE
y = # COMPLETE THIS LINE

# COMPLETE THIS CODE

best_features = # COMPLETE THIS CODE

#### **Step #3: Split data into training and testing data and standardize appropriately**
---

Complete the code below to split the data, using 80% for training and 20% for testing. **Make sure to use only the best 4 features found above.**

#### **Step #4: Import the algorithm**
---

#### **Step #5:  Initialize the model and set hyperparameters**
---

Specifically,
* For Linear Regression, there are no hyperparameters to set.
* For KNN, choose a reasonable value for `n_neighbors`. You are encouraged to try Steps #5 - 7 for several values and picking the model with the highest performance.
* For Logistic Regression, decide whether you need to specify `multiclass = 'ovr'` or not.

<br>

**NOTE**: Since both KNN and Logistic Regression are used for classification, you should always try modeling the data with *both* and seeing which one works best after hyperparameter tuning.

In [None]:
model_3 = # COMPLETE THIS LINE

#### **Steps #6 - 7: Fit your model, evaluating using 10-Folds Cross Validation. Create a visualization if applicable**
---

Specifically,

1. Fit the model to the training data, determining an average relevant evaluation metric using 10-Folds Cross Validation.
2. Train the final model and visualize the results.

##### **1. Fit the model to the training data, determining an average relevant evaluation metric using 10-Folds Cross Validation.**


**NOTE**: The y-data in this section is still in the form of a pandas DataFrame, so to access a specific index you need to use `.iloc[...]`.

##### **2. Visualize the results.**

The code is provided for both linear regression and KNN, but it is up to you to decide which one makes the most sense here.

###### **Regression Visualization**

In [None]:
# Visualize comparison of predictions vs. actual values
plt.scatter(y_test, pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color = 'black', label='Correct Predictions')


plt.xlabel('True Value')
plt.ylabel('Predicted Value')
plt.title('Real vs Value')
plt.legend()

plt.show()

###### **Classification Visualization**

In [None]:
feature_1_name = 'Body Mass (g)'
feature_2_name = 'Delta 15 N (o/oo)'

# Make the same scatter plot of the training data
fig, ax = plt.subplots(figsize=(10,6))

xx, yy = np.meshgrid(np.arange(penguin_df[feature_1_name].min() - 1, penguin_df[feature_1_name].max() + 1, 100),
                     np.arange(penguin_df[feature_2_name].min() - 1, penguin_df[feature_2_name].max() + 1, 0.1))

means = best_features.mean()
inputs = [[x, y, means[2], means[3]] for (x, y) in np.c_[xx.ravel(), yy.ravel()]]
z = model_3.predict(scaler.transform(inputs))
z = z.reshape(xx.shape)
z = [[island_map[island] for island in islands] for islands in z]

ax.pcolormesh(xx, yy, z, alpha=0.1)

for label, data in penguin_df.groupby('Island'):
  ax.scatter(data[feature_1_name], data[feature_2_name], label=label)

ax.set_title("Decision Boundary of the Classifier")
ax.set_xlabel(feature_1_name)
ax.set_ylabel(feature_2_name)
ax.legend()
plt.show()

#### **Step #8: Use the model**
---

Specifically,

1. Predict the island of origin of two new penguins.

2. Visualize the modeled relationship between `Island` and `Body Mass (g)` to see if a qualitative relationship can be inferred.

3. *If you used linear regression*, look at the coefficients and intercept to determine the modeled relationships quantitatively.

##### **1. Predict the island of origin of these new penguins:**

**Penguin 1**

* `Clutch Completion`: Yes
* `Culmen Length (mm)`: 33
* `Culmen Depth (mm)`: 15
* `Flipper Length (mm)`: 250
* `Body Mass (g)`: 4715
* `Sex`: `'MALE'`
* `Delta 15 N (o/oo)`: 8.7
* `Delta 13 C (o/oo)`: -25.6
* `Species`: `'Gentoo penguin (Pygoscelis papua)'`

<br>

**Penguin 2**

* `Clutch Completion`: No
* `Culmen Length (mm)`: 47
* `Culmen Depth (mm)`: 18
* `Flipper Length (mm)`: 175
* `Body Mass (g)`: 3600
* `Sex`: `'FEMALE'`
* `Delta 15 N (o/oo)`: 8.7
* `Delta 13 C (o/oo)`: -25.6
* `Species`: `'Chinstrap penguin (Pygoscelis antarctica)'`

<br>

**NOTE**: You will need to use your `StandardScaler` to transform these new points *and* you will need to determine how these species are represented in this given encoding.

In [None]:
X_train.columns

##### **2. Visualize the modeled relationship between `Island` and `Body Mass (g)` in the test dataset to see if a qualitative relationship can be inferred.**


**NOTE**: Here you can use the non-standardized, and more interpretable, `X_test` since the predictions have already been made.

##### **3. *If you used linear regression*, complete the cells below to look at the coefficients and intercept to determine the modeled relationships quantitatively.**

In [None]:
coefficients = model_1.# COMPLETE THIS LINE
intercept = model_1.# COMPLETE THIS LINE

coefficients = pd.DataFrame([coefficients], columns = X_test.columns)
intercept = pd.DataFrame([intercept], columns = ["Island"])

In [None]:
print("Coefficients:")
coefficients.head()

In [None]:
print("\nIntercept:")
intercept.head()

<a name="p3w"></a>

---
### **Part 3: Wrapup**
---

Now that you have trained several models to accomplish this task, answer the following questions:

1. Is Linear Regression, KNN, or Logistic Regression better suited to this task? Why?
2. Were there any hyperparameters you need to tune and, if so, what were the best values you found?
3. Did selecting a smaller number of features improve or decrease the performance of your model?
4. What 4 variables seem to play the largest role in determining the `Island` based on your work in this part?

# End of Notebook

---
© 2023 The Coding School, All rights reserved