![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

### Classification Algorithms

# Classifying Penguins with Machine Learning

In this project, you will classify penguins species and practice classification algorithms you saw during the course.

The data that you will be use is a penguins dataset which contains information about penguins from 3 different species: Adelie, Chinstrap, and Gentoo. Penguin Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/).

<img width="440" src="images/penguins.png"></img>

<h2 style="font-weight: bold;">
    Penguins Dataset
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


The dataset consists of 7 columns.

- species: penguin species (Chinstrap, Adélie, or Gentoo)
- culmen_length_mm: culmen length (mm)
- culmen_depth_mm: culmen depth (mm)
- flipper_length_mm: flipper length (mm)
- body_mass_g: body mass (g)
- island: island name (Dream, Torgersen, or Biscoe) in the Palmer Archipelago (Antarctica)
- sex: penguin sex

Penguins wings are called flippers. They are flat, thin, and broad with a long, tapered shape and a blunt, rounded tip

<img width="400" src="images/penguin_cols.jpg"></img>

**Your task:** Read the `penguins.csv` dataset which is within the `data/` folder.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# your code goes here
penguins = None

<div style="width: 100%; background-color: #222; text-align: center">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Data Visualization
</h1>

<br><br> 
</div>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

The first thing you'll need to do is to analyze and understand penguins characteristics.

To do that we will begin creating some plots...


<h2 style="font-weight: bold;">
    Flipper Length Visualization
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

**Your task**: Create a plot showing the distribution of flipper length per specie.

In [3]:
# your code goes here


<h2 style="font-weight: bold;">
    Body Mass Visualization
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

And what about body mass?

**Your task**: Create a plot showing the distribution of penguins body mass per specie.


In [5]:
# your code goes here


<h2 style="font-weight: bold;">
    Bill Length Visualization
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

**Your task**: Go on and create a scatterplot showing the relationships between bill length and bill depth.


In [7]:
# your code goes here


<div style="width: 100%; background-color: #222; text-align: center">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Data Preparation
</h1>

<br><br> 
</div>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

<h2 style="font-weight: bold;">
    Convert Categorical Features to One-Hot-Encoding
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

As we saw on previous lessons, before using our data we need to create **features** _(X)_ and **label** _(y)_ we want to predict. And finally encode the features to have all numeric values.

**Your task:** Create _X_ features matrix containing penguins characteristics, and _y_ resonse vector containing the expected species.

In [9]:
# your code goes here
X = None
y = None

**Your task:** Now convert all the categorical features to numerical using one-hot-encoding.

In [11]:
# your code goes here


<h2 style="font-weight: bold;">
    Train Test Split
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

We need to do one more thing with our data: split it into train and test partitions.

**Your task:** Split data into train and test sets.

In [13]:
from sklearn.model_selection import train_test_split

# your code goes here
X_train, X_test, y_train, y_test = [None, None, None, None]

We will not create a validation set for this project because we will use the default hyperparameters of the models instead of finetuning the hyperparameters.

<div style="width: 100%; background-color: #222; text-align: center">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Model Building
</h1>

<br><br> 
</div>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)



<h2 style="font-weight: bold;">
    Train Models
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Now is the time to start building our predicting models, but before that let's import the models:

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

**Your task:** Initialize the four imported models.

In [16]:
# your code goes here
logreg = None
knn = None
svm = None
decision_tree = None

**Your task:** Now that your models are created lets train them using the train data you have!

In [18]:
# your code goes here


<div style="width: 100%; background-color: #222; text-align: center">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Model Evaluation
</h1>

<br><br> 
</div>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)



<h2 style="font-weight: bold;">
    Evaluate Models
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

The hard work is done, congratulations! Let's evaluate the models and to see which one performs better.

**Your task:** For each model, compute all the evaluation metrics.

In [20]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

results = pd.DataFrame(columns=["logreg", "knn", "svm", "decision_tree"], 
                       index=["accuracy", "precision", "recall", "f1"])

# your code goes here


<h2 style="font-weight: bold;">
    Evaluate Models
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)



In [22]:
pd.options.display.float_format = "{:.3f}".format
results

Unnamed: 0,logreg,knn,svm,decision_tree
accuracy,0.982,0.718,0.732,0.964
precision,0.975,0.662,0.505,0.96
recall,0.982,0.674,0.614,0.953
f1,0.978,0.666,0.551,0.956


We will use the **f1 score** to compare the models, but the other metrics are also included for our reference. For this particular dataset, the logistic regression model achieved the highest f1 score.

<h2 style="font-weight: bold;">
    Wrapping Up
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

**Your task:** Which models performs the best?


In [23]:
# your answer goes here


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)
