## Imports:

In [None]:
import numpy as np
import pandas as pd
import copy
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
import tensorflow as tf
from tensorflow import keras
from keras.layers import Dense, Input
from keras.models import Model
from keras.optimizers import Adam

# Project Description

For this project we will be using the Palmer Archipelago Penguin Dataset to predict the species of penguins.
Multiple classification methods are used to predict the species of penguin, including:
* Logistic Regression,
* K-Nearest Neighbors,
* Naive Bayes,
* Linear Discriminant Analysis,
* Support Vector Machines, and
* Neural Network

These were either taught or discussed throughout the course. 
Throughout this notebook, pandas and scikit-learn are used as opposed to the models implemented in class manually. This decision was made because the models from scikit-learn are more likely to be used in future applications/career and are also my personal preference. 

# Data Exploration
We begin by loading the data and looking at the variables available. Pandas provides useful methods for doing this.

In [None]:
#load data 
data_dir = '../input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv'
df = pd.read_csv(data_dir)

In [None]:
#view column names, data types, and missing values
df.info()

In [None]:
numeric_cols = list(df.dtypes[df.dtypes != 'object'].index) #to be used for scaling later
numeric_cols

In [None]:
#lets take a look at the rows with missing values
df[df.isna().any(axis=1)]

Above we see that the most common missing value is sex, while other rows are almost completely blank. These rows make up a small portion of the dataset so they can be removed later. 

Next, lets take a look at the distribution of the variables. For numerical we can use the describe method and categorical we use frequencies.

In [None]:
df.describe()

In [None]:
df['species'].value_counts()

In [None]:
df['island'].value_counts()

In [None]:
df['sex'].value_counts()

In [None]:
df[df.sex == '.']

By looking at the above outputs we see that the dataset is will balances, with the categories being split somewhat evenly.
It is strange that one of the rows has sex as '.'. Thus, we will remove this row alongside those having an NA value. 

## Data Preparation

Now that we have had a chance to look over the dataset, it is necessary to modify it before training models. This includes removing rows with missing values, changing categorical variables to numeric, subsetting into training and validation datasets, and scaling the numerical columns.

First we remove the rows that have missing values. We want to do this before other operations so it doesn't waste time or interfere with the scaling process. 

In [None]:
df.dropna(inplace=True)
df.drop(336,axis = 0,inplace=True) #this is the row that has sex == '.'

Earlier we saw that the species, island, and sex columns were objects using the info method. This means they are strings or categorical variables. It is necessary to convert these to numeric so they can be used with all algorithms.

We use different techniques for each of these columns:

* Because species is our response variable, it is not necessary to scale it in a way that is meaningful for the models. We simply change the 3 species of penguins into numbers 0,1,2. 

* The island variable is going to be used as a predictor variable. In its current state, it is a categorical variable with 3 different islands as levels. To convert this to numeric, the one hot encoding method is used, where a column is created for each island and a 0 or 1 is used to indicate boolean membership to the island. Each row only has one of the newly created columns set to 1.

* The sex variable can easily be converted to a binary number, with 0 representing MALE and 1 representing FEMALE. 

In [None]:
df['species'].replace({'Adelie': 0, 'Gentoo': 1, 'Chinstrap': 2}, inplace=True)
df[['Biscoe','Dream','Torgersen']] = pd.get_dummies(df.island)
df['sex'].replace({'MALE': 0, 'FEMALE': 1}, inplace = True)

In [None]:
df.drop(['island'],axis = 1,inplace = True) #original island variable is no longer needed

A look at the new dataframe:

In [None]:
df.head()

### Train-Test split

It is now necessary to split the dataset. This is vital for evaluating the performance of our model. Note that we stratify on the species variable so that the training and validation sets have similar distributions. We do this before applying the last preprocessing step: Standardization. 

We fit the scaler to the training dataset and then use it on the validation data. It is important not to fit the scaler to the whole dataset.
Also note that only the original numeric columns are scaled. If the transformed columns were scaled they would lose the intended meaning. 

In [None]:
df = df.sample(frac=1)#shuffle rows
X = df.drop('species',axis=1) #predictors
Y = df['species'] #labels

In [None]:
X_train, X_validation, Y_train, Y_validation = train_test_split(X,Y,stratify=Y)
X_train.index = range(X_train.shape[0]) #reset index
X_validation.index = range(X_validation.shape[0]) #reset index

In [None]:
scaler = StandardScaler()
X_train_numeric = scaler.fit_transform(X_train[numeric_cols])
X_validation_numeric = scaler.transform(X_validation[numeric_cols])

In [None]:
X_train = pd.concat([pd.DataFrame(X_train_numeric,columns=numeric_cols), X_train.drop(numeric_cols,axis=1)], axis = 1)
X_train.head()

In [None]:
X_validation = pd.concat([pd.DataFrame(X_validation_numeric,columns=numeric_cols), X_validation.drop(numeric_cols,axis=1)], axis = 1)
X_validation.head()

In [None]:
Y_train.value_counts(normalize=True)

In [None]:
Y_validation.value_counts(normalize=True)

# Model Definition:

Several models will be created and their performance will be shown in terms of accuracy. A confusion matrix will be used so we can see what errors are most common.

## Logistic Regression

In [None]:
logistic_regression = LogisticRegression(C=1e5)
logistic_regression.fit(X_train,Y_train)

In [None]:
logistic_regression.score(X_train,Y_train), logistic_regression.score(X_validation,Y_validation)

In [None]:
confusion_matrix(Y_validation,logistic_regression.predict(X_validation))

The logistic regression model exhibits very high performance, with 100% accuracy on both the training and validation tests.

## K-Nearest Neighbors

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,Y_train)

In [None]:
knn.score(X_train, Y_train), knn.score(X_validation, Y_validation)

In [None]:
confusion_matrix(Y_validation,knn.predict(X_validation))

The K-nearest neighbors algorithm is somewhat simple in comparison to the other models, however it still proves to be effective.
With ~99% accuracy on the training and validation sets, this model is certaintly suitable for the task at hand. There is only one of the validation observations misclasified, incorrectly being predicted as Adelie when the true species is Chinstrap.

## Naive Bayes

In [None]:
naive_bayes = GaussianNB()
naive_bayes.fit(X_train,Y_train)

In [None]:
naive_bayes.score(X_train, Y_train), naive_bayes.score(X_validation, Y_validation)

In [None]:
confusion_matrix(Y_validation,naive_bayes.predict(X_validation))

The naive bayes model is the worse performer so far, with ~70% accuracy on both training and validation sets. This is surprising as the naive bayes model should not have any problems with there being multiple classes as opposed to a simpiler binary classification. Here we see that the Adelie penguins frequently misclassified, with the model only correcly classifying 45% of the Adelie penguins in the validation set.  

## Linear Discriminant Analysis

In [None]:
lda = LinearDiscriminantAnalysis()
lda.fit(X_train,Y_train)

In [None]:
lda.score(X_train, Y_train), lda.score(X_validation, Y_validation)

In [None]:
confusion_matrix(Y_validation,lda.predict(X_validation))

The linear discriminant analysis model exhibits very high performance, with 100% accuracy on the training set and ~99% on the validation sets. We see only one misclassification, identical to the KNN model.

## Support Vector Machine

In class we constructed three different SVM classifiers to use for this problem manually.
Scikit learn has implemented the SVM model such that multiple classifiers are created automatically when performing a multiclass task. 
Thus, we can only use one model here which uses the one-vs-rest strategy, so 3 classifiers are created in the backend. 

In [None]:
svc = SVC(gamma='auto')
svc.fit(X_train,Y_train)

In [None]:
svc.score(X_train, Y_train), svc.score(X_validation, Y_validation)

In [None]:
confusion_matrix(Y_validation,svc.predict(X_validation))

Yet another model achieves near perfect performance. 

## Neural Network

Note: this is an extremely simple network as this is not a difficult task. Thus, it is similar to using logistic regression but I wanted to include it as an option.

In [None]:
def neural_network():
    input_layer = Input(shape=(8))
    x = Dense(64, activation = 'sigmoid')(input_layer)
    output_layer = Dense(3, activation = 'softmax')(x)

    nn = Model(inputs = [input_layer], outputs = [output_layer])
    nn.compile(Adam(lr=.001), loss = 'sparse_categorical_crossentropy', metrics =['accuracy'])
    return nn
nn = neural_network()

In [None]:
nn.summary()

In [None]:
nn_fit = nn.fit(X_train, Y_train, batch_size=1, epochs = 20)

In [None]:
nn.evaluate(X_validation,Y_validation)

In [None]:
confusion_matrix(Y_validation,np.argmax(nn.predict(X_validation), axis = 1))

Yet again, only one of the validation datapoints is miscalssified.

# Comparison and Conclusion

Many of the models shown above have very good accuracy. The only model that did not seem to fit well to this dataset was the Naive Bayes model, achieving an accuracy below 80% while all others reached around 99-100%. 

From the MATH 3094 course, we learned many of the introductory techniques for modeling. These were demonstrated within this notebook and show to each be effective. 
It is important to create multiple models on the same dataset to choose the most effective for the task. It was beneficial to learn about all the above models throughout the semester to effectively solve problems and these techniques will be used for future applications. 