# Introduction
In this notebook, we will do some data analysis on the palmer penguins dataset. We'll work through the ML pipeline and create a species classifier.
- EDA
- Feature Engineering
- Modelling

# Importing some libraries

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.feature_selection import mutual_info_regression
from sklearn.linear_model import LinearRegression
from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import IterativeImputer
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings('ignore')

# Common functions
def apply_pca(X, standardize=True):
    # Standardize
    if standardize:
        X = (X - X.mean(axis=0)) / X.std(axis=0)
    # Create principal components
    pca = PCA()
    X_pca = pca.fit_transform(X)
    # Convert to dataframe
    component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
    X_pca = pd.DataFrame(X_pca, columns=component_names)
    # Create loadings
    loadings = pd.DataFrame(
        pca.components_.T,  # transpose the matrix of loadings
        columns=component_names,  # so the columns are the principal components
        index=X.columns,  # and the rows are the original features
    )
    return pca, X_pca, loadings

def plot_variance(pca, width=8, dpi=100):
    # Create figure
    fig, axs = plt.subplots(1, 2)
    n = pca.n_components_
    grid = np.arange(1, n + 1)
    # Explained variance
    evr = pca.explained_variance_ratio_
    axs[0].bar(grid, evr)
    axs[0].set(
        xlabel="Component", title="% Explained Variance", ylim=(0.0, 1.0)
    )
    # Cumulative Variance
    cv = np.cumsum(evr)
    axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
    axs[1].set(
        xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0)
    )
    # Set up figure
    fig.set(figwidth=8, dpi=100)
    return axs


def make_mi_scores(X, y):
    X = X.copy()
    for colname in X.select_dtypes(["object", "category"]):
        X[colname], _ = X[colname].factorize()
    # All discrete features should now have integer dtypes
    mi_scores = mutual_info_regression(X, y, discrete_features=False, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

# Exploratory Data Analysis
Let's take a quick look at the dataset and see what columns do we have. Looking into the sample data, we can see som NaN values, we will fix this later. Aside form that, we can start dropping some columns from the set, like studyName since they do not provide any additional data to training.

In [None]:
# Read the dataset
penguins_data = pd.read_csv("../input/palmer-archipelago-antarctica-penguin-data/penguins_lter.csv")

print(penguins_data.info())

In [None]:
# Let's pick some features as our training variables
feature_columns = ['Species', 'Island', 'Stage',
        'Culmen Length (mm)','Culmen Depth (mm)',
        'Flipper Length (mm)', 'Body Mass (g)', 'Sex']
X = penguins_data[feature_columns]

# This is likely a data entry mistake, hence we'll set it as a NaN for now
X.loc[X.Sex == '.', 'Sex'] = np.nan

In [None]:
# Let's see the correlation heatmap
fig, axs = plt.subplots(nrows=1, figsize=(10, 10))

# most correlated features if too much features
#corrmat = train.corr()
#top_corr_features = corrmat.index[abs(corrmat["SalePrice"])>0.5]
#plt.figure(figsize=(10,10))
#g = sns.heatmap(train[top_corr_features].corr(),annot=True,cmap="RdYlGn")

sns.heatmap(X.corr(), ax=axs, annot=True, square=True, cmap='coolwarm', annot_kws={'size': 14})

axs.tick_params(axis='x', labelsize=10)
axs.tick_params(axis='y', labelsize=10)
    
axs.set_title('Correlations', size=15)

plt.show()

## Missing Values

We can see that there are missing values from the dataset. We will work with the full dataset (excluding the target variable) when fixing missing values, else we risk overfitting to the train/test set with the filled data. From the code ran below, we can see that there isn't any major missing data aside from the comments column. We'll now start working on restoring the values one by one.

In [None]:
def display_missing(df):    
    for col in df.columns.tolist():          
        print('{} column missing values: {}'.format(col, df[col].isnull().sum()))
    print('\n')
    
display_missing(X)

### Length & Depth Missing Values
Looking into the null values for culmen length, we can see that the majority of the labels are null as well. Due to its high nullibility, we can safely drop these rows without much problem.

In [None]:
X[X['Culmen Length (mm)'].isnull()]
X.dropna(how='all', subset=['Culmen Length (mm)'], inplace=True)
display_missing(X)

### Sex & Delta 
A very simple way to do this is to impute the missing values with the modal gender of the dataset. However, let's try to impute it in a smarter fashion. One simple way is to just use k-nearest neighbours to impute these values. However, in this notebook, we will use another imputation technique called MICE. MICE stands for Multiple Imputation by Chained Equation, and as can be figured by the name alone, it uses multiple imputations and performs multiple regression over the sample data and takes the averages of them.

In [None]:
# Label Encoding
lbl = LabelEncoder()
label_encode_features = ['Sex', 'Stage', 'Species']
for feature in label_encode_features:
    X[feature] = lbl.fit_transform(X[feature])
    
# One-hot encoding
X = pd.get_dummies(X, prefix=['Island'], columns=['Island'])

# MICE Impute
imp = IterativeImputer(verbose=1)
imputed_X = imp.fit_transform(X)
imputed_X = pd.DataFrame(imputed_X, columns=X.columns)
X = imputed_X

### Failed: Culmen Length (mm) 
> NOTE: It is better to drop the rows entirely in this case because of the lack of data (which I did not check in the first place. The following section will be left here as it's still useful information

When it comes to data imputation, the most straightforward way to work on this is to just fill in the missing values with its mean or median. However, we can definitely improve upon this. Since Flipper Length has the highest correlation with Culmen Length, we can apply a regression approach to impute the missing culmen length.

In [None]:
"""
# We can try to group it by species and sex as well.
culmen_length_grouped = X.groupby(['Sex', 'Species']).median()['Culmen Length (mm)']

for Species in X.Species.unique():
    for sex in ['MALE', 'FEMALE']:
        print('Median age of Pclass {} {}s: {}'.format(pclass, sex, culment_length_grouped[sex][species]))
print('Median age of all passengers: {}'.format(df_all['Age'].median()))
df_all['Culmen Length (mm)'] = df_all.groupby(['Sex', 'Species'])['Culmen Length (mm)'].apply(lambda x: x.fillna(x.median()))
"""

#df_all_corr = X.corr().abs().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
#df_all_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
#df_all_corr[df_all_corr['Feature 1'] == 'Culmen Length (mm)']

In [None]:
def imputation_regressor(X, y):
    """
    We create a linear regression model to predict missing values here
    """
    regress_train_X, regress_test_X, regress_train_y, regress_test_y = train_test_split(X.dropna(), y.dropna(), random_state=1)
    clf = LinearRegression()
    clf.fit(regress_train_X, regress_train_y)
    y_test_pred = clf.predict(regress_test_X)
    print(f'R2 Score: {r2_score(regress_test_y, y_test_pred)}')
    return clf

#regress_X = X[['Flipper Length (mm)', 'Body Mass (g)']]
#regress_y = X['Culmen Length (mm)']
#clf = imputation_regressor(regress_X, regress_y)

# This doesn't work because flipper length and body mass are all null, still useful information nonetheless
# X['Culmen Length (mm)'] = clf.predict(regress_X)[X['Culmen Length (mm)'].isnull()]

The R2 score measures how much better than baseline linear regression performs, where baseline is flat regression against the mean. In this case that baseline performance (an R2 of 0) is the performance of replacing the missing values with the mean of the observed values. In this case, albeit slightly low cross validation scores are seen, it is still statistically significant and slightly better as it introduces less bias compared to naively using the mean or median of the dataset.

# Feature Engineering
With missing values fixed, we can now proceed to feature engineering. Whatever relationships the model can't learn, we can provide ourselves through transformations. That way, we can improve the predictive accuracy of our model overall. First, let's choose features that correlate well with species (target variable).

### PCA 
Although one of the most common uses of PCA is for dimensionality reduction, we can also explore the loadings to create new features. For this notebook, we'll do a PCA based on the continuous features ``Culmen Length``, ``Culmen Depth``, ``Flipper Length``, and ``Body Mass`` since these are physically measurable features. Notably, we'll standardize these features first since these features aren't on the same scale.

In [None]:
# We set the species column as the target variable to be predicted
y = X.Species
X.drop(columns=['Species'], inplace=True)

pca_features = ['Culmen Length (mm)', 'Flipper Length (mm)', 'Culmen Depth (mm)', 'Body Mass (g)']
pca, X_pca, loadings = apply_pca(X[pca_features])

In [None]:
mi_scores = make_mi_scores(X_pca, y)
mi_scores

We can see that PC4 is insignificant in this case, hence we'll ignore the last principal component.

##### Loadings
We will try and interpret these loadings prior to deriving new features from this information. Let's look at the principal components one by one.
1. In PC1, we can see that it has something to do with the total span of the penguin in proportion to its body mass.
2. In PC2, flipper length and body mass are insignificant. Since we can discard those, we can see that the 2nd principal component has something to do with the culmen size.
3. In PC3, it has something to do with the span of the culmen in proportion to its entire body.

Now that we interpreted these principal components, we will try to make some new features through transformations.

In [None]:
print(loadings)
X['Culmen Span'] = X['Culmen Length (mm)']/X['Culmen Depth (mm)']
X['Culmen Size'] = X['Culmen Length (mm)'] * X['Culmen Depth (mm)']

# Modelling
Now that all of the ground work has been established, we can now start modelling.

### Random forest classifier
We will use a random forest classifier for this notebook. One reason is because it is straightforward and simplicity is key (as opposed to a ANN).
We use a random forest as it is generally more accurate than a decision tree. Feature standardization is also a very important step in machine learning that we will not skip. It helps the algorithm quickly learn a better solution to the problem.

#### Cross Validation
We will use a 5-fold cross validation to measure our classifier's accuracy. As we can see, we are achieving >98% accuracy on average.

#### Confusion Matrix
We can also look at the classifier's confusion matrix. As can be predicted from the cross validation performance, we only see a single misclassification which is very impressive.

In [None]:
model = RandomForestClassifier(n_estimators=10)
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model.fit(X_train, y_train)

print(cross_val_score(model, X, y, cv=5))
plot_confusion_matrix(model, X_test, y_test)
plt.show()