"I have followed a code-first approach, where I perform some analysis/coding and then I have summarized it in the text below it." <br>
"I will update the kernel as I experiment with this dataset. Please upvote if you liked it."

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv')
df.head()

In [None]:
df.shape

# 1-INTRODUCTION
* This is a toy dataset like IRIS flower species. Such datasets (low in dimensionality and samples) are great to understand various Visualiations and Classification techniques.
* Palmer Archipelago, also known as Antarctic Archipelago, is a group of islands off the northwestern coast of the Antarctic Peninsula. (Trivia - The islands in this study are around 64°S 64°W , you can check them on the map)
* We will be focusing on the "penguins_size.csv" datafile ; the other file ("penguins_lter.csv") contains some additional variables.
* Here we have 344 penguin records from the Antartic.

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df_species_count = df.species.value_counts()
# df_species_perc = 100.0*df_species_count / len(df)
df_species_perc = 100.0*df_species_count / df_species_count.sum()

df_species_table = pd.concat([df_species_count, df_species_perc], axis=1)
df_species_table.columns = ['df_species_count','df_species_perc']

df_species_table

In [None]:
df_count = df.island.value_counts()
df_perc = 100.0*df_count / df_count.sum()

df_table = pd.concat([df_count, df_perc], axis=1)
df_table.columns = ['df_island_count','df_island_perc']

df_table

In [None]:
df_count = df.sex.value_counts(dropna=False)
df_perc = 100.0*df_count / df_count.sum()

df_table = pd.concat([df_count, df_perc], axis=1)
df_table.columns = ['df_sex_count','df_sex_perc']

df_table

# 2-DATA OVERVIEW
* Here we are dealing with 7 variables ('species', 'island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex')
* For our study lets consider 'species' as the dependent variable / target and the rest as the independent variables which will be used as the features to predict the species of the penguin under study.
* Variables explained -<br>
>     species: penguin's species (Chinstrap, Adélie, Gentoo)
>     culmen_X: culmen measurements in millimeters (penguin's upper part of the beak)
>     flipper_X: flipper measurements in millimeters (penguin's wings)
>     body_mass_g: body mass in grams
>     island: island name (Dream, Torgersen, Biscoe)
>     sex: penguin's sex
* Here we have 3 variables in string format which we will later encode while creating the feature vector for our classification task
* The target class is imbalanced (unlike IRIS dataset) with following ratios - (A,G,C)=(44%,36%,20%)
* Contribution across islands and sex can also be inferred from above.


In [None]:
df.isnull().sum()

In [None]:
df[df.culmen_length_mm.isna()]

In [None]:
df = df.drop(df[df.culmen_length_mm.isna()].index)

In [None]:
df.sex.unique()

In [None]:
df[df.sex=="."]

In [None]:
df = df.drop(df[df.sex=="."].index)

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df[df.sex.isna()]

**So we dropped 3 columns and are still left with missing values in SEX which we will impute, we can either use mode imputation which would replace all NaN with MALE, or we can analyse males and females across other variables and impute smartly**

In [None]:
df[df.sex=="MALE"].culmen_length_mm.mean() , df[df.sex=="FEMALE"].culmen_length_mm.mean()

In [None]:
df[df.sex=="MALE"].culmen_depth_mm.mean() , df[df.sex=="FEMALE"].culmen_depth_mm.mean()

In [None]:
df[df.sex=="MALE"].flipper_length_mm.mean() , df[df.sex=="FEMALE"].flipper_length_mm.mean()

In [None]:
df[df.sex=="MALE"].body_mass_g.mean() , df[df.sex=="FEMALE"].body_mass_g.mean()
# df[df.sex=="MALE"].body_mass_g.describe() , df[df.sex=="FEMALE"].body_mass_g.describe()

**Flipper length and Body mass may be good base features to impute Sex, this imputation could have been robust if we had a feature for Age**

In [None]:
### actual - find distance from either means and then assign
round((4545.684523809524+ 3862.2727272727275)/2,2)

In [None]:
# df.loc[df['sex'] == 'FEMALE', 'body_mass_g'].mean()

df.loc[(df.sex.isna()) & (df.body_mass_g <= 4203.98) , "sex"] = "FEMALE"

In [None]:
df['sex'] = df['sex'].fillna("MALE")

In [None]:
df.sex.value_counts(dropna=False)

# 3-DATA CLEANING
* There are a 2 missing values in culmen and flipper and 10 in sex.
* These 2 records have missing values across all 4 variables ---> DROP IT
* As it is a small dataset we cannot afford to drop the records where we observe missing data
* We can also use the remaining variables to predict sex for these missing 10 records.
* There is one garbage record in sex variable ---> DROP IT
* Now the dataset is clean and ready for EDA

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
plt.figure(figsize = (8,8))
sns.violinplot(x="species", y="culmen_length_mm", data=df)
plt.show()

In [None]:
plt.figure(figsize = (8,8))
sns.violinplot(x="species", y="culmen_depth_mm", data=df)
plt.show()

In [None]:
plt.figure(figsize = (8,8))
sns.violinplot(x="species", y="flipper_length_mm", data=df)
plt.show()

**flipper_length_mm has individual class medians quite separated**

In [None]:
plt.figure(figsize = (8,8))
sns.violinplot(x="species", y="body_mass_g", data=df)
plt.show()

In [None]:
plt.figure(figsize = (15,15))
sns.pairplot(df, hue="species",diag_kind="kde")
plt.show()

In [None]:
# to be updated ...

# 4-EDA
* Significant separation is seen in the following 3 plot - ("culmenlengthmm" vs "culmen_depth_mm") , ("culmenlengthmm" vs "flipper_length_mm") , ("culmenlengthmm" vs "body_mass_g") 

In [None]:
le = preprocessing.LabelEncoder()

In [None]:
le.fit(df.species)
df.species = le.transform(df.species)

In [None]:
class_names = list(le.classes_)

In [None]:
# try not using drop_first for ISLAND variable
df = pd.get_dummies(df,columns=["island","sex"], drop_first=True)

In [None]:
df.shape

In [None]:
df.head(10)

In [None]:
y = df.species.values

In [None]:
X = df.drop(columns=["species"])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
# dt = DecisionTreeClassifier(criterion='entropy',max_depth=3, random_state=42)
dt = DecisionTreeClassifier(criterion='entropy', random_state=42)
# dt = DecisionTreeClassifier(random_state=42)

In [None]:
dt.fit(X_train, y_train)

In [None]:
y_pred_train = dt.predict(X_train)
accuracy = metrics.accuracy_score(y_train, y_pred_train)
print("Accuracy: {:.2f}".format(accuracy))
cm=confusion_matrix(y_train,y_pred_train)
print('Confusion Matrix: \n', cm)
print(classification_report(y_train, y_pred_train, target_names=class_names))

In [None]:
y_pred_test = dt.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred_test)
print("Accuracy: {:.2f}".format(accuracy))
cm=confusion_matrix(y_test,y_pred_test)
print('Confusion Matrix: \n', cm)
print(classification_report(y_test, y_pred_test, target_names=class_names))

# 5-MODELLING
* Label Encoding for the multiclass target SPECIES
* OHE for categorical variables ISLAND and SEX
* I have used the entire dataset for EDA and I should have kept the test data aside and not look at it at all. But the main aim over here was data exploration and visualization.
* As I have used a tree model, no need to perform scaling on the numerical variables.
* Data Imbalance can be tackled by using class weights while calculating loss for unequal penalties OR by using SMOTE.
* Also data imbalance must be taken into consideration while splitting so that the ratio remains same in TRAIN as well as TEST sets
* Here we get a testing accuracy of 97% with each F1 score above 0.9
* Handling imbalance and hyperparameter tuning can be experimented in next phase.

"I will update the kernel as I experiment with this dataset (with more visualizations and modelling techniques). Please upvote if you liked it."