## Penguin Classification

### Simplified data from original penguin data sets. Contains variables:
### 1. *species*: penguin species (Chinstrap, Adélie, or Gentoo)
### 2. *culmen_length_mm*: culmen length (mm)
### 3. *culmen_depth_mm*: culmen depth (mm)
### 4. *flipper_length_mm*: flipper length (mm)
### 5. *body_mass_g*: body mass (g)
### 6. *island*: island name (Dream, Torgersen, or Biscoe) in the Palmer Archipelago (Antarctica)
### 7. *sex*: penguin sex


<img src= "https://imgur.com/orZWHly.png" alt ="Titanic" style='width: 800px;'>

**The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica**


**Aside: That’s right, developers – Gentoo Linux is named after penguins!**

In [23]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

In [24]:
data_size = pd.read_csv('/kaggle/input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv')

In [25]:
data_size.head()

In [26]:
data_size.describe()

In [27]:
sns.countplot(x='species',data=data_size, palette='hls')
plt.show()
plt.savefig('count_plot')

In [28]:
# sns.catplot(x="island", y="species", data=data_size)
sns.catplot(x="species", y="body_mass_g", jitter=False, data=data_size)

In [29]:
data_size.groupby('species').mean()

*Island and Sex variables are categorical variables, we have to convert these variables into dummy/indicator variables, here we are doing it using pandas get_dummies*

In [30]:
data_df = data_size.copy()
cat_vars = ['island', 'sex']
for var in cat_vars:
    cat_list='var'+'_'+var
    cat_list = pd.get_dummies(data_df[var], prefix=var)
    data1=data_df.join(cat_list)
    data_df=data1

In [31]:
data_vars=data_df.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]

In [32]:
data_final=data_df[to_keep]
data_final.columns.values

***Here we are removing variables [island_Torgersen,'sex_.','sex_MALE'] because dummy variable is not a good practice***

In [33]:
data_final_vars = ['culmen_length_mm', 'culmen_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'island_Biscoe',
       'island_Dream', 'sex_FEMALE']
y=['species']
X=[i for i in data_final_vars if i not in y]

In [34]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(data_size[['culmen_length_mm','culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']])
scaled_df = pd.DataFrame(scaled, columns = ['culmen_length_mm','culmen_depth_mm', 'flipper_length_mm', 'body_mass_g'])
scaled_df.head()

In [35]:
data_final[['culmen_length_mm','culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']] = scaled_df[['culmen_length_mm','culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']]

In [36]:
data_final.head(2)

#### Removing missing values using dropna() 

In [37]:
data_final.dropna(inplace=True)

In [38]:
X

*using these columns creating X_data and y_data*

y=['species']

X=['culmen_length_mm',  'culmen_depth_mm',  'flipper_length_mm',  'body_mass_g',

'island_Biscoe',  'island_Dream',  'sex_FEMALE']

In [39]:
X_data = data_final[X]
y_data = data_final[y]

dividing X, y into train and test data

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, random_state = 0)

In [41]:
# training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
dtree_predictions = dtree_model.predict(X_test)
 
# creating a confusion matrix
cm = confusion_matrix(y_test, dtree_predictions)

In [42]:
from sklearn.metrics import f1_score
score = f1_score(y_test, dtree_predictions, average='macro')
score

**Here we have achieved f1-score of 0.98**

In [43]:
# Source code credit for this function: https://gist.github.com/shaypal5/94c53d765083101efc0240d776a23823
def print_confusion_matrix(confusion_matrix, class_names, figsize = (10,7), fontsize=14):
    """Prints a confusion matrix, as returned by sklearn.metrics.confusion_matrix, as a heatmap.
    
    Arguments
    ---------
    confusion_matrix: numpy.ndarray
        The numpy.ndarray object returned from a call to sklearn.metrics.confusion_matrix. 
        Similarly constructed ndarrays can also be used.
    class_names: list
        An ordered list of class names, in the order they index the given confusion matrix.
    figsize: tuple
        A 2-long tuple, the first value determining the horizontal size of the ouputted figure,
        the second determining the vertical size. Defaults to (10,7).
    fontsize: int
        Font size for axes labels. Defaults to 14.
        
    Returns
    -------
    matplotlib.figure.Figure
        The resulting confusion matrix figure
    """
    df_cm = pd.DataFrame(
        confusion_matrix, index=class_names, columns=class_names, 
    )
    fig = plt.figure(figsize=figsize)
    try:
        heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
    except ValueError:
        raise ValueError("Confusion matrix values must be integers.")
    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize)
    plt.ylabel('Truth')
    plt.xlabel('Prediction')

In [44]:
print_confusion_matrix(cm,['Adelie', 'Gentoo', 'Chinstrap'])