### This is my first Kaggle notebook! I will be performing simple EDA on penguins dataset.

#### The EDA covers the following topics:
1. Descriptive statistics
2. Handling missing values
3. Univariate analysis
4. Handling outliers
5. Bivariate analysis - numerical variables
6. Bivariate analysis - categorical variables
7. Converting categorical variables into dummy variables

#### The EDA does not cover the following topics:
1. Determining dependent and independent variables.
2. Analysis between categorical and numberical variables.
3. Removing of redundant variables.

#### I hope you enjoy going through the notebook!

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
# Imorting the dataset
raw_data = pd.read_csv('../input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv')
raw_data.head()

## Basic exploration of data

In [None]:
# Number of rows and columns
raw_data.shape

In [None]:
# Descriptive stats for numerical and categorical variables
raw_data.describe(include='all')

From the 'count' row we can observe that there are some missing values.

In [None]:
# Total number of observations for each species
raw_data['species'].value_counts()

In [None]:
# Total number of observations for each island
raw_data['island'].value_counts()

In [None]:
# Total number of observations for each gender
raw_data['sex'].value_counts()

There is one invalid entry which is equal to dot.

## Handling Missing Values

In [None]:
# Finding the count of missing values
raw_data.isna().sum()

In [None]:
# Displaying all the rows where missing values are present
raw_data[raw_data['sex'].isna()]

We can observe that the rows where we have missing values for numerical variables, we also have missing values for gender variable. We'll remove all the rows where missing values are present.

In [None]:
# Removing the rows where missing values are present
raw_data = raw_data.dropna()
raw_data.head()

In [None]:
# Finding the count of missing values again
raw_data.isna().sum()

No missing values in the dataset! Yay! :)

In [None]:
# There is one row in the gender column with invalid data
raw_data[raw_data['sex'] == '.']

In [None]:
# Deleting the row with index = 336
raw_data = raw_data.drop(index=[336])

In [None]:
# Resetting the indices as some of the rows were deleted and assigning it to a new variable
data_no_mv = raw_data.reset_index(drop=True)

In [None]:
# Looking at the summary of the dataset again
data_no_mv.describe(include='all')

## Univariate Analysis

In [None]:
# Culmen Length distribution
sns.displot(data_no_mv['culmen_length_mm'], kde=True)
plt.show()

In [None]:
# Culmen Depth distribution
sns.displot(data_no_mv['culmen_depth_mm'], kde=True)
plt.show()

In [None]:
# Flipper Length distribution
sns.displot(data_no_mv['flipper_length_mm'], kde=True)
plt.show()

In [None]:
# Body Mass distribution
sns.displot(data_no_mv['body_mass_g'], kde=True)
plt.show()

Univariate analysis conclusion: All the numerical variables seem to be normally distributed. So we're good to go.

## Handling Outliers

In [None]:
# Checking outliers in culmen length variable
sns.boxplot(x = data_no_mv['culmen_length_mm'])
plt.show()

In [None]:
# Checking outliers in culmen depth variable
sns.boxplot(x = data_no_mv['culmen_depth_mm'])
plt.show()

In [None]:
# Checking outliers in flipper length variable
sns.boxplot(x = data_no_mv['flipper_length_mm'])
plt.show()

In [None]:
# Checking outliers in body mass variable
sns.boxplot(x = data_no_mv['body_mass_g'])
plt.show()

Outliers analysis conclusion: There are no outliers present.

## Bivariate Analysis - Numerical Variables

In [None]:
# Building scatterplots between all combinations of numerical variables
(fig, ((ax1, ax2, ax3), (ax4, ax5, ax6))) = plt.subplots(2, 3, figsize=(15,10))
ax1.scatter(data_no_mv['culmen_length_mm'], data_no_mv['culmen_depth_mm'])
ax1.set_title('Culmen Length and Culmen Depth')
ax2.scatter(data_no_mv['culmen_length_mm'], data_no_mv['flipper_length_mm'])
ax2.set_title('Culmen Length and Flipper Length')
ax3.scatter(data_no_mv['culmen_length_mm'], data_no_mv['body_mass_g'])
ax3.set_title('Culmen Length and Body Mass')
ax4.scatter(data_no_mv['culmen_depth_mm'], data_no_mv['flipper_length_mm'])
ax4.set_title('Culmen Depth and Flipper Length')
ax5.scatter(data_no_mv['culmen_depth_mm'], data_no_mv['body_mass_g'])
ax5.set_title('Culmen Depth and Body Mass')
ax6.scatter(data_no_mv['flipper_length_mm'], data_no_mv['body_mass_g'])
ax6.set_title('Flipper Length and Body Mass')

plt.show()

In [None]:
# Correlation Coefficients
cc_cul_len_cul_dep = stats.pearsonr(data_no_mv['culmen_length_mm'], data_no_mv['culmen_depth_mm'])[0]
cc_cul_len_fli_len = stats.pearsonr(data_no_mv['culmen_length_mm'], data_no_mv['flipper_length_mm'])[0]
cc_cul_len_body_mass = stats.pearsonr(data_no_mv['culmen_length_mm'], data_no_mv['body_mass_g'])[0]
cc_cul_dep_fli_len = stats.pearsonr(data_no_mv['culmen_depth_mm'], data_no_mv['flipper_length_mm'])[0]
cc_cul_dep_body_mass = stats.pearsonr(data_no_mv['culmen_depth_mm'], data_no_mv['body_mass_g'])[0]
cc_fli_len_body_mass = stats.pearsonr(data_no_mv['flipper_length_mm'], data_no_mv['body_mass_g'])[0]

df_col_1 = ['Culmen Length', 'Culmen Length', 'Culmen Length',
            'Culmen Depth', 'Culmen Depth', 'Flipper Length']
df_col_2 = ['Culmen Depth', 'Flipper Length', 'Body Mass',
            'Flipper Length', 'Body Mass', 'Body Mass']
df_col_3 = [cc_cul_len_cul_dep, cc_cul_len_fli_len, cc_cul_len_body_mass,
            cc_cul_dep_fli_len, cc_cul_dep_body_mass, cc_fli_len_body_mass]

df_cc = pd.DataFrame({'Variable 1': df_col_1, 'Variable 2': df_col_2, 'Correlation Coefficient': df_col_3})
df_cc

There is high correlation between the following variables:
1. Flipper Length and Body Mass

There is moderate correlation between the following variables:
1. Culmen Length and Flipper Length
2. Culmen Length and Body Mass
3. Culmen Depth and Flipper Length
4. Culmen Depth and Body Mass

There is low correlation between the following variables:
1. Culmen Length and Culmen Depth


## Exploring clusters wrt Species

In [None]:
(fig, ((ax1, ax2, ax3), (ax4, ax5, ax6))) = plt.subplots(2, 3, figsize=(15,10))
scatter = ax1.scatter(data_no_mv['culmen_length_mm'], data_no_mv['culmen_depth_mm'],
                      c=data_no_mv['species'].astype('category').cat.codes, cmap = 'viridis')
ax1.set_title('Culmen Length and Culmen Depth')
ax2.scatter(data_no_mv['culmen_length_mm'], data_no_mv['flipper_length_mm'],
                      c=data_no_mv['species'].astype('category').cat.codes, cmap = 'viridis')
ax2.set_title('Culmen Length and Flipper Length')
ax3.scatter(data_no_mv['culmen_length_mm'], data_no_mv['body_mass_g'],
                      c=data_no_mv['species'].astype('category').cat.codes, cmap = 'viridis')
ax3.set_title('Culmen Length and Body Mass')
ax4.scatter(data_no_mv['culmen_depth_mm'], data_no_mv['flipper_length_mm'],
                      c=data_no_mv['species'].astype('category').cat.codes, cmap = 'viridis')
ax4.set_title('Culmen Depth and Flipper Length')
ax5.scatter(data_no_mv['culmen_depth_mm'], data_no_mv['body_mass_g'],
                      c=data_no_mv['species'].astype('category').cat.codes, cmap = 'viridis')
ax5.set_title('Culmen Depth and Body Mass')
ax6.scatter(data_no_mv['flipper_length_mm'], data_no_mv['body_mass_g'],
                      c=data_no_mv['species'].astype('category').cat.codes, cmap = 'viridis')
ax6.set_title('Flipper Length and Body Mass')

plt.show()

Clusters can be clearly identified in the following scatter plots:
1. Culmen Length and Culmen Depth
2. Culmen Length and Flipper Length
3. Culmen Length and Body Mass

We can clearly distinguish Chinstrap species in the following scatter plots:
1. Culmen Depth and Flipper Length
2. Culmen Depth and Body Mass
3. Flipper Length and Body Mass

## Exploring clusters wrt Island

In [None]:
(fig, ((ax1, ax2, ax3), (ax4, ax5, ax6))) = plt.subplots(2, 3, figsize=(15,10))
scatter = ax1.scatter(data_no_mv['culmen_length_mm'], data_no_mv['culmen_depth_mm'],
                      c=data_no_mv['island'].astype('category').cat.codes, cmap = 'plasma')
ax1.set_title('Culmen Length and Culmen Depth')
ax2.scatter(data_no_mv['culmen_length_mm'], data_no_mv['flipper_length_mm'],
                      c=data_no_mv['island'].astype('category').cat.codes, cmap = 'plasma')
ax2.set_title('Culmen Length and Flipper Length')
ax3.scatter(data_no_mv['culmen_length_mm'], data_no_mv['body_mass_g'],
                      c=data_no_mv['island'].astype('category').cat.codes, cmap = 'plasma')
ax3.set_title('Culmen Length and Body Mass')
ax4.scatter(data_no_mv['culmen_depth_mm'], data_no_mv['flipper_length_mm'],
                      c=data_no_mv['island'].astype('category').cat.codes, cmap = 'plasma')
ax4.set_title('Culmen Depth and Flipper Length')
ax5.scatter(data_no_mv['culmen_depth_mm'], data_no_mv['body_mass_g'],
                      c=data_no_mv['island'].astype('category').cat.codes, cmap = 'plasma')
ax5.set_title('Culmen Depth and Body Mass')
ax6.scatter(data_no_mv['flipper_length_mm'], data_no_mv['body_mass_g'],
                      c=data_no_mv['island'].astype('category').cat.codes, cmap = 'plasma')
ax6.set_title('Flipper Length and Body Mass')

plt.show()

## Exploring clusters wrt Gender

In [None]:
(fig, ((ax1, ax2, ax3), (ax4, ax5, ax6))) = plt.subplots(2, 3, figsize=(15,10))
scatter = ax1.scatter(data_no_mv['culmen_length_mm'], data_no_mv['culmen_depth_mm'],
                      c=data_no_mv['sex'].astype('category').cat.codes, cmap = 'Purples')
ax1.set_title('Culmen Length and Culmen Depth')
ax2.scatter(data_no_mv['culmen_length_mm'], data_no_mv['flipper_length_mm'],
                      c=data_no_mv['sex'].astype('category').cat.codes, cmap = 'Purples')
ax2.set_title('Culmen Length and Flipper Length')
ax3.scatter(data_no_mv['culmen_length_mm'], data_no_mv['body_mass_g'],
                      c=data_no_mv['sex'].astype('category').cat.codes, cmap = 'Purples')
ax3.set_title('Culmen Length and Body Mass')
ax4.scatter(data_no_mv['culmen_depth_mm'], data_no_mv['flipper_length_mm'],
                      c=data_no_mv['sex'].astype('category').cat.codes, cmap = 'Purples')
ax4.set_title('Culmen Depth and Flipper Length')
ax5.scatter(data_no_mv['culmen_depth_mm'], data_no_mv['body_mass_g'],
                      c=data_no_mv['sex'].astype('category').cat.codes, cmap = 'Purples')
ax5.set_title('Culmen Depth and Body Mass')
ax6.scatter(data_no_mv['flipper_length_mm'], data_no_mv['body_mass_g'],
                      c=data_no_mv['sex'].astype('category').cat.codes, cmap = 'Purples')
ax6.set_title('Flipper Length and Body Mass')

plt.show()

## Bivariate Analysis - Categorical Variables

In [None]:
# Performing Chi-Square test and getting p-value
def get_p_value(variable_1, variable_2):
    crosstab = pd.crosstab(variable_1, variable_2)
    chi_square_test_result = stats.chi2_contingency(crosstab)
    p_value = chi_square_test_result[1]
    return np.around(p_value, 3)

In [None]:
# Performing Chi-Square test for all combinations of categorical variables
p_val_species_island = get_p_value(data_no_mv['species'], data_no_mv['island'])
p_val_species_sex = get_p_value(data_no_mv['species'], data_no_mv['sex'])
p_val_island_sex = get_p_value(data_no_mv['island'], data_no_mv['sex'])

df_p_val_1 = pd.DataFrame({'Feature 1': ['Species', 'Species', 'Island'],
                           'Feature 2': ['Island', 'Sex', 'Sex'],
                           'Correlation Coefficient': [p_val_species_island, p_val_species_sex, p_val_island_sex]})
df_p_val_1

Since the p-value from the Chi-Square test between Species and Island columns is less than 0.05, we can conlude that there is a relationship between these two categorical variables.

## Creating Dummy Variables

In [None]:
preprocessed_dataset = pd.get_dummies(data_no_mv, drop_first=True)
preprocessed_dataset.head()

### Overall conclusions:
1. The dataset seems to be fit for Cluster Analysis - we can have Species as the target variable.
2. Since body mass is highly correlated with flipper length, we can drop the body mass variable.
3. Since island is highly correlated with species variable, we can drop the island variable (in the case where both are considered to be dependent variables).