# Exploratory Data Analysis For Penguin DataSet


### Dataset Description

This dataset is related to Penguin species

The columns in the dataset are as follows:

* Species: Species that we have observed(Adelie,Gentoo,Chinstrap)

* Island:Islands where the penguin was observed(Biscoe,Dream,Torgersen)

* bill_length_mm: The length of the penguin's bill (beak) in millimeters.

* bill_depth_mm: The depth (height) of the penguin's bill in millimeters.

* flipper_length_mm: The length of the penguin's flipper\wing in millimeters.

* body_mass_g: The body mass of the penguin in grams.

* sex: The gender of the penguin.

### Importing The Required libraries

In [30]:
import pandas as pd
from scipy.stats import skew, kurtosis
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from scipy.stats import zscore

#### Reading the Dataset and displaying some rows

In [29]:
df = pd.read_csv('dataset.csv')
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female


### Before Cleaning the dataset

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     337 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                328 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [34]:
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,337.0,342.0,342.0,342.0
mean,43.876855,17.15117,200.915205,4273.976608
std,5.4786,1.974793,14.061714,1119.229602
min,32.1,13.1,172.0,2700.0
25%,39.2,15.6,190.0,3550.0
50%,44.1,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4793.75
max,59.6,21.5,231.0,15000.0


#### Q1:Identify missing or incorrect data in the dataset and apply appropriate preprocessing steps to clean it

Code

In [31]:
# Identifying missing data
missing_data = df.isnull().sum()
print(missing_data)

species               0
island                0
bill_length_mm        7
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  16
dtype: int64


In [32]:
df_clean = df.copy()
df_clean['bill_length_mm'].fillna(df_clean['bill_length_mm'].mean(), inplace=True)
df_clean['bill_depth_mm'].fillna(df_clean['bill_depth_mm'].mean(), inplace=True)
df_clean['flipper_length_mm'].fillna(df_clean['flipper_length_mm'].mean(), inplace=True)
df_clean['body_mass_g'].fillna(df_clean['body_mass_g'].mean(), inplace=True)
df_clean['sex'].fillna(df_clean['sex'].mode()[0], inplace=True)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,43.876855,17.15117,200.915205,4273.976608,male
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female


#### Explanation

We will find the  missing data in the dataset using isnull() method of dataframe. 
There are missing values in several columns. 

To clean the data, In general we will have two choices:

1. Dropping rows that contain missing data.
2. Imputing missing values with appropriate measures:

    * For numerical columns (bill_length_mm, flipper_length_mm), we use the column mean.
    
    * For categorical columns (sex), we use the mode.

* Here I have adapted the second method since the no of rows in the dataset are less 

### After cleaning the Dataset

In [39]:
df_clean.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,43.876855,17.15117,200.915205,4273.976608,male
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female


In [37]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     344 non-null    float64
 3   bill_depth_mm      344 non-null    float64
 4   flipper_length_mm  344 non-null    float64
 5   body_mass_g        344 non-null    float64
 6   sex                344 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [38]:
df_clean.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,344.0,344.0,344.0,344.0
mean,43.876855,17.15117,200.915205,4273.976608
std,5.422408,1.969027,14.020657,1115.961772
min,32.1,13.1,172.0,2700.0
25%,39.275,15.6,190.0,3550.0
50%,43.876855,17.3,197.0,4050.0
75%,48.4,18.7,213.0,4781.25
max,59.6,21.5,231.0,15000.0


Q2: What is the average body_mass_g for Gentoo penguins?

In [40]:
df_gentoo = df_clean[df_clean['species'] == 'Gentoo'].reset_index()
df_gentoo

Unnamed: 0,index,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,152,Gentoo,Biscoe,46.100000,13.20000,211.000000,4500.000000,female
1,153,Gentoo,Biscoe,50.000000,16.30000,230.000000,5700.000000,male
2,154,Gentoo,Biscoe,48.700000,14.10000,210.000000,4450.000000,female
3,155,Gentoo,Biscoe,50.000000,15.20000,218.000000,5700.000000,male
4,156,Gentoo,Biscoe,47.600000,14.50000,215.000000,5400.000000,male
...,...,...,...,...,...,...,...,...
119,271,Gentoo,Biscoe,43.876855,17.15117,200.915205,4273.976608,male
120,272,Gentoo,Biscoe,46.800000,14.30000,215.000000,4850.000000,female
121,273,Gentoo,Biscoe,50.400000,15.70000,222.000000,5750.000000,male
122,274,Gentoo,Biscoe,45.200000,14.80000,212.000000,5200.000000,female


In [41]:
gentoo_avg = df_gentoo['body_mass_g'].mean()
gentoo_avg

5126.806262969251

Q3: How do the distributions of bill_length_mm and bill_depth_mm differ between the three penguin species? Analyze the skewness and kurtosis of each feature for different species. (code and explanation)

In [28]:
from scipy.stats import skew, kurtosis