# Exploratory Data Analysis
- Exploratory data analysis is a way of evaluating data sets to summarize their main features, frequently utilising graphs, plots, charts and other methods for data visualisation. 
- EDA contrasts with conventional hypothesis testing because it is primarily used to see what the data can tell us outside of formal modelling.
- John Tukey, a renowned statistician, first suggested the idea of EDA in 1970.
- The objective was to investigate the evidence and develop the theory. It can assist us in data collection and experimentation. 
- EDA differs from **initial data analysis (IDA)**, which is more specifically focused on resolving missing values, transforming variables as necessary, and confirming the assumptions needed for model fitting and hypothesis testing.. 
- IDA is included in EDA.


## Basic exploration
- The aim of basic exploration is to utilized python functions to identifiy erroneous information from that raw data, such as missing values, outliers etc. 
- Functions used in this file can differ based on new file.
- Excellent knowledge of python as well as certain libraries are must, such as numpy, pandas etc

In [1]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import warnings
warnings.filterwarnings("ignore") # ignoring wranings
# it is imporat when we are done with project otherwise do not call this and try to understand the warnings.


In [3]:
penguin_raw = pd.read_excel('ML101 Dataset_2 penguin_manipulated_data_set.xlsx') # reading url dataset

In [4]:
penguin = penguin_raw.copy() # creating a copy of read file

In [5]:
penguin_raw.shape # exploring number of observations and variables

(344, 17)

In [6]:
penguin.head()

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0708,1.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,,,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,PAL0708,2.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,PAL0708,3.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
3,PAL0708,4.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,.,,,,Adult not sampled.
4,PAL0708,5.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,


We have 344 observations and 17 variables. 

### Examining few observations and features

In [7]:
penguin_raw.head() # head function to identify any erroneous value
# it is not about first 5 or 10 rows
# Aim is to check if data looks fine

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0708,1.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,,,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,PAL0708,2.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,PAL0708,3.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
3,PAL0708,4.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,.,,,,Adult not sampled.
4,PAL0708,5.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,


In [17]:
penguin["Species"].value_counts()

Adelie Penguin (Pygoscelis adeliae)          152
Gentoo penguin (Pygoscelis papua)            124
Chinstrap penguin (Pygoscelis antarctica)     68
Name: Species, dtype: int64

- There are missing values in multiple columns, in form of NaN and '.'
- Index 3 has many missing features.

In [8]:
penguin.tail() # tail function to identify any erroneous value
# it is not about last 5 or 10 rows
# Aim is to check if data looks fine

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
339,PAL0910,120.0,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N38A2,No,2009-12-01,,,,.,,,,
340,PAL0910,121.0,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N39A1,Yes,2009-11-22,46.8,14.3,215.0,4850.0,FEMALE,8.41151,-26.13832,
341,PAL0910,122.0,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N39A2,Yes,2009-11-22,50.4,15.7,222.0,5750.0,MALE,8.30166,-26.04117,
342,PAL0910,123.0,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N43A1,Yes,2009-11-22,45.2,14.8,212.0,.,FEMALE,8.24246,-26.11969,
343,PAL0910,124.0,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N43A2,Yes,2009-11-22,49.9,16.1,213.0,5400.0,MALE,8.3639,-26.15531,


- There are missing values in multiple columns, in form of NaN and '.'
- Index 339 has many missing features.

### Examining column names

In [9]:
penguin.columns # checking column names
# some column names to be renamed 

Index(['studyName', 'Sample Number', 'Species', 'Region', 'Island', 'Stage',
       'Individual ID', 'Clutch Completion', 'Date Egg', 'Culmen Length (mm)',
       'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)', 'Sex',
       'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments'],
      dtype='object')

### Examining data types

In [10]:
penguin.dtypes # Checking if datatypes are read correctly. 
# We have to be sure about data types first and then match it with dtypes result. 
# If it is correct we are good otherewise we have to change the data type.

studyName                      object
Sample Number                 float64
Species                        object
Region                         object
Island                         object
Stage                          object
Individual ID                  object
Clutch Completion              object
Date Egg               datetime64[ns]
Culmen Length (mm)            float64
Culmen Depth (mm)             float64
Flipper Length (mm)           float64
Body Mass (g)                  object
Sex                            object
Delta 15 N (o/oo)             float64
Delta 13 C (o/oo)             float64
Comments                       object
dtype: object

In [19]:
penguin["Body Mass (g)"].unique()

array([3750.0, 3800.0, 3250.0, '.', 3450.0, 3650.0, 3625.0, 4675.0, 0.0,
       4250.0, 3300.0, 3700.0, 3200.0, 4400.0, 4500.0, 3325.0, 4200.0,
       3400.0, 3600.0, 3950.0, 3150.0, 3900.0, 4150.0, 4650.0, 3100.0,
       3000.0, 4600.0, 3425.0, 2975.0, 3500.0, 4300.0, 2900.0, 3550.0,
       2850.0, 4050.0, 3350.0, 4100.0, 3050.0, 4450.0, 4000.0, 4700.0,
       4350.0, 3725.0, 4725.0, 3075.0, 2925.0, 3175.0, 4775.0, 3825.0,
       4275.0, 4075.0, 3775.0, 3875.0, 3275.0, 4475.0, 3975.0, 3475.0,
       3525.0, 3575.0, 4550.0, 3850.0, 4800.0, 2700.0, 3675.0, 5700.0,
       5400.0, 5200.0, 5150.0, 5550.0, 5850.0, 6300.0, 5350.0, 5000.0,
       5050.0, 5100.0, 5650.0, 5250.0, 6050.0, 4950.0, 4750.0, 4900.0,
       5300.0, 4850.0, 5800.0, 6000.0, 5950.0, 4625.0, 5450.0, 5600.0,
       4875.0, 4925.0, 4975.0, 5500.0, 4575.0, 4375.0, 5750.0],
      dtype=object)

- Here body mass (g) was expected to be a continuous (float) feature, but read as an obect. Hence, there are some issues in this variable. ( has a . value ) 

In [20]:
 penguin = penguin[penguin["Body Mass (g)"]!="."]

In [23]:
penguin["Body Mass (g)"].unique()

array([3750.0, 3800.0, 3250.0, 3450.0, 3650.0, 3625.0, 4675.0, 0.0,
       4250.0, 3300.0, 3700.0, 3200.0, 4400.0, 4500.0, 3325.0, 4200.0,
       3400.0, 3600.0, 3950.0, 3150.0, 3900.0, 4150.0, 4650.0, 3100.0,
       3000.0, 4600.0, 3425.0, 2975.0, 3500.0, 4300.0, 2900.0, 3550.0,
       2850.0, 4050.0, 3350.0, 4100.0, 3050.0, 4450.0, 4000.0, 4700.0,
       4350.0, 3725.0, 4725.0, 3075.0, 2925.0, 3175.0, 4775.0, 3825.0,
       4275.0, 4075.0, 3775.0, 3875.0, 3275.0, 4475.0, 3975.0, 3475.0,
       3525.0, 3575.0, 4550.0, 3850.0, 4800.0, 2700.0, 3675.0, 5700.0,
       5400.0, 5200.0, 5150.0, 5550.0, 5850.0, 6300.0, 5350.0, 5000.0,
       5050.0, 5100.0, 5650.0, 5250.0, 6050.0, 4950.0, 4750.0, 4900.0,
       5300.0, 4850.0, 5800.0, 6000.0, 5950.0, 4625.0, 5450.0, 5600.0,
       4875.0, 4925.0, 4975.0, 5500.0, 4575.0, 4375.0, 5750.0],
      dtype=object)

In [24]:
penguin["Body Mass (g)"] = penguin["Body Mass (g)"].astype("int")

In [26]:
penguin.dtypes

Species                 object
Island                  object
Clutch Completion       object
Culmen Length (mm)     float64
Culmen Depth (mm)      float64
Flipper Length (mm)    float64
Body Mass (g)            int32
Sex                     object
dtype: object

### Dropping few unnecessary columns
- Mostly we drops column at the end, however, many a times, based on domain, we already know that some columns may be unnecessary. 

In [27]:
# dropping unnecessary columns
penguin = penguin.drop(['studyName', 'Sample Number', 'Stage', 'Region', 'Date Egg', 'Individual ID', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments'], axis = 1)

KeyError: "['studyName' 'Sample Number' 'Stage' 'Region' 'Date Egg' 'Individual ID'\n 'Delta 15 N (o/oo)' 'Delta 13 C (o/oo)' 'Comments'] not found in axis"

### Examining unique values
- It is to check if each column has correct values, specifically categorical variable

In [28]:
penguin.nunique() # nuniuqe function to count unique values in each column

Species                  3
Island                   3
Clutch Completion        2
Culmen Length (mm)     142
Culmen Depth (mm)       74
Flipper Length (mm)     55
Body Mass (g)           95
Sex                      3
dtype: int64

In [35]:
penguin["Sex"].unique()

array(['MALE', 'FEMALE', nan, '.'], dtype=object)

- Here we expected 2 unique values in sex column (males and females), but there are three.

### Extrating Data with Multiple conditns 

In [44]:
penguin.loc[(penguin["Sex"]!="MALE") & (penguin["Sex"]!="FEMALE")]

Unnamed: 0,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
8,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,34.1,18.1,193.0,0,
9,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,42.0,20.2,190.0,4250,
10,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,,,186.0,3300,
11,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,,,180.0,3700,
47,Adelie Penguin (Pygoscelis adeliae),Dream,Yes,,,179.0,2975,
246,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,44.5,14.3,216.0,4100,
286,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,46.2,14.4,214.0,4650,
324,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,47.3,13.8,216.0,4725,
336,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,44.5,15.7,217.0,4875,.


In [45]:
penguin[penguin.duplicated()]

Unnamed: 0,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
206,Chinstrap penguin (Pygoscelis antarctica),Dream,Yes,42.5,,187.0,3350,FEMALE


In [46]:
penguin.drop_duplicates() # check if there are duplicates

Unnamed: 0,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,,,181.0,3750,MALE
1,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,39.5,17.4,186.0,3800,FEMALE
2,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,40.3,18.0,195.0,3250,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,36.7,19.3,193.0,3450,FEMALE
5,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,,,190.0,3650,MALE
...,...,...,...,...,...,...,...,...
337,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,48.8,16.2,222.0,6000,MALE
338,Gentoo penguin (Pygoscelis papua),Biscoe,No,47.2,13.7,214.0,4925,FEMALE
340,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,46.8,14.3,215.0,4850,FEMALE
341,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,50.4,15.7,222.0,5750,MALE


### Summary

In [47]:
penguin.describe(include = 'all')

Unnamed: 0,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
count,334,334,334,280.0,227.0,329.0,334.0,326
unique,3,3,2,,,,,3
top,Adelie Penguin (Pygoscelis adeliae),Biscoe,Yes,,,,,MALE
freq,146,163,299,,,,,164
mean,,,,44.528571,17.194273,201.033435,4081.961078,
std,,,,5.238174,2.006894,14.075215,1098.042129,
min,,,,32.1,13.2,172.0,0.0,
25%,,,,40.275,15.55,190.0,3500.0,
50%,,,,45.45,17.5,197.0,4000.0,
75%,,,,49.0,18.8,214.0,4768.75,


In [48]:
penguin["Species"].value_counts()

Adelie Penguin (Pygoscelis adeliae)          146
Gentoo penguin (Pygoscelis papua)            121
Chinstrap penguin (Pygoscelis antarctica)     67
Name: Species, dtype: int64