# Programming for Data Analysis - Project

# **Data Set Simulation Based on the Palmer Penguins Data**

**Instructions:**

Create a data set by simulating a real-world phenomenon. 
- Choose a real-world phenomenon that can be measured and for which yoy could collect at least one-hundred data points across at least four different variables
- Investigate the types of variables involved, their likely distributions, and their relationships with each other
- Synthesise/simulate a data set as closely matching their properties as possible
- Detail your research and implement the simulation in a Jupyter notebook - the data set itself can simply be displayed in an output cell within the notebook




## **Table of Contents**

[**1.0 Intro/Overview of Selected Data Set**](#part1.0)<br/>
[**2.0 Accessing the Original Data Set**](#part2.0) <br/>
[**3.0 Preparing/Cleansing the Original Data Set**](#part3.0)<br/>
[**3.1 Remove missing values**](#part3.1)<br/>
[**3.2 Remove "Year" Column**](#part3.2)<br/>
[**4.0 Original Data Set Investigation**](#part4.0)<br/>
[**4.1 Identifying values for categorical attributes**](#part4.1)<br/>
[**4.2 Investigating Numerical Values**](#part4.2)<br/>
[**4.3 Subsets of data by species**](#part4.3)<br/>
[**4.4 Subsets of data by sex**](#part4.4)<br/>
[**5.0 Data Set Simulation**](#part5.0)<b/>
[**6.0 Data Set Comparison/Validation**](#part6.0)<br/>

<a id='part1.0'></a>
## **1.0 Intro/Overview of Selected Data Set**
This notebook will be used to investigate the Palmer's Penguins datas et and based on the findings create a simulated data set matching the properties of the original data. 
I selected the penguins data set as it contains several attributes that can be used for classifying the three different species. The original research paper and its accompanying data set are easily available as are cleansed and simplified versions of it, for example as a library in Python. 


The original research paper published in 2014 is an investigation of size differences between male and female penguins of the Pygoscelis genus, so-called sexual dimorphism. 

The research was carried out on three different species of penguins nesting on three islands of the Palmer Archipelago in Antactica between 2007 and 2009 during the egg laying period. 
Islands: Biscoe, Torgersen and Dream
Species: Adelie, Gentoo, Chinstrap
Reduced sample size for chinstrap due to overall smaller population on the island. 
(page 2) 

Measurements taken: 
blood samples for determination of sex and SI analysis? 
measurements of structual size and body mass 
Culmen length
Culmen depth
right flipper
body mass
sea ice concentration (SIC): to calculate average sea ice area and duration of winter ice season
(page 3)

Sample sizes differ because of weather conditions hindering access to penguins (p3)

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081# (18/12/2021)

Results overview: 
"Our results demonstrate inter-specific differences in best morphological predictors of sex among Pygoscelis penguins. Adelie penguin body mass and culmen length were the strongest predictors of sex, while body mass and culmen depth were best predictors of male and female gentoo penguins. For chinstrap penguins, body mass was the least predictive structural feature, while culmen length and depth were similarly strong predictors of sex. Species-specific models based on these best morphological predictors correctly classified a high percentage of individuals from independent datasets (i.e., 89–94%). Interestingly, flipper length was not a strong predictor of sex for any of the three species. Culmen features and body mass are structures important during penguin courtship [52], and therefore, likely targets of sexual selection, which may be why these parameters are strong predictors of sex across Pygoscelis species." (page 9)


Links: 

Data set investigation
https://github.com/allisonhorst/palmerpenguins (18/12/2021)

https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data (18/12/2021)

https://towardsdatascience.com/penguins-dataset-overview-iris-alternative-9453bb8c8d95 (18/12/2021)

https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris (18/12/2021)

https://www.gabemednick.com/post/penguin/ (18/12/2021)

https://github.com/mcnakhaee/palmerpenguins (18/12/2021)

https://github.com/YuOlvera/PalmerPenguins (18/12/2021)

https://www.python-graph-gallery.com/web-ggbetweenstats-with-matplotlib (18/12/2021)

https://inria.github.io/scikit-learn-mooc/python_scripts/trees_dataset.html (18/12/2021)



https://www.youtube.com/watch?v=Eai1jaZrRDs (18/12/2021)

https://www.youtube.com/watch?v=6kD2OzF2uoU (18/12/2021)

https://www.youtube.com/watch?v=uiYgZomY-v4 (18/12/2021)



Penguins and Antarctica: 
https://en.wikipedia.org/wiki/Ad%C3%A9lie_penguin (18/12/2021)

https://en.wikipedia.org/wiki/Chinstrap_penguin (18/12/2021)

https://en.wikipedia.org/wiki/Gentoo_penguin (18/12/2021)

https://en.wikipedia.org/wiki/Pygoscelis (18/12/2021)

https://en.wikipedia.org/wiki/Palmer_Archipelago (18/12/2021)


<a id='part2.0'></a>
## **2.0 Accessing the Original Data Set**

There are several options to get the data set into Python code, to list a few options: 
- Using the `read_csv()` function from an online csv version of the data set
- Importing the Python library palmerpenguins
- Importing the data set from the seaborn library

For this notebook the data set will be accessed from the palmerpenguins library. To be able to use it, the library first needs to be installed, for example using the command `pip install palmerpenguins`. 

https://pypi.org/project/palmerpenguins (18/12/2021)

In [1]:
# Importing the penguins dataset
from palmerpenguins import load_penguins

# Importing pandas for working with datasets
import pandas as pd

The dataset can then be defined as a variable.

In [2]:
# Defining the dataset
penguins = load_penguins()

<a id='part3.0'></a>
## **3.0 Preparing/Cleansing the Original Data Set**

The palmerpenguins library contains a modified version of the original dataset, for easier usage some irrelevant columns have been removed. 

https://allisonhorst.github.io/palmerpenguins/#about-the-data (19/12/2021)

Further cleanup of the data set might be required ahead of starting the analysis. To get a first understanding of the data, the pandas library offers a number of options, for example using the `info()` function for an overview of columns, the number of values and their data type. 

In [3]:
# Overview of the data set
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB


For a visual inspection of the data set the `head()` function can be useful. It displays the first rows of the data set, the number of rows to be displayed can be set individually by adding a parameter (default if left blank is 5 rows). 

In [4]:
# Displaying the first ten rows of the data set
penguins.head(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,female,2007
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,male,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007


<a id='part3.1'></a>
### **3.1 Remove missing values** <br>
The output of the `info()` function shows that there are 344 data records in total. Some of the columns contain null values, for example bill_length_mm, billdepth_mm, flipper_length_mm, body_mass_g and sex. One of these rows is also visible in the first 10 rows (index 3)

To get a better understanding of the full picture, variations of the `isna()` function can be used to determine the number of instances of NaN records and identify affected records. 

https://datatofish.com/check-nan-pandas-dataframe/ (18/12/2021)

https://datatofish.com/rows-with-nan-pandas-dataframe/ (18/12/2021)


In [5]:
# Determine if there are NaN values in whole data set
penguins.isna().values.any()

True

In [6]:
# Count NaN values in whole data set
penguins.isna().sum().sum()

19

In [7]:
# Display number of Nan values by column
penguins.isna().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64

In [8]:
# Display all columns with one or more values = NaN
penguins[penguins.isna().any(axis=1)]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
3,Adelie,Torgersen,,,,,,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007
10,Adelie,Torgersen,37.8,17.1,186.0,3300.0,,2007
11,Adelie,Torgersen,37.8,17.3,180.0,3700.0,,2007
47,Adelie,Dream,37.5,18.9,179.0,2975.0,,2007
178,Gentoo,Biscoe,44.5,14.3,216.0,4100.0,,2007
218,Gentoo,Biscoe,46.2,14.4,214.0,4650.0,,2008
256,Gentoo,Biscoe,47.3,13.8,216.0,4725.0,,2009
268,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,,2009


There are different methods available for populating missing values, for example, for numerical values using the column mean and for categorical values to using the most frequent value (mode). 
https://www.geeksforgeeks.org/python-replace-nan-values-with-average-of-columns/?ref=rp (19/12/2021) 
<br>
For this project, columns with NaN values will be removed using the `dropna()` function.

https://datatofish.com/dropna/ (18/12/2021)

In [9]:
# Removing all columns with NaN values
penguins = penguins.dropna()

# Resetting the index column
penguins = penguins.reset_index(drop=True)

In [10]:
# Re-run info() to verify new size of dataset
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
 7   year               333 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 20.9+ KB


<br>
<a id='part3.2'></a>

### **3.2 Remove "Year" Column** 
<br>
The data set provided in Python contains a "year" column which indicates the year in which the data sample was collected. As this column will not be relevant for the data simulation, the column can be removed using the `drop()` function. 

https://pythonexamples.org/pandas-dataframe-delete-column/ (18/12/2021)

In [11]:
# Remove the "year" column
penguins = penguins.drop(['year'], axis=1)

In [12]:
# Display the first five columns to check that the year column has been removed
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male


*Remove island?*

<a id='part4.0'></a>

## **4.0 Original Data Set Investigation**

In [13]:
# Overview of dataset: Count of records and data type of each column
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.3+ KB


The (cleansed) data set contains 333 rows and 7 columns. 
3 columns contain object data types (= categorical data)
- species
- island
- sex

4 columns contain float point numbers
- bill_length_mm
- bill_depth_mm
- flipper_length_mm
- body_mass_g

<a id='part4.1'></a>
### **4.1 Identifying values for categorical attributes**

To identify the records in of the three categorical values, the functions for counting and displaying unique records can be used `nunique()` and `unique()`

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html (19/12/2021)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html (19/12/2021)

In [14]:
# Count unique records of each column
penguins.nunique()

species                3
island                 3
bill_length_mm       163
bill_depth_mm         79
flipper_length_mm     54
body_mass_g           93
sex                    2
dtype: int64

In [15]:
# Display unique values for "Species"
species = penguins['species'].unique()
print(species)

['Adelie' 'Gentoo' 'Chinstrap']


In [16]:
# Count instances for "Species"
penguins['species'].value_counts()

Adelie       146
Gentoo       119
Chinstrap     68
Name: species, dtype: int64

In [17]:
# Display unique values for "Island"
islands = penguins['island'].unique()
print(islands)

['Torgersen' 'Biscoe' 'Dream']


In [18]:
# Count instances for "Island"
penguins['island'].value_counts()

Biscoe       163
Dream        123
Torgersen     47
Name: island, dtype: int64

In [19]:
# Display unique values for "Sex"
sexes = penguins['sex'].unique()
print(sexes)

['male' 'female']


In [20]:
# Count instances for "Sex"
penguins['sex'].value_counts()

male      168
female    165
Name: sex, dtype: int64

<a id='part4.2'></a>
### **4.2 Investigating Numerical Values**

In [21]:
penguins.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,333.0,333.0,333.0,333.0
mean,43.992793,17.164865,200.966967,4207.057057
std,5.468668,1.969235,14.015765,805.215802
min,32.1,13.1,172.0,2700.0
25%,39.5,15.6,190.0,3550.0
50%,44.5,17.3,197.0,4050.0
75%,48.6,18.7,213.0,4775.0
max,59.6,21.5,231.0,6300.0


In [22]:
# Overview of dataset: Statistical values for the 4 columns with numerical data
penguins.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,333.0,43.992793,5.468668,32.1,39.5,44.5,48.6,59.6
bill_depth_mm,333.0,17.164865,1.969235,13.1,15.6,17.3,18.7,21.5
flipper_length_mm,333.0,200.966967,14.015765,172.0,190.0,197.0,213.0,231.0
body_mass_g,333.0,4207.057057,805.215802,2700.0,3550.0,4050.0,4775.0,6300.0


In [23]:
# Investigate Bill Legnth by species
penguins.groupby('species')['bill_length_mm'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adelie,146.0,38.823973,2.662597,32.1,36.725,38.85,40.775,46.0
Chinstrap,68.0,48.833824,3.339256,40.9,46.35,49.55,51.075,58.0
Gentoo,119.0,47.568067,3.106116,40.9,45.35,47.4,49.6,59.6


In [24]:
penguins.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,-0.228626,0.653096,0.589451
bill_depth_mm,-0.228626,1.0,-0.577792,-0.472016
flipper_length_mm,0.653096,-0.577792,1.0,0.872979
body_mass_g,0.589451,-0.472016,0.872979,1.0


<a id='part4.3'></a>
### **4.3 Subsets of data by species**
Create variables for the species to be able to filter the results by species

In [25]:
# Filter variable for Adelie species
adelie = penguins[penguins['species'] == 'Adelie']

# Filter variable for Chinstrap species
chinstrap = penguins[penguins['species'] == 'Chinstrap']

# Filter variable for Gentoo species
gentoo = penguins[penguins['species'] == 'Gentoo']

In [26]:
# Correlations for species Adelie
adelie.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.385813,0.332274,0.544276
bill_depth_mm,0.385813,1.0,0.310897,0.580156
flipper_length_mm,0.332274,0.310897,1.0,0.464854
body_mass_g,0.544276,0.580156,0.464854,1.0


In [27]:
# Correlations for species Chinstrap
chinstrap.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.653536,0.471607,0.513638
bill_depth_mm,0.653536,1.0,0.580143,0.604498
flipper_length_mm,0.471607,0.580143,1.0,0.641559
body_mass_g,0.513638,0.604498,0.641559,1.0


In [28]:
# Correlations for species Gentoo
gentoo.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.654023,0.664205,0.66673
bill_depth_mm,0.654023,1.0,0.710642,0.722967
flipper_length_mm,0.664205,0.710642,1.0,0.711305
body_mass_g,0.66673,0.722967,0.711305,1.0


<a id='part4.4'></a>
### **4.4 Subsets of data by sex**

#### Adelie

In [29]:
male_adelie = adelie[adelie['sex'] == 'male']
female_adelie = adelie[adelie['sex'] == 'female']

In [30]:
male_adelie.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,73.0,40.390411,2.277131,34.6,39.0,40.6,41.5,46.0
bill_depth_mm,73.0,19.072603,1.018886,17.0,18.5,18.9,19.6,21.5
flipper_length_mm,73.0,192.410959,6.599317,178.0,189.0,193.0,197.0,210.0
body_mass_g,73.0,4043.493151,346.811553,3325.0,3800.0,4000.0,4300.0,4775.0


In [31]:
male_adelie.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,-0.038247,0.313488,0.22037
bill_depth_mm,-0.038247,1.0,0.185328,0.159558
flipper_length_mm,0.313488,0.185328,1.0,0.360434
body_mass_g,0.22037,0.159558,0.360434,1.0


In [32]:
female_adelie.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,73.0,37.257534,2.028883,32.1,35.9,37.0,38.8,42.2
bill_depth_mm,73.0,17.621918,0.942993,15.5,17.0,17.6,18.3,20.7
flipper_length_mm,73.0,187.794521,5.595035,172.0,185.0,188.0,191.0,202.0
body_mass_g,73.0,3368.835616,269.380102,2850.0,3175.0,3400.0,3550.0,3900.0


#### Chinstrap

In [33]:
male_chinstrap = chinstrap[chinstrap['sex'] == 'male']
female_chinstrap = chinstrap[chinstrap['sex'] == 'female']

In [34]:
male_chinstrap.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,34.0,51.094118,1.564558,48.5,50.05,50.95,51.975,55.8
bill_depth_mm,34.0,19.252941,0.761273,17.5,18.8,19.3,19.8,20.8
flipper_length_mm,34.0,199.911765,5.976558,187.0,196.0,200.5,203.0,212.0
body_mass_g,34.0,3938.970588,362.13755,3250.0,3731.25,3950.0,4100.0,4800.0


In [35]:
female_chinstrap.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,34.0,46.573529,3.108669,40.9,45.425,46.3,47.375,58.0
bill_depth_mm,34.0,17.588235,0.781128,16.4,17.0,17.65,18.05,19.4
flipper_length_mm,34.0,191.735294,5.754096,178.0,187.25,192.0,195.75,202.0
body_mass_g,34.0,3527.205882,285.333912,2700.0,3362.5,3550.0,3693.75,4150.0


#### Gentoo

In [36]:
male_gentoo = gentoo[gentoo['sex'] == 'male']
female_gentoo = gentoo[gentoo['sex'] == 'female']

In [37]:
male_gentoo.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,61.0,49.47377,2.720594,44.4,48.1,49.5,50.5,59.6
bill_depth_mm,61.0,15.718033,0.74106,14.1,15.2,15.7,16.1,17.3
flipper_length_mm,61.0,221.540984,5.673252,208.0,218.0,221.0,225.0,231.0
body_mass_g,61.0,5484.836066,313.158596,4750.0,5300.0,5500.0,5700.0,6300.0


In [38]:
female_gentoo.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,58.0,45.563793,2.051247,40.9,43.85,45.5,46.875,50.5
bill_depth_mm,58.0,14.237931,0.540249,13.1,13.8,14.25,14.6,15.5
flipper_length_mm,58.0,212.706897,3.897856,203.0,210.0,212.0,215.0,222.0
body_mass_g,58.0,4679.741379,281.578294,3950.0,4462.5,4700.0,4875.0,5200.0


In [39]:
penguins.groupby(['sex', 'island'])['flipper_length_mm'].mean()

sex     island   
female  Biscoe       205.687500
        Dream        190.016393
        Torgersen    188.291667
male    Biscoe       213.289157
        Dream        196.306452
        Torgersen    194.913043
Name: flipper_length_mm, dtype: float64

Is there an impact of island for the Adelie species? 

In [40]:
penguins.groupby(['species','island', 'sex']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
species,island,sex,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelie,Biscoe,female,37.359091,17.704545,187.181818,3369.318182
Adelie,Biscoe,male,40.590909,19.036364,190.409091,4050.0
Adelie,Dream,female,36.911111,17.618519,187.851852,3344.444444
Adelie,Dream,male,40.071429,18.839286,191.928571,4045.535714
Adelie,Torgersen,female,37.554167,17.55,188.291667,3395.833333
Adelie,Torgersen,male,40.586957,19.391304,194.913043,4034.782609
Chinstrap,Dream,female,46.573529,17.588235,191.735294,3527.205882
Chinstrap,Dream,male,51.094118,19.252941,199.911765,3938.970588
Gentoo,Biscoe,female,45.563793,14.237931,212.706897,4679.741379
Gentoo,Biscoe,male,49.47377,15.718033,221.540984,5484.836066


In [41]:
male_adelie.groupby('island').agg({'bill_length_mm': ['mean', 'std']})

Unnamed: 0_level_0,bill_length_mm,bill_length_mm
Unnamed: 0_level_1,mean,std
island,Unnamed: 1_level_2,Unnamed: 2_level_2
Biscoe,40.590909,2.006634
Dream,40.071429,1.748196
Torgersen,40.586957,3.027496


In [42]:
# Comparing averages for species Adelie
adelie.groupby(['sex', 'island']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
sex,island,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,Biscoe,37.359091,17.704545,187.181818,3369.318182
female,Dream,36.911111,17.618519,187.851852,3344.444444
female,Torgersen,37.554167,17.55,188.291667,3395.833333
male,Biscoe,40.590909,19.036364,190.409091,4050.0
male,Dream,40.071429,18.839286,191.928571,4045.535714
male,Torgersen,40.586957,19.391304,194.913043,4034.782609


<a id='part5.0'></a>
## **5.0 Data Set Simulation**

<a id='part6.0'></a>
## **6.0 Data Set Comparison/Validation**

## **References Used:**

*Placeholder*