# Programming for Data Analysis - Project

# **Data Set Simulation Based on the Palmer Penguins Data**

**Instructions:**

Create a data set by simulating a real-world phenomenon. 
- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables
- Investigate the types of variables involved, their likely distributions, and their relationships with each other
- Synthesise/simulate a data set as closely matching their properties as possible
- Detail your research and implement the simulation in a Jupyter notebook - the data set itself can simply be displayed in an output cell within the notebook




## **Table of Contents**

[**1.0 Intro/Overview of Selected Data Set**](#part1.0)<br/>
[**2.0 Accessing the Original Data Set**](#part2.0) <br/>
[**3.0 Preparing the Original Data Set for Analysis**](#part3.0)<br/>
[**3.1 Remove missing values**](#part3.1)<br/>
[**3.2 Remove "Year" Column**](#part3.2)<br/>
[**4.0 Original Data Set Investigation**](#part4.0)<br/>
[**4.1 Identifying values for categorical attributes**](#part4.1)<br/>
[**4.2 Investigating Numerical Values**](#part4.2)<br/>
[**4.3 Subsets of data by species**](#part4.3)<br/>
[**4.4 Subsets of data by sex**](#part4.4)<br/>
[**5.0 Data Set Simulation**](#part5.0)<b/>
[**6.0 Data Set Comparison/Validation**](#part6.0)<br/>

<a id='part1.0'></a>
## **1.0 Intro/Overview of Selected Data Set**
This notebook will be used to investigate the Palmer Penguins dataset and based on the findings create a simulated data set matching the properties of the original data. 
I selected the Palmer penguins data set as it contains several attributes that can be used for classifying the three different species. The original research paper and its accompanying data set are easily available as are cleansed and simplified versions of it, for example as a library in Python. 


The original research paper published in 2014 is an investigation of size differences between male and female penguins of the Pygoscelis genus, so-called sexual dimorphism. 

The research was carried out on three different species of penguins nesting on three islands of the Palmer Archipelago in Antactica between 2007 and 2009 during the egg laying period. 
Islands: Biscoe, Torgersen and Dream
Species: Adelie, Gentoo, Chinstrap
Reduced sample size for chinstrap due to overall smaller population on the island. 
(page 2) 

Measurements taken: 
blood samples for determination of sex and SI analysis? 
measurements of structual size and body mass 
Culmen length
Culmen depth
right flipper
body mass
sea ice concentration (SIC): to calculate average sea ice area and duration of winter ice season
(page 3)

Sample sizes differ because of weather conditions hindering access to penguins (p3)

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081# (18/12/2021)

Results overview: 
"Our results demonstrate inter-specific differences in best morphological predictors of sex among Pygoscelis penguins. Adelie penguin body mass and culmen length were the strongest predictors of sex, while body mass and culmen depth were best predictors of male and female gentoo penguins. For chinstrap penguins, body mass was the least predictive structural feature, while culmen length and depth were similarly strong predictors of sex. Species-specific models based on these best morphological predictors correctly classified a high percentage of individuals from independent datasets (i.e., 89–94%). Interestingly, flipper length was not a strong predictor of sex for any of the three species. Culmen features and body mass are structures important during penguin courtship [52], and therefore, likely targets of sexual selection, which may be why these parameters are strong predictors of sex across Pygoscelis species." (page 9)


Links: 

Data set investigation
https://github.com/allisonhorst/palmerpenguins (18/12/2021)

https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data (18/12/2021)

https://towardsdatascience.com/penguins-dataset-overview-iris-alternative-9453bb8c8d95 (18/12/2021)

https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris (18/12/2021)

https://www.gabemednick.com/post/penguin/ (18/12/2021)

https://github.com/mcnakhaee/palmerpenguins (18/12/2021)

https://github.com/YuOlvera/PalmerPenguins (18/12/2021)

https://www.python-graph-gallery.com/web-ggbetweenstats-with-matplotlib (18/12/2021)

https://inria.github.io/scikit-learn-mooc/python_scripts/trees_dataset.html (18/12/2021)



https://www.youtube.com/watch?v=Eai1jaZrRDs (18/12/2021)

https://www.youtube.com/watch?v=6kD2OzF2uoU (18/12/2021)

https://www.youtube.com/watch?v=uiYgZomY-v4 (18/12/2021)



Penguins and Antarctica: 
https://en.wikipedia.org/wiki/Ad%C3%A9lie_penguin (18/12/2021)

https://en.wikipedia.org/wiki/Chinstrap_penguin (18/12/2021)

https://en.wikipedia.org/wiki/Gentoo_penguin (18/12/2021)

https://en.wikipedia.org/wiki/Pygoscelis (18/12/2021)

https://en.wikipedia.org/wiki/Palmer_Archipelago (18/12/2021)


<a id='part2.0'></a>
## **2.0 Accessing the Original Data Set**

There are several options to get the Palmer penguins data set into Python code, a selection of these is listed below: 
- Using the `read_csv()` function from an online csv version of the data set
- Importing the Python library palmerpenguins
- Importing the data set from the seaborn library

For this notebook the data set will be accessed from the palmerpenguins library. To be able to use it, the library first needs to be installed, for example using the command `pip install palmerpenguins`. 

[[x] Python Software Foundation, 2021: *palmerpenguins 0.1.4*](https://pypi.org/project/palmerpenguins) (Accessed 18 December 2021)

In [1]:
# Importing the penguins dataset
from palmerpenguins import load_penguins

# Importing pandas for working with datasets
import pandas as pd

The dataset can then be defined as a variable.

In [2]:
# Defining the dataset
penguins = load_penguins()

<a id='part3.0'></a>
## **3.0 Preparing the Original Data Set for Analysis**

The palmerpenguins library contains a modified version of the original dataset. For easier usage some columns have been removed which were considered not relevant for data analysis on classification, for example identifiers of the original study and dates of egg-laying. 

[[x] HORST, A.M., HILL, A.P., GORMAN, K.B., 2020: *About the data*](https://allisonhorst.github.io/palmerpenguins/#about-the-data) (Accessed 19 December 2021)

Further cleanup of the data set might be required ahead of starting the analysis. To get a first understanding of the data, the pandas library offers a number of options, for example using the `info()` function for an overview of the columns, the number of values and their data type. 

In [3]:
# Overview of the data set
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB


The above output shows that the data set consists of 8 columns and 344 rows. The columns contain three different data types: 

**Object (= integer values):**
- species
- island
- sex

**float64 (= numerical float point values):**
- bill-length_mm
- bill-depth_mm
- flipper_lengths_mm
- body_mass_g

**int64 (= whole numbers):**
- year


It also shows that some of columns have missing (null) values. As these can have an impacct on accuracy of data analysis, they will need to be further investigated. 

For a visual inspection of the data set the `head()` function can be useful which displays the first rows of the data set. The number of rows to be displayed can be set individually by adding a parameter. If left blank, by default the first five rows are displayed. 

In [4]:
# Displaying the first ten rows of the data set
penguins.head(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,female,2007
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,male,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007


<a id='part3.1'></a>
### **3.1 Remove missing values** <br>
The output of the `info()` function shows that there are 344 data records in total. Some of the columns contain null values, for example bill_length_mm, billdepth_mm, flipper_length_mm, body_mass_g and sex. One of these rows is also visible when displaying the first 10 rows (see row index 3). 

To get a better understanding of the full picture, variations of the `isna()` function can be used to determine the number of instances of NaN values and identify affected records. 

[[x] Data to Fish, 2021: *Check for NaN in Pandas DataFrame*](https://datatofish.com/check-nan-pandas-dataframe/) (Accessed 18 December 2021)

[[x] Data to Fish, 2021: *Select all Rows with NaN Values in Pandas DataFrame*](https://datatofish.com/rows-with-nan-pandas-dataframe/) (Accessed 18 December 2021)


In [5]:
# Determine if there are NaN values in whole data set
penguins.isna().values.any()

True

In [6]:
# Count NaN values in whole data set
penguins.isna().sum().sum()

19

In [7]:
# Display number of NaN values by column
penguins.isna().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64

In [8]:
# Display all columns with one or more values = NaN
penguins[penguins.isna().any(axis=1)]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
3,Adelie,Torgersen,,,,,,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007
10,Adelie,Torgersen,37.8,17.1,186.0,3300.0,,2007
11,Adelie,Torgersen,37.8,17.3,180.0,3700.0,,2007
47,Adelie,Dream,37.5,18.9,179.0,2975.0,,2007
178,Gentoo,Biscoe,44.5,14.3,216.0,4100.0,,2007
218,Gentoo,Biscoe,46.2,14.4,214.0,4650.0,,2008
256,Gentoo,Biscoe,47.3,13.8,216.0,4725.0,,2009
268,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,,2009


Different methods are available for dealing with missing values in a data set, for example using imputation to substitute missing values with an estimate. For numerical variables this can be achieved by using the column mean, for categorical values by using the most frequent value (mode). 

[[x] garg_ak0109, 2019: *Python | Replace NaN values with average of columns*](https://www.geeksforgeeks.org/python-replace-nan-values-with-average-of-columns/?ref=rp) (Accessed 19 December 2021) 

[[x] Wikipedia Contributors, 2021: *Imputation (statistics)*](https://en.wikipedia.org/wiki/Imputation_(statistics)) (Accessed 25 December 2021)
<br>
For this project, columns with NaN values will be removed using the `dropna()` function.

[[x] Data to Fish, 2021: *How to Drop Rows with NaN Values in Pandas DataFrame*](https://datatofish.com/dropna/) (Accessed 18 December 2021)

In [9]:
# Removing all columns with NaN values
penguins = penguins.dropna()

# Resetting the index column
penguins = penguins.reset_index(drop=True)

In [10]:
# Re-run info() to verify new size of dataset and all values are non-null
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
 7   year               333 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 20.9+ KB


<br>
<a id='part3.2'></a>

### **3.2 Remove the "Year" Column** 
<br>

The data set provided in Python contains a "year" column which indicates the year in which the data sample was collected. As this column will not be relevant for the data simulation, the column can be removed using the `drop()` function. 

[[x] Python Examples, : *How to Delete Column(s) of Pandas DataFrame?*](https://pythonexamples.org/pandas-dataframe-delete-column/) (Accessed 18 December 2021)

In [11]:
# Remove the "year" column
penguins = penguins.drop(['year'], axis=1)

In [12]:
# Display the first five columns to check that the "year" column has been removed
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male


*Remove island?*

<a id='part4.0'></a>

## **4.0 Original Data Set Investigation**

In [13]:
# Overview of dataset: Count of records and data type of each column
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.3+ KB


The (cleansed) data set contains 333 rows and 7 columns. 
3 columns contain object data types (= categorical data)
- species
- island
- sex

4 columns contain float point numbers
- bill_length_mm
- bill_depth_mm
- flipper_length_mm
- body_mass_g

<a id='part4.1'></a>
### **4.1 Identifying values for categorical attributes**

To identify the different values of the three categorical variables, the functions for counting and displaying unique records can be used `nunique()` and `unique()`. 

[[x] The pandas development team, 2021: *pandas.DataFrame.nunique*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html) (Accessed 19 December 2021)

[[x] The pandas development team, 2021: *pandas.Series.unique*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html) (Accessed 19 December 2021)

In [14]:
# Count unique records of each column
penguins.nunique()

species                3
island                 3
bill_length_mm       163
bill_depth_mm         79
flipper_length_mm     54
body_mass_g           93
sex                    2
dtype: int64

#### **4.1.1 Identifying Values for "Species"**

In [15]:
# Display unique values for "Species"
species = penguins['species'].unique()
print(species)

['Adelie' 'Gentoo' 'Chinstrap']


In [16]:
# Count instances for "Species"
penguins['species'].value_counts()

Adelie       146
Gentoo       119
Chinstrap     68
Name: species, dtype: int64

#### **4.1.2 Identifying Values for "Island"**

In [17]:
# Display unique values for "Island"
islands = penguins['island'].unique()
print(islands)

['Torgersen' 'Biscoe' 'Dream']


In [18]:
# Count instances for "Island"
penguins['island'].value_counts()

Biscoe       163
Dream        123
Torgersen     47
Name: island, dtype: int64

#### **4.1.3 Identifying Values for "Sex"**

In [19]:
# Display unique values for "Sex"
sexes = penguins['sex'].unique()
print(sexes)

['male' 'female']


In [20]:
# Count instances for "Sex"
penguins['sex'].value_counts()

male      168
female    165
Name: sex, dtype: int64

<a id='part4.2'></a>
### **4.2 Investigating Numerical Values**

In [21]:
penguins.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,333.0,333.0,333.0,333.0
mean,43.992793,17.164865,200.966967,4207.057057
std,5.468668,1.969235,14.015765,805.215802
min,32.1,13.1,172.0,2700.0
25%,39.5,15.6,190.0,3550.0
50%,44.5,17.3,197.0,4050.0
75%,48.6,18.7,213.0,4775.0
max,59.6,21.5,231.0,6300.0


The `describe()` function provides an overview of descriptive statistics for columns with numerical data. 

[[x] The pandas development team, 2021: *pandas.DataFrame.describe*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) (Accessed 19 December 2021)

In [22]:
# Overview of dataset: Statistical values for the 4 columns with numerical data
penguins.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,333.0,43.992793,5.468668,32.1,39.5,44.5,48.6,59.6
bill_depth_mm,333.0,17.164865,1.969235,13.1,15.6,17.3,18.7,21.5
flipper_length_mm,333.0,200.966967,14.015765,172.0,190.0,197.0,213.0,231.0
body_mass_g,333.0,4207.057057,805.215802,2700.0,3550.0,4050.0,4775.0,6300.0


The `corr()` function can be used to identify correlations between variables. 



[[x] W3Schools, 2021: *Pandas - Data Correlations*](https://www.w3schools.com/python/pandas/pandas_correlations.asp) (Accessed 25 December 2021) 

[[x] The pandas development team, 2021: *pandas.DataFrame.corr*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) (Accessed 25 December 2021)



In [23]:
# Display correlations between variables
penguins.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,-0.228626,0.653096,0.589451
bill_depth_mm,-0.228626,1.0,-0.577792,-0.472016
flipper_length_mm,0.653096,-0.577792,1.0,0.872979
body_mass_g,0.589451,-0.472016,0.872979,1.0


In [24]:
# Investigate Bill Legnth by species
penguins.groupby('species')['bill_length_mm'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adelie,146.0,38.823973,2.662597,32.1,36.725,38.85,40.775,46.0
Chinstrap,68.0,48.833824,3.339256,40.9,46.35,49.55,51.075,58.0
Gentoo,119.0,47.568067,3.106116,40.9,45.35,47.4,49.6,59.6


<a id='part4.3'></a>
### **4.3 Subsets of data by species**
Create variables for the species to be able to filter the results by species

In [25]:
# Filter variable for Adelie species
adelie = penguins[penguins['species'] == 'Adelie']

# Filter variable for Chinstrap species
chinstrap = penguins[penguins['species'] == 'Chinstrap']

# Filter variable for Gentoo species
gentoo = penguins[penguins['species'] == 'Gentoo']

In [26]:
# View correlations for species Adelie
adelie.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.385813,0.332274,0.544276
bill_depth_mm,0.385813,1.0,0.310897,0.580156
flipper_length_mm,0.332274,0.310897,1.0,0.464854
body_mass_g,0.544276,0.580156,0.464854,1.0


In [27]:
# View correlations for species Chinstrap
chinstrap.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.653536,0.471607,0.513638
bill_depth_mm,0.653536,1.0,0.580143,0.604498
flipper_length_mm,0.471607,0.580143,1.0,0.641559
body_mass_g,0.513638,0.604498,0.641559,1.0


In [28]:
# View correlations for species Gentoo
gentoo.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.654023,0.664205,0.66673
bill_depth_mm,0.654023,1.0,0.710642,0.722967
flipper_length_mm,0.664205,0.710642,1.0,0.711305
body_mass_g,0.66673,0.722967,0.711305,1.0


<a id='part4.4'></a>
### **4.4 Subsets of data by sex**

#### Adelie

In [29]:
male_adelie = adelie[adelie['sex'] == 'male']
female_adelie = adelie[adelie['sex'] == 'female']

In [30]:
male_adelie.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,73.0,40.390411,2.277131,34.6,39.0,40.6,41.5,46.0
bill_depth_mm,73.0,19.072603,1.018886,17.0,18.5,18.9,19.6,21.5
flipper_length_mm,73.0,192.410959,6.599317,178.0,189.0,193.0,197.0,210.0
body_mass_g,73.0,4043.493151,346.811553,3325.0,3800.0,4000.0,4300.0,4775.0


In [31]:
male_adelie.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,-0.038247,0.313488,0.22037
bill_depth_mm,-0.038247,1.0,0.185328,0.159558
flipper_length_mm,0.313488,0.185328,1.0,0.360434
body_mass_g,0.22037,0.159558,0.360434,1.0


In [32]:
female_adelie.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,73.0,37.257534,2.028883,32.1,35.9,37.0,38.8,42.2
bill_depth_mm,73.0,17.621918,0.942993,15.5,17.0,17.6,18.3,20.7
flipper_length_mm,73.0,187.794521,5.595035,172.0,185.0,188.0,191.0,202.0
body_mass_g,73.0,3368.835616,269.380102,2850.0,3175.0,3400.0,3550.0,3900.0


In [33]:
female_adelie.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.160636,-0.03724,0.170095
bill_depth_mm,0.160636,1.0,0.064044,0.396937
flipper_length_mm,-0.03724,0.064044,1.0,0.26293
body_mass_g,0.170095,0.396937,0.26293,1.0


#### Chinstrap

In [34]:
male_chinstrap = chinstrap[chinstrap['sex'] == 'male']
female_chinstrap = chinstrap[chinstrap['sex'] == 'female']

In [35]:
male_chinstrap.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,34.0,51.094118,1.564558,48.5,50.05,50.95,51.975,55.8
bill_depth_mm,34.0,19.252941,0.761273,17.5,18.8,19.3,19.8,20.8
flipper_length_mm,34.0,199.911765,5.976558,187.0,196.0,200.5,203.0,212.0
body_mass_g,34.0,3938.970588,362.13755,3250.0,3731.25,3950.0,4100.0,4800.0


In [36]:
male_chinstrap.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.44627,0.169109,0.238285
bill_depth_mm,0.44627,1.0,0.421323,0.345404
flipper_length_mm,0.169109,0.421323,1.0,0.664588
body_mass_g,0.238285,0.345404,0.664588,1.0


In [37]:
female_chinstrap.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,34.0,46.573529,3.108669,40.9,45.425,46.3,47.375,58.0
bill_depth_mm,34.0,17.588235,0.781128,16.4,17.0,17.65,18.05,19.4
flipper_length_mm,34.0,191.735294,5.754096,178.0,187.25,192.0,195.75,202.0
body_mass_g,34.0,3527.205882,285.333912,2700.0,3362.5,3550.0,3693.75,4150.0


In [38]:
female_chinstrap.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.256317,0.121909,0.275594
bill_depth_mm,0.256317,1.0,0.135474,0.391344
flipper_length_mm,0.121909,0.135474,1.0,0.24215
body_mass_g,0.275594,0.391344,0.24215,1.0


#### Gentoo

In [39]:
male_gentoo = gentoo[gentoo['sex'] == 'male']
female_gentoo = gentoo[gentoo['sex'] == 'female']

In [40]:
male_gentoo.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,61.0,49.47377,2.720594,44.4,48.1,49.5,50.5,59.6
bill_depth_mm,61.0,15.718033,0.74106,14.1,15.2,15.7,16.1,17.3
flipper_length_mm,61.0,221.540984,5.673252,208.0,218.0,221.0,225.0,231.0
body_mass_g,61.0,5484.836066,313.158596,4750.0,5300.0,5500.0,5700.0,6300.0


In [41]:
male_gentoo.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.306767,0.520438,0.39131
bill_depth_mm,0.306767,1.0,0.470975,0.253457
flipper_length_mm,0.520438,0.470975,1.0,0.330452
body_mass_g,0.39131,0.253457,0.330452,1.0


In [42]:
female_gentoo.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,58.0,45.563793,2.051247,40.9,43.85,45.5,46.875,50.5
bill_depth_mm,58.0,14.237931,0.540249,13.1,13.8,14.25,14.6,15.5
flipper_length_mm,58.0,212.706897,3.897856,203.0,210.0,212.0,215.0,222.0
body_mass_g,58.0,4679.741379,281.578294,3950.0,4462.5,4700.0,4875.0,5200.0


In [43]:
female_gentoo.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.430444,0.206881,0.271926
bill_depth_mm,0.430444,1.0,0.307793,0.371881
flipper_length_mm,0.206881,0.307793,1.0,0.487618
body_mass_g,0.271926,0.371881,0.487618,1.0


In [44]:
penguins.groupby(['sex', 'island'])['flipper_length_mm'].mean()

sex     island   
female  Biscoe       205.687500
        Dream        190.016393
        Torgersen    188.291667
male    Biscoe       213.289157
        Dream        196.306452
        Torgersen    194.913043
Name: flipper_length_mm, dtype: float64

### **4.5 Subsets of data by island**

For the Adelie species, data samples have been collected across three different islands. Investigation if there is an impact of island for the Adelie species.  

In [45]:
# Grouping the penguin data set by species, island and sex: 
penguins.groupby(['species','island', 'sex']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
species,island,sex,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelie,Biscoe,female,37.359091,17.704545,187.181818,3369.318182
Adelie,Biscoe,male,40.590909,19.036364,190.409091,4050.0
Adelie,Dream,female,36.911111,17.618519,187.851852,3344.444444
Adelie,Dream,male,40.071429,18.839286,191.928571,4045.535714
Adelie,Torgersen,female,37.554167,17.55,188.291667,3395.833333
Adelie,Torgersen,male,40.586957,19.391304,194.913043,4034.782609
Chinstrap,Dream,female,46.573529,17.588235,191.735294,3527.205882
Chinstrap,Dream,male,51.094118,19.252941,199.911765,3938.970588
Gentoo,Biscoe,female,45.563793,14.237931,212.706897,4679.741379
Gentoo,Biscoe,male,49.47377,15.718033,221.540984,5484.836066


In [46]:
male_adelie.groupby('island').agg({'bill_length_mm': ['mean', 'std']})

Unnamed: 0_level_0,bill_length_mm,bill_length_mm
Unnamed: 0_level_1,mean,std
island,Unnamed: 1_level_2,Unnamed: 2_level_2
Biscoe,40.590909,2.006634
Dream,40.071429,1.748196
Torgersen,40.586957,3.027496


In [47]:
# Comparing averages for species Adelie
adelie.groupby(['sex', 'island']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
sex,island,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,Biscoe,37.359091,17.704545,187.181818,3369.318182
female,Dream,36.911111,17.618519,187.851852,3344.444444
female,Torgersen,37.554167,17.55,188.291667,3395.833333
male,Biscoe,40.590909,19.036364,190.409091,4050.0
male,Dream,40.071429,18.839286,191.928571,4045.535714
male,Torgersen,40.586957,19.391304,194.913043,4034.782609


In [48]:
# Comparing standard deviation for species Adelie
adelie.groupby(['sex', 'island']).std()

Unnamed: 0_level_0,Unnamed: 1_level_0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
sex,island,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,Biscoe,1.762212,1.091298,6.744567,343.470715
female,Dream,2.089043,0.897448,5.510156,212.056475
female,Torgersen,2.207887,0.879723,4.638958,259.144356
male,Biscoe,2.006634,0.879689,6.463517,355.567956
male,Dream,1.748196,1.033276,6.803749,330.547636
male,Torgersen,3.027496,1.082469,5.915412,372.471714


In [49]:
# Comparing minimum values for species Adelie
adelie.groupby(['sex', 'island']).min()

Unnamed: 0_level_0,Unnamed: 1_level_0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
sex,island,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,Biscoe,Adelie,34.5,16.0,172.0,2850.0
female,Dream,Adelie,32.1,15.5,178.0,2900.0
female,Torgersen,Adelie,33.5,15.9,176.0,2900.0
male,Biscoe,Adelie,37.6,17.2,180.0,3550.0
male,Dream,Adelie,36.3,17.0,178.0,3425.0
male,Torgersen,Adelie,34.6,17.6,181.0,3325.0


In [50]:
# Comparing maximum values for species Adelie
adelie.groupby(['sex', 'island']).max()

Unnamed: 0_level_0,Unnamed: 1_level_0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
sex,island,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,Biscoe,Adelie,40.5,20.7,199.0,3900.0
female,Dream,Adelie,42.2,19.3,202.0,3700.0
female,Torgersen,Adelie,41.1,19.3,196.0,3800.0
male,Biscoe,Adelie,45.6,21.1,203.0,4775.0
male,Dream,Adelie,44.1,21.2,208.0,4650.0
male,Torgersen,Adelie,46.0,21.5,210.0,4700.0


In [51]:
# Subset of data for Adelie on Biscoe Point
female_adelie_biscoe = female_adelie[female_adelie['island'] == 'Biscoe']
male_adelie_biscoe = male_adelie[male_adelie['island'] == 'Biscoe']

In [52]:
female_adelie_biscoe.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,22.0,37.359091,1.762212,34.5,35.75,37.75,38.475,40.5
bill_depth_mm,22.0,17.704545,1.091298,16.0,17.0,17.7,18.25,20.7
flipper_length_mm,22.0,187.181818,6.744567,172.0,184.25,187.0,191.75,199.0
body_mass_g,22.0,3369.318182,343.470715,2850.0,3150.0,3375.0,3693.75,3900.0


In [53]:
male_adelie_biscoe.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,22.0,40.590909,2.006634,37.6,39.025,40.8,41.55,45.6
bill_depth_mm,22.0,19.036364,0.879689,17.2,18.6,18.9,19.5,21.1
flipper_length_mm,22.0,190.409091,6.463517,180.0,185.75,191.0,194.75,203.0
body_mass_g,22.0,4050.0,355.567956,3550.0,3800.0,4000.0,4268.75,4775.0


In [54]:
female_adelie_biscoe.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.10831,-0.000146,0.099711
bill_depth_mm,0.10831,1.0,-0.051228,0.359601
flipper_length_mm,-0.000146,-0.051228,1.0,0.417752
body_mass_g,0.099711,0.359601,0.417752,1.0


In [55]:
male_adelie_biscoe.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.206565,0.466214,0.549775
bill_depth_mm,0.206565,1.0,0.456207,0.524467
flipper_length_mm,0.466214,0.456207,1.0,0.616937
body_mass_g,0.549775,0.524467,0.616937,1.0


In [56]:
# Subset of data for Adelie on Dream
female_adelie_dream = female_adelie[female_adelie['island'] == 'Dream']
male_adelie_dream = male_adelie[male_adelie['island'] == 'Dream']

In [57]:
female_adelie_dream.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,27.0,36.911111,2.089043,32.1,36.0,36.8,37.85,42.2
bill_depth_mm,27.0,17.618519,0.897448,15.5,17.05,17.8,18.45,19.3
flipper_length_mm,27.0,187.851852,5.510156,178.0,185.0,188.0,191.0,202.0
body_mass_g,27.0,3344.444444,212.056475,2900.0,3212.5,3400.0,3487.5,3700.0


In [58]:
male_adelie_dream.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,28.0,40.071429,1.748196,36.3,39.15,40.25,41.1,44.1
bill_depth_mm,28.0,18.839286,1.033276,17.0,18.1,18.65,19.275,21.2
flipper_length_mm,28.0,191.928571,6.803749,178.0,188.5,190.5,196.0,208.0
body_mass_g,28.0,4045.535714,330.547636,3425.0,3875.0,3987.5,4300.0,4650.0


In [59]:
female_adelie_dream.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.446291,-0.095079,0.385415
bill_depth_mm,0.446291,1.0,0.102465,0.532589
flipper_length_mm,-0.095079,0.102465,1.0,0.321025
body_mass_g,0.385415,0.532589,0.321025,1.0


In [60]:
male_adelie_dream.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.080813,0.255781,0.052007
bill_depth_mm,0.080813,1.0,0.140551,0.097043
flipper_length_mm,0.255781,0.140551,1.0,0.312754
body_mass_g,0.052007,0.097043,0.312754,1.0


In [61]:
# Subset of data for Adelie on Torgersen
female_adelie_torgersen = female_adelie[female_adelie['island'] == 'Torgersen']
male_adelie_torgersen = male_adelie[male_adelie['island'] == 'Torgersen']

In [62]:
female_adelie_torgersen.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,24.0,37.554167,2.207887,33.5,35.85,37.6,39.125,41.1
bill_depth_mm,24.0,17.55,0.879723,15.9,17.0,17.45,17.925,19.3
flipper_length_mm,24.0,188.291667,4.638958,176.0,186.0,189.0,191.0,196.0
body_mass_g,24.0,3395.833333,259.144356,2900.0,3200.0,3400.0,3606.25,3800.0


In [63]:
male_adelie_torgersen.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,23.0,40.586957,3.027496,34.6,38.85,41.1,42.65,46.0
bill_depth_mm,23.0,19.391304,1.082469,17.6,18.55,19.2,20.15,21.5
flipper_length_mm,23.0,194.913043,5.915412,181.0,192.0,195.0,198.0,210.0
body_mass_g,23.0,4034.782609,372.471714,3325.0,3787.5,4000.0,4275.0,4700.0


In [64]:
female_adelie_torgersen.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,-0.082935,-0.015193,0.042966
bill_depth_mm,-0.082935,1.0,0.224263,0.350438
flipper_length_mm,-0.015193,0.224263,1.0,-0.065854
body_mass_g,0.042966,0.350438,-0.065854,1.0


In [65]:
male_adelie_torgersen.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,-0.296162,0.312882,0.141502
bill_depth_mm,-0.296162,1.0,-0.103054,-0.022327
flipper_length_mm,0.312882,-0.103054,1.0,0.236101
body_mass_g,0.141502,-0.022327,0.236101,1.0


<a id='part5.0'></a>
## **5.0 Data Set Simulation**

To create a simulated data set using the numpy.random package. 

In [66]:
# Importing NumPy
import numpy as np

In [67]:
# Define the random number generator (RNG)
rng = np.random.default_rng(seed=5)

**Gentoo Female**

Simulation for 90 female Gentoo penguins, based on Island Biscoe Point. 

In [68]:
female_gentoo.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,58.0,45.563793,2.051247,40.9,43.85,45.5,46.875,50.5
bill_depth_mm,58.0,14.237931,0.540249,13.1,13.8,14.25,14.6,15.5
flipper_length_mm,58.0,212.706897,3.897856,203.0,210.0,212.0,215.0,222.0
body_mass_g,58.0,4679.741379,281.578294,3950.0,4462.5,4700.0,4875.0,5200.0


In [75]:
# Simulation of Bill_length_mm
sim_female_gentoo_bill_l = rng.normal(loc = 45.6, scale = 2.1, size = 90)

In [76]:
# Display bill_length_mm
sim_female_gentoo_bill_l

array([45.47742008, 49.49198399, 50.13595899, 44.49786448, 43.65488761,
       51.25436161, 43.54253627, 44.39585576, 45.67682117, 46.61481711,
       47.76097465, 46.42105217, 43.76799555, 46.66340289, 46.12482692,
       49.54152988, 45.56850042, 42.79235691, 43.40545818, 48.64532142,
       44.46572447, 41.18062138, 44.38053062, 45.60003171, 48.09654442,
       43.46961691, 47.00003498, 47.27012811, 44.13128455, 45.20606162,
       49.3158455 , 49.21301797, 47.39659621, 46.29708735, 47.9904502 ,
       45.3045777 , 45.40030456, 43.79324371, 45.61168311, 45.42776509,
       51.4246265 , 45.19502935, 48.26871778, 48.37304428, 45.20839473,
       48.0557341 , 41.02869134, 45.79934709, 47.40024894, 40.56448293,
       43.16574048, 47.8177546 , 45.06027739, 43.29378274, 44.8154491 ,
       44.46056456, 47.12133936, 46.55316035, 45.01152049, 44.20278036,
       45.48686111, 48.38282745, 46.33295484, 46.94178575, 45.28356865,
       42.88718587, 44.10722858, 46.96759679, 45.81675382, 43.70

In [77]:
#Simulation of bill_depth_mm
sim_female_gentoo_bill_d = rng.normal(loc = 14.2, scale = 0.5, size = 90)

In [78]:
# Display bill_depth_mm
sim_female_gentoo_bill_d

array([14.20887996, 14.55433491, 13.53384732, 15.06649628, 14.56930994,
       14.13696291, 14.02400598, 14.25694794, 14.11663433, 14.06065494,
       13.64020001, 14.83140472, 14.18021116, 14.23731782, 14.24645799,
       13.02089133, 14.46357073, 14.03542253, 14.05538577, 14.23489394,
       13.7137622 , 14.6216904 , 13.85206769, 13.74817097, 13.7689052 ,
       14.27937635, 14.57288974, 13.90247338, 13.44490666, 14.73124535,
       13.6674582 , 14.79741518, 14.17285206, 13.85082505, 13.87998003,
       14.43093309, 14.6446859 , 14.10091489, 14.67346719, 15.2434891 ,
       14.09584583, 13.62527668, 13.67279983, 13.50483232, 13.88016727,
       13.83811038, 13.53744834, 14.46509213, 12.8943305 , 13.53960051,
       15.2076259 , 14.71890121, 14.46139584, 14.42862953, 15.08546859,
       14.46781482, 14.23878854, 13.58491992, 14.16316659, 13.833058  ,
       13.84691236, 13.84288053, 14.17767657, 14.03514459, 14.63555526,
       13.53690928, 14.1152649 , 15.12974875, 14.21944867, 14.79

In [79]:
# Simulation of flipper_length_mm
sim_female_gentoo_flipper = rng.normal(loc = 212.7, scale = 3.9, size = 90)

In [81]:
# Display flipper_length_mm
sim_female_gentoo_flipper

array([214.73919649, 214.09569174, 210.11389629, 210.29562603,
       214.39597   , 211.93153316, 212.53170834, 202.98121852,
       202.16959448, 214.01487795, 213.98527876, 218.1343843 ,
       215.51253616, 213.56478132, 212.76381358, 207.13571233,
       216.14438597, 210.34601454, 214.65968937, 213.48016242,
       213.76117838, 216.18143334, 210.78952216, 211.18978279,
       209.36214025, 211.22910711, 219.19716878, 211.2830572 ,
       217.32390362, 215.92144459, 215.54419899, 215.83175903,
       208.74432144, 214.15644595, 218.89480718, 209.98639585,
       212.51628237, 210.01227929, 213.93634489, 208.49373249,
       218.9777512 , 211.23034432, 206.68361987, 219.68408455,
       216.48503929, 214.72219478, 214.77709142, 211.30458777,
       211.82914716, 215.523675  , 216.27873734, 216.33718837,
       213.52894884, 210.88108644, 211.95085942, 217.02743887,
       218.31568065, 211.36888168, 210.83851439, 217.56544371,
       215.52223119, 215.41930694, 207.58189871, 211.24

In [82]:
# Simulation of body_mass_g
sim_female_gentoo_body = rng.normal(loc = 4679.7, scale = 281.6, size = 90)

In [131]:
# Display body_mass_g
sim_female_gentoo_body

array([4899.27099204, 4381.66758509, 4739.91599956, 4356.62486363,
       4775.60466231, 4542.79399095, 4499.19509931, 4695.5538069 ,
       5191.49675394, 4394.78789558, 4510.24702649, 4980.35422344,
       4560.42362573, 5177.92482046, 4742.3643646 , 4705.22194702,
       4780.1985735 , 5170.01617438, 4864.24980811, 4825.20613663,
       4750.6231461 , 4239.72285654, 4430.40739687, 5097.40970448,
       4911.14093827, 4822.20786509, 4859.09686413, 4526.62419653,
       4774.87073778, 4426.03623209, 4789.9076297 , 4478.59955282,
       4619.48833611, 4807.3816203 , 4587.67658273, 4862.05906948,
       4808.12037859, 4780.92341139, 5052.65424541, 4264.65407511,
       4230.18691028, 4853.2336886 , 4944.62761716, 5246.17794444,
       4892.6619146 , 4913.48854353, 4431.0246575 , 4891.63829254,
       4869.45304496, 4382.36836273, 4427.29698945, 4648.43361329,
       4760.87225634, 3850.44839232, 4699.49803167, 4744.7086261 ,
       4629.90288109, 4754.8378913 , 4808.08319482, 4394.82272

In [110]:
# Populate the data frame with simulated data for female Gentoo penguins
# FLoat point numbers for bill_length_mm and bill_depth_mm are rounded to 1 decimal
# Float point numbers for flipper_length_mm and body_mass_g are rounded to full numbers to match the original data set

sim_female_gentoo = pd.DataFrame(data = {'species': 'Gentoo', 'island': 'Biscoe', 'bill_length_mm': np.round(sim_female_gentoo_bill_l, 1), 'bill_depth_mm': np.round(sim_female_gentoo_bill_d, 1), 'flipper_length_mm': np.round(sim_female_gentoo_flipper,0), 'body_mass_g': np.round(sim_female_gentoo_body, 0), 'sex': 'female'})

In [111]:
sim_female_gentoo

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Gentoo,Biscoe,45.5,14.2,215.0,4899.0,female
1,Gentoo,Biscoe,49.5,14.6,214.0,4382.0,female
2,Gentoo,Biscoe,50.1,13.5,210.0,4740.0,female
3,Gentoo,Biscoe,44.5,15.1,210.0,4357.0,female
4,Gentoo,Biscoe,43.7,14.6,214.0,4776.0,female
...,...,...,...,...,...,...,...
85,Gentoo,Biscoe,46.8,14.0,208.0,4351.0,female
86,Gentoo,Biscoe,48.0,14.0,210.0,4524.0,female
87,Gentoo,Biscoe,43.9,14.2,216.0,5002.0,female
88,Gentoo,Biscoe,41.9,14.0,216.0,4761.0,female


In [115]:
female_gentoo.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,58.0,45.563793,2.051247,40.9,43.85,45.5,46.875,50.5
bill_depth_mm,58.0,14.237931,0.540249,13.1,13.8,14.25,14.6,15.5
flipper_length_mm,58.0,212.706897,3.897856,203.0,210.0,212.0,215.0,222.0
body_mass_g,58.0,4679.741379,281.578294,3950.0,4462.5,4700.0,4875.0,5200.0


In [114]:
sim_female_gentoo.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,90.0,45.861111,2.166688,40.6,44.5,45.6,47.375,51.4
bill_depth_mm,90.0,14.13,0.483538,12.9,13.8,14.1,14.5,15.2
flipper_length_mm,90.0,213.277778,3.693382,202.0,211.0,214.0,216.0,221.0
body_mass_g,90.0,4680.011111,281.712719,3850.0,4461.0,4739.5,4861.25,5246.0


In [97]:
# Create a blank pandas data frame with the required columns
sim_df = pd.DataFrame(columns = ['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex'])

In [98]:
sim_df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex


**Gentoo Male**

Simulation for 90 male Gentoo penguins, based on Island Biscoe Point. 

In [117]:
male_gentoo.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bill_length_mm,61.0,49.47377,2.720594,44.4,48.1,49.5,50.5,59.6
bill_depth_mm,61.0,15.718033,0.74106,14.1,15.2,15.7,16.1,17.3
flipper_length_mm,61.0,221.540984,5.673252,208.0,218.0,221.0,225.0,231.0
body_mass_g,61.0,5484.836066,313.158596,4750.0,5300.0,5500.0,5700.0,6300.0


In [118]:
# Simulation of Bill_length_mm
sim_male_gentoo_bill_l = rng.normal(loc = 49.5, scale = 2.7, size = 90)

In [119]:
# Display bill_length_mm
sim_male_gentoo_bill_l

array([44.15673195, 44.4800441 , 51.47340144, 50.42225627, 48.70064441,
       48.38497632, 44.83952291, 50.50228088, 52.29293616, 53.93386164,
       49.63267247, 48.94866896, 46.54080798, 52.8027579 , 54.10850993,
       47.84444098, 50.28504868, 49.74480929, 47.95066623, 42.75688618,
       48.51180242, 51.14009919, 45.60308452, 47.23502058, 40.63438668,
       52.96822483, 50.8507462 , 49.62279203, 48.76972185, 47.77933542,
       49.28722311, 48.66214968, 48.82943685, 49.40870999, 49.83925761,
       53.68224173, 45.58653251, 50.20731529, 47.81368579, 46.4234008 ,
       43.10593883, 51.13860879, 46.82597605, 47.44884787, 50.74167247,
       46.54762287, 49.54332185, 49.27082291, 51.63797715, 52.20173054,
       47.13250002, 51.94937988, 49.21178672, 49.80455342, 48.89921324,
       51.56695701, 46.2000589 , 50.17522752, 52.64760543, 48.70709123,
       54.94798567, 47.48605421, 50.04997955, 48.16995697, 47.80237416,
       46.46041832, 48.01482053, 49.3319599 , 47.30463762, 51.85

In [122]:
#Simulation of bill_depth_mm
sim_male_gentoo_bill_d = rng.normal(loc = 15.7, scale = 0.7, size = 90)

In [123]:
# Display bill_depth_mm
sim_male_gentoo_bill_d

array([15.90136142, 15.87691135, 16.91102213, 16.27437781, 15.01786018,
       15.02466761, 16.66621969, 15.84400156, 16.3209406 , 15.62552028,
       15.0353564 , 15.92694099, 16.83324084, 15.97939509, 14.95410467,
       15.52789888, 15.4001111 , 15.87314081, 16.4686379 , 17.13800293,
       15.92014385, 15.46445938, 15.82011871, 15.77760632, 15.46914256,
       15.78302291, 15.76670411, 14.94363688, 16.15067716, 15.51330821,
       15.07857915, 16.11018825, 15.42588492, 16.98677865, 16.0996347 ,
       15.81804572, 15.23395662, 15.47712616, 16.3548698 , 15.27825976,
       16.29054889, 14.28742517, 15.89631681, 16.32671151, 15.99140581,
       15.75091813, 16.46043802, 15.15449616, 15.06445667, 15.82614986,
       16.59433659, 16.05455097, 15.61643234, 15.96255012, 16.06201387,
       14.88067945, 13.78402908, 15.03457277, 14.70622261, 15.34648648,
       14.91073925, 15.14611391, 16.05614435, 15.8214571 , 15.73735963,
       16.91807245, 15.4350293 , 15.49218278, 16.64580564, 14.98

In [128]:
# Simulation of flipper_length_mm
sim_male_gentoo_flipper = rng.normal(loc = 221.5, scale = 5.7, size = 90)

In [129]:
# Display flipper_length_mm
sim_male_gentoo_flipper

array([228.344812  , 222.40174957, 217.05201682, 225.2085522 ,
       218.11659153, 222.9563456 , 211.39829534, 219.71842625,
       224.8801747 , 217.98434113, 221.53170376, 225.04253174,
       222.05930672, 231.84962642, 220.1905984 , 220.64144706,
       220.74657932, 210.39367069, 212.9170022 , 224.6955674 ,
       219.77982385, 229.57209059, 221.20993904, 220.68532485,
       227.51800529, 223.23001362, 230.79613836, 215.62958008,
       224.16354005, 225.53215525, 224.83158268, 228.23436487,
       227.54601106, 221.8994172 , 218.52631325, 227.31103072,
       222.0299145 , 227.23529684, 229.50497811, 222.97343781,
       216.71690873, 220.42440505, 226.38927843, 220.5668338 ,
       228.65553887, 222.19558519, 222.74485898, 218.44055194,
       221.45362634, 228.26176698, 220.07053175, 220.11404923,
       217.56184295, 226.43751654, 213.30299927, 218.23300476,
       227.12862442, 230.03469901, 220.64703502, 227.39370663,
       210.58455836, 226.19583906, 223.53648654, 214.89

In [130]:
# Simulation of body_mass_g
sim_male_gentoo_body = rng.normal(loc = 5484.8, scale = 313.2, size = 90)

In [132]:
# Display body_mass_g
sim_male_gentoo_body

array([5739.33940704, 5448.19837045, 5420.61124516, 5252.51505707,
       5413.12249946, 5248.37669658, 5887.7150541 , 6007.6290082 ,
       5409.12486439, 5484.92600591, 4902.94351395, 5331.55364806,
       5602.93922853, 5581.98391114, 5534.83685401, 5760.72887699,
       5524.42426246, 5374.3093319 , 5996.66260225, 4895.23299567,
       5133.0441643 , 5263.66575969, 5265.9045876 , 5251.31901574,
       5422.58783224, 5269.43487906, 5473.97065845, 5675.73979784,
       5499.44927254, 5531.71222301, 5197.35830204, 5134.50151608,
       5397.74971726, 5778.00369683, 5606.0034649 , 5827.82021961,
       5798.35889914, 5838.76929893, 5236.34444593, 5874.59152543,
       5724.51860188, 6135.4043159 , 5672.00280745, 5589.86620776,
       5448.02002719, 5539.76266874, 4973.48726978, 5410.62233982,
       5476.63617533, 5796.91189808, 5284.18185822, 5529.81000086,
       6057.18890604, 5323.80545425, 5750.98125504, 5322.54132353,
       5457.49668148, 5632.00010517, 5763.30130589, 5335.47256

In [133]:
# Populate the data frame with simulated data for female Gentoo penguins
# FLoat point numbers for bill_length_mm and bill_depth_mm are rounded to 1 decimal
# Float point numbers for flipper_length_mm and body_mass_g are rounded to full numbers to match the original data set

sim_male_gentoo = pd.DataFrame(data = {'species': 'Gentoo', 'island': 'Biscoe', 'bill_length_mm': np.round(sim_male_gentoo_bill_l, 1), 'bill_depth_mm': np.round(sim_male_gentoo_bill_d, 1), 'flipper_length_mm': np.round(sim_male_gentoo_flipper,0), 'body_mass_g': np.round(sim_male_gentoo_body, 0), 'sex': 'male'})

In [134]:
sim_male_gentoo

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Gentoo,Biscoe,44.2,15.9,228.0,5739.0,male
1,Gentoo,Biscoe,44.5,15.9,222.0,5448.0,male
2,Gentoo,Biscoe,51.5,16.9,217.0,5421.0,male
3,Gentoo,Biscoe,50.4,16.3,225.0,5253.0,male
4,Gentoo,Biscoe,48.7,15.0,218.0,5413.0,male
...,...,...,...,...,...,...,...
85,Gentoo,Biscoe,55.0,16.4,217.0,4988.0,male
86,Gentoo,Biscoe,52.2,16.4,221.0,5916.0,male
87,Gentoo,Biscoe,52.9,16.5,226.0,5810.0,male
88,Gentoo,Biscoe,48.9,15.2,216.0,5502.0,male


<a id='part6.0'></a>
## **6.0 Data Set Comparison/Validation**

## **References Used:**

*Placeholder*