# Programming for Data Analysis - Project

# **Data Set Simulation Based on the Palmer Penguins Data**

**Instructions:**

Create a data set by simulating a real-world phenomenon. 
- Choose a real-world phenomenon that can be measured and for which yoy could collect at least one-hundred data points across at least four different variables
- Investigate the types of variables involved, their likely distributions, and their relationships with each other
- Synthesise/simulate a data set as closely matching their properties as possible
- Detail your research and implement the simulation in a Jupyter notebook - the data set itself can simply be displayed in an output cell within the notebook




## **Table of Contents**

*Placeholder*

## **1.0 Intro/Overview of Selected Data Set**
This notebook will be used to investigate the Palmer's Penguins datas et and based on the findings create a simulated data set matching the properties of the original data. 
I selected the penguins data set as it contains several attributes that can be used for classifying the three different species. The original research paper and its accompanying data set are easily available as are cleansed and simplified versions of it, for example as a library in Python. 


The original research paper published in 2014 is an investigation of size differences between male and female penguins of the Pygoscelis genus, so-called sexual dimorphism. 

The research was carried out on three different species of penguins nesting on three islands of the Palmer Archipelago in Antactica between 2007 and 2009 during the egg laying period. 
Islands: Biscoe, Torgersen and Dream
Species: Adelie, Gentoo, Chinstrap
Reduced sample size for chinstrap due to overall smaller population on the island. 
(page 2) 

Measurements taken: 
blood samples for determination of sex and SI analysis? 
measurements of structual size and body mass 
Culmen length
Culmen depth
right flipper
body mass
sea ice concentration (SIC): to calculate average sea ice area and duration of winter ice season
(page 3)

Sample sizes differ because of weather conditions hindering access to penguins (p3)

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081# (18/12/2021)

Results overview: 
"Our results demonstrate inter-specific differences in best morphological predictors of sex among Pygoscelis penguins. Adelie penguin body mass and culmen length were the strongest predictors of sex, while body mass and culmen depth were best predictors of male and female gentoo penguins. For chinstrap penguins, body mass was the least predictive structural feature, while culmen length and depth were similarly strong predictors of sex. Species-specific models based on these best morphological predictors correctly classified a high percentage of individuals from independent datasets (i.e., 89–94%). Interestingly, flipper length was not a strong predictor of sex for any of the three species. Culmen features and body mass are structures important during penguin courtship [52], and therefore, likely targets of sexual selection, which may be why these parameters are strong predictors of sex across Pygoscelis species." (page 9)


Links: 

Data set investigation
https://github.com/allisonhorst/palmerpenguins (18/12/2021)

https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data (18/12/2021)

https://towardsdatascience.com/penguins-dataset-overview-iris-alternative-9453bb8c8d95 (18/12/2021)

https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris (18/12/2021)

https://www.gabemednick.com/post/penguin/ (18/12/2021)

https://github.com/mcnakhaee/palmerpenguins (18/12/2021)

https://github.com/YuOlvera/PalmerPenguins (18/12/2021)

https://www.python-graph-gallery.com/web-ggbetweenstats-with-matplotlib (18/12/2021)

https://inria.github.io/scikit-learn-mooc/python_scripts/trees_dataset.html (18/12/2021)



https://www.youtube.com/watch?v=Eai1jaZrRDs (18/12/2021)

https://www.youtube.com/watch?v=6kD2OzF2uoU (18/12/2021)

https://www.youtube.com/watch?v=uiYgZomY-v4 (18/12/2021)



Penguins and Antarctica: 
https://en.wikipedia.org/wiki/Ad%C3%A9lie_penguin (18/12/2021)

https://en.wikipedia.org/wiki/Chinstrap_penguin (18/12/2021)

https://en.wikipedia.org/wiki/Gentoo_penguin (18/12/2021)

https://en.wikipedia.org/wiki/Pygoscelis (18/12/2021)

https://en.wikipedia.org/wiki/Palmer_Archipelago (18/12/2021)


## **2.0 Accessing the Original Data Set**

There are several options to get the data set into Python code, to list a few options: 
- Using the `read_csv()` function from an online csv version of the data set
- Importing the Python library palmerpenguins
- Importing the data set from the seaborn library

For this notebook the data set will be accessed from the palmerpenguins library. To be able to use it, the library first needs to be installed, for example using the command `pip install palmerpenguins`. 

https://pypi.org/project/palmerpenguins (18/12/2021)

In [1]:
# Importing the penguins dataset
from palmerpenguins import load_penguins

# Importing pandas for working with datasets
import pandas as pd

The dataset can then be defined as a variable.

In [2]:
# Defining the dataset
penguins = load_penguins()

## **3.0 Preparing/Cleansing the Original Data Set**

The palmerpenguins library contains a modified version of the original dataset, for easier usage some irrelevant columns have been removed. 

https://allisonhorst.github.io/palmerpenguins/#about-the-data (19/12/2021)

Further cleanup of the data set might be required ahead of starting the analysis. To get a first understanding of the data, the pandas library offers a number of options, for example using the `info()` function for an overview of columns, the number of values and their data type. 

In [3]:
# Overview of the data set
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB


For a visual inspection of the data set the `head()` function can be useful. It displays the first rows of the data set, the number of rows to be displayed can be set individually by adding a parameter (default if left blank is 5 rows). 

In [4]:
# Displaying the first ten rows of the data set
penguins.head(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,female,2007
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,male,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007


# Remove missing values
The output of the `info()` function shows that there are 344 data records in total. Some of the columns contain null values, for example bill_length_mm, billdepth_mm, flipper_length_mm, body_mass_g and sex. One of these rows is also visible in the first 10 rows (index 3)

To get a better understanding of the full picture, variations of the `isna()` function can be used to determine the number of instances of NaN records and identify affected records. 

https://datatofish.com/check-nan-pandas-dataframe/ (18/12/2021)

https://datatofish.com/rows-with-nan-pandas-dataframe/ (18/12/2021)


In [5]:
# Determine if there are NaN values in whole data set
penguins.isna().values.any()

True

In [6]:
# Count NaN values in whole data set
penguins.isna().sum().sum()

19

In [7]:
# Display all columns with one or more values = NaN
penguins[penguins.isna().any(axis=1)]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
3,Adelie,Torgersen,,,,,,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007
10,Adelie,Torgersen,37.8,17.1,186.0,3300.0,,2007
11,Adelie,Torgersen,37.8,17.3,180.0,3700.0,,2007
47,Adelie,Dream,37.5,18.9,179.0,2975.0,,2007
178,Gentoo,Biscoe,44.5,14.3,216.0,4100.0,,2007
218,Gentoo,Biscoe,46.2,14.4,214.0,4650.0,,2008
256,Gentoo,Biscoe,47.3,13.8,216.0,4725.0,,2009
268,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,,2009


There are different methods available for populating missing values, for example, for numerical values using the column mean and for categorical values to using the most frequent value (mode). 
https://www.geeksforgeeks.org/python-replace-nan-values-with-average-of-columns/?ref=rp (19/12/2021) 
<br>
For this project, columns with NaN values will be removed using the `dropna()` function.

https://datatofish.com/dropna/ (18/12/2021)

In [10]:
# Removing all columns with NaN values
penguins = penguins.dropna()

# Resetting the index column
penguins = penguins.reset_index(drop=True)

In [11]:
# Re-run info() to verify new size of dataset
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
 7   year               333 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 20.9+ KB


## **4.0 Original Data Set Investigation**

## **5.0 Data Set Simulation**

## **6.0 Data Set Comparison/Validation**

## **References Used:**

*Placeholder*