# Palmer Penguins

***
[Penguins](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png) 

This notebook contains my analysis of the famous palmer penguins dataset.

The dataset is available [ on GitHub](https://allisonhorst.github.io/palmerpenguins/).

## Imports

***

We use pandas for the DataFrame data structure.

It allows us to investigate CSV files, amongst other features.

In [1]:
# Data frames
import pandas as pd

## Load Data

Load the palmer penguins data set from a URL

In [2]:
# Loads the penguin dataser
df= pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')

The data is now loaded and we can inspect it

In [3]:
# Let's have a look
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


Inspecting Data

In [4]:
# look at the first row
df.iloc[0]

species                 Adelie
island               Torgersen
bill_length_mm            39.1
bill_depth_mm             18.7
flipper_length_mm        181.0
body_mass_g             3750.0
sex                       MALE
Name: 0, dtype: object

In [5]:
# Sex of penguins
df['sex']

0        MALE
1      FEMALE
2      FEMALE
3         NaN
4      FEMALE
        ...  
339       NaN
340    FEMALE
341      MALE
342    FEMALE
343      MALE
Name: sex, Length: 344, dtype: object

In [6]:
# count the number of penguins of each sex
df['sex'].value_counts()

sex
MALE      168
FEMALE    165
Name: count, dtype: int64

In [7]:
#describe the data set
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


## Tables

***

| Species  |Bill Length (mm)|Body Mass (g)|
|--------- |----------------:|-------------:|
|Adelie    |            38.8|         3701|
|Chinstrap |            48.8|         3733|
|Gentoo    |            47.5|         5076|


# Presentation of the dataset

The Palmer Penguins dataset is a popular dataset used for data analysis and machine learning tasks. It contains measurements and other attributes of penguins collected from three species: Adelie, Chinstrap, and Gentoo, on three islands in the Palmer Archipelago, Antarctica: Dream, Torgersen and Biscoe.

### Overview of the variables:

**species:** The species of penguin. It can be one of three values: "Adelie", "Chinstrap", or "Gentoo".  
**island:** The island where the penguin was observed. It can be one of three values: "Biscoe", "Dream", or "Torgersen".  
**bill_length_mm:** The length of the penguin's bill in millimeters.  
**bill_depth_mm:** The depth of the penguin's bill in millimeters.  
**flipper_length_mm:** The length of the penguin's flipper in millimeters.  
**body_mass_g:** The body mass of the penguin in grams.  
**sex:** The gender of the penguin. It can be one of three values: "male", "female", or "NA" (indicating unknown).  
year: The year in which the data was collected.  


### Notes:
The dataset may contain missing values (indicated by "NA" or NaN).
The measurements (bill length, bill depth, flipper length, body mass) are numeric variables representing physical characteristics of the penguins.
The categorical variables (species, island, sex) provide information about the species, habitat, and gender of the penguins.
The dataset is often used for classification tasks (e.g., predicting the species of a penguin based on its physical characteristics) and exploratory data analysis.



To model the variables in the Palmer Penguins dataset in Python, I have considered the following:

- Categorical Variables:  
  - Species: This variable represents different penguin species (Adelie, Chinstrap, or Gentoo). It should be treated as a categorical variable because each species represents a distinct category with no inherent ordering.  
  - Island: The island where the penguin was observed (Dream, Torgersen, or Biscoe). Like species, this variable should also be treated as categorical since the islands represent distinct categories without any natural ordering.  
  - Sex: The sex of the penguin (male, female, or NA if unknown). This is also a categorical variable since each sex represents a distinct category.
- Numeric Variables:   
  - Bill Length (bill_length_mm): This represents the length of the penguin's bill in millimeters. It is a numeric variable that can be used for modeling, such as regression analysis.  
  - Bill Depth (bill_depth_mm): Similar to bill length, this represents the depth of the penguin's bill in millimeters and can also be treated as a numeric variable.  
  - Flipper Length (flipper_length_mm): The length of the penguin's flipper in millimeters. This is another numeric variable suitable for modeling.
  - Body Mass (body_mass_g): The body mass of the penguin in grams. This numeric variable provides information about the weight of the penguins and can be useful for modeling purposes.  
- Target Variable (Optional):  
  - Depending on your analysis goals, you may choose to define a target variable for predictive modeling tasks. For example, if you want to predict the species of a penguin based on its physical characteristics, then species can be considered as the target variable for classification tasks.  


Treating categorical variables as categorical types ensures that Python understands them as distinct categories rather than treating them as numeric values with some order or significance.  
Numeric variables for modeling allow you to perform mathematical operations and statistical analyses on these variables. This enables tasks like regression analysis, correlation analysis, and visualization.  
The choice of variables depends on the objective of the analysis. For example to understand the relationship between penguin physical characteristics and their species, you would choose species as the target variable and use the other variables as predictors.


***

### End