Daniel Rocha Ruiz, MSc in Data Science and Business Analytics

Dataset:
- https://allisonhorst.github.io/palmerpenguins/

# Set-up
## Import packages

In [1]:
# penguins dataset
from palmerpenguins import load_penguins
#import sys
#!{sys.executable} -m pip install palmerpenguins

# general
import pandas as pd

# scikit-learn
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split

## Load data

In [None]:
df = load_penguins(return_X_y=True)

# EDA

## Basic evaluation

This dataset contains 8 variables.

There are 4 discrete variables, each with 2-3 unique values.
- Species, Island, and Sex are strings.
- Year is integer.

There are 4 continuous variables. They reflect measurements.
- Bill lentgh, bill depth, and flipper length are in milimiters.
- Body mass is in grams.

In [4]:
print("Species:", df["species"].unique().tolist())
print("Island:", df["island"].unique().tolist())
print("Sex:", df["sex"].unique().tolist())
print("Year:", df["year"].unique().tolist())

Species: ['Adelie', 'Gentoo', 'Chinstrap']
Island: ['Torgersen', 'Biscoe', 'Dream']
Sex: ['male', 'female', nan]
Year: [2007, 2008, 2009]


## Missing observations
There are 11 observations with missing data. 2 observations have all the measurements, but don't contain the sex of the penguin; and 9 of them contain neither any measurement nor the sex of the penguins.
- The suggested path would be to discard the 2 observations with barely any data, and then use an algorithm of missing data imputation to input the missing sex.
- For simplicity, given that there are only 11 observations with missing data, we can just ignore them.

In [5]:
df.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64

In [6]:
df[df.isnull().sum(axis=1)>0]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
3,Adelie,Torgersen,,,,,,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007
10,Adelie,Torgersen,37.8,17.1,186.0,3300.0,,2007
11,Adelie,Torgersen,37.8,17.3,180.0,3700.0,,2007
47,Adelie,Dream,37.5,18.9,179.0,2975.0,,2007
178,Gentoo,Biscoe,44.5,14.3,216.0,4100.0,,2007
218,Gentoo,Biscoe,46.2,14.4,214.0,4650.0,,2008
256,Gentoo,Biscoe,47.3,13.8,216.0,4725.0,,2009
268,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,,2009


In [None]:
df = df.dropna()

## Univariate EDA

In [None]:
#####

## Bivariate EDA

In [None]:
#####

# Variable Encoding

In [7]:
le = preprocessing.LabelEncoder()
le.fit(df["species"].unique().tolist())
print("Classes:", list(le.classes_))
print("Transform:", le.transform(list(le.classes_)))
print("Inverse transform", le.inverse_transform(le.transform(list(le.classes_))))

Classes: ['Adelie', 'Chinstrap', 'Gentoo']
Transform: [0 1 2]
Inverse transform ['Adelie' 'Chinstrap' 'Gentoo']


In [8]:
# encoding the species
df["species"] = le.transform(df["species"])

# dummy names should be indicative and interpretable
df["male"]=df["sex"]=="male"
df["male"]=df["male"].astype(int)

# dummifying the islands
df["dream"]=df["island"]=="Dream"
df["dream"]=df["dream"].astype(int)
df["biscoe"]=df["island"]=="Biscoe"
df["biscoe"]=df["biscoe"].astype(int)

# dummifying the years
df["y2008"]=df["year"]==2008
df["y2008"]=df["y2008"].astype(int)
df["y2009"]=df["year"]==2009
df["y2009"]=df["y2009"].astype(int)

# dropping the non-compliant columns
df = df.drop(columns=["island","sex","year"])

# the base category is female on Torgersen island in 2007

In [16]:


X, y = load_penguins(return_X_y=True)
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [19]:
print(X.shape)
print(y.shape)

(344, 4)
(344,)


In [20]:
X

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,39.1,18.7,181.0,3750.0
1,39.5,17.4,186.0,3800.0
2,40.3,18.0,195.0,3250.0
3,,,,
4,36.7,19.3,193.0,3450.0
...,...,...,...,...
339,55.8,19.8,207.0,4000.0
340,43.5,18.1,202.0,3400.0
341,49.6,18.2,193.0,3775.0
342,50.8,19.0,210.0,4100.0


In [None]:
#create Transformer
transformer = ColumnTransformer(transformers=[
    ('tnf1',SimpleImputer(),['Male']),
    ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough']),
    ('tnf3',OneHotEncoder(sparse=False,drop='first'),['gender','city'])
],remainder='passthrough')

x_train_transform = transformer.fit_transform(X_train)

In [None]:
df.to_parquet("dataset.parquet")