<a href="https://colab.research.google.com/github/rhodes-byu/cs180-winter25/blob/main/notebooks/06d-eda-penguins-example-plots.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><p><b>After clicking the "Open in Colab" link, copy the notebook to your own Google Drive before getting started, or it will not save your work</b></p>

# Palmer Penguins Dataset: Overview

The Palmer Penguins dataset contains data about three species of penguins observed in the Palmer Archipelago, Antarctica. The dataset was collected by Dr. Kristen Gorman and the Palmer Station LTER (Long Term Ecological Research) Program. It serves as a popular alternative to the Iris dataset for data exploration, statistical analysis, and machine learning practice due to its richer set of features and categorical variables.

Objective:

The primary goal of working with the Palmer Penguins dataset is to explore relationships between penguin species and their physical characteristics, as well as to perform classification tasks, such as predicting the species of a penguin based on measurable features. The dataset also offers an opportunity to practice data cleaning and handling missing values, as there are some missing entries.

Dataset Features:

The dataset consists of 344 rows and 7 columns. The columns are:

	1.	species: Categorical feature indicating the penguin species (Adélie, Chinstrap, Gentoo).
	2.	island: Categorical feature representing the island where the penguin was observed (Biscoe, Dream, Torgersen).
	3.	bill_length_mm: Continuous numerical feature representing the length of the penguin’s bill (in millimeters).
	4.	bill_depth_mm: Continuous numerical feature representing the depth of the penguin’s bill (in millimeters).
	5.	flipper_length_mm: Continuous numerical feature representing the penguin’s flipper length (in millimeters).
	6.	body_mass_g: Continuous numerical feature representing the penguin’s body mass (in grams).
	7.	sex: Categorical feature indicating the penguin’s sex (male or female), though some entries are missing.


1.	Understand the Data:
	* The dataset is loaded from seaborn.
	* First few rows and data types are printed.
	* Missing values are checked.
2.	Detect and Address Outliers and Missing Values:
	* Visualizing missing values using a heatmap.
  * Boxplots for detecting potential outliers.
	* Decide whether to drop or impute missing values.
3.	Describe Shape of Data using Univariate Analysis:
	* Histograms for numerical variables to check distributions.
	* Count plots for categorical variables like species.
4.	Identify Feature Relationships using Bivariate Analysis:
	* Scatter plots and pair plots to visualize relationships between numerical variables and categories.
	* Correlation matrix heatmap to examine the relationships between numerical variables.
5.	Multivariate Analysis:
	* Grouping box plots to compare flipper lengths across species and gender.


In [None]:
# Importing necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Palmer Penguins dataset from seaborn
penguins = sns.load_dataset('penguins')

# Display the first few rows of the dataset
penguins

In [None]:
# Inspect data structure and data types
print("\nData Info:")
print(penguins.info())

# Descriptive statistics for numerical columns
print("\nDescriptive Statistics:")
print(penguins.describe())


In [None]:
penguins.dtypes

In [None]:
# Checking for missing values
print("\nMissing Values:")
print(penguins.isnull().sum())

In [None]:
# Visualize missing data using a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(penguins.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in the Dataset')
plt.show()

In [None]:
# Univariate Analysis

# Histogram for numerical variables
penguins.hist(figsize=(10, 8), bins=20)
plt.suptitle("Histograms of Numerical Features")
plt.show()

In [None]:
# Add grouped by hist plots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
titles = ['Bill Length by Species', 'Bill Depth by Species', 'Flipper Length by Species', 'Body Mass by Species']

for ax, feature, title in zip(axes.flatten(), features, titles):
    sns.histplot(data=penguins, x=feature, hue='species', kde=True, ax=ax)
    ax.set_title(title)

plt.tight_layout()
plt.show()


In [None]:
penguins.columns

In [None]:
data2 = penguins.drop('body_mass_g', axis=1)
data2.head()

In [None]:
# Boxplots for detecting outliers
plt.figure(figsize=(10, 6))
data2 = penguins.drop('body_mass_g', axis=1)
sns.boxplot(data=data2.dropna())
plt.title('Boxplots of Numerical Features')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Count plot for categorical variables
plt.figure(figsize=(10, 6))
sns.countplot(x='species', hue='species', data=penguins)
plt.title('Count of Penguin Species')
plt.show()

In [None]:
# Calculate descriptive statistics by species
penguins[["bill_length_mm", "species"]].groupby(by="species").describe()

In [None]:
# Bivariate Analysis

# Pairplot of numerical variables colored by species
plt.rcParams.update({'font.size': 14})
sns.pairplot(penguins.dropna(), hue='species', markers=["o", "s", "D"])
plt.suptitle('Pairplot of Numerical Features by Species', y=1.02)
plt.show()

In [None]:
# Does island affect the relationships?
plt.rcParams.update({'font.size': 14})
p = sns.pairplot(penguins, hue='island', markers=['o', 's', 'v'])
plt.show()

In [None]:
# Combine sex and species in a single scatterplot
p = sns.scatterplot(
    data=penguins, x='bill_length_mm', y='body_mass_g', hue='species', style='sex'
)
plt.legend(loc=2, prop={'size': 10})  # Adjust the location of the legend to upper left corner and font size to 10
plt.show()

In [None]:
# Contingency table for species and sex
pd.crosstab(index=penguins['species'], columns=penguins['sex'])

In [None]:
# Scatter plot of bill_length_mm vs bill_depth_mm
plt.figure(figsize=(8, 6))
sns.scatterplot(x='bill_length_mm', y='bill_depth_mm', hue='species', data=penguins)
plt.title('Bill Length vs Bill Depth')
plt.show()

In [None]:
# Distribution of bill length by species
p = sns.histplot(data=penguins, x='bill_length_mm', hue='species')
p.set_xlabel('Bill length (mm)', fontsize=14)
p.set_ylabel('Count', fontsize=14)
plt.show()

In [None]:
# Distribution of bill length by species
p = sns.kdeplot(data=penguins, x='bill_length_mm', hue='species', linewidth=2)
p.set_xlabel('Bill length (mm)', fontsize=14)
p.set_ylabel('Density', fontsize=14)
plt.show()

In [None]:
# Correlation Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(penguins.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Numerical Variables')
plt.show()

In [None]:
# Multivariate Analysis

# Boxplot of flipper length by species and sex
plt.figure(figsize=(10, 6))
sns.boxplot(x='species', y='flipper_length_mm', hue='sex', data=penguins)
plt.title('Flipper Length by Species and Sex')
plt.show()

In [None]:
# Handling Missing Data (if needed)

# Strategy 1: Drop rows with missing values
penguins_cleaned = penguins.dropna()
print(f"\nShape of data after dropping missing values: {penguins_cleaned.shape}")

In [None]:
# Strategy 2: Fill missing values (Example: filling with mean)
penguins_filled = penguins.fillna(penguins.mean(numeric_only=True))
print(f"\nMissing values after filling numeric NaNs with mean:\n{penguins_filled.isnull().sum()}")


In [None]:
# Count plot for the "island" feature
plt.figure(figsize=(8, 6))
sns.countplot(x='island', hue='island', data=penguins)
plt.title('Count of Penguins by Island')
plt.show()
