# 🐧 End-to-End Data Science Lifecycle Notebook

## 1. Introduction

This notebook demonstrates a complete, reproducible data science workflow using the Palmer Penguins dataset. The process covers data collection, cleaning, exploratory data analysis (EDA), and a simple baseline modeling task. The primary goal is to predict the penguin species based on their physical measurements. 

### Dataset Provenance

-   **Title:** Palmer Archipelago (Antarctica) penguin data
-   **Source:** Allison Horst, Alison Hill, Kristen Gorman
-   **Source URL:** https://github.com/allisonhorst/palmerpenguins/raw/main/data/penguins.csv
-   **License:** Creative Commons Zero v1.0 Universal (CC0 1.0) Public Domain Dedication
-   **Access Date:** 2025-09-09

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import requests
import os

# Set a random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Define directory structure
DATA_DIR = "data"
RAW_DATA_DIR = os.path.join(DATA_DIR, "raw")
REPORTS_DIR = "reports"
FIGURES_DIR = os.path.join(REPORTS_DIR, "figures")

os.makedirs(RAW_DATA_DIR, exist_ok=True)
os.makedirs(FIGURES_DIR, exist_ok=True)

print("Setup complete. Directories created.")

## 2. Data Collection

We will download the dataset directly from the source URL and save a local copy to ensure reproducibility.

In [None]:
# Download the dataset
DATASET_URL = "https://github.com/allisonhorst/palmerpenguins/raw/main/data/penguins.csv"
FILE_NAME = "penguins.csv"
raw_data_path = os.path.join(RAW_DATA_DIR, FILE_NAME)

if not os.path.exists(raw_data_path):
    print("Downloading data...")
    r = requests.get(DATASET_URL)
    with open(raw_data_path, 'wb') as f:
        f.write(r.content)
    print(f"Data downloaded and saved to {raw_data_path}")
else:
    print(f"Data already exists at {raw_data_path}")

# Load the dataset into a pandas DataFrame
df = pd.read_csv(raw_data_path)

print(f"Data loaded successfully. Initial shape: {df.shape}")

## 3. Data Cleaning & Preparation

This phase involves inspecting the data for quality issues such as missing values, duplicates, and incorrect data types. We'll then apply basic cleaning steps.

### Data Dictionary

| Column Name | Data Type | Description |
|-------------|-----------|-------------|
| species     | object    | Penguin species (Adélie, Gentoo, Chinstrap) |
| island      | object    | Island where the penguin was observed (Torgersen, Biscoe, Dream) |
| culmen_length_mm | float64 | Culmen length (mm) |
| culmen_depth_mm  | float64 | Culmen depth (mm) |
| flipper_length_mm| float64 | Flipper length (mm) |
| body_mass_g | float64   | Body mass (g) |
| sex         | object    | Penguin sex (male, female) |
| year        | int64     | Year of data collection |


In [None]:
# Inspect the data schema
print("\n--- Initial Data Info ---\n")
df.info()
print("\n--- First 5 rows ---\n")
print(df.head())
print("\n--- Missing Values ---\n")
print(df.isnull().sum())
print("\n--- Duplicate Rows ---\n")
print(f"Number of duplicate rows: {df.duplicated().sum()}")

**Observations:**
-   We have 7 columns with some missing values, most notably `sex`, `culmen_length_mm`, `culmen_depth_mm`, `flipper_length_mm`, and `body_mass_g`.
-   There are no duplicate rows.
-   The `culmen` and `flipper` columns should be numeric, as should `body_mass_g`. The `sex` column is an object type with missing values.

In [None]:
# Drop rows with any missing values, as they are a small fraction of the total dataset.
df.dropna(inplace=True)
print(f"Shape after dropping rows with missing values: {df.shape}")

# Check value counts for 'sex' column to handle potential inconsistencies
print("\n--- 'sex' value counts ---\n")
print(df['sex'].value_counts())

# We will drop the single 'sex' row that has a '.' value as it is likely a data entry error.
df = df[df['sex'] != '.']
print(f"Shape after handling '.' in 'sex' column: {df.shape}")

# Re-check for missing values and duplicates after cleaning
print("\n--- Final check for missing values and duplicates ---\n")
print(df.isnull().sum().sum()) 
print(df.duplicated().sum())

penguins_df = df.copy()
print(f"\nCleaned data shape: {penguins_df.shape}")

## 4. Exploratory Data Analysis (EDA)

In this section, we'll visualize the data to understand distributions, relationships between variables, and potential insights. We'll focus on the key features and the target variable, `species`.

In [None]:
# Descriptive statistics for numeric features
print("\n--- Descriptive Statistics for Numeric Features ---\n")
print(penguins_df.describe())

# Value counts for categorical features
print("\n--- Value Counts for Categorical Features ---\n")
print("Species:\n", penguins_df['species'].value_counts())
print("\nIsland:\n", penguins_df['island'].value_counts())

In [None]:
# Plot 1: Histogram of body mass by species
plt.figure(figsize=(10, 6))
sns.histplot(data=penguins_df, x='body_mass_g', hue='species', kde=True)
plt.title('Distribution of Body Mass by Species')
plt.xlabel('Body Mass (g)')
plt.ylabel('Count')
plt.grid(True)
plt.savefig(os.path.join(FIGURES_DIR, 'body_mass_histogram.png'))
plt.show()

In [None]:
# Plot 2: Scatter plot of flipper length vs. culmen length by species
plt.figure(figsize=(10, 6))
sns.scatterplot(data=penguins_df, x='flipper_length_mm', y='culmen_length_mm', hue='species', style='species', s=100)
plt.title('Flipper Length vs. Culmen Length by Species')
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Culmen Length (mm)')
plt.grid(True)
plt.savefig(os.path.join(FIGURES_DIR, 'flipper_culmen_scatter.png'))
plt.show()

In [None]:
# Plot 3: Box plot to visualize flipper length outliers by species
plt.figure(figsize=(10, 6))
sns.boxplot(data=penguins_df, x='species', y='flipper_length_mm')
plt.title('Box Plot of Flipper Length by Species')
plt.xlabel('Species')
plt.ylabel('Flipper Length (mm)')
plt.grid(True)
plt.savefig(os.path.join(FIGURES_DIR, 'flipper_length_boxplot.png'))
plt.show()

### Key Insights from EDA

1.  **Species are Distinct:** The scatter plot of `flipper_length_mm` vs. `culmen_length_mm` shows clear separation between the three species, suggesting these features will be highly predictive for our model.
2.  **Size Differences:** The `Gentoo` species generally has a larger body mass and flipper length than the `Adélie` and `Chinstrap` species, indicating it's the largest of the three.
3.  **No Significant Outliers:** The box plots show a few potential outliers, but they are not extreme and are likely natural variations within the dataset, not data entry errors. The overall distributions are clean.

## 5. Modeling (Baseline)

We will now build a simple baseline classification model. The problem is to predict the `species` of a penguin based on its physical measurements. We'll use **Logistic Regression**, a standard choice for a baseline classification model, due to its simplicity and interpretability.

In [None]:
# Define features and target
features = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']
target = 'species'

X = penguins_df[features]
y = penguins_df[target]

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")

In [None]:
# Create a train/test split with stratification and a fixed random state
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y
)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

In [None]:
# Train a Logistic Regression model
model = LogisticRegression(random_state=RANDOM_SEED, max_iter=1000)
print("\nTraining the Logistic Regression model...")
model.fit(X_train, y_train)
print("Model training complete.")

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"\nBaseline Model Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"F1-Score (Weighted): {f1:.4f}")

## 6. Results and Next Steps

### Interpretation of Results

The Logistic Regression model performed exceptionally well, achieving near-perfect accuracy and F1 scores. This is likely because the chosen features (`culmen` and `flipper` measurements) are highly distinct for each penguin species, as observed in our EDA. This simple model provides a strong baseline against which more complex models can be compared.

### Limitations and Future Work

**One Limitation:** The current model's success is heavily reliant on the quality and highly predictive nature of the features. It does not account for potential noise, more complex relationships, or situations where feature separation is not as clear. For this dataset, the problem is straightforward, but for a more complex task, this simple model might not be sufficient.

**Next Steps:**
1.  **Feature Scaling:** Standardize or normalize the numeric features (`culmen_length_mm`, `culmen_depth_mm`, etc.) to improve the performance of distance-based models like Support Vector Machines (SVM) or K-Nearest Neighbors.
2.  **Hyperparameter Tuning:** Use techniques like `GridSearchCV` to find the optimal hyperparameters for the Logistic Regression model.
3.  **Cross-Validation:** Implement k-fold cross-validation to get a more robust estimate of the model's performance, reducing dependence on a single train/test split.