üêß The Penguin Predictor

Objective: Build a machine learning model to identify penguin species based on their physical measurements.

The Workflow:

    -Load Data

    -Clean Data (Handle missing values)

    -Split Data (Training vs Testing)

    -Train Model (Random Forest)

    -Predict

In [16]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

print("Libraries imported successfully!")

Libraries imported successfully!


1. Load the Data

We will use the Palmer Penguins dataset. It contains measurements for three species: Ad√©lie, Chinstrap, and Gentoo.

In [17]:
#Load dataset from seaborn

df = sns.load_dataset('penguins')

#Display the first 5 rows

df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


2. Data Cleaning

Machine Learning models cannot handle missing data (NaNs). Let's check our data quality.

In [18]:
#Check for missing values

print("Missing values per column:")
print(df.isnull().sum())

#Drop rows with missing values

df_clean = df.dropna()

print(f"\nOriginal shape: {df.shape}")
print(f"Cleaned shape:  {df_clean.shape}")

Missing values per column:
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

Original shape: (344, 7)
Cleaned shape:  (333, 7)


3. Prepare Features & Targets

X (Features): The physical measurements (Bill length, Flipper length, etc.)

y (Target): The species we want to predict.

In [19]:
#Select only numerical features for simplicity

feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

X = df_clean[feature_columns]
y = df_clean['species']

#Preview the input data

X.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,39.1,18.7,181.0,3750.0
1,39.5,17.4,186.0,3800.0
2,40.3,18.0,195.0,3250.0
4,36.7,19.3,193.0,3450.0
5,39.3,20.6,190.0,3650.0


4. Train/Test Split

We split the data into Training (80%) and Testing (20%) sets to ensure we can evaluate the model fairly.

In [20]:
#Split the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples:  {len(X_test)}")

Training samples: 266
Testing samples:  67


5. Model Training

We will use a Random Forest Classifier, which uses multiple decision trees to vote on the best answer.

In [None]:
#Initialize and train the model

model = RandomForestClassifier(n_estimators=100, random_state=42) #100 Decision Trees, 42 seed for randomness
model.fit(X_train, y_train)

print("Model trained successfully!")

Model trained successfully!


6. Evaluation

Let's see how well our model performs on data it has never seen before (the Test set).

In [24]:
#Make predictions

y_pred = model.predict(X_test)

#Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

print("\nDetailed Report:")
print(classification_report(y_test, y_pred))

Model Accuracy: 100.00%

Detailed Report:
              precision    recall  f1-score   support

      Adelie       1.00      1.00      1.00        31
   Chinstrap       1.00      1.00      1.00        13
      Gentoo       1.00      1.00      1.00        23

    accuracy                           1.00        67
   macro avg       1.00      1.00      1.00        67
weighted avg       1.00      1.00      1.00        67



7. Live Prediction

Let's test the model with a custom penguin!

In [35]:
import pandas as pd

samples = [
    [50.0, 15.0, 220.0, 5000.0], #gentoo
    [38.0, 18.0, 180.0, 3500.0], #Adelie
    [45.0, 16.0, 200.0, 4200.0], #Gentoo
    [42.0, 20.0, 190.0, 3800.0] #adelile
]

# bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g
new_penguin = pd.DataFrame(
    [[50.0, 15.0, 220.0, 5000.0]],
    columns=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
)

prediction = model.predict(new_penguin)
print(f"This looks like a {prediction[0]} penguin")


This looks like a Gentoo penguin
