### Penguins binary classification dataset

The dataset used in this notebook contains measurements and categorical information about penguins, such as their species, island of origin, bill length and depth, flipper length, body mass, and the year of observation. This type of data will be used for classification tasks.


In [22]:
import pandas as pd

df = pd.read_csv(r"datasets\penguins_binary_classification.csv")
print("Dataset Shape:", df.shape)
print("\nFirst few rows of the dataset:")
df.head()

Dataset Shape: (274, 7)

First few rows of the dataset:


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,2007
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,2007
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,2007


#### Check if there are missing values in the dataset.

In [38]:
missing_values = df.isna().sum()
if missing_values.any():
    missing_info = pd.DataFrame({"Missing Values": missing_values})
    print("Columns with missing values:")
    print(missing_info[missing_info["Missing Values"] > 0])
else:
    print("No missing values in the dataset.")

No missing values in the dataset.


#### Check if the dataset has fully duplicate rows.

In [24]:
duplicated_rows = df[df.duplicated(keep=False)]
print(f"Number of fully duplicated rows: {len(duplicated_rows)}")

match len(duplicated_rows):
    case 0:
        print("No fully duplicated rows found in the dataset.")
    case _:
        print("Duplicated rows:")
        print(duplicated_rows.sort_values(by=df.columns.tolist()))

Number of fully duplicated rows: 0
No fully duplicated rows found in the dataset.


#### Check correlation between the penguin's bill depth and penguins mass.

In [25]:
bill_depth = df["bill_depth_mm"]
body_mass = df["body_mass_g"]

correlation = bill_depth.corr(body_mass)
print(f"Correlation between bill depth and body mass: {correlation:.4f}")

Correlation between bill depth and body mass: -0.4832


The correlation is slightly negative. That means that more often the deeper the penguin's bill, the smaller the penguin's mass is.

#### Dropping `year` and `island` columns.

In [26]:
df = df.drop(columns=["year", "island"], axis=1)
df.head()

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,Adelie,39.1,18.7,181.0,3750.0
1,Adelie,39.5,17.4,186.0,3800.0
2,Adelie,40.3,18.0,195.0,3250.0
3,Adelie,36.7,19.3,193.0,3450.0
4,Adelie,39.3,20.6,190.0,3650.0


#### One-hot encode `species` column.

In [27]:
# One-hot encode `species` column: Adelie = 0, Gentoo = 1
df["species"] = df["species"].map({"Adelie": 0, "Gentoo": 1})
df.head()


Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,0,39.1,18.7,181.0,3750.0
1,0,39.5,17.4,186.0,3800.0
2,0,40.3,18.0,195.0,3250.0
3,0,36.7,19.3,193.0,3450.0
4,0,39.3,20.6,190.0,3650.0


#### Split the dataset into train data and test data.

In [28]:
from sklearn.model_selection import train_test_split

# Separate features and target
X = df.drop("species", axis=1)
y = df["species"]  # target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=16)

Train shape: (205, 4) (205,)
Test shape: (69, 4) (69,)


#### Count Gentoo species number in train and test sets.

In [29]:
# Count number of Gentoo penguins (species == 1) in train and test sets
gentoo_train_count = (y_train == 1).sum()
gentoo_test_count = (y_test == 1).sum()

print(f"Gentoo count in train set: {gentoo_train_count}")
print(f"Gentoo count in test set: {gentoo_test_count}")

Gentoo count in train set: 90
Gentoo count in test set: 33


#### Fit classification models without data scaling.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# KNN classifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)

# Decision Tree classifier
dt_clf = DecisionTreeClassifier(random_state=16)
dt_clf.fit(X_train, y_train)

# Logistic Regression classifier
lr_clf = LogisticRegression(random_state=16, max_iter=1000)
lr_clf.fit(X_train, y_train)

#### Getting predictions on test data for all models.

In [32]:
knn_predictions = knn_clf.predict(X_test)
dt_predictions = dt_clf.predict(X_test)
lr_predictions = lr_clf.predict(X_test)

#### Fit classification models with data scaling.

1. Scale the data first

In [33]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Fit the models with scaled data

In [34]:
knn_clf_scaled = KNeighborsClassifier()
knn_clf_scaled.fit(X_train_scaled, y_train)

dt_clf_scaled = DecisionTreeClassifier(random_state=16)
dt_clf_scaled.fit(X_train_scaled, y_train)

lr_clf_scaled = LogisticRegression(random_state=16, max_iter=1000)
lr_clf_scaled.fit(X_train_scaled, y_train)


#### Getting predictions on scaled test data for all models.

In [35]:
knn_predictions_scaled = knn_clf_scaled.predict(X_test_scaled)
dt_predictions_scaled = dt_clf_scaled.predict(X_test_scaled)
lr_predictions_scaled = lr_clf_scaled.predict(X_test_scaled)


#### Calculating accuracy metric for all models (not scaled).

In [37]:
from sklearn.metrics import accuracy_score

knn_accuracy = accuracy_score(y_test, knn_predictions)
dt_accuracy = accuracy_score(y_test, dt_predictions)
lr_accuracy = accuracy_score(y_test, lr_predictions)

print(f"KNN (not scaled) accuracy: {knn_accuracy:.4f}")
print(f"Decision Tree (not scaled) accuracy: {dt_accuracy:.4f}")
print(f"Logistic Regression (not scaled) accuracy: {lr_accuracy:.4f}")


KNN (not scaled) accuracy: 0.9565
Decision Tree (not scaled) accuracy: 0.9855
Logistic Regression (not scaled) accuracy: 1.0000


#### Calculating accuracy metric for all models (scaled).

In [36]:
from sklearn.metrics import accuracy_score

knn_accuracy_scaled = accuracy_score(y_test, knn_predictions_scaled)
dt_accuracy_scaled = accuracy_score(y_test, dt_predictions_scaled)
lr_accuracy_scaled = accuracy_score(y_test, lr_predictions_scaled)

print(f"KNN (scaled) accuracy: {knn_accuracy_scaled:.4f}")
print(f"Decision Tree (scaled) accuracy: {dt_accuracy_scaled:.4f}")
print(f"Logistic Regression (scaled) accuracy: {lr_accuracy_scaled:.4f}")


KNN (scaled) accuracy: 1.0000
Decision Tree (scaled) accuracy: 0.9855
Logistic Regression (scaled) accuracy: 1.0000
