In [None]:
import numpy as np
import pandas as pd
import requests
import zipfile
import altair as alt
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import fbeta_score, make_scorer

## Title : Abalone Age Prediction based on Physical Measurements and Sex

## Summary - 1:

In this project, we aimed to build a regression model using the k-Nearest Neighbours (k-NN) algorithm to predict the age of an abalone using its physical characteristics and sex. Since determining abalone age traditionally requires cutting the shell and counting rings (a destructive and time-consuming process), machine learning methods offer a non-destructive alternative for estimating age from easily measurable features. Age in this dataset is represented as Rings + 1.5, where each ring corresponds roughly to one year of growth and the additional 1.5 accounts for early development. For example, an abalone with 10 rings would be approximately 11.5 years old.

Our k-NN Regressor (with $k=5$) produced an RMSE of approximately 2.2 rings on the test set, meaning the model’s predictions deviate from the true age by about 2 rings on average. The scatter plot of actual versus predicted values shows a generally increasing trend but with noticeable spread, especially among older abalones. This indicates that while k-NN captures broad growth patterns, predicting exact age remains challenging due to natural biological variability and overlapping physical features. Nevertheless, the model produces reasonable baseline performance and demonstrates that physical measurements contain meaningful information about abalone age. Further improvement could be achieved by tuning $k$ or exploring more complex models.

## Introduction - 2

#### 2.1 Project Goal:
Understand if physical features and sex can accurately predict the age of an abalone.

#### 2.2 Background:
Abalones are a type of marine mollusk widely harvested for food and shell products. Understanding their age is important for marine biology research, sustainable fisheries management, and conservation efforts. However, determining the age of an abalone is not straightforward. The most accurate method requires cutting the shell and counting the number of rings inside under a microscope—an approach that is destructive, labor-intensive, and not feasible at scale for monitoring wild populations.

The Abalone dataset from the UCI Machine Learning Repository provides measurements of physical characteristics of abalone. These features include:
- Categorical: Sex (M, F, I)
- Continuous physical measurements: Length, Diameter, Height, Whole weight, Shucked weight, Viscera weight, Shell weight
such as shell length, diameter, height, whole weight, shucked weight, viscera weight, and shell weight.

The dataset contains 4,177 observations and 8 predictor variables, representing a reasonably large sample for training and evaluating regression models. The measurements reflect biological growth patterns, making them strong candidates for predicting age. In addition, these variables are non-destructive measurements that can be collected easily and consistently. Since abalone age is strongly related to its size and mass, this dataset provides an opportunity to use machine learning to predict an abalone's age from its measurable physical features.

Age in this dataset is recorded using the variable Rings, where Age = Rings + 1.5 years, reflecting the biological growth process of shell formation. Because Rings is a numeric, continuous output, predicting age is naturally formulated as a regression problem. Using measurable physical features to estimate age has the potential to provide a non-destructive, scalable tool for resource management and marine science applications.

## Methods - 3

#### 3.1 Model Selection:

k-Nearest Neighbours (k-NN) Regressor is our chosen model to predict the age of an abalone from it's physical characteristics and sex.

This model estimates the age of a new abalone by finding the k most similar abalones in the training set, where similarity is measured using Euclidean distance in the standardized feature space and averaging their observed ring counts. 

We selected k-NN because it is a simple, easily interpretable, and non-parametric model that does not assume linear relationships between predictors and age. Because abalone growth patterns are expected to be non-linear — with shell dimensions increasing quickly when young and more slowly with age — a flexible model such as k-NN is well suited to capture these unknown relationships between physical characteristics, sex, and age.

#### 3.2 Preprocessing Steps:

Prior to model fitting, all numeric features were standardized with Standard Scaler to ensure that measurements on different scales contributed equally to the distance calculations. One Hot Encoding is performed on the Sex categorical feature.

Hyperparameter tuning for k will be performed in future analysis for simplicity, as noted in the Milestone 1 project instructions as well. 

Scoring of the model will be based on Root Mean Squared Error (RMSE), since the target variable (Rings) is a continous variable. RMSE also penalizes large errors more heavily, which is desirable for age estimations as predicting an abalone to be far older or younger than it truly is is more problematic than small deviations.

## Discussion - 4:

#### 4.1 Exploratory Data Analysis Discussion
The exploratory data analysis reveals that most physical features, including length, diameter, height, and the different weight measurements are positively associated with the number of rings. In general, larger and heavier abalones tend to be older. However, these relationships are clearly nonlinear: growth accelerates when abalones are young and gradually tapers off, producing curved patterns in the scatterplot matrix. As measurements increase, the variability in ring counts also widens, indicating that abalones with similar sizes can still differ noticeably in age. This suggests that simple linear models may struggle to capture the underlying structure.

Among the predictors, height appears to be the least informative due to its narrow range and low variability. By contrast, the various weight variables (whole weight, shucked weight, viscera weight, and shell weight) show strong and highly correlated patterns, reinforcing the need for feature scaling to prevent any single weight variable from dominating distance calculations later in modeling.

The three sex categories (Male, Female, and Infant) overlap heavily across all measurements, with no visually distinct patterns separating them. At Milestone 1 of this project, for simplicity, we assume that "sex" alone is not a strong indicator of age. However, it is still included as a categorical feature via one-hot encoding, as we want to retain this feature for future analysis.

Overall, the nonlinear growth patterns, strong correlations among weight-based predictors, and high degree of feature overlap support our choice of a k-Nearest Neighbours model. kNN is well suited to datasets where local structure matters and where the relationship between predictors and the target does not follow a simple linear form.

## Importing Dataset 

#### Steps 
1. Prepare data validation Schema using Pandera
2. Load Data via URL 
3. Apply the Pandera Schema to validate the dataframe

##### Checking if packages are imported correctly

In [None]:
import pandera.pandas as pa
import deepchecks
import pointblank as pb

# Checking if packages are imported correctly
print("Deepchecks:", deepchecks.__version__)
print("Pointblank:", pb.__version__)

##### Setting up Data Checking Schema

Data Type of choice is selected from the original dataset source.

We will use Pandera to check the followings: 

    1. Correct data file format

    2. Correct column names

    3. No empty observations

    4. Missingness not beyond expected threshold

    5. Correct data types in each column

    6. No duplicate observations

    7. No outlier or anomalous values

    8. Correct category levels (i.e., no string mismatches or single values)

We will use Deepcheck to check the followings: 

    9. Target/response variable follows expected distribution

    10. No anomalous correlations between target/response variable and features/explanatory variables
    
    11. No anomalous correlations between features/explanatory variables


In [None]:
from pandera import Column, Check, DataFrameSchema

# Allowed categories for Sex
SEX_CATEGORIES = ["M", "F", "I"]

abalone_schema = DataFrameSchema(
    {
        "Sex": Column(
            str,
            Check.isin(SEX_CATEGORIES),
            nullable=False
        ),
        "Length": Column(
            float,
            Check.ge(0.0),
            nullable=False
        ),
        "Diameter": Column(
            float,
            Check.ge(0.0),
            nullable=False
        ),
        "Height": Column(
            float,
            Check.ge(0.0),
            nullable=False
        ),
        "Whole_weight": Column(
            float,
            Check.ge(0.0),
            nullable=False
        ),
        "Shucked_weight": Column(
            float,
            Check.ge(0.0),
            nullable=False
        ),
        "Viscera_weight": Column(
            float,
            Check.ge(0.0),
            nullable=False
        ),
        "Shell_weight": Column(
            float,
            Check.ge(0.0),
            nullable=False
        ),
        "Rings": Column(
            int,
            Check.between(1, 30),
            nullable=False
        )
    },
    checks=[
        Check(lambda df: ~df.duplicated().any(), error="Duplicate rows found."),
        Check(lambda df: ~(df.isna().all(axis=1)).any(), error="Empty rows found."),
        Check(lambda df: (df.isna().mean() <= 0.05).all(),
              error="Missingness exceeds 5% threshold.")
    ]
)


#### Loading Data

In [None]:
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"

column_names = [
    "Sex", "Length", "Diameter", "Height",
    "Whole_weight", "Shucked_weight",
    "Viscera_weight", "Shell_weight", "Rings"
]

#This ensures our data is freshly loaded everytime :) 
#Creating a function here for the loading process: 

def load_and_validate_abalone() -> pd.DataFrame:
    #1) Loading data 
    abalone_raw = pd.read_csv(url, header=None, names=column_names)

    #2) validation with pandera
    abalone_validated = abalone_schema.validate(abalone_raw, lazy=True)

    return abalone_validated

#### Validation with Pandera

In [None]:
abalone = load_and_validate_abalone()

#Saving the validated dataset
abalone.to_csv("data/abalone_validated.csv", index=False)

# Peeking the dataframe
abalone

### Exploratory Data Analysis

We will be perform a pairwise plot below to assess feature correlations.

In [None]:
# Excluding column: Sex for cleaner display of graphs
new_column_names = ["Length", "Diameter", "Height",
    "Whole_weight", "Shucked_weight",
    "Viscera_weight", "Shell_weight", "Rings"
]

# Plot all variables against one another for EDA
chart = alt.Chart(abalone,width=150, height=100).mark_point().encode(
  alt.X(alt.repeat('row'), type='quantitative'),
  alt.Y(alt.repeat('column'), type='quantitative'),
    color='Sex:N'
).repeat(column = new_column_names, row = new_column_names
).properties(title = "Scatterplot matrix")

chart

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# 1. One-hot encode the Sex categorical variable
# get_dummies() converts categories (M, F, I) -> columns Sex_F and Sex_M
# drop_first=True avoids creating redundant dummy columns

abalone_converted = pd.get_dummies(abalone, columns=["Sex"], drop_first=True)

# 2. Split predictors (X) and target variable (y)
# "Rings" is the target we want to predict (continuous -> regression)

X = abalone_converted.drop("Rings", axis=1)
y = abalone_converted["Rings"]
abalone_converted.head()

In [None]:
# 3. Train-test split (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

#### Validations with Deepcheck

In [None]:
# validate training data for anomalous correlations between target/response variable 
# and features/explanatory variables, 
# as well as anomalous correlations between features/explanatory variables
# Do these on training data as part of EDA
from deepchecks.tabular.checks import FeatureLabelCorrelation, FeatureFeatureCorrelation, MultivariateDrift, LabelDrift
from deepchecks.tabular import Dataset

# Combine X_train and y_train to form training dataset
abalone_train = X_train.copy()
abalone_train["Rings"] = y_train


# Setting up Dataset object for Deepcheck
abalone_train_ds = Dataset(
    abalone_train.drop(columns=["Sex_I", "Sex_M"]), 
    label="Rings",
    cat_features=[]
)

# 1. Feature–Label Correlation Check
#    Ensures no single feature is too predictive of the target.
#    PPS (predictive power score) must remain < 0.9.
check_feat_lab_corr = FeatureLabelCorrelation().add_condition_feature_pps_less_than(0.9)
check_feat_lab_corr_result = check_feat_lab_corr.run(dataset=abalone_train_ds)

# 2. Feature–Feature Correlation Check
#    Ensures no pair of features has correlation above 0.99.
#    n_pairs = 0 means absolutely no correlated pairs allowed.
check_feat_feat_corr = FeatureFeatureCorrelation().add_condition_max_number_of_pairs_above_threshold(
    threshold = 0.99,
    n_pairs = 0
)
check_feat_feat_corr_result = check_feat_feat_corr.run(dataset=abalone_train_ds)

# 3. Target Distribution Check (Label Drift)
#    Ensures the distribution of the target ("Rings") looks normal,
#    and that the training and test sets come from the same population.
#    This prevents unexpected label shifts.

# Build Deepchecks dataset for test data
abalone_test = X_test.copy()
abalone_test["Rings"] = y_test

abalone_test_ds = Dataset(
    abalone_test.drop(columns=["Sex_I", "Sex_M"]),
    label="Rings",
    cat_features=[]
)

# Run the target distribution drift check
check_label_drift = LabelDrift().add_condition_drift_score_less_than(0.2)
check_label_drift_result = check_label_drift.run(
    train_dataset=abalone_train_ds,
    test_dataset=abalone_test_ds
)


if not check_feat_lab_corr_result.passed_conditions():
    raise ValueError("Feature–Label correlation exceeds the acceptable threshold.")

if not check_feat_feat_corr_result.passed_conditions():
    raise ValueError("Feature–Feature correlation exceeds the acceptable threshold.")

if not check_label_drift_result.passed_conditions():
    raise ValueError("Target variable distribution drift detected ")

Even though the feature–feature correlation check throws an error, the highly correlated pairs all make sense based on the biology of abalone growth. As the abalone gets older and larger, all of the weight-related measurements (whole weight, shucked weight, viscera weight, etc.) naturally increase together, so it's expected that they will be strongly correlated.

In [None]:
# 4. Standardize numeric features 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)


# 5. Define the k-NN Regressor model
# weights="uniform" -> all neighbors contribute equally
# metric="minkowski", p=2 -> Euclidean distance
knn = KNeighborsRegressor(
    n_neighbors = 5,
    weights = "uniform",
    metric = "minkowski",
    p=2
)

# 6. Make predictions on both train and test sets
knn.fit(X_train_scaled, y_train)
y_train_pred = knn.predict(X_train_scaled)
y_test_pred  = knn.predict(X_test_scaled)

# 7. Evaluate performance using RMSE (Root Mean Squared Error)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse  = np.sqrt(mean_squared_error(y_test, y_test_pred))

print("=== KNN Regression performance (k=5) ===")
print(f"Train RMSE : {train_rmse:.4f}")
print(f"Test  RMSE : {test_rmse:.4f}")

In [None]:
#8. Plot Actual vs Predicted Rings for visual inspection
plt.figure(figsize=(7,5))
plt.scatter(y_test, y_test_pred, alpha=0.5)

plt.plot(
    [y_test.min(), y_test.max()],
    [y_test.min(), y_test.max()],
    linestyle="--",
    color="black"
)

plt.xlabel("Actual Rings")
plt.ylabel("Predicted Rings")
plt.title("KNN Regression: Actual vs Predicted")
plt.show()

## References: 

Dua, D., & Graff, C. (1995). Abalone Data Set. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/1/abalone

Nash, W. J., Sellers, T. L., Talbot, S. R., Cawthorn, A. J., & Ford, W. B. (1994). The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait. Sea Fisheries Division Technical Report No. 48.

Waugh, S. (1995). Extending and benchmarking Cascade-Correlation (PhD thesis). Department of Computer Science, University of Tasmania.

Clark, D., Schreter, Z., & Adams, A. (1996). A Quantitative Comparison of Dystal and Backpropagation. Proceedings of the Australian Conference on Neural Networks (ACNN’96).