In [16]:
import numpy as np
import pandas as pd
import requests
import zipfile
import altair as alt
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import fbeta_score, make_scorer

## Title : Abalone Age Prediction based on Physical Measurements and Sex

### Summary 

## Introduction

#### Project Goal:
Understand if physical features and sex can accurately predict the age of an abalone.

#### Background:
Abalones are a type of marine mollusk widely harvested for food and shell products. Understanding their age is important for marine biology research, sustainable fisheries management, and conservation efforts. However, determining the age of an abalone is not straightforward. The most accurate method requires cutting the shell and counting the number of rings inside under a microscope—an approach that is destructive, labor-intensive, and not feasible at scale for monitoring wild populations.

The Abalone dataset from the UCI Machine Learning Repository provides measurements of physical characteristics such as shell length, diameter, height, whole weight, shucked weight, viscera weight, and shell weight. These variables are non-destructive measurements that can be collected easily and consistently. Since abalone age is strongly related to its size and mass, this dataset provides an opportunity to use machine learning to predict an abalone’s age from its measurable physical features.

Age in this dataset is recorded using the variable Rings, where Age = Rings + 1.5 years, reflecting the biological growth process of shell formation. Because Rings is a numeric, continuous output, predicting age is naturally formulated as a regression problem.


## Methods

#### Model Selection:

k-Nearest Neighbours (k-NN) Regressor is our chosen model to predict the age of an abalone from it's physical characteristics and sex.

This model estimates the age of a new abalone by finding the k most similar abalones in the training set, where similarity is measured using Euclidean distance in the standardized feature space and averaging their observed ring counts. 

We selected k-NN because it is a simple, easily interpretable, and non-parametric model that does not assume linear relationships between predictors and age. Because abalone growth patterns are expected to be non-linear — with shell dimensions increasing quickly when young and more slowly with age — a flexible model such as k-NN is well suited to capture these unknown relationships between physical characteristics, sex, and age.

#### Preprocessing Steps:

Prior to model fitting, all numeric features were standardized with Standard Scaler to ensure that measurements on different scales contributed equally to the distance calculations. One Hot Encoding is performed on the Sex categorical feature.

Hyperparameter tuning for k will be performed in future analysis for simplicity, as noted in the Milestone 1 project instructions as well. 

Scoring of the model will be based on Root Mean Squared Error (RMSE), since the target variable (Rings) is a continous variable. RMSE also penalizes large errors more heavily, which is desirable for age estimations as predicting an abalone to be far older or younger than it truly is is more problematic than small deviations.

#### Discussion and Results

The exploratory analysis shows that most physical features, such as length, diameter, and the various weight measurements, are positively associated with the number of rings, meaning larger and heavier abalones tend to be older. These relationships are nonlinear, with the variability in age increasing as the measurements increase, suggesting that the data may not fit well under simple linear assumptions. Height appears to be the weakest predictor due to its narrow range and minimal variation. It is also observed that the three sex categories overlap heavily across all features, indicating that sex is unlikely to meaningfully distinguish age on its own. Thus, in this project, we will be focusing on numerical features. Because many weight variables are strongly correlated, feature scaling will be important to prevent any one feature from dominating distance calculations. Overall, the structure of the data—nonlinear trends, local patterns, and high feature overlap—supports our choice of k-Nearest Neighbours, as kNN can adapt to complex relationships without assuming a specific functional form.

In [17]:
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"

column_names = [
    "Sex", "Length", "Diameter", "Height",
    "Whole_weight", "Shucked_weight",
    "Viscera_weight", "Shell_weight", "Rings"
]
abalone = pd.read_csv(url, header=None, names=column_names)

abalone.to_csv("data/abalone.csv", index=False)

In [18]:
abalone

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
0,M,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15
1,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
2,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
4,I,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7
...,...,...,...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


In [33]:
# Excluding column: Sex for cleaner display of graphs
new_column_names = ["Length", "Diameter", "Height",
    "Whole_weight", "Shucked_weight",
    "Viscera_weight", "Shell_weight", "Rings"
]

# Plot all variables against one another for EDA
chart = alt.Chart(abalone,width=150, height=100).mark_point().encode(
  alt.X(alt.repeat('row'), type='quantitative'),
  alt.Y(alt.repeat('column'), type='quantitative'),
    color='Sex:N'
).repeat(column = new_column_names, row = new_column_names)

chart


Hint: Instead of e.g. `is_pandas_dataframe(df)`, did you mean `is_pandas_dataframe(df.to_native())`?
  return _is_pandas_dataframe(obj) or isinstance(


#### Discussion:
summarize what you found
discuss whether this is what you expected to find?
discuss what impact could such findings have?
discuss what future questions could this lead to?

#### References: 
at least 4 citations relevant to the project (format is your choose, just be consistent across the references).