# Predictive Models of the Age of Abalones Based on Physical Characteristics

#### In this report, we aim to create an effective model which can predict age of Abalone accurately based on individual physical characteristics. We perform exploratory data analysis and visualization, test both linear and non-linear models on our data, then comapre models through regression metrics. 

Authors: Serene Zha, Mehmet Imga, Claudia Liauw, Wendy Frankel

# Introduction

Abalone are marine mollusks that are commercially important in fisheries and aquaculture, particularly in regions such as Tasmania. Estimating the age structure of abalone populations is essential for setting sustainable harvest limits and monitoring stock health. However, the standard method for determining age requires cutting the shell through the cone, staining it, and counting growth rings under a microscope—a destructive, time-consuming, and labor-intensive procedure[1]. Because of this, methods that can infer age from simple, non-destructive measurements of the animal are of practical interest to biologists, fisheries managers, and growers.


Here, we ask whether we can use a machine learning model to predict the age of an abalone from basic physical measurements. Specifically, we will explore linear regression models in Python to relate age to various attributes, including sex, shell length, diameter, height, and several weight measurements. To investigate this question, we use the UCI Abalone dataset, which contains 4,177 abalones with eight predictor variables and a target variable, “Rings.” Each row corresponds to one abalone, and the recorded features include sex (male, female, infant), three shell size measurements (length, diameter, height), and four weight measurements (whole, shucked, viscera, and shell weight). The number of rings serves as a proxy for age, with age in years given approximately by Rings + 1.5[2]. By building and evaluating linear regression models on this dataset, we aim to understand how well these readily obtained physical measurements can predict abalone age and what this implies for practical, non-destructive age estimation.

# Methods and Results

### 1. Load Necessary Packages:

In [27]:
from ucimlrepo import fetch_ucirepo 
import pandas as pd
import altair as alt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

### 2. Load data

In [None]:
# import requests
# import zipfile

# url = "https://archive.ics.uci.edu/static/public/1/abalone.zip"

# request = requests.get(url)
# with open("../data/raw/abalone.zip", 'wb') as f:
#     f.write(request.content)

# with zipfile.ZipFile("../data/raw/abalone.zip", 'r') as zip_ref:
#     zip_ref.extractall("../data/raw")

In [21]:
 # fetch dataset 
abalone = fetch_ucirepo(id=1) 

# Extract features and targets
X = abalone.data.features
y = abalone.data.targets

# Combine into a single DataFrame for easier initial handling
df = pd.concat([X, y], axis=1)
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


### 3. Data Wrangling and Cleaning

#### Checking that there is no null values

In [22]:
missing_values = df.isnull().sum()
print("Missing values per column:", missing_values[missing_values > 0])

df.info()

Missing values per column: Series([], dtype: int64)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole_weight    4177 non-null   float64
 5   Shucked_weight  4177 non-null   float64
 6   Viscera_weight  4177 non-null   float64
 7   Shell_weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


### 4. Split data

In [10]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Get features and targets
X = abalone.data.features
y = abalone.data.targets

# 2. Split Data (Same random_state as baseline for comparison)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=522
)

train_df = pd.concat([X_train, y_train], axis=1)
# train_df.to_csv('../data/processed/abalone_train.csv', index=False)

# ravel y for sklearn
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

print("Data loaded and split.")

train_df

Data loaded and split.


Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
2194,I,0.430,0.325,0.110,0.3675,0.1355,0.0935,0.1200,13
3996,I,0.315,0.230,0.000,0.1340,0.0575,0.0285,0.3505,6
3329,F,0.545,0.435,0.150,0.6855,0.2905,0.1450,0.2250,10
492,F,0.655,0.510,0.155,1.2895,0.5345,0.2855,0.4100,11
241,I,0.270,0.200,0.070,0.1000,0.0340,0.0245,0.0350,5
...,...,...,...,...,...,...,...,...,...
3956,F,0.515,0.395,0.140,0.6860,0.2810,0.1255,0.2200,12
154,F,0.565,0.450,0.135,0.9885,0.3870,0.1495,0.3100,12
3360,F,0.580,0.440,0.175,1.0730,0.4005,0.2345,0.3350,19
1899,M,0.575,0.450,0.130,0.7850,0.3180,0.1930,0.2265,9


In [3]:
# test_df = pd.concat([X_test, y_test], axis=1)
# test_df.to_csv('../data/processed/abalone_test.csv', index=False)

### 5. EDA
* Mostly numerical variables except sex.
* No missing values.
* Target (Rings) ranges from 1 to 29. Mostly normal, slight right skew.
* Sex needs to be one-hot encoded, the rest should be scaled.
* Numeric variables are moderately positively correlated with target.

#### Summary

In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3341 entries, 2194 to 3988
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             3341 non-null   object 
 1   Length          3341 non-null   float64
 2   Diameter        3341 non-null   float64
 3   Height          3341 non-null   float64
 4   Whole_weight    3341 non-null   float64
 5   Shucked_weight  3341 non-null   float64
 6   Viscera_weight  3341 non-null   float64
 7   Shell_weight    3341 non-null   float64
 8   Rings           3341 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 390.1+ KB


In [8]:
train_df.describe().round(2)

Unnamed: 0,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
count,3341.0,3341.0,3341.0,3341.0,3341.0,3341.0,3341.0,3341.0
mean,0.52,0.41,0.14,0.82,0.36,0.18,0.24,9.93
std,0.12,0.1,0.04,0.49,0.22,0.11,0.14,3.25
min,0.08,0.06,0.0,0.0,0.0,0.0,0.0,1.0
25%,0.45,0.35,0.12,0.44,0.18,0.09,0.13,8.0
50%,0.55,0.42,0.14,0.8,0.33,0.17,0.23,9.0
75%,0.62,0.48,0.16,1.14,0.5,0.25,0.32,11.0
max,0.82,0.65,0.52,2.83,1.49,0.76,1.0,29.0


#### EDA Visualisation:

In [6]:
from ydata_profiling import ProfileReport
ProfileReport(train_df)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 9/9 [00:00<00:00, 1170.36it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



#### Further EDA Visualization:

In [23]:
# Correlation matrix heatmap
corr_matrix = train_df.select_dtypes(include=['float64', 'int64']).corr().reset_index()
corr_df = pd.melt(corr_matrix, id_vars='index', var_name='variable2', value_name='correlation')

heatmap = alt.Chart(corr_df).mark_rect().encode(
    x=alt.X('index', title=None),
    y=alt.Y('variable2', title=None),
    color=alt.Color('correlation', scale=alt.Scale(scheme='blueorange', domain=[-1, 1])),
    tooltip=['index', 'variable2', 'correlation']
).properties(
    title='Correlation Matrix',
    width=400,
    height=400
)

heatmap

Figure 1: Heatmap showing the correlation between different numerical features and the target variable Rings.

Below, we investigate the possible relationship between sex of adults (M/F), Infants, and number of rings, as the relationship may differ between those categories.

In [11]:
train_df = pd.concat([X_train, pd.DataFrame(y_train, columns=['Rings'], index=X_train.index)], axis=1)

base = alt.Chart(train_df).mark_circle(opacity=0.3).encode(
    x='Shell_weight',
    y='Rings',
    color='Sex'
)

lines = base.transform_regression(
    'Shell_weight', 'Rings', groupby=['Sex']
).mark_line().encode(
    color='Sex'
)

(base + lines).properties(
    title="Rings vs Shell Weight by Sex (with Regression Lines)",
    width=500
)

Figure 2. Number of rings (target variable) by shell weight (grams), grouped into Male (M), Female (F), and Infant (I). Linear regression lines for each group are shown.

Observation: The slopes appear slightly different, particularly for Infants (I) compared to adults (M/F). Infants seem to have a steeper growth curve in this dimension.

### 6. Modeling Interactions

We will use a Linear Regression model to predict the number of rings. We construct a pipeline that:

1. One-hot encodes the categorical Sex feature.
2. Scales the numerical features.
3. Applies Linear Regression.

In [39]:
# Define features
numeric_features = ['Length', 'Diameter', 'Height', 'Whole_weight', 'Shucked_weight', 'Viscera_weight', 'Shell_weight']
categorical_features = ['Sex']

# Create preprocessor
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(drop='if_binary'), categorical_features)
)

# Create and fit pipeline
lr_model = make_pipeline(preprocessor, LinearRegression())
lr_model.fit(X_train, y_train)

# Make predictions
lr_y_pred = lr_model.predict(X_test)

print(f"Linear Regression Model - RMSE: {rmse_int:.4f}, R2: {r2_int:.4f}")

Linear Regression Model - RMSE: 4.3909, R2: -0.9592


### 7. Visual Evaluation of the model:

We evaluate the model by plotting the Predicted vs. Actual values.

In [41]:
# Calculate metrics
lr_rmse = np.sqrt(mean_squared_error(y_test, lr_y_pred))
lr_r2 = r2_score(y_test, lr_y_pred)

print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R-squared (R2): {r2:.4f}")

# Visualization
results_df = pd.DataFrame({
    'Actual': y_test.flatten(),
    'Predicted': lr_y_pred.flatten()
})

pred_chart = alt.Chart(results_df).mark_circle(opacity=0.5).encode(
    x=alt.X('Actual', title='Actual Rings'),
    y=alt.Y('Predicted', title='Predicted Rings')
).properties(
    title=f'Actual vs Predicted Rings (R2 = {r2:.2f})',
    width=500,
    height=500
)

line = alt.Chart(pd.DataFrame({'x': [0, 30], 'y': [0, 30]})).mark_line(color='red', strokeDash=[5,5]).encode(
    x='x',
    y='y'
)

pred_chart + line

Root Mean Squared Error (RMSE): 2.3419
R-squared (R2): 0.4427


Figure 3: Actual Rings (x-axis) vs Predicted Rings (y-axis). The red dashed line represents the linear regression line. Points below the line indicate over-prediction, while points above indicate under-prediction.

In [50]:
# Extracting feature names from the preprocessor
ohe = lr_model.named_steps['columntransformer'].named_transformers_['onehotencoder']
ohe_features = list(ohe.get_feature_names_out(categorical_features))

# combining to get all features
all_features = numeric_features + ohe_features

# Extract coefficients from linear regression
coefficients = lr_model.named_steps['linearregression'].coef_

coef_df = pd.DataFrame({
    "Feature": all_features,
    "Coefficient": coefficients
}).sort_values("Coefficient", ascending=False)

coef_df

Unnamed: 0,Feature,Coefficient
3,Whole_weight,4.483371
1,Diameter,1.153338
6,Shell_weight,1.020152
2,Height,0.808758
9,Sex_M,0.306958
7,Sex_F,0.208326
0,Length,-0.265891
8,Sex_I,-0.515284
5,Viscera_weight,-1.18449
4,Shucked_weight,-4.397524


Table 1. Linear Regression coefficient values showing the estimated contribution of each feature to the predicted target after preprocessing and one-hot encoding.

### 8. Analysis 2: Non-Linear Models

#### Model A: Random Forest Regressor

In [42]:
rf_preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features), # Scaling isn't strictly necessary for RF, but good practice
    (OneHotEncoder(drop='if_binary'), categorical_features)
)

rf_model = make_pipeline(rf_preprocessor, RandomForestRegressor(n_estimators=100, random_state=522))
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest - RMSE: {rmse_rf:.4f}, R2: {r2_rf:.4f}")

Random Forest - RMSE: 2.1817, R2: 0.5163


#### Model B: Support Vector Regression (SVR)

In [43]:
svr_preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features), # Scaling is CRITICAL for SVR
    (OneHotEncoder(drop='if_binary'), categorical_features)
)

svr_model = make_pipeline(svr_preprocessor, SVR(kernel='rbf', C=1.0, epsilon=0.1))
svr_model.fit(X_train, y_train)
y_pred_svr = svr_model.predict(X_test)

rmse_svr = np.sqrt(mean_squared_error(y_test, y_pred_svr))
r2_svr = r2_score(y_test, y_pred_svr)

print(f"SVR (RBF Kernel) - RMSE: {rmse_svr:.4f}, R2: {r2_svr:.4f}")

SVR (RBF Kernel) - RMSE: 2.1673, R2: 0.5227


### 9. Comparison of Models

In [47]:
results = pd.DataFrame({
    'Model': ['Baseline (Linear)', 'Random Forest', 'SVR (RBF)'],
    'RMSE': [lr_rmse, rmse_rf, rmse_svr],
    'R2 Score': [lr_r2, r2_rf, r2_svr]
})

print(results.round(4))

# Visualize Comparison
base = alt.Chart(results).encode(x=alt.X('Model', sort='-y'))

bar_r2 = base.mark_bar().encode(
    y=alt.Y('R2 Score', title='R2 Score'),
    color=alt.Color('Model', legend=None)
).properties(title='Model Performance Comparison (R2)')

bar_rmse = (
    alt.Chart(results)
        .encode(
            x=alt.X('Model', sort='-y'),
            y=alt.Y('RMSE', title='RMSE'),
            color=alt.Color('Model', legend=None)
        )
        .mark_bar()
        .properties(title='Model Performance Comparison (RMSE)')
)

bar_rmse | bar_r2

               Model    RMSE  R2 Score
0  Baseline (Linear)  2.3419    0.4427
1      Random Forest  2.1817    0.5163
2          SVR (RBF)  2.1673    0.5227


Figure 4. RMSE and R^2 of each model tested. 

# Discussion

summarize what you found

discuss whether this is what you expected to find?

discuss what impact could such findings have?

discuss what future questions could this lead to?

Baseline (Linear)  RMSE: 5.4844    R^2: 0.4427

Random Forest - RMSE: 4.7598, R2: 0.5163

SVR (RBF Kernel) - RMSE: 4.6972, R2: 0.5227

In this report, we tested 3 different models to find which model was best at predicting number of rings (as a proxy for age) of Abalone molluscs. The baseline linear model explains about half of the variance in number of rings (R^2 = 0.44) using size and weight; errors are moderate and roughly symmetric, with a root mean squared error (RMSE) of approximately 5.48. The other two models tested were non-linear models. The Random Forest model generally achieves higher R² (0.52) and lower RMSE (4.76) than linear models, showing that abalone growth is not purely linear in the features. Finally, the support vector regression with an RBF kernel model is also an improvement over the baseline (RMSE of 4.70 and R^2 of 0.52), especially in mid-age ranges, but can be more sensitive to scaling and hyperparameters than Random Forest. The non-linear models (especially random forest) provided the best predictive performance, suggesting that future work should focus on flexible models and possibly add environmental features (e.g., location, temperature).



# References


[1]
Dua, D., & Graff, C. (2019). UCI Machine Learning Repository: Abalone Data Set. University of California, Irvine, School of Information and Computer Science. Retrieved from the UCI Machine Learning Repository.

[2]
Nash, W. J., Sellers, T. L., Talbot, S. R., Cawthorn, A. J., & Ford, W. B. (1994). The population biology of abalone (Haliotis species) in Tasmania. I. Blacklip abalone (H. rubra) from the north coast and islands of Bass Strait. Sea Fisheries Division Technical Report No. 48.