### Abalone Traints
- Generate a machine learning model to predict the age of Abalone using 'Rings'.
    - EDA
    - Feature Selection
    - Model Selection and training
    - Model Evaluation
- Predict the 'Rings' which determine age of the Abalone species in the testing dataset (i.e. Test.csv).
    - Apply the trained model to predict 'Rings'in test dataset which are used to determine age of abaolone

In [12]:
# Importing all necessary libraries and environments here
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import numpy as np
from scipy.stats import f_oneway
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
import joblib

##### Basic Data Exploration

In [13]:
# Loading the dataset
file_path = r"D:\Data Mining\Data Mining Exam\Part B\Train.csv"
df = pd.read_csv(file_path)

#check the structure before transformation
df.info() # 90615 rows, 10 columns

#Display first 5 rows
df.head()
print(df.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90615 entries, 0 to 90614
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              90615 non-null  int64  
 1   Sex             90615 non-null  object 
 2   Length          90615 non-null  float64
 3   Diameter        90615 non-null  float64
 4   Height          90615 non-null  float64
 5   Whole weight    90615 non-null  float64
 6   Whole weight.1  90615 non-null  float64
 7   Whole weight.2  90615 non-null  float64
 8   Shell weight    90615 non-null  float64
 9   Rings           90615 non-null  int64  
dtypes: float64(7), int64(2), object(1)
memory usage: 6.9+ MB
Index(['id', 'Sex', 'Length', 'Diameter', 'Height', 'Whole weight',
       'Whole weight.1', 'Whole weight.2', 'Shell weight', 'Rings'],
      dtype='object')


##### Data Pre-processing

In [14]:
# Check for missing values
missing_count = df.isnull().sum() # Count total missing values
rows_with_missing = df.isnull().any(axis=1).sum()  # Count rows with missing values

if rows_with_missing > 0:
    df = df.dropna()
    print(f"Dropped {rows_with_missing} rows containing missing values.")
else:
    print("No missing values found. No rows were dropped.")

No missing values found. No rows were dropped.


In [15]:
# Check & removal of duplicates (Same values across all variables)
duplicates = df[df.duplicated(subset=['id', 'Sex', 'Length', 'Diameter', 'Height', 'Whole weight',
       'Whole weight.1', 'Whole weight.2', 'Shell weight', 'Rings'], keep=False)]

# Number and all duplicated rows
print(f"Total duplicated rows: {duplicates.shape[0]}")

# Removing duplicates while keeping the first occurrence
df = df.drop_duplicates(subset=['id', 'Sex', 'Length', 'Diameter', 'Height', 'Whole weight',
       'Whole weight.1', 'Whole weight.2', 'Shell weight', 'Rings'], keep="first")

df.info()

Total duplicated rows: 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90615 entries, 0 to 90614
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              90615 non-null  int64  
 1   Sex             90615 non-null  object 
 2   Length          90615 non-null  float64
 3   Diameter        90615 non-null  float64
 4   Height          90615 non-null  float64
 5   Whole weight    90615 non-null  float64
 6   Whole weight.1  90615 non-null  float64
 7   Whole weight.2  90615 non-null  float64
 8   Shell weight    90615 non-null  float64
 9   Rings           90615 non-null  int64  
dtypes: float64(7), int64(2), object(1)
memory usage: 6.9+ MB


In [16]:
# Converting "Sex" (categorical) to numerical

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply LabelEncoder to the 'Sex' column
df['Sex'] = label_encoder.fit_transform(df['Sex'])

print("Label encoding applied to 'Sex' column.")
df.head()

Label encoding applied to 'Sex' column.


Unnamed: 0,id,Sex,Length,Diameter,Height,Whole weight,Whole weight.1,Whole weight.2,Shell weight,Rings
0,0,0,0.55,0.43,0.15,0.7715,0.3285,0.1465,0.24,11
1,1,0,0.63,0.49,0.145,1.13,0.458,0.2765,0.32,11
2,2,1,0.16,0.11,0.025,0.021,0.0055,0.003,0.005,6
3,3,2,0.595,0.475,0.15,0.9145,0.3755,0.2055,0.25,10
4,4,1,0.555,0.425,0.13,0.782,0.3695,0.16,0.1975,9


#### Feature Selection

In [17]:
# Continuous and categorical features
continuous_features = ["Length", "Diameter", "Height", "Whole weight", "Whole weight.1", "Whole weight.2", "Shell weight", "Rings"]
categorical_features = ["Sex"]

# Correlation matrix (for numerical variables)
correlation_matrix = df[continuous_features].corr()

# Correlation values for 'Rings' (for numericals)
print("Correlation Matrix:\n")
print(correlation_matrix["Rings"].sort_values(ascending=False))

# ANOVA test for the 'Sex' categorical feature with 'Rings' (continuous)
groups = [df[df["Sex"] == sex]["Rings"] for sex in df["Sex"].unique()]
F, p_value = f_oneway(*groups)

# Print the p-value for the ANOVA test
print(f"\nANOVA test p-value for 'Sex' vs 'Rings': {p_value}")

Correlation Matrix:

Rings             1.000000
Shell weight      0.694766
Height            0.665772
Diameter          0.636832
Length            0.623786
Whole weight      0.617274
Whole weight.2    0.588954
Whole weight.1    0.515067
Name: Rings, dtype: float64

ANOVA test p-value for 'Sex' vs 'Rings': 0.0


- Results for Feature Selection
    - Key predictors of "Rings" include "Shell weight", "Height", "Diameter", "Length", "Whole weight", Whole weight.2, Whole weight.1 with moderate contributions. The ANOVA test (p = 0.0) confirms "Sex" as a significant categorical feature.

#### Prediction Model Training and Evaluation

In [19]:
#Separating the target variable (Rings) from the features
X = df.drop(columns=['Rings', 'id'])  # Features
y = df['Rings']  # Target

# Train-test Split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Training a RandomForest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_valid)
print("Model Performance:")
print(f"MAE: {mean_absolute_error(y_valid, y_pred)}")
print(f"R² Score: {r2_score(y_valid, y_pred)}")

Model Performance:
MAE: 1.2949787562765547
R² Score: 0.6479998060436999


##### Cross-validation of model's performance

In [20]:
# Performing 5-fold cross-validation of the model's performance.

cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
print("Cross-Validation MAE Scores:", -cv_scores)
print("Average Cross-Validation MAE:", -cv_scores.mean())

Cross-Validation MAE Scores: [1.2764912  1.29288308 1.28362247 1.26761298 1.27611212]
Average Cross-Validation MAE: 1.2793443690338244


#####  Saving the prediction model

In [21]:
joblib.dump(model, "Abalone_Age_predictor.joblib")
print("Model saved as 'Abalone_Age_predictor.joblib'")

Model saved as 'Abalone_Age_predictor.joblib'


##### Predicting the 'Rings' which determine age of the Abalone species in the testing dataset (i.e. Test.csv)

In [22]:
#Applying the model to the test data (Test.csv) for making predictions

# Loading test data
test_df = pd.read_csv('Test.csv')
#test_df.info()

# Ensure 'Id' is correctly referenced
id_column = 'id' if 'id' in test_df.columns else 'id'

# Encode 'Sex' if it exists
if 'Sex' in test_df.columns:
    label_encoder = LabelEncoder()
    test_df['Sex'] = label_encoder.fit_transform(test_df['Sex'])  # Encode

# Prepare test features
X_test = test_df.drop(columns=[col for col in ['Rings', id_column] if col in test_df.columns])

# Make predictions
predictions = model.predict(X_test)

# Restore original 'Sex' values (optional, in case needed in output)
if 'Sex' in test_df.columns:
    test_df['Sex'] = label_encoder.inverse_transform(test_df['Sex'])  # Decode back

# Append predictions to test data
test_df['Predicted Rings'] = predictions.astype(int) # 'Rings' data in train dataset were of type int

# Save as CSV
test_df.to_csv('Rings_prediction.csv', index=False)

print("\nPredictions saved successfully as 'Rings_prediction.csv'")


Predictions saved successfully as 'Rings_prediction.csv'


In [23]:
test_df.head()

Unnamed: 0,id,Sex,Length,Diameter,Height,Whole weight,Whole weight.1,Whole weight.2,Shell weight,Predicted Rings
0,90615,M,0.645,0.475,0.155,1.238,0.6185,0.3125,0.3005,10
1,90616,M,0.58,0.46,0.16,0.983,0.4785,0.2195,0.275,9
2,90617,M,0.56,0.42,0.14,0.8395,0.3525,0.1845,0.2405,10
3,90618,M,0.57,0.49,0.145,0.874,0.3525,0.1865,0.235,10
4,90619,I,0.415,0.325,0.11,0.358,0.1575,0.067,0.105,7
