# Data Preprocessing and Splitting for Abalone Dataset

This notebook covers the steps for preprocessing the abalone dataset and splitting it into training, testing, and validation sets.


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

abalone_data_path = '../abalone/abalone.data'
column_names = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight', 'Viscera weight', 'Shell weight', 'Rings']
abalone_df = pd.read_csv(abalone_data_path, header=None, names=column_names)
abalone_df['Age'] = abalone_df['Rings'] + 1.5  # Age calculated from Rings


## One-Hot Encoding the 'Sex' Column

The 'Sex' column is a categorical feature and needs to be converted into a numerical format for machine learning algorithms. We use one-hot encoding to achieve this, resulting in separate columns for each category with binary values (0 or 1).


In [5]:
# One-hot encode the 'Sex' column
encoder = OneHotEncoder(sparse=False)
sex_encoded = encoder.fit_transform(abalone_df[['Sex']])
sex_encoded_df = pd.DataFrame(sex_encoded, columns=encoder.get_feature_names_out(['Sex']))

# Drop the original 'Sex' column and concatenate the encoded one
abalone_df = abalone_df.drop('Sex', axis=1)
abalone_df = pd.concat([abalone_df, sex_encoded_df], axis=1)




## Splitting the Dataset into Features and Target

We separate the dataset into features (`X`) and the target variable (`y`). The target variable in our case is 'Rings', which indicates the age of the abalone.


In [6]:
# Splitting the dataset into features (X) and target variable (y)
X = abalone_df.drop('Rings', axis=1)
y = abalone_df['Rings']


## Splitting the Data into Training, Testing, and Validation Sets

We split the data into three parts:
1. Training set (80% of the data): Used to train the model.
2. Testing set (10% of the data): Used to test the model's performance after training.
3. Validation set (10% of the data): Used to fine-tune the model's hyperparameters.


In [7]:
# Splitting the dataset into training (80%), and a temporary set (20%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)

# Splitting the temporary set into testing and validation sets (50% each of the temporary set, 10% each of the total dataset)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


## Feature Scaling

We scale the features to ensure they contribute equally to the model's performance. This is particularly important for models that are sensitive to the scale of the input data, such as linear regression.


In [8]:
# Scaling the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_val_scaled = scaler.transform(X_val)


## Verifying the Sizes of the Splits

We print the sizes of each split to ensure the data has been divided correctly.


In [9]:
# Verifying the sizes of the splits
print(f"Training Set: {len(X_train)} samples")
print(f"Testing Set: {len(X_test)} samples")
print(f"Validation Set: {len(X_val)} samples")


Training Set: 3341 samples
Testing Set: 418 samples
Validation Set: 418 samples
