# Data Preprocessing and Splitting for Abalone Dataset

This notebook covers the steps for preprocessing the abalone dataset and splitting it into training, testing, and validation sets.


In [25]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder,LabelEncoder
from data_loader import load_abalone_data
import numpy as np

abalone_df = load_abalone_data()



## One-Hot Encoding the 'Sex' Column

The 'Sex' column is a categorical feature and needs to be converted into a numerical format for machine learning algorithms. We use one-hot encoding to achieve this, resulting in separate columns for each category with binary values (0 or 1).


In [26]:
# One-hot encode the 'Sex' column
encoder = OneHotEncoder(sparse=False)
sex_encoded = encoder.fit_transform(abalone_df[['Sex']])
sex_encoded_df = pd.DataFrame(sex_encoded, columns=encoder.get_feature_names_out(['Sex']))

# Drop the original 'Sex' column and concatenate the encoded one
abalone_df = abalone_df.drop('Sex', axis=1)
abalone_df = pd.concat([abalone_df, sex_encoded_df], axis=1)




## Splitting the Dataset into Features and Target

We separate the dataset into features (`X`) and the target variable (`y`). The target variable in our case is 'Rings', which indicates the age of the abalone.


In [27]:
# Splitting the dataset into features (X) and target variable (y)
X = abalone_df.drop('Rings', axis=1)
y = abalone_df['Rings']


## Splitting the Data into Training, Testing, and Validation Sets

We split the data into three parts:
1. Training set (80% of the data): Used to train the model.
2. Testing set (10% of the data): Used to test the model's performance after training.
3. Validation set (10% of the data): Used to fine-tune the model's hyperparameters.


In [28]:

# Splitting the dataset into training (80%), and a temporary set (20%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
#Creaing new y_train for classification of age
bins = [0, 8, 15, np.inf]
labels = ['young', 'middle_age', 'old']

# Create the new classification target variable
y_train_classification = pd.cut(y_train, bins, labels=labels, right=True)
# Print the result
print(y_train_classification)
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train_classification)
# Print the result
print(y_train_encoded)

# Splitting the temporary set into testing and validation sets (50% each of the temporary set, 10% each of the total dataset)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Printing the shapes of the resulting sets
print('Training set:', X_train.shape, y_train.shape)
print('Testing set:', X_test.shape, y_test.shape)
print('Validation set:', X_val.shape, y_val.shape)

# Printing some rows from each created set
print('\nTraining set:')
print(X_train.head())
print('\nTesting set:')
print(X_test.head())
print('\nValidation set:')
print(X_val.head())



2830    middle_age
925          young
3845    middle_age
547          young
2259    middle_age
           ...    
3444    middle_age
466     middle_age
3092    middle_age
3772    middle_age
860          young
Name: Rings, Length: 2923, dtype: category
Categories (3, object): ['young' < 'middle_age' < 'old']
2830    middle_age
925          young
3845    middle_age
547          young
2259    middle_age
           ...    
3444    middle_age
466     middle_age
3092    middle_age
3772    middle_age
860          young
Name: Rings, Length: 2923, dtype: category
Categories (3, object): ['young' < 'middle_age' < 'old']
Training set: (2923, 11) (2923,)
Testing set: (627, 11) (627,)
Validation set: (627, 11) (627,)

Training set:
      Length  Diameter  Height  Whole weight  Shucked weight  Viscera weight  \
2830   0.525     0.430   0.135        0.8435          0.4325          0.1800   
925    0.430     0.325   0.100        0.3645          0.1575          0.0825   
3845   0.455     0.350   0.105 

## Feature Scaling

We scale the features to ensure they contribute equally to the model's performance. This is particularly important for models that are sensitive to the scale of the input data, such as linear regression.


In [29]:
# Scaling the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_val_scaled = scaler.transform(X_val)

# Printing the first 5 rows of the scaled sets
print('\nFirst 5 rows of the scaled training set:\n', X_train_scaled[:5])
print('\nFirst 5 rows of the scaled testing set:\n', X_test_scaled[:5])
print('\nFirst 5 rows of the scaled validation set:\n', X_val_scaled[:5])



First 5 rows of the scaled training set:
 [[-0.00954585  0.20680221 -0.12072394  0.01457825  0.31105444 -0.02098237
  -0.42540763 -0.30383887  1.46390258 -0.67933791 -0.75989604]
 [-0.80318028 -0.85404855 -0.94465076 -0.95900917 -0.9194155  -0.91375895
  -0.96974532 -0.92144737 -0.68310557  1.47202148 -0.75989604]
 [-0.59432912 -0.60146504 -0.82694693 -0.85433328 -0.89704332 -0.78098705
  -0.68512431  0.31376963 -0.68310557 -0.67933791  1.31596949]
 [-2.68284079 -2.57161646 -2.2393929  -1.61348756 -1.54807379 -1.61882353
  -1.60658483 -0.92144737 -0.68310557 -0.67933791  1.31596949]
 [ 0.53346719  0.56041914  0.46779521  0.53694144  0.6399255   0.64287714
   0.38220449  0.93137813  1.46390258 -0.67933791 -0.75989604]]

First 5 rows of the scaled testing set:
 [[-2.59930032 -2.47058306 -2.00398524 -1.57995062 -1.50556664 -1.55472675
  -1.58523826 -1.53905587 -0.68310557  1.47202148 -0.75989604]
 [ 0.65877789  0.50990243  0.70320287  0.58064109 -0.07150985  1.11902465
   0.524515    0.9

## Verifying the Sizes of the Splits

We print the sizes of each split to ensure the data has been divided correctly.


In [30]:
# Verifying the sizes of the splits
print(f"Training Set: {len(X_train)} samples")
print(f"Testing Set: {len(X_test)} samples")
print(f"Validation Set: {len(X_val)} samples")


Training Set: 2923 samples
Testing Set: 627 samples
Validation Set: 627 samples
