# Crab Age Competition Notebook

*Created by Taylor Daugherty*

*Created: 6/11/2023    Last updated: 6/12/2023*

This is my solution for the Crab Age Competition. 

In this notebook, I experimented with different levels of normalizing the dataset. This includes not normalizing any data, normalizing only the features, and normalizing the features and the target.

## Table of Contents

1. Import modules

2. Create universal information

3. No normalization

4. Normalize features

5. Normalize all data

6. Evaluate methods

7. Submission

8. Results

## Import modules

In this notebook the following imports are needed to build and evaluate a linear regression model

1. `Numpy`: used for matrix operations and manipulation

2. `Pandas`: used for dataframe creation and manipulation

3. `train_test_split`: used to split the data to evaluate model's performance on existing data

4. `LinearRegression`: used to build, fit, and make predictions using a linear regression model

5. `mean_absolute_error`: used to evaluate the model's performance on the validation data

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

## Create universal information

The information in the following section will pertain to multiple regression methods. 

These have been added here for easier access and readability

Set the file paths to a variable so all methods can access the training and testing sets

In [2]:
train_filepath = '/kaggle/input/playground-series-s3e16/train.csv'
test_filepath = '/kaggle/input/playground-series-s3e16/test.csv'

### Normalization

These functions will help normalize and undo normalizing for the age values

**Normalize** will normalize an entire feature 

In [3]:
def normalize(feature):
    '''
    This function normalizes a Series
    
    Input: A feature of type Series
    
    Output: The normalized feature of type Series
    '''
    return (feature - feature.mean())/feature.std()

**Normalize_features** will normalize all features in a given dataframe

In [4]:
def normalize_features(df):
    '''
    This function normalizes all features in a dataframe
    
    Input: A pandas dataframe
    
    Output: The normalized dataframe
    '''
    for column in df.columns:
        df[column] = normalize(df[column])
    return df

**Revert_normalize** will undo the normalization on features

In [5]:
def revert_normalize(predictions, mean, std):
    '''
    This function reverts the changes caused by normalizing a feature
    
    Input: A set of predictions of type Series
    
    Output: A set of predictions that are no longer normalized of type Series
    '''
    return predictions * std + mean

## No normalization

This model will not normalize any of the data points to evaluate the baseline performance of a linear regression model on this dataset

### Access data as pandas dataframes

Use `read_csv` to transform the data into a dataframe.

Use the colume `'id'` as the index, since that is the role of this column

In [6]:
df_train = pd.read_csv(train_filepath, index_col='id')
df_test = pd.read_csv(test_filepath, index_col='id')

Since `LinearRegression` does not take categorical data as an input, all categorical columns must be converted to integer or floating point values.

In this dataset, the only categorical feature is 'Sex' which has three possible values ('I', 'M', 'F'). These values can simply be converted to 0, 1, and 2 respectively, then stored as an integer type

In [7]:
df_train['Sex'] = df_train['Sex'].map({'I':0, 'M':1, 'F':2}).astype(int)
df_test['Sex'] = df_test['Sex'].map({'I':0, 'M':1, 'F':2}).astype(int)

Separate the data into the features and target using common names 'X' and 'y'

Once divided, further separate the data into training and validation sets

In [8]:
X = df_train.drop('Age', axis=1)
y = df_train['Age']

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=10)

### ML Model

Create a linear regression model and fit it to the training data. Then use the model to make predictions on the validation data

In [9]:
clf_noNormalized = LinearRegression().fit(X_train, y_train)
predictions_noNormalized = clf_noNormalized.predict(X_val)

Evaluate the accuracy of the model using `mean_absolute_error`

In [10]:
accuracy_noNormalized = mean_absolute_error(y_val, predictions_noNormalized)
print('Mean absolute error:', accuracy_noNormalized)

Mean absolute error: 1.4917817961878985


The mean absolute error using this model and not normalizing any of the dataset yielded a 1.492. This is the base to compare the other methods to

## Normalize features

In this method only the features will be normalized

### Access data as pandas dataframes

Use `read_csv` to transform the data into a dataframe.

Use the colume `'id'` as the index, since that is the role of this column

In [11]:
df_train = pd.read_csv(train_filepath, index_col='id')
df_test = pd.read_csv(test_filepath, index_col='id')

Since `LinearRegression` does not take categorical data as an input, all categorical columns must be converted to integer or floating point values.

In this dataset, the only categorical feature is 'Sex' which has three possible values ('I', 'M', 'F'). These values can simply be converted to 0, 1, and 2 respectively, then stored as an integer type

In [12]:
df_train['Sex'] = df_train['Sex'].map({'I':0, 'M':1, 'F':2}).astype(int)
df_test['Sex'] = df_test['Sex'].map({'I':0, 'M':1, 'F':2}).astype(int)

Separate the data into the features and target using common names 'X' and 'y'

Once divided, further separate the data into training and validation sets

In [13]:
X = df_train.drop('Age', axis=1)
y = df_train['Age']

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=10)

Normalize all features using the functions created in the Universal Information section at the beginning of the notebook

In [14]:
X_train = normalize_features(X_train)
X_val = normalize_features(X_val)
df_test = normalize_features(df_test)

### ML Model

Create a linear regression model and fit it to the training data. Then use the model to make predictions on the validation data

In [15]:
clf_featureNormalized = LinearRegression().fit(X_train, y_train)
predictions_featureNormalized = clf_featureNormalized.predict(X_val)

In [16]:
print('Predicted:\n',predictions_featureNormalized[:5])
print('\nTrue:\n',y_val[:5])

Predicted:
 [ 9.40244233 11.20461771 13.32516142  9.52746728  4.74511222]

True:
 id
15752     9
48375    18
63342    11
68606     7
4306      4
Name: Age, dtype: int64


Evaluate the accuracy of the model using `mean_absolute_error`

In [17]:
accuracy_featureNormalized = mean_absolute_error(y_val, predictions_featureNormalized)
print('Mean absolute error:', accuracy_featureNormalized)

Mean absolute error: 1.4858929226773459


The mean absolute error using this model and not normalizing any of the dataset yielded a 1.486

## Normalize all data

In this method the entire dataframe will be normalized

### Access data as pandas dataframes

Use `read_csv` to transform the data into a dataframe.

Use the colume `'id'` as the index, since that is the role of this column

In [18]:
df_train = pd.read_csv(train_filepath, index_col='id')
df_test = pd.read_csv(test_filepath, index_col='id')

Since `LinearRegression` does not take categorical data as an input, all categorical columns must be converted to integer or floating point values.

In this dataset, the only categorical feature is 'Sex' which has three possible values ('I', 'M', 'F'). These values can simply be converted to 0, 1, and 2 respectively, then stored as an integer type

In [19]:
df_train['Sex'] = df_train['Sex'].map({'I':0, 'M':1, 'F':2}).astype(int)
df_test['Sex'] = df_test['Sex'].map({'I':0, 'M':1, 'F':2}).astype(int)

Separate the data into the features and target using common names 'X' and 'y'

Once divided, further separate the data into training and validation sets

In [20]:
X = df_train.drop('Age', axis=1)
y = df_train['Age']

X_train, X_test, y_train, y_val = train_test_split(X, y, random_state=10)

Since the true mean and standard deviation for age will be lost after normalization, these values are stored for when the ages need to be reverted back to the true values/predictions

In [21]:
mean_age_train = y_train.mean()
std_age_train = y_train.std()

mean_age_val = y_val.mean()
std_age_val = y_val.std()

Normalize all features and targets using the functions created in the Universal Information section at the beginning of the notebook.

It is important to keep the training and validation data separate to simulate a true testing environment. This means that the validation data has no impact on the testing data

In [22]:
X_train = normalize_features(X_train)
X_val = normalize_features(X_val)

y_train = normalize(y_train)
y_val = normalize(y_val)

df_test = normalize_features(df_test)

### ML Model

Create a linear regression model and fit it to the training data. Then use the model to make predictions on the validation data

In [23]:
clf_allNormalized = LinearRegression().fit(X_train, y_train)
predictions_allNormalized = clf_allNormalized.predict(X_val)

Revert the predicted and true age values to the correct range. This is accomplished using the `revert_normalize` function created in the Universal Information at the beginning of this notebook

In [24]:
predictions_allNormalized_train = revert_normalize(predictions_allNormalized, mean_age_train, std_age_train)
predictions_allNormalized_val = revert_normalize(predictions_allNormalized, mean_age_val, std_age_val)

y_val = revert_normalize(y_val, mean_age_val, std_age_val)

In [25]:
print('Predictions (train):\n', predictions_allNormalized_train[:5])
print('Predictions (validation):\n', predictions_allNormalized_val[:5])

print('\nTrue values:\n', y_val[:5])

Predictions (train):
 [ 9.40244233 11.20461771 13.32516142  9.52746728  4.74511222]
Predictions (validation):
 [ 9.40317831 11.20048583 13.31530173  9.52786555  4.75842813]

True values:
 id
15752     9.0
48375    18.0
63342    11.0
68606     7.0
4306      4.0
Name: Age, dtype: float64


Evaluate the accuracy of the model using `mean_absolute_error`

In [26]:
accuracy_allNormalized_train = mean_absolute_error(y_val, predictions_allNormalized_train)
accuracy_allNormalized_val = mean_absolute_error(y_val, predictions_allNormalized_val)

print('Mean absolute error (train):', accuracy_allNormalized_train)
print('Mean absolute error (validation):', accuracy_allNormalized_val)

Mean absolute error (train): 1.485892922677346
Mean absolute error (validation): 1.485751892389333


The mean absolute error using this model and not normalizing any of the dataset yielded a 1.486

## Evaluate Methods

The next section will compare the performances of all methods tested so far

Consolidate all MAE scores into one location and determine the minimum. The minimum of these scores will be the most accurate method and will be submitted to the competition.

In [27]:
print('No normalized score:',accuracy_noNormalized)
print('Features normalized score:',accuracy_featureNormalized)
print('All normalized score (train):',accuracy_allNormalized_train)
print('All normalized score (validation):',accuracy_allNormalized_val)

min_score = min(accuracy_noNormalized, 
                min(accuracy_featureNormalized, 
                    min(accuracy_allNormalized_train, accuracy_allNormalized_val)))

print('\nmin_accuracy score =',min_score)

No normalized score: 1.4917817961878985
Features normalized score: 1.4858929226773459
All normalized score (train): 1.485892922677346
All normalized score (validation): 1.485751892389333

min_accuracy score = 1.485751892389333


The minimum score from the three methods is 1.485751892389333 (the method where all data was normalized, then reverted using the validation age mean and std). Therefore, this is the method that will be used to make predictions for the test set and have the predictions submitted to the comeptition

## Submissions

These are files that have been submitted into the competition

### Normalized features

Generate predictions using the normalized feature linear regression model. 

To make sure these predictions are reasonable, view the first 5

In [28]:
test_predictions = clf_featureNormalized.predict(df_test)
print('Test Predictions:\n',test_predictions[:5])

Test Predictions:
 [ 7.78896891  7.80360933 10.46703751  9.52078342  7.60673022]


Put the predictions into a pandas dataframe with the index being the `'id'` column in the original file

In [29]:
submission2 = pd.DataFrame(data={'Age':test_predictions}, index=df_test.index)

submission2.head()

Unnamed: 0_level_0,Age
id,Unnamed: 1_level_1
74051,7.788969
74052,7.803609
74053,10.467038
74054,9.520783
74055,7.60673


Create a `.csv` file with the resulting dataframe. This is the file that will be submitted to the competition

In [30]:
submission2.to_csv('Submission2.csv')

#### Result

The MAE for this set of predictions is 1.48523

### Normalized all data - training age

This submission uses the result of normalizing all data. To convert the predictions to reasonable values, the mean and std from the training data was used to undo the normalization.

Generate predictions using the normalized feature linear regression model. 

To make sure these predictions are reasonable, view the first 5

In [31]:
test_predictions = clf_allNormalized.predict(df_test)
test_predictions = revert_normalize(test_predictions, mean_age_val, std_age_val)
print('Test Predictions:\n',test_predictions[:5])

Test Predictions:
 [ 7.79406305  7.80866392 10.46489791  9.52119975  7.6123166 ]


Put the predictions into a pandas dataframe with the index being the `'id'` column in the original file

In [32]:
submission3 = pd.DataFrame(data={'Age':test_predictions}, index=df_test.index)

submission3.head()

Unnamed: 0_level_0,Age
id,Unnamed: 1_level_1
74051,7.794063
74052,7.808664
74053,10.464898
74054,9.5212
74055,7.612317


Create a `.csv` file with the resulting dataframe. This is the file that will be submitted to the competition

In [33]:
submission3.to_csv('Submission3.csv')

#### Result

The MAE for this set of predictions is 1.48523

### Normalized all data - validation age

This submission uses the result of normalizing all data. To convert the predictions to reasonable values, the mean and std from the validation data was used to undo the normalization.

Generate predictions using the normalized feature linear regression model. 

To make sure these predictions are reasonable, view the first 5

In [34]:
test_predictions = clf_allNormalized.predict(df_test)
test_predictions = revert_normalize(test_predictions, mean_age_val, std_age_val)
print('Test Predictions:\n',test_predictions[:5])

Test Predictions:
 [ 7.79406305  7.80866392 10.46489791  9.52119975  7.6123166 ]


Put the predictions into a pandas dataframe with the index being the `'id'` column in the original file

In [35]:
submission4 = pd.DataFrame(data={'Age':test_predictions}, index=df_test.index)

submission4.head()

Unnamed: 0_level_0,Age
id,Unnamed: 1_level_1
74051,7.794063
74052,7.808664
74053,10.464898
74054,9.5212
74055,7.612317


Create a `.csv` file with the resulting dataframe. This is the file that will be submitted to the competition

In [36]:
submission4.to_csv('Submission4.csv')

#### Result

The MAE for this set of predictions is 1.48513

## Results

The results from the submissions are very similar to the simulated results.

The best normalization method was to normalize all data and use the validataion mean and std for age to revert the predictions.

The best MAE was **1.48513**