# Data Processing

This notebook processes our abalone dataset and prepares it for modelling.

In [None]:
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split

In [None]:
# Lets ingest our data
df = pd.read_csv(
  'data/abalone.data',
  names=[
    'sex',
    'length',
    'diameter',
    'height',
    'whole_weight',
    'shucked_weight',
    'viscera_weight',
    'shell_weight',
    'rings'
  ]
)

In [None]:
# Lets take a quick look at our data
df.head()

In [None]:
# What does the shape of our data look like?
print(f'Our dataset has {df.shape[0]} rows and {df.shape[1]} columns')

1. Clean the data (eg. convert M, F and I to 0, 1 and 2). You can do this with code or simple find and replace (2 Marks).

`M = Male (0)`, `F = Female (1)`, `I = Infant (2)`

In [None]:
df['sex'] = df['sex'].map({'M': 0, 'F': 1, 'I': 2})
df.head()

2. Develop a correlation map using a heatmap and discuss major observations (2 Marks).

In [None]:
df.corr().style.background_gradient(cmap='coolwarm')

From this correlation map we can observe the following:
- A value of `1` indicates a perfect positive correlation, `0` indicates no correlation and `-1` indicates a perfect negative correlation.
- We see a diagonal line of `1` values from the top left to the bottom right. This is expected as each variable is perfectly postively correlated with itself.
- We see that all of thte size and weight features are strongly postively correlated with eachother, the strongest being between length and diameter with `0.986` and the weakest positve correlation between height and shucked weight being `0.775`
- We see that the sex has a negative correlation between the rest of the columns, meaning that size and weight is larger for males and smaller for infants.
- We also observe that most of the feature columns have a positive correlation with the number of rings except for sex, with shell weight and diameter having the strongest positive correlation being `0.628` and `0.575` respectively.

3. Pick two of the most correlated features (negative or positive) and create a scatter plot with ring-age. Discuss major observations (2 Marks). 

In [None]:
# Again our 2 most correlated features with rings are shell_weight and diameter
# Lets create a scatter plot visualisation of these vs rings

# First create a scatter plot for shell_weight
shell_weight_scatter = alt.Chart(df, title="Shell Weight vs Rings").mark_circle().encode(
  x=alt.X('shell_weight', title="Shell Weight (grams)", scale=alt.Scale(domain=(0, 1.1))),
  y=alt.Y('rings', title="Rings (count)"),
  color=alt.ColorValue('#7BB2D9')
).properties(
  width=480,
  height=300
)

# Create a regression line for shell_weight
shell_weight_scatter_regression = shell_weight_scatter.transform_regression(
  'shell_weight',
  'rings',
  method='linear'
).mark_line(strokeWidth=2.5).encode(
  color=alt.ColorValue('#FFDC00'),
  opacity=alt.value(1),
)

# Create a scatter plot for diameter
diameter_scatter = alt.Chart(df, title="Diameter vs Rings").mark_circle().encode(
  x=alt.X('diameter', title="Diameter (mm)", scale=alt.Scale(domain=(0, 1.1))),
  y=alt.Y('rings', title="Rings (count)"),
  color=alt.ColorValue('#D62828')
).properties(
  width=480,
  height=300,
)

# Create a regression line for diameter
diameter_scatter_regression = diameter_scatter.transform_regression(
  'diameter',
  'rings',
  method='linear'
).mark_line(strokeWidth=2.5).encode(
  color=alt.ColorValue('#FFDC00'),
  opacity=alt.value(1),
)

shell_weight_diameter_scatter_seperate = ((shell_weight_scatter + shell_weight_scatter_regression) |  (diameter_scatter + diameter_scatter_regression))
shell_weight_diameter_scatter_seperate.save('assets/shell_weight_diameter_scatter_seperate.png', ppi=300)
shell_weight_diameter_scatter_seperate

From these scatter plots we can observe:
- The positve correlation is again reinforced
- Both scatter plots are denser towards the lower range indicating that abalone with a smaller weight and diameter might be represented more within the data set
- Both scatter plots contain some outliers indicating that some abalone don't follow the general trend
- Finally, the diameter vs rings plot seems to be tighter with less outliers potentially indicating that it might be a better predictor of rings

4. Create histograms of the two most correlated features, and the ring-age. What are the major observations?  (2 Marks)

In [None]:
# Now lets take a look at a histogram of these correlated features vs rings

# Create a histogram for shell_weight
shell_weight_histogram = alt.Chart(df, title="Shell Weight vs Rings").mark_bar().encode(
  x=alt.X('shell_weight', title="Shell Weight (grams)"),
  y=alt.Y('count(rings):Q', title="Rings (count)", scale=alt.Scale(domain=(0, 150))),
  color=alt.ColorValue('#7BB2D9'),
).properties(
  width=480,
  height=300
)

# Create a histogram for diameter
diameter_histogram = alt.Chart(df, title="Diameter vs Rings").mark_bar().encode(
  x=alt.X('diameter', title="Diameter (mm)"),
  y=alt.Y('count(rings):Q', title="Rings (count)", scale=alt.Scale(domain=(0, 150))),
  color=alt.ColorValue('#D62828')
).properties(
  width=480,
  height=300,
)

shell_weight_diameter_histogram_seperate = (shell_weight_histogram | diameter_histogram)
shell_weight_diameter_histogram_seperate.save('assets/shell_weight_diameter_histogram_seperate.png', ppi=300)
shell_weight_diameter_histogram_seperate

From this we can observe:
- The histogram of shell weight vs rings is broad, suggesting significant variability in shell weight within the population.
- There is a noticeable right skew in the shell weight distribution, indicating a common range but with some instances of much heavier shells.
- The diameter vs rings histogram, while also broad, is narrower compared to the shell weight distribution, which points to less variability in diameters.
- The diameter distribution has steeper sides, which means that there are fewer extreme diameter values.
- The diameter histogram is said to have a left skew; however, this contradicts the previous observation from the image provided, where the distribution actually appeared to be right-skewed.

In [None]:
# Create a histogram for shell_weight
shell_weight_histogram = alt.Chart(df, title="Shell Weight vs Rings").mark_bar().encode(
  x=alt.X('shell_weight', title="Shell Weight (grams)"),
  y=alt.Y('count(rings):Q', title="Rings (count)", scale=alt.Scale(domain=(0, 150))),
  color=alt.ColorValue('#7BB2D9'),
  opacity=alt.value(0.35)
).properties(
  width=480,
  height=300
)

# Create a histogram for diameter
diameter_histogram = alt.Chart(df, title="Diameter vs Rings").mark_bar().encode(
  x=alt.X('diameter', title="Diameter (mm)"),
  y=alt.Y('count(rings):Q', title="Rings (count)", scale=alt.Scale(domain=(0, 150))),
  color=alt.ColorValue('#D62828'),
  opacity=alt.value(0.35)
).properties(
  width=480,
  height=300,
)

shell_weight_diameter_histogram_together = (diameter_histogram + shell_weight_histogram)
shell_weight_diameter_histogram_together.save('assets/shell_weight_diameter_histogram_together.png', ppi=300)
shell_weight_diameter_histogram_together

Finally this superimposed visualisations again reinforces the differences between the two distributions.

5. Create a 60/40 train/test split - which takes a random seed based on the experiment number to create a new dataset for every experiment (2 Marks).

In [None]:
# Lets create a function that splits our data into a 60/40 train/test split
# We also need to make sure the function can accept a random seed
def get_train_test_split(df: pd.DataFrame, random_seed: int) -> tuple[pd.DataFrame, pd.DataFrame]:
  """
  A function that splits our data into a 60/40 train/test split and accepts a random seed

  Args:
    df (pd.DataFrame): The dataframe to split
    random_seed (int): The random seed to use for reproducibility and experimentation

  Returns:
    tuple[pd.DataFrame, pd.DataFrame]: A tuple containing the train and test dataframes respectively
  """
  # Call the sklearn train_test_split function
  return train_test_split(df, test_size=0.4, random_state=random_seed)

Finally, before we move onto the modelling notebook - let's save our processed dataframe.

In [None]:
# Save the dataframe to a parquet file
df.to_parquet('data/abalone_processed.parquet')