FJ- lecture 7

Let's revisit the Palmer penguins dataset and apply some basic linear reagression. Our task is to evaluate the goodness of the model, we'll discuss the theory of linear regression next week. For now, it suffices to think of this as "fitting a line among a set of points".

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

In [None]:
# Load and preprocess the dataset (drop rows with missing values) 
df = pd.read_csv('penguins_size.csv').dropna()

# split the dataset intp training and test set
df_train, df_test = train_test_split(df,random_state=1234)

In [None]:
Dataset = len(df)
TrainingSet = len(df_train)
TestSet= len(df_test)

print(f"Dataset (Length: {Dataset}) divided to  \nTraining Set (Length: {TrainingSet}) and Test Set (Length: {TestSet}) ")

* df_train: The training set (default 75% of the data).
* df_test: The test set (default 25% of the data).
* The random_state=1234 ensures reproducibility so the same split occurs each time you run the code.

In [None]:
# Initialize the Linear Regression model
lr = LinearRegression()

# Define the independent and dependent variables
independent_variables = ['culmen_length_mm','culmen_depth_mm','flipper_length_mm'] # x1,x2,x3
dependent_variable = 'body_mass_g'                                                 # y

# Prepare the training set
X_train = df_train[independent_variables]  # Features (independent variables) for training
y_train = df_train[dependent_variable]     # Target variable (dependent) for training

# Prepare the test set
X_test = df_test[independent_variables]    # Features for testing
y_test = df_test[dependent_variable]       # Target for testing

# Fit the model to the training data
lr.fit(X_train,y_train)  #trains the linear regression model on the provided data

In [None]:
# Make predictions on the test set
y_predicted = lr.predict(X_test)

# y_predicted : values predicted by your regression model _ corresponding to each row in X_test.
# y_test      : the actual observed values from the test set

In [None]:
X_train

The model aims to predict the body mass of a penguin by using a linear combination of the other numerical features (culmen length/depth and flipper length).

Let's start by plotting the absolute and relative error. Note that we are using a signed version of the error: we should expect the error to be strongly concentrated around 0 if the model makes sense and somewhat symmetrically around the mean.

In [None]:
fig, axs = plt.subplots(2) 
# Plot absolute error (difference between predicted and actual values)
axs[0].hist(y_predicted - y_test ,bins=10) #residuals
axs[0].set_xlabel('Absolute error')
axs[0].set_ylabel('Count')

# Plot relative error (difference normalized by actual value)
# Add a small value to y_test to avoid division by zero
axs[1].hist((y_predicted - y_test) / y_test ,bins=10)
axs[1].set_xlabel('Relative error')
axs[1].set_ylabel('Count')

fig.suptitle('Absolute and relative error for Palmer penguins')
fig.tight_layout()
plt.show()

This looks like a kind of bell curve. It's difficult to determine if the error is truly skewed because the number of samples is kind of small. Let's compute some statistics, such as Mean Squared Error (MSE) and Root Mean Squared (RMS).

In [None]:
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_predicted)
print(f"MSE: {mse:.4f}")

In [None]:
# Calculate Root Mean Square Error (RMSE)
rms = np.sqrt(mse)
print(f"RMSE: {rms:.4f}")

So this gives us the interpretation that the typical prediction is off by about 384 grams, or this is the order of magnitude of the error (let's say around 400 grams to be the typical precision).

Let's see if we can do better by focusing on a smaller subset; we now have three different species of penguins and two genders, both of which factor heavily into, well, how heavy the penguins are.

In [None]:
# Filter the DataFrame for Adelie male penguins
df_adelie_male = df.query('species == "Adelie" and sex == "MALE"')

# Split the dataset into training and test sets for Adelie males
df_adelie_male_train, df_adelie_male_test = train_test_split(df_adelie_male)

# Extract independent variables (features) from the training set
X_adelie_male_train = df_adelie_male_train[independent_variables]

# Extract the dependent variable (target) from the training set
y_adelie_male_train = df_adelie_male_train[dependent_variable]

# Extract independent variables (features) from the test set
X_adelie_male_test = df_adelie_male_test[independent_variables]

# Extract the dependent variable (target) from the test set
y_adelie_male_test = df_adelie_male_test[dependent_variable]

# Create a Linear Regression model for Adelie males
lr_adelie_male = LinearRegression()

# Fit the model using the training data
lr_adelie_male.fit(X_adelie_male_train,y_adelie_male_train)

# Predict the dependent variable values using the test data
y_adelie_male_pred = lr_adelie_male.predict(X_adelie_male_test)

fig.savefig('palmer_penguins_error.pdf')

In [None]:
fig, axs = plt.subplots(2) 

# Plot the histogram of absolute errors on the first subplot
axs[0].hist(y_adelie_male_pred-y_adelie_male_test,bins=5)

# Plot the histogram of relative errors on the second subplot
axs[1].hist((y_adelie_male_pred-y_adelie_male_test)/y_adelie_male_test,bins=5)

axs[0].set_xlabel('Absolute error')
axs[0].set_ylabel('Count')
axs[1].set_xlabel('Relative error')
axs[1].set_ylabel('Count')
fig.suptitle('Absolute and relative error for male Adélie penguins')
fig.tight_layout()

In [None]:
mse_adelie_male = np.mean((y_adelie_male_pred-y_adelie_male_test)**2)
print(f"MSE: {mse_adelie_male:.4f}")

In [None]:
rms_adelie_male = np.sqrt(mse_adelie_male)
print(f"RMSE: {rms_adelie_male:.4f}")

In [None]:
df.groupby('species')['body_mass_g'].mean()

In [None]:
df.query('species == "Gentoo"').groupby('sex')['body_mass_g'].mean()

### Now it is time for you to play with data :)