# Lab 4 - Predicting a Continuous Target with Regression

Author: Craig Wilcox

Date: 4/2/2025

Introduction: The goal of this lab is to predict fare using regression and a continuous numeric target.

## Section 1 - Imports and Inspect Data

In [1]:
# Imports
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Load Titanic dataset from seaborn 
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Section 2 - Data Exploration and Preperation

In [3]:
# Fill age missing values with median
titanic['age'] = titanic['age'].fillna(titanic['age'].median())

# Removes fare rows with missing columns
titanic = titanic.dropna(subset=['fare'])

# Creates new feature named family_size
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

## Section 3 Feature Selection and Justification

In [4]:
# Case 1 age only
X1 = titanic[['age']]
y1 = titanic['fare']

In [5]:
# Case 2 family_size only
X2 = titanic[['family_size']]
y2 = titanic['fare']

In [6]:
# Case 3 age and family_size
X3 = titanic[['age', 'family_size']]
y3 = titanic['fare']

In [7]:
# Case 4 
X4 = titanic[['pclass']]
y4 = titanic['fare']

### Reflection 4

Why might these features affect a passenger’s fare: I am not sure that age necessarily does effect fare but it will be interesting to find out. The larger the family it would make sense the higher the fare and the passenger class should effect the fare.

List all available features: survived, pclass, sex, age, sibsp, parch, fare, embarked, class, who, adult_male, deck, embark_town, alive, alone

Which other features could improve predictions and why: sibsp could help for the same reason that family size is. Two others could be embark_town or embarked based on options and locations for those columns.

How many variables are in your Case 4: Just 1 pclass

Which variable(s) did you choose for Case 4 and why do you feel those could make good inputs: just pclass as I thought that would be a good indicator for fare

## Train, evaluate, and report regression model

In [8]:
# Split data

X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=123)

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=123)

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=123)

X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, test_size=0.2, random_state=123)

In [9]:
# Create models 
lr_model1 = LinearRegression().fit(X1_train, y1_train)
lr_model2 = LinearRegression().fit(X2_train, y2_train)
lr_model3 = LinearRegression().fit(X3_train, y3_train)
lr_model4 = LinearRegression().fit(X4_train, y4_train)

# Predictions
y_pred_train1 = lr_model1.predict(X1_train)
y_pred_test1 = lr_model1.predict(X1_test)

y_pred_train2 = lr_model2.predict(X2_train)
y_pred_test2 = lr_model2.predict(X2_test)

y_pred_train3 = lr_model3.predict(X3_train)
y_pred_test3 = lr_model3.predict(X3_test)

y_pred_train4 = lr_model4.predict(X4_train)
y_pred_test4 = lr_model4.predict(X4_test)

In [16]:
# Report performance
# Case 1
print("Case 1: Training R²:", r2_score(y1_train, y_pred_train1))
print("Case 1: Test R²:", r2_score(y1_test, y_pred_test1))
print("Case 1: Test RMSE:", mean_squared_error(y1_test, y_pred_test1))
print("Case 1: Test MAE:", mean_absolute_error(y1_test, y_pred_test1))

# Case 2
print("Case 2: Training R²:", r2_score(y2_train, y_pred_train2))
print("Case 2: Test R²:", r2_score(y2_test, y_pred_test2))
print("Case 2: Test RMSE:", mean_squared_error(y2_test, y_pred_test2))
print("Case 2: Test MAE:", mean_absolute_error(y2_test, y_pred_test2))

# Case 3
print("Case 3: Training R²:", r2_score(y3_train, y_pred_train3))
print("Case 3: Test R²:", r2_score(y3_test, y_pred_test3))
print("Case 3: Test RMSE:", mean_squared_error(y3_test, y_pred_test3))
print("Case 3: Test MAE:", mean_absolute_error(y3_test, y_pred_test3))

# Case 4
print("Case 4: Training R²:", r2_score(y4_train, y_pred_train4))
print("Case 4: Test R²:", r2_score(y4_test, y_pred_test4))
print("Case 4: Test RMSE:", mean_squared_error(y4_test, y_pred_test4))
print("Case 4: Test MAE:", mean_absolute_error(y4_test, y_pred_test4))


Case 1: Training R²: 0.009950688019452314
Case 1: Test R²: 0.0034163395508415295
Case 1: Test RMSE: 1441.8455811188421
Case 1: Test MAE: 25.28637293162364
Case 2: Training R²: 0.049915792364760736
Case 2: Test R²: 0.022231186110131973
Case 2: Test RMSE: 1414.6244812277246
Case 2: Test MAE: 25.025348159416414
Case 3: Training R²: 0.07347466201590014
Case 3: Test R²: 0.04978483276307333
Case 3: Test RMSE: 1374.7601875944656
Case 3: Test MAE: 24.284935030470688
Case 4: Training R²: 0.3005588037487471
Case 4: Test R²: 0.30160172341699276
Case 4: Test RMSE: 1010.4344561482964
Case 4: Test MAE: 20.65370367148405


### Reflection 4

Compare the train vs test results for each.

Did Case 1 overfit or underfit? Explain: This case was an extreme underfit with very low R² results and large errors in the RMSE and MAE

Did Case 2 overfit or underfit? Explain: Again a case of underfit though there was an improvement. This might be more of the model being to simple than and underfit problem

Did Case 3 overfit or underfit? Explain: There is a modest improvement vs the first 2 cases but still likely underfit in the data with the RMSE and the MAE scores

Did Case 4 overfit or underfit? Explain: This was the highest performer of the 4 cases explaining 30% variance and a much lower RMSE and MAE

Adding Age

Did adding age improve the model: Yes it did improve the model but not by enough to conclude it as significant

Propose a possible explanation (consider how age might affect ticket price, and whether the data supports that): There is a possibility that there were different fares depending on the age of the passenger but that is not always the case and that could be effected by other variables which these regressions would not catch

Worst

Which case performed the worst: Case 1

How do you know: R² both in train and test were below 1% and the both the RMSE and MAE values were the highest of any of the cases

Do you think adding more training data would improve it (and why/why not): No, age is clearly a poor predictor of fare there is no indication in any of the values that adding additional data would yield a more accurate result

Best

Which case performed the best: Case 4

How do you know: R² for both train and test is around 30% which is the highest of the cases and the RSME and MAE values are the lowest of the 4 cases indicating the highest accuracy

Do you think adding more training data would improve it (and why/why not): Yes, this model appears to be learning with the train and testing data being similar I believe it would be improved by adding more data.