<a href="https://colab.research.google.com/github/sc22lg/COMP3611_Machine_Learning/blob/main/COMP3611_202526_Set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coursework COMP3611 (Winter 2025)
**A pdf brief with submission instructions can be found in the same folder as this notebook. Please follow these instructions carefully**

This coursework notebook contains 2 questions. The total of all subquestions is worth 30 marks.

**All your answers need text cells!** *Comments in code do not count as answers*. Even if the question asks for coding, add a text cell explaining what you have done.

## Question 1 - Principal Component Analysis

**Q1 a (4 marks)**

You are given a data array called "shape_array.npy" that comprises 7 samples organised as columns in the array, where each column corresponds to one sample. The data format in each column is: [x_1, y_1, z_1, x_2, y_2, z_2, ………, x_N, y_N, z_N], where (x_i, y_i, z_i) corresponds to the i-th 3D point of a blood vessel. By plotting all 3D points in one column, you can obtain the shape of a blood vessel of that sample.

Plot seven figures to show the 3D blood vessel shape for each sample separately. Also plot two arbitrary shapes on top of each other to get a feeling of how similar or dissimilar the shapes are.

In [79]:
import plotly.express as px
import numpy as np
import pandas as pd

shape_array = np.load('shape_array.npy')
print(shape_array.shape)

def get_display_datapoint(data):

  x = data[0::3]
  y = data[1::3]
  z = data[2::3]

  df = pd.DataFrame({"x": x, "y": y, "z": z})
  return df


for i in range(7):
  df = get_display_datapoint(shape_array.T[i])
  fig = px.line_3d(df, x = "x", y = "y", z = "z")
  fig.show()

(1845, 7)


In the code above we extract every 3rd element from each collumn to find X, Y and Z co-ordinates separately.
Then for each of the columns we construct a dataframe of co-ordinates, we then plot this in 3D.

In [80]:
df1 = get_display_datapoint(shape_array.T[0])
df2 = get_display_datapoint(shape_array.T[1])

df = pd.concat([df1, df2])
fig = px.line_3d(df, x = "x", y = "y", z = "z")
fig.show()

In the code above we construct a dataframe for two arbitrary shapes and give them distinct line names. We then concatenate these into one dataframe which can be plotted in 3D.

**Q1 b (10 marks)**

Next, perform eigendecomposition of the covariance matrix estimated from the given data array. Finally, project original data onto lower-dimensional space and reconstruct data.

Proceed as follows:

1. Subtract the mean from the data, so that it is centered around the origin.

2. Estimate the covariance matrix from the centred data.

3. Calculate eigenvectors and eigenvalues using numpy functions

4. Project centered data (1845 dimension) into a lower-dimension space (You need to choose a reasonable dimension).

5. Reconstruct the blood vessel shape from the lower dimension data in step 4.

As a sanity check plot a blood vessel shape reconstructed from the eigenvectors on top of the original blood vessel shape. Explain how much data reduction you have achieved. Comment on your results.

In [81]:
print(shape_array.shape)
data_points = shape_array.T
print(data_points.shape)

#center datapoints
means = np.mean(data_points, axis = 0)
print(means.shape)
data_points_centered = data_points - means
print(data_points_centered.shape)

df = get_display_datapoint(data_points_centered[2])
fig = px.line_3d(df, x = "x", y = "y", z = "z")
fig.show()

(1845, 7)
(7, 1845)
(1845,)
(7, 1845)


In [82]:
#find covariance matrix
covariance_estimate = np.cov(data_points_centered.T)
print(covariance_estimate.shape)
#find eigen values and vectors
eig_vals, eig_vecs = np.linalg.eig(covariance_estimate)
print(eig_vals.shape)

(1845, 1845)
(1845,)


In [83]:
#express dataset in eigenbasis
eigenbasis = eig_vecs.T.dot(data_points_centered.T).real
print(eigenbasis.shape)

(1845, 7)


###Get Lowest eigenvalues:

In [84]:
real_eig_vals = eig_vals.real
# Get the total number of eigenvalues
n_total_eig_vals = len(real_eig_vals)

reduction = 0.9

# Calculate the number of eigenvalues corresponding to the lowest n%
n_lowest_percent = int(np.floor(reduction * n_total_eig_vals))

# Get the indices that would sort the eigenvalues in ascending order
# These are the original indices of the smallest eigenvalues first
indices_sorted_ascending = np.argsort(real_eig_vals)

# Select the original indices of the lowest eigenvalues
lowest_percent_indices = indices_sorted_ascending[:n_lowest_percent]

print(f"Total number of eigenvalues: {n_total_eig_vals}")
print(f"Number of eigenvalues considered in the lowest 90%: {n_lowest_percent}")
print(f"Original indices of the lowest {n_lowest_percent} eigenvalues:\n{lowest_percent_indices}")
print(len(lowest_percent_indices))


Total number of eigenvalues: 1845
Number of eigenvalues considered in the lowest 90%: 1660
Original indices of the lowest 1660 eigenvalues:
[  7   8   9 ... 502 589 544]
1660


###Perform Reduction by setting to 0 at low eigval positions

In [85]:
#set eigenbasis vectors to 0 at indicies of lowest eigenvalues
for i in lowest_percent_indices:
  for j in range(len(eigenbasis[i])):
    eigenbasis[i][j] = 0

print(eigenbasis.shape)
print(eigenbasis)

(1845, 7)
[[ 8.65868278e-03 -4.78102045e-02 -3.52272572e-02 ...  4.86487354e-02
  -9.27925616e-05  2.93052925e-02]
 [ 8.03143855e-03  2.49408721e-02 -1.52638448e-02 ...  1.21018169e-02
   8.21503214e-03 -4.12392512e-03]
 [-5.70059139e-03  2.03563476e-03 -8.10759034e-03 ...  1.02909283e-02
   9.55762714e-03 -2.03369421e-02]
 ...
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
   0.00000000e+00  0.00000000e+00]]


###Project reduced dataset back to original co-ordinates

In [86]:
#get inverse of eigvecs
einvec_inv = np.linalg.inv(eig_vecs)

#apply inverse of eigvecs to 'unrotate'
reduced_shapes = einvec_inv.T.dot(eigenbasis).real
print(reduced_shapes.shape)

(1845, 7)


In [87]:
#re-add the means
reduced_shapes_un_centered = reduced_shapes + means[:, np.newaxis]
print(reduced_shapes_un_centered.shape)

(1845, 7)


In [88]:
for i in range(7):
  df = get_display_datapoint(reduced_shapes_un_centered.T[i])
  fig = px.line_3d(df, x = "x", y = "y", z = "z")
  fig.show()

We start by centering the co-ordinates around the mean of each sample
across the whole dataset, we calculate the sum of
[xi, yi, zi][xi, yi, zi]T
Divide by the number of additions we did over the matrix (as per the equation)
We then use np.linalg.eig() to calculate the eigenvalues and eigenvectors.

**Q1 c (4 marks)**

Research PCA analysis using the *scikit-learn* library. Perform PCA analysis and show the reconstructed data of any blood vessel shape on top of the
original blood vessel shape. There are variables in the PCA object that correspond to the eigenvalues used for choosing projection eigenvectors. Compare the eigenvalues  and eigenvectors you have computed in the previous question with the eigenvalues  and the eigenvectors computed by the *scikit-learn* library. Compare the reconstructed coordinates from both methods. Comment on your results.


# Question 2 : Predict Cancer Mortality Rates in US Counties

The provided dataset comprises data collected from multiple counties in the US. The regression task for this assessment is to predict cancer mortality rates in "unseen" US counties, given some training data. The training data ('Training_data.csv') comprises various features/predictors related to socio-economic characteristics, amongst other types of information for specific counties in the country. The corresponding target variables for the training set are provided in a separate CSV file ('Training_data_targets.csv'). Use the notebooks provided for lab sessions throughout this module to provide solutions to the exercises listed below. Throughout all exercises text describing your code and answering any questions included in the exercise descriptions should be included as part of your submitted solution.


The list of predictors/features available in this data set are described below:

**Data Dictionary**

avgAnnCount: Mean number of reported cases of cancer diagnosed annually

avgDeathsPerYear: Mean number of reported mortalities due to cancer

incidenceRate: Mean per capita (100,000) cancer diagoses

medianIncome: Median income per county

popEst2015: Population of county

povertyPercent: Percent of populace in poverty

MedianAge: Median age of county residents

MedianAgeMale: Median age of male county residents

MedianAgeFemale: Median age of female county residents

AvgHouseholdSize: Mean household size of county

PercentMarried: Percent of county residents who are married

PctNoHS18_24: Percent of county residents ages 18-24 highest education attained: less than high school

PctHS18_24: Percent of county residents ages 18-24 highest education attained: high school diploma

PctSomeCol18_24: Percent of county residents ages 18-24 highest education attained: some college

PctBachDeg18_24: Percent of county residents ages 18-24 highest education attained: bachelor's degree

PctHS25_Over: Percent of county residents ages 25 and over highest education attained: high school diploma

PctBachDeg25_Over: Percent of county residents ages 25 and over highest education attained: bachelor's degree

PctEmployed16_Over: Percent of county residents ages 16 and over employed

PctUnemployed16_Over: Percent of county residents ages 16 and over unemployed

PctPrivateCoverage: Percent of county residents with private health coverage

PctPrivateCoverageAlone: Percent of county residents with private health coverage alone (no public assistance)

PctEmpPrivCoverage: Percent of county residents with employee-provided private health coverage

PctPublicCoverage: Percent of county residents with government-provided health coverage

PctPubliceCoverageAlone: Percent of county residents with government-provided health coverage alone

PctWhite: Percent of county residents who identify as White

PctBlack: Percent of county residents who identify as Black

PctAsian: Percent of county residents who identify as Asian

PctOtherRace: Percent of county residents who identify in a category which is not White, Black, or Asian

PctMarriedHouseholds: Percent of married households

BirthRate: Number of live births relative to number of women in county

In [None]:
import os
import pandas as pd

## Define paths to the training data and targets files
root_dir = './'
training_data_path = root_dir + 'Training_data.csv'
training_targets_path = root_dir + 'Training_data_targets.csv'

**Q2 a**

Read in the training data and targets files. The training data comprises features/predictors while the targets file comprises the targets (i.e. cancer mortality rates in US counties) you need to train models to predict. Plot histograms of all features to visualise their distributions and identify outliers. Do you notice any unusual values for any of the features? If so comment on these in the text accompanying your code. Compute correlations of all features with the target variable (across the data set) and sort them according the strength of correlations. Which are the top five features with strongest correlations to the targets? Plot these correlations using the scatter matrix plotting function available in pandas and comment on at least two sets of features that show visible correlations to each other.

**(4 marks)**

Unusual Values:

avgAnnCount - one value of 25k, outlier

avg deaths per year - one value of 10k, outlier

incidence rate - one value 1200, outlier

StudyPerCap- What is this feature? - outliers around 9k

Median Age has values in the 350 - 600 year range

AvgHouseholdSize - 48 counties with 0

Birthrates up to 21?

In [None]:
training_data = pd.read_csv('Training_data.csv')
training_data_targets = pd.read_csv('Training_data_targets.csv')

for column in training_data.columns:
    fig = px.histogram(training_data, x=column, nbins=50, title=f"Histogram of {column}")
    fig.show()


In [None]:
#finding correlation
correlation_matrix = training_data.corrwith(training_data_targets['TARGET_deathRate'])

top5 = correlation_matrix.reindex(
    correlation_matrix.abs().sort_values(ascending=False).head(5).index
)

print(top5)


In [None]:
# Combine top 5 features and the target into a single DataFrame for plotting
features_to_plot = top5.index.tolist()
plot_df = training_data[features_to_plot].copy()
plot_df['TARGET_deathRate'] = training_data_targets['TARGET_deathRate']

# Plot individual scatter plots for each of the top 5 features against TARGET_deathRate
for feature in features_to_plot:
    fig = px.scatter(plot_df, x=feature, y='TARGET_deathRate', title=f"Scatter Plot of {feature} vs. TARGET_deathRate")
    fig.show()

The values above are the 5 most strongly correlated values (positive or negatigve)

incidence rate and deathRate have a clear visual correlation, especially if you ignore the outlier datapoints, the scatter plot shows a clear up and to the right trend.

Percentage Bachelors Degree 25 Over also has visible negative correlation with death rate, falling off quickly around 5-10% then flattening out.

**Q2 b**

Create an ML pipeline using scikit-learn (as demonstrated in the lab notebooks) to pre-process the training data.

**(3 marks)**

In [None]:
import sklearn
#from sklearn.model_selection import train_test_split

Test_size = 0.2
seed = 67

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(training_data, training_data_targets, test_size = Test_size, random_state = seed)
print("X_train: ", X_train.shape)
print("X_test: ", X_test.shape)
print("y_train: ", y_train.shape)
print("y_test: ", y_test.shape)

In [None]:
print(X_train)
print(X_test)

Next we must Impute our data

In [None]:
from sklearn.impute import SimpleImputer

imputer=SimpleImputer(strategy="median")
imputer.fit(X_train)

X_train_tr = imputer.transform(X_train)
print(X_train_tr)

**Q2 c**

Perform linear regression on the target data. Make a taining/test splot of 80/20. Train a linear classifier on the training set. Evaluate the root mean squared error on both the test and training set. For both the training and the test data plot the predicted value vs the actual target value.  Discuss on the basis of these results whether you believe that the model overfits, underfits or fits correctly.

**(5 marks)**



In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lin_reg = LinearRegression()
lin_reg.fit(X_train_tr, y_train)

train_predictions = lin_reg.predict(X_train_tr)
train_rmse = np.sqrt(mean_squared_error(y_train, train_predictions))
print(train_rmse)

In [None]:
test_predictions = lin_reg.predict(imputer.transform(X_test))
test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
print(test_rmse)

In [None]:
import matplotlib.pyplot as plt

# Create a DataFrame for plotting
train_plot_df = pd.DataFrame({
    'Actual': y_train['TARGET_deathRate'].values,
    'Predicted': train_predictions.flatten()
})

# Plotting predictions vs target values
fig = px.scatter(train_plot_df, x='Actual', y='Predicted', title='Predicted vs. Actual Target Values (Training Set)')
fig.update_layout(xaxis_title='Actual TARGET_deathRate', yaxis_title='Predicted TARGET_deathRate')
fig.show()

# Create a DataFrame for plotting
test_plot_df = pd.DataFrame({
    'Actual': y_test['TARGET_deathRate'].values,
    'Predicted': test_predictions.flatten()
})

# Plotting predictions vs target values
fig = px.scatter(test_plot_df, x='Actual', y='Predicted', title='Predicted vs. Actual Target Values (Test Set)')
fig.update_layout(xaxis_title='Actual TARGET_deathRate', yaxis_title='Predicted TARGET_deathRate')
fig.show()

The results shown in the plots above and the rmse values show that the linear regression model performs well on both training and test sets. This is an indication thata good fit has been found for the data, as if we were over or underfitting, we would see good performance on the training set and poor performance on the test set. However in this instance we can see we actually have a slightly lower RMSE on the test predictions than the train predictions.