<a href="https://colab.research.google.com/github/sc22lg/COMP3611_Machine_Learning/blob/main/COMP3611_202526_Set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coursework COMP3611 (Winter 2025)
**A pdf brief with submission instructions can be found in the same folder as this notebook. Please follow these instructions carefully**

This coursework notebook contains 2 questions. The total of all subquestions is worth 30 marks.

**All your answers need text cells!** *Comments in code do not count as answers*. Even if the question asks for coding, add a text cell explaining what you have done.

## Question 1 - Principal Component Analysis

**Q1 a (4 marks)**

You are given a data array called "shape_array.npy" that comprises 7 samples organised as columns in the array, where each column corresponds to one sample. The data format in each column is: [x_1, y_1, z_1, x_2, y_2, z_2, ………, x_N, y_N, z_N], where (x_i, y_i, z_i) corresponds to the i-th 3D point of a blood vessel. By plotting all 3D points in one column, you can obtain the shape of a blood vessel of that sample.

Plot seven figures to show the 3D blood vessel shape for each sample separately. Also plot two arbitrary shapes on top of each other to get a feeling of how similar or dissimilar the shapes are.

In [7]:
import plotly.express as px
import numpy as np
import pandas as pd

shape_array = np.load('shape_array.npy')
#print("array: ", shape_array)

x = shape_array[0::3]
y = shape_array[1::3]
z = shape_array[2::3]

#print("X: ", x)
#print("Y: ", y)
#print("Z: ", z)
print(x.shape)

for i in range(7):
  df = pd.DataFrame({"x": x[:, i], "y": y[:, i], "z": z[:, i]})
  fig = px.line_3d(df, x = "x", y = "y", z = "z")
  fig.show()



(615, 7)


In the code above we extract every 3rd element from each collumn to find X, Y and Z co-ordinates separately.
Then for each of the columns we construct a dataframe of co-ordinates, we then plot this in 3D.

In [79]:
df1 = pd.DataFrame({"x": x[:, 6], "y": y[:, 6], "z": z[:, 6], "line": "Blood vessel 0"})
df2 = pd.DataFrame({"x": x[:, 5], "y": y[:, 5], "z": z[:, 5], "line": "Blood vessel 1"})

df = pd.concat([df1, df2])
fig = px.line_3d(df, x = "x", y = "y", z = "z", color = "line")
fig.show()

In the code above we construct a dataframe for two arbitrary shapes and give them distinct line names. We then concatenate these into one dataframe which can be plotted in 3D.

**Q1 b (10 marks)**

Next, perform eigendecomposition of the covariance matrix estimated from the given data array. Finally, project original data onto lower-dimensional space and reconstruct data.

Proceed as follows:

1. Subtract the mean from the data, so that it is centered around the origin.

2. Estimate the covariance matrix from the centred data.

3. Calculate eigenvectors and eigenvalues using numpy functions

4. Project centered data (1845 dimension) into a lower-dimension space (You need to choose a reasonable dimension).

5. Reconstruct the blood vessel shape from the lower dimension data in step 4.

As a sanity check plot a blood vessel shape reconstructed from the eigenvectors on top of the original blood vessel shape. Explain how much data reduction you have achieved. Comment on your results.

In [28]:
#get means of each axis collumn per sample
x_mean = np.mean(x, axis = 0)
y_mean = np.mean(y, axis = 0)
z_mean = np.mean(z, axis = 0)

x_centered = x - x_mean
y_centered = y - y_mean
z_centered = z - z_mean


matrix = np.zeros((3, 3))
for i in range(len(x)):
  for j in range(7):
    coordinate = np.array([x_centered[i][j], y_centered[i][j], z_centered[i][j]])
    matrix += np.outer(coordinate, coordinate.T)

matrix = matrix / (7*len(x))

print(matrix)

eigenvalues, eigenvectors = np.linalg.eig(matrix)
print(eigenvalues)
print(eigenvectors)

[[ 5.86308571e-06  7.26350546e-06  3.97254993e-06]
 [ 7.26350546e-06  4.44825738e-05 -1.09550351e-05]
 [ 3.97254993e-06 -1.09550351e-05  1.11978777e-04]]
[4.21517204e-06 4.43243204e-05 1.13784944e-04]
[[-0.97984087  0.1980758   0.0260353 ]
 [ 0.19187902  0.96935251 -0.15342147]
 [ 0.05562646  0.145333    0.98781781]]


We start by centering the co-ordinates around the mean of each sample
across the whole dataset, we calculate the sum of
[xi, yi, zi][xi, yi, zi]T
Divide by the number of additions we did over the matrix (as per the equation)
We then use np.linalg.eig() to calculate the eigenvalues and eigenvectors.

In [69]:
idx = np.argsort(eigenvalues)[::-1]
eigvals = eigenvalues[idx]
eigvecs = eigenvectors[:, idx]

# choose top two vectors for 2D projection
W = eigvecs[:, :2]
datapoints = []

print(np.shape(x_centered))

for i in range(7):
  for j in range(len(x_centered)):
    point = np.dot([x_centered[j][i], y_centered[j][i], z_centered[j][i]], W)
    datapoints.append(point.tolist())

print(np.shape(datapoints))

(615, 7)
(4305, 2)


Then, we find our largest eigen values, and sort our values and vectors from highest to lowest. Then take the top two of those to perform 2D projection with.
For each point in our dataset we project it into two dimensions by taking the dot product with out eigenvectors.


In [73]:
datapoints_array = np.array(datapoints)
datapoints_array = datapoints_array.reshape(7, 615, 2)

print(datapoints_array.shape)
for i in range(7):
  df = pd.DataFrame({"x": 0, "y": datapoints_array[i, :, 1], "z": datapoints_array[i, :, 0]})
  fig = px.line_3d(df, x = "x", y = "y", z = "z")
  fig.show()

(7, 615, 2)


Reshape data into correct format (7 samples of 615 2D datapoints) and draw figures for each 2D projected sample

In [89]:
df_whole = pd.DataFrame({"x": x[:, 0], "y": y[:, 0], "z": z[:, 0], "line": "Blood vessel 0"})
df_flat = pd.DataFrame({"x": 0, "y": datapoints_array[0, :, 1], "z": datapoints_array[0, :, 0], "line": "Flat Blood vessel"})
#df_flat = pd.DataFrame({"x": 0, "y": y[:, 0], "z": z[:, 0], "line": "Flat Blood vessel"})

df = pd.concat([df1, df_flat])
fig = px.line_3d(df, x = "x", y = "y", z = "z", color = "line")
fig.show()

**Q1 c (4 marks)**

Research PCA analysis using the *scikit-learn* library. Perform PCA analysis and show the reconstructed data of any blood vessel shape on top of the
original blood vessel shape. There are variables in the PCA object that correspond to the eigenvalues used for choosing projection eigenvectors. Compare the eigenvalues  and eigenvectors you have computed in the previous question with the eigenvalues  and the eigenvectors computed by the *scikit-learn* library. Compare the reconstructed coordinates from both methods. Comment on your results.


# Question 2 : Predict Cancer Mortality Rates in US Counties

The provided dataset comprises data collected from multiple counties in the US. The regression task for this assessment is to predict cancer mortality rates in "unseen" US counties, given some training data. The training data ('Training_data.csv') comprises various features/predictors related to socio-economic characteristics, amongst other types of information for specific counties in the country. The corresponding target variables for the training set are provided in a separate CSV file ('Training_data_targets.csv'). Use the notebooks provided for lab sessions throughout this module to provide solutions to the exercises listed below. Throughout all exercises text describing your code and answering any questions included in the exercise descriptions should be included as part of your submitted solution.


The list of predictors/features available in this data set are described below:

**Data Dictionary**

avgAnnCount: Mean number of reported cases of cancer diagnosed annually

avgDeathsPerYear: Mean number of reported mortalities due to cancer

incidenceRate: Mean per capita (100,000) cancer diagoses

medianIncome: Median income per county

popEst2015: Population of county

povertyPercent: Percent of populace in poverty

MedianAge: Median age of county residents

MedianAgeMale: Median age of male county residents

MedianAgeFemale: Median age of female county residents

AvgHouseholdSize: Mean household size of county

PercentMarried: Percent of county residents who are married

PctNoHS18_24: Percent of county residents ages 18-24 highest education attained: less than high school

PctHS18_24: Percent of county residents ages 18-24 highest education attained: high school diploma

PctSomeCol18_24: Percent of county residents ages 18-24 highest education attained: some college

PctBachDeg18_24: Percent of county residents ages 18-24 highest education attained: bachelor's degree

PctHS25_Over: Percent of county residents ages 25 and over highest education attained: high school diploma

PctBachDeg25_Over: Percent of county residents ages 25 and over highest education attained: bachelor's degree

PctEmployed16_Over: Percent of county residents ages 16 and over employed

PctUnemployed16_Over: Percent of county residents ages 16 and over unemployed

PctPrivateCoverage: Percent of county residents with private health coverage

PctPrivateCoverageAlone: Percent of county residents with private health coverage alone (no public assistance)

PctEmpPrivCoverage: Percent of county residents with employee-provided private health coverage

PctPublicCoverage: Percent of county residents with government-provided health coverage

PctPubliceCoverageAlone: Percent of county residents with government-provided health coverage alone

PctWhite: Percent of county residents who identify as White

PctBlack: Percent of county residents who identify as Black

PctAsian: Percent of county residents who identify as Asian

PctOtherRace: Percent of county residents who identify in a category which is not White, Black, or Asian

PctMarriedHouseholds: Percent of married households

BirthRate: Number of live births relative to number of women in county

In [None]:
import os
import pandas as pd

## Define paths to the training data and targets files
root_dit = './'
training_data_path = root_dir + 'Training_data.csv'
training_targets_path = root_dir + 'Training_data_targets.csv'

**Q2 a**

Read in the training data and targets files. The training data comprises features/predictors while the targets file comprises the targets (i.e. cancer mortality rates in US counties) you need to train models to predict. Plot histograms of all features to visualise their distributions and identify outliers. Do you notice any unusual values for any of the features? If so comment on these in the text accompanying your code. Compute correlations of all features with the target variable (across the data set) and sort them according the strength of correlations. Which are the top five features with strongest correlations to the targets? Plot these correlations using the scatter matrix plotting function available in pandas and comment on at least two sets of features that show visible correlations to each other.

**(4 marks)**

**Q2 b**

Create an ML pipeline using scikit-learn (as demonstrated in the lab notebooks) to pre-process the training data.

**(3 marks)**

**Q2 c**

Perform linear regression on the target data. Make a taining/test splot of 80/20. Train a linear classifier on the training set. Evaluate the root mean squared error on both the test and training set. For both the training and the test data plot the predicted value vs the actual target value.  Discuss on the basis of these results whether you believe that the model overfits, underfits or fits correctly.

**(5 marks)**

