# PHAY0076: Computational Pharmaceutics — Coursework 1 (2026)

## Tablet formulation analysis: data cleaning → EDA → statistics → unsupervised learning → supervised learning

**IMPORTANT**
- Work individually.
- Save your notebook with your **examination number** at the start of the filename (e.g., `12345678_PHAY0076_Coursework1.ipynb`).
- Export an **HTML** version: `File → Save and Export Notebook As… → HTML`.
- Upload the the `.html` to moodle.


- Feel free to alter the notebook - adding in extra cells and breaking up your code into different sections.

---

## Context

A pharmaceutical company is developing an **immediate‑release oral tablet** containing a poorly soluble API.  
You have been given a screening dataset of ~120 formulations with measured outcomes. Your job is to:

1. Clean and validate the dataset (quality checks)
2. Explore the data using plots and descriptive statistics
3. Use statistical tests to support (or refute) observed patterns
4. Use unsupervised learning to identify structure in formulation space
5. Train a simple predictive model and evaluate performance
6. Summarise findings and suggest next experimental directions

This is a simulation of a real task a computationally competent formulation scientist might be asked to do in an industrial or academic lab.

---

## Dataset

The data are provided as a .csv file:

`tablet_formulation_screen_2026.csv`

Place the file in the **same folder** as this notebook before running.

### Expected columns

**Inputs (formulation / process)**
- `API_pct`
- `Filler_pct`
- `Binder_pct`
- `Disintegrant_pct`
- `Lubricant_pct`
- `Polymer` (categorical)
- `Compression_kN`
- `Coating` (`Yes`/`No`)

**Outputs (measurements)**
- `Diss_30min`
- `Stability_6mo`
- `Bioavail_AUC`

### API group definition

As part of step 1 - you will be asked to create a categorical variable `API_group` from `API_pct`:

- **Low:** `API_pct ≤ 15`
- **Medium:** `15 < API_pct ≤ 30`
- **High:** `API_pct > 30`

---

## Marking (100%)

- Section 1: Data loading & cleaning — **15%**
- Section 2: Visualisation & EDA — **20%**
- Section 3: Statistical tests — **20%**
- Section 4: Unsupervised learning — **15%**
- Section 5: Supervised learning — **20%**
- Section 6: Summary & recommendations — **10%**

---

## Notes

- Can can double click on any cell to edit it - including the markdown cells.
- You are **not** expected to tune models extensively. You will not be scored on the accuracy of your models - just on implementation and interpretation.
- You may use any reasonable Python approach. Use of `pandas`, `scipy`, `seaborn`, and `scikit-learn` is recommended but not required.
- Where the coursework asks you to justify choices, a short explanation (2–4 sentences) is sufficient.

---


Insert your examination number here:

**Exam number:** 

In [2]:
# =========================
# Setup (some initial libraries - you will need to load more for downstream tasks but this can get you started)
# =========================

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

## For loading the file in colab - use the below command:
# data = pd.read_csv("https://raw.githubusercontent.com/shorthouse-lab/computational_pharmaceutics/refs/heads/main/Workshop5_coursework1/2026/tablet_formulation_screen_2026.csv")

# Section 1 — Data loading and cleaning (15%)

## Tasks:

1. Load the dataset from `tablet_formulation_screen_2026.csv` into a dataframe `df`.
2. Inspect the dataframe structure and contents:
   - `head()`, `shape`, `info()`, `describe()`
3. Create the new column `API_group` as defined:

- **Low:** `API_pct ≤ 15`
- **Medium:** `15 < API_pct ≤ 30`
- **High:** `API_pct > 30`

4. Perform basic data quality checks:
   - missing values (per column)
   - duplicate rows
   - impossible / suspicious values (e.g. negative percentages)
   - whether the composition columns approximately sum to 100% (optional but encouraged)
5. I recommend you create a cleaned dataframe `df_clean` for use in further down analysis.

## Questions (write your answers in Markdown)

**Q1 (5%)** Are there missing values? If yes, how did you handle them and why?  
*(Hint: consider dataset size and whether missingness might bias results.)*

**Q2 (10%)** From summary statistics (mean, standard deviation, and range), make **two observations about the dataset** focusing on data properties (e.g., variability, spread, outliers, constraints).  

In [1]:
# Your code for Section 1 goes here

# 1) Load data into a dataframe

# 2) Inspect data

# 3) Create API_group

# 4) Data quality checks

# 5) Create df_clean



You can write your answers to the questions in a markdown cell like this

# Section 2 — Visualisation and Exploratory Data Analysis (20%)

## Tasks

Using your cleaned dataframe (`df_clean`):

1. Plot distributions (histogram or KDE) for:
   - `Diss_30min`
   - `Stability_6mo`
   - `Bioavail_AUC`

2. Create scatter plots (include a fitted trend line if you wish):
   - `API_pct` vs `Diss_30min`
   - `Compression_kN` vs `Diss_30min`
   - `Binder_pct` vs `Stability_6mo`
   - `Diss_30min` vs `Bioavail_AUC`

3. Create box plots:
   - `Stability_6mo` grouped by `API_group`
   - `Diss_30min` grouped by `Polymer`

4. Create a correlation heatmap for numeric columns.

## Questions

**Q3 (10%)** What do the distributions suggest (e.g., skew, outliers, multimodality)? What does this imply for later statistical analysis?  
*(Example: whether parametric tests might be reasonable.)*

**Q4 (10%)** Identify **two relationships** between **inputs and outputs** supported by your plots (direction + brief description).  
*(Example structure: “As X increases, Y tends to increase/decrease, supported by plot Z”.)*


In [3]:
# Your code for Section 2 goes here



You can write your answers to the questions in a markdown cell like this

# Section 3 — Statistical tests (20%)

## Tasks

Using `df_clean`:

1. Perform a **one-way ANOVA** to test whether `Stability_6mo` differs across `API_group` (Low/Medium/High).
2. If your ANOVA is significant, perform a **post-hoc** comparison between groups (any reasonable approach is acceptable).
3. Calculate correlation coefficients and **p-values** for:
   - `Binder_pct` vs `Stability_6mo`
   - `API_pct` vs `Diss_30min`
   - `Diss_30min` vs `Bioavail_AUC`

## Questions

**Q5 (10%)** Is there evidence that stability differs between API groups? Report the test statistic and p-value and interpret in plain language.

**Q6 (10%)** Which correlations are statistically significant? Briefly comment on which are likely to be practically meaningful.  
Also state whether you used Pearson or Spearman correlation for each test and why.


In [None]:
# Your code for Section 3 goes here



You can write your answers to the questions in a markdown cell like this

# Section 4 — Unsupervised learning (15%)

## Tasks

Using `df_clean`:

1. Build an input feature table `X` using:
   - Numeric inputs: `API_pct`, `Filler_pct`, `Binder_pct`, `Disintegrant_pct`, `Lubricant_pct`, `Compression_kN`
   - Categorical inputs: `Polymer`, `Coating`


2. Preprocess:
   - standardise numeric columns and remove non-numeric columns


3. Perform **PCA** and plot **PC1 vs PC2**:
   - colour points by `API_group`

4. Perform **KMeans clustering** (choose a reasonable k) and plot clusters on the PCA scatter plot.

## Questions

**Q7 (10%)** What does PCA show? What do PC1 and PC2 represent in terms of formulation/process variables?  
*(Hint - you can PCA loadings to support your interpretation, or plot numeric values on the PCA to see the trends.)*

**Q8 (5%)** Do your clusters align with `API_group` or other variables (Polymer/Coating)? What might that suggest about formulation “families”?


In [4]:
# Your code for Section 4 goes here



You can write your answers to the questions in a markdown cell like this

# Section 5 — Supervised learning (20%)

Build a simple predictive model to predict dissolution (`Diss_30min`) from formulation inputs.

## Tasks

1. Define:
   - `X` = inputs (same as Section 4)
   - `y` = `Diss_30min`

2. Split data into training and test sets (e.g., 80/20).

3. Train two models:
   - Linear regression
   - Random forest regression (or another model of your choice)

4. Evaluate test performance using:
   - MAE
   - RMSE
   - Comparing the scores to just predicting the mean of the dataset

6. Plot **predicted vs observed** for the test set.

## Questions

**Q9 (10%)** How well can dissolution be predicted from the available inputs? Does your model beat just predicting the mean of the dataset? Discuss limitations.

**Q10 (10%)** Which features appear important (if your model provides feature importance or coefficients)? Do they make formulation sense?


In [3]:
# Your code for Section 5 goes here


You can write your answers to the questions in a markdown cell like this

# Your code for Section 5 goes here



# Section 6 — Summary & recommendations (10%)

**Q11 (10%)** Write ~5 sentences summarising:

- key EDA findings
- key statistical results (include p-values where relevant)
- PCA/clustering findings
- model performance (include at least one metric)
- two practical recommendations for formulation development


---

## Academic integrity reminder

Use of AI tools is recommended for debugging/syntax, but make sure you check you understand code you are running before submitting. Your analysis and interpretations should be (at least mostly!) your own.  
