<img src="https://teaching.bowyer.ai/sdsai/resources/0/img/IMPERIAL_logo_RGB_Blue_2024.svg" alt="Imperial Logo" width="500"/><br /><br />

Manipulating and Visualising Data - Tutorial Exercises
==============
### SURG70098 - Surgical Data Science and AI
### Stuart Bowyer

# Exercise 2.1
## Computing Average CRP Values

You have been given the (fake) data of five patients' C-Reactive Protein (CRP) tests.  
Write a function that computes and returns the **average (mean) CRP value** for a given **patient of interest (POI)**.

Each patient has multiple CRP values recorded in the dataset, and all of the values are mixed together randomly.  
To identify which value belongs to which patient, you are also given an aligned list of patient names.

### Example

The first four entries of the CRP value list, `crp`, look like this:

```python
crp = ['0.0 mg/L', '0.1301 mg/L', '1.4855 mg/L', '0.6009 mg/L', ...]
```

And the first four entries of the patient list, `patients`, look like this:

```python
patients = ['Liam', 'Emma', 'Oliver', 'Oliver', ...]
```

This means that:
- The first value (`0.0 mg/L`) was from *Liam*  
- The second (`0.1301 mg/L`) was from *Emma*  
- The third (`1.4855 mg/L`) and fourth (`0.6009 mg/L`) were from *Oliver*

### Additional Requirements

Your function should:

- Be called **`average_crp`**  
- Take **three inputs**:  
  `crp` (the full list of CRP values),  
  `patients` (the full list of patient names associated with the values),  
  and `poi` (the name of a single patient of interest)  
- **Return the average CRP value** that patient `poi` has in the dataset  

You do **not** need to use any additional Python modules to complete this exercise.  
You will learn how to do this more efficiently with external libraries (e.g., `pandas`) later.

### Notes
- This exercise helps you practise **writing and using functions** in Python.  
- It also reinforces how to **associate related data** across lists.  
- You could try and extend this to identify “elevated” CRP tests (i.e. those test greater than 1 mg/L above mean) once you are comfortable with these ideas.


In [None]:
# This code loads the sample data for the tutorial
# YOU DO NOT NEED TO UNDERSTAND THIS YET #######################################
%pip install pandas
import pandas as pd
crp_data = pd.read_csv('https://teaching.bowyer.ai/sdsai/resources/1/data/dummy_crp_data.csv')
crp = crp_data['crp'].tolist()
patients = crp_data['patients'].tolist()
names = list(set(patients))
names.sort()
################################################################################

print(names)          # These are the names of patients in the dataset
print(crp[0:10])      # These are the CRP values
print(patients[0:10]) # These are the patients for which each CRP value was taken

def average_crp(crp, patients, poi):
  pass

# Test loop. This code:
#  * loops through all of the patient names
#  * calls your function with each name
#  * prints the result
for name in names:
    avg = average_crp(crp, patients, name)
    print(name, 'has an average CRP of', round(avg, 2), 'mg/L')

# Exercise 2.2
## Identifying Elevated CRP Values

Now that you’ve learned how to use **pandas**, let’s revisit the CRP dataset — but this time, you’ll use a `DataFrame` instead of basic Python lists.

You will use pandas to:
1. Load and inspect the CRP dataset.
2. Compute the **average CRP** value for each patient.
3. Identify which patients have **elevated CRP** values.

The data is here: `https://teaching.bowyer.ai/sdsai/resources/1/data/dummy_crp_data.csv`

### Data Description
The dataset `dummy_crp_data.csv` contains two columns:
- `patients` — the name of the patient
- `crp` — the CRP test result, stored as a string (e.g., `"1.4855 mg/L"`)

### Tasks

#### Part A – Load and Clean the Data
1. Load the dataset into a pandas `DataFrame` called `df`.
2. Inspect the first few rows using `df.head()`.
3. Convert the `crp` column to a numeric type (remove the `' mg/L'` text and store as `float`).

*Hint:* You can use `.str.replace`.

#### Part B – Compute Averages
1. Use `groupby` to calculate the **mean CRP** value for each patient.
2. Store the result in a new `DataFrame` called `avg_crp`.
3. Print `avg_crp` so you can see each patient’s mean CRP value.

#### Part C – Identify Elevated CRP Levels
In clinical contexts, CRP levels **greater than 5 mg/L** often indicate inflammation or infection.

1. Add a new column called `'elevated'` to your main `df`:
   - It should contain `True` if the CRP value is greater than 5.0, and `False` otherwise.
2. Count how many elevated CRP results each patient has.
3. Print a summary table with each patient’s:
   - average CRP
   - number of elevated results

*Hint:* You can combine `groupby` and `agg` for this:
```python
summary = df.groupby('patients').agg(
    average_crp=('crp', 'mean'),
    elevated_count=('elevated', 'sum')
).reset_index()
```

In [None]:
import pandas as pd

# Load the dataset from the provided URL

# Inspect the first few rows

# Remove the ' mg/L' text and convert to float

# Check the cleaning

# Verify data types

# Group by patient and compute average CRP

# Print the results

# Add a column to identify elevated CRP values (>5.0 mg/L)

# Count elevated CRP values per patient

# Print the counts of elevated CRP values per patient

# Exercise 2.3
## Pulse Pressure Analysis
In this lecture, we looked at some string data that represent physiological measurements.

For this exercise, you need to compute the mean, median and standard deviation for the pulse pressure values stored within the `str_data.csv` file as well as plotting the distribution of values in whichever way you think is most appropriate.

Hints:
*   The data are here: `https://teaching.bowyer.ai/sdsai/resources/2/data/str_data.csv`
*   The blood pressure data are in the `bp` column
*   Both systolic and diastolic values are stored in the same column, so you will need to think about string manipulation to parse the numerical values out
*   We define pulse pressure as the difference between the systolic and diastolic blood pressure values
*   For the plot, we are interested in seeing the distribution of pulse pressure values across the whole population, i.e., which values are more/less common

# Exercise 2.4
## Visualising Demographics Data
Previously in this lecture we joined demographics and recovery time datasets together so that we could identify associations between them, i.e., does one gender typically have longer recovery times.

For this exercise, you are required to visualise the associations between age, ethnicity, gender and post-op laboratory test values with recovery time. For each of the visualisations, you should carefully think about what conclusions one can draw from them.

### Part A
To start, produce appropriate visualisations (plots) for each of the following, to demonstrate any associations:
*   age vs recovery time
*   ethnicity vs recovery time
*   gender vs recovery time

### Part B
Having produced these visualisations, you will note that there are many ethnicity codes, and each has relatively few patients. This can limit the information you can take from a plot. In these cases, it is sometimes beneficial to aggregate groups of patients together.

It is possible to group these ethnicities to a higher level and thus increase the number of people per group.

Therefore, you now need to produce a visualisation of the association between grouped ethnicity and recovery time. Suggested groups [from the NHS](https://digital.nhs.uk/data-and-information/data-collections-and-data-sets/data-sets/mental-health-services-data-set/submit-data/data-quality-of-protected-characteristics-and-other-vulnerable-groups/ethnicity): `White`, `Mixed`, `Asian or Asian British`, `Black or Black British`, `Other Ethnic Groups`, `Not Stated/Known`

### Part C
The laboratory tests dataset contains results for a range of lab tests taken after surgery, in a long format.

Now, you need to produce visualisations that show the associations between each of these test types (individually) and recovery time.

### Part D
Finally, you should be aware that an association between two observations does not necessarily illustrate causality. In these cases, confounding variables might be influencing the association.

To explore confounding, you can stratify your visualisations by a suspected confounding variable. Do some self research on stratified plots and produce one of these for each of the lab test vs recovery time visualisations from part 3.

### Hints
*   The data are here:
    *   `https://teaching.bowyer.ai/sdsai/resources/2/data/demographics.csv`
    *   `https://teaching.bowyer.ai/sdsai/resources/2/data/simple_recovery.csv`
    *   `https://teaching.bowyer.ai/sdsai/resources/2/data/laboratory_tests.csv`
*   These data are all fake/semi-randomly generated, so for now, please do not think too hard about the real clinical meaning of each bit
*   We have not looked at the laboratory_test dataframe in this lecture, so you probably want to have a look at this data to understand how to use it before going too far