# Data Visualisation Notebook

As we learnt in the data science and AI lecture, a very important aspect of analysing data is to visualise. So, we will learn how to visualise data using mostly seaborn. 
We already came across seaborn in the prev semester. Here we will use the data from the titanic dataset to see some graphs to help us understand the data and relationships between factors.

The url for the file is:

https://raw.githubusercontent.com/rrr-uom-projects/MPiCRT-AI/refs/heads/main/Data/ORN_PMCID_PMC8906058.csv 

And the data we will find inside is:

Outcomes:
- ORN_binary: any grade of osteoradionecrosis
- ORN Grade: grade of the osteoradionecrosis

Patient data:
- Age: nr of years at treatment
- Gender: sex of the patient
- Subsite: where the tumour, oral cavity vs. oropharynx vs. hypopharynx/unknown-primary/larynx/nasopharynx
- T_stage: tumour stage, related to tumour size
- N_stage: nodal stage, related to nodal involvement
- Dental_extraction: 0 no, 1 yes
- Smoking_status: current, former, or never smoke
- pack_years: nr of cigarrete packs/year
- chemotherapy: chemotherapy info, in particular timing 
- postop_RT: definitive vs. postoperative radiotherapy

DVH parameters form the mandible:
- mean, minGy, maxGy: mean, minimum and maximum dose within the mandible bone, respectively
- Vx: DVH metric, the volume within the mandible receiving at least x Gy, where x goes from 5 to 70 every 5 Gy
- Dy: DVH metric, the highest dose that y% of the mandible received, where y is 2, 5 to 95 every 5, 97, 98, 99 
- Volume: volume of the mandible bone



What do we need to import to load the data, and visualise it? 

In [None]:
# add code here!

Let's load the dataframe in a variable named **orn**

In [None]:
# add code here!

Let's set the categorical variables as categorical.  These include:
- ORN_binary
- ORN Grade
- Gender
- Subsite (check for consistency)
- T_stage
- N_stage
- Dental_extraction
- Smoking_status
- chemotherapy (check for consistency)
- postop_RT

In [None]:
# add code here!

Let's start exploring the variables.  What is the distribution of ages?  Does it differ between female/males?

In [None]:
# add code here!

Do you see any pattern between age and t-stage? 

In [None]:
# add code here!

Do female/males have different dental extraction counts?

In [None]:
# add code here!

What about smoking status and dental extraction counts?

In [None]:
# add code here!

Do you see any pattern between age and mandible volume?

In [None]:
# add code here!

What about mandible volume and dental extraction? 

In [None]:
# add code here!

## Outcome (ORN_binary and ORN Grade) and other variables

Let's see whether we can discover patterns in the data when looking at the outcome (both ORN binary and grade).

For example: 
Is there any pattern between dental extraction and ORN? 

In [None]:
# add code here!

What about mandible volume? is it different for female/males?

In [None]:
# add code here!

Let's explore the doses.  Is the mean dose higher for those who experienced ORN? is it different if they had dental extraction? 

In [None]:
# add code here!

What other pattern can you see?

In [None]:
# add code here!

In [None]:
# add more code here!

In [None]:
# and even more code here!

## Freebie

Here we have many continous variables which we would like to check for patterns w.r.t. the outcome. For this, pairplot() is a good tool.  

Let's first choose all columns that are dosimetric:
- mean, minGy, maxGy: mean, minimum and maximum dose within the mandible bone, respectively
- Vx: DVH metric, the volume within the mandible receiving at least x Gy, where x goes from 5 to 70 every 5 Gy
- Dy: DVH metric, the highest dose that y% of the mandible received, where y is 2, 5 to 95 every 5, 97, 98, 99 
and plot their correlations, stratifying by the outcome.

For this piece, I assume you have run this code before:

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# we are loading data from github. 
dataurl = 'https://raw.githubusercontent.com/rrr-uom-projects/MPiCRT-AI/refs/heads/main/Data/ORN_PMCID_PMC8906058.csv' 
orn = pd.read_csv(dataurl, sep = ',')
orn['ORN_binary'] = orn['ORN_binary'].astype('category')
orn['ORN Grade'] = orn['ORN Grade'].astype('category')
```

In [None]:
scols = ['ORN_binary', 'mean', 'minGy', 'maxGy']
sdoses = orn.loc[:,scols] 
sns.pairplot(sdoses,hue='ORN_binary')

In [None]:
dcols = ['ORN_binary', 'D2', 'D10', 'D20', 'D30', 'D40', 'D50', 'D60', 'D70', 'D80', 'D90', 'D98']
ddoses = orn.loc[:,dcols] 
sns.pairplot(ddoses,hue='ORN_binary')

In [None]:
vcols = ['ORN_binary', 'V5', 'V10', 'V20', 'V30', 'V40', 'V50', 'V60', 'V70']
vdoses = orn.loc[:,vcols] 
sns.pairplot(vdoses,hue='ORN_binary')

Here we have stratified by ORN status (any grade). You can change this to any other categorical variable to quickly assess whether there are patterns within your data.  