# Clinical data for the project

In the file ClinicalData.csv in the data folder you will see a subset of clinical variables for the patient population we are exploring.  These variables have been stored numerically, eventhough some of these are categorical. 

In this notebook, you will be able to load the clinical variables from a csv file, explore the meaning of each of them, as well as visualize them. 

At the end of the notebook, you are requested to select the variable you want to keep for the next part of the project (CNNs for classification/regression) and store it as a new csv.  

The first step is to initialize the notebook:

In [None]:
%%javascript
// this cell stops the notebook from putting output in scrolling frames
IPython.OutputArea.prototype._should_scroll = function(lines){return false;}

In [None]:
%matplotlib notebook 
# alternatives notebook inline

## Reading and accessing data using pandas

We will use pandas for data IO and simple visualization.  Some info for reference: 
- [API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) 
- [User guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html).

Let's start by importing the libraries and setting up the basefolder:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

basefolder="T:/Eliana/RADACol2020/MoveToVM/"

Let's read the file using the read_csv() function, and then showing the loaded data.  The result of calling read_csv() is a [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html).

Note that by setting index_col = 0 in read_csv() function, we are requesting to get the values stored in the first column (xID) to be the label of the rows. Remember that python is 0-based, meaning that the index of the first column or row is 0.

In [None]:
# first column, xID, used as indices of the dataframe:
clinicalfile = basefolder + "ClinicalData.csv"
cdata = pd.read_csv(clinicalfile, index_col=0);
cdata

In [None]:
cdata.info()

### Accessing data in the pandas dataframe

To access a particular value in the dataframe, you can use either
- loc(): where you need to use the row's label and column names
- iloc(): where you use the row's and column's positions

In [None]:
print( cdata.loc[1,"follow_up"] ) # xID started in 1.  SO the first element is 1 rather than 0 when using row labels.
print( cdata.iloc[0,1] )

You can also access the values of a whole column or a whole row:

In [None]:
cdata["follow_up"]

In [None]:
cdata.loc[1]

You can also select a subset of values using array slicing:

In [None]:
cdata.iloc[100:110,0:3]

In [None]:
#some code to have the column number of each column
counter = 0
for coli in cdata.columns:
    print('column ' + str(counter) + ' --> ' + coli)
    counter = counter+1


# Clinical variables available

## Survival data: status, follow_up

These two variables work together to determine whether a patient was dead or alive at a given time (e.g. 12 monthns).
- **Status**: determines whether the patient was alive (0) or dead (1) at the time of the last follow up appointment.
- **follow_up**: determines when was the last time a patient was seen (if status is 0) or when the patient died (if status is 1)


- **Location in dataframe:** columns 0 and 1
- **Missing value flag:** -1 (none)

In [None]:
print( 'status, in column 0, has ' + str((cdata.iloc[:,0] == -1).sum()) + ' missing values.' )
print( 'follow_up, in column 1, has ' + str((cdata.iloc[:,1] == -1).sum()) + ' missing values.' )

cdata.boxplot(column="follow_up", by='status')
plt.ylabel('Follow up (months)')

### Binarizing survival data
Especial care should be taken when classifying patients as 'dead' or 'alive' for a given time 't'.  The key point is: it is impossible to know the whether a patient is dead/alive when t is larger than their follow_up (t > follow_up), and they were alive at the last follow-up appointment (status == 0).  The way to deal with these 'unknown' status in statistical analysis is via [censoring](https://en.wikipedia.org/wiki/Censoring_(statistics)).

Therefore, whenever you binarize survival data at a given time 't', you will have up to three resulting status:
- Alive
- Dead
- Unknown

Here we show you how to do binarization for a given time t. We will be storing the values in a new column named "status_at_[t]". The values in the column are: 0 for alive, 1 for dead, -1 for unknown.

In [None]:
t = 24 # months
newcolname = 'status_at_' + str(t)

cdata[newcolname] = -1 # let's assume that we don't know the status of anyone.

# we know that if follow_up == t, then the status at t is the same as the value in status.
cdata.loc[(cdata['follow_up']==t), newcolname] = cdata.loc[(cdata['follow_up']==t), 'status'];
print(cdata[newcolname].value_counts().sort_index())

# we know that if the follow_up > t, the patient was alive at t.
cdata.loc[cdata['follow_up']>t, newcolname] = 0;
print(cdata[newcolname].value_counts().sort_index())

# we know that if the follow_up < t and status = 1, then the patient was already dead at t.
cdata.loc[(cdata['follow_up']<t)&(cdata['status']==1), newcolname] = 1;
print(cdata[newcolname].value_counts().sort_index())

# we don't know the status at ti for patients which follow_up < t and status = 0.
# Note that this step can be skipped as we initialize the newcolumn with -1s
cdata.loc[(cdata['follow_up']<t)&(cdata['status']==0), newcolname] = -1;
print(cdata[newcolname].value_counts().sort_index())

# we can now plot the proportions of patients dead/alive/unknown using a pie!
plt.figure()
cdata[newcolname].value_counts().sort_index().plot.pie()

## Tumour size
It is the volume of the visible tumour, as segmented for treatment planning (in cubic cm). It includes the primary tumour and visibly affected lymph nodes.

- **Location in dataframe:** column 2
- **Missing value flag:** -1

In [None]:
print( 'Tumour_size, in column 2, has ' + str((cdata.iloc[:,2] == -1).sum()) + ' missing values.' )

cdata.plot.box(y='tumour_size') # alternatively cdata.plot.box(y=2)
plt.ylabel('Tumour volume (cm3)')

## Age
It is the age of the patient at the moment that treatment started (in years).

- **Location in dataframe:** column 3
- **Missing value flag:** -1

In [None]:
print( 'Age, in column 3, has ' + str((cdata.iloc[:,3] == -1).sum()) + ' missing values.' )

cdata.plot(y='age',style='k.')
axis = plt.axis()
print(axis)
plt.plot( axis[0:2], [0,0], 'r-' )
plt.ylabel('Age (y)')

# values under the red line are missing
# ageNoMissingData = cdata.loc[(cdata.iloc[:,4]!=-1),'Age'];


## Gender
Biological sex of the patient.  Categorical variable stored as numeric: 1 is female, 2 is male

- **Location in dataframe:** column 4
- **Missing value flag:** -1

In [None]:
print( 'Gender, in column 4, has ' + str((cdata.iloc[:,4] == -1).sum()) + ' missing values.' )

#value is stored as numerical type
plt.figure()
cdata["gender"].value_counts().plot.pie(figsize=[3,3])

## Performance status

It is a quantity that aims at assessing the fitness of the patient before the treatment starts.  The scale used is the [ECOG/WHO system](https://ecog-acrin.org/resources/ecog-performance-status) which ranges from 0 to 5:
- 0: fully active,
- 1: unable to do strenous activities (heavy work), but otherwise ok,
- 2: able to walk and manage self-care but not able to work,
- 3: confined to bed/chair for more than 50% of waking hours,
- 4: totally confined to bed/chair,
- 5: dead.

It has a degree of subjectivity.

- **Location in dataframe:** column 5
- **Missing value flag:** -1

In [None]:
print( 'performance_status, in column 5, has ' + str((cdata.iloc[:,5] == -1).sum()) + ' missing values.' )

#value is stored as numerical type
plt.figure()
cdata["performance_status"].value_counts().sort_index().plot.bar()
plt.ylabel('Count')

## TNM stage
Classification which describes the extent of spread of cancer.
- **T** describes the size of the primary tumour and whether it has invaded nearby tissue.  Values are X (stored as 0), 1 to 4.
- **N** describes nearby lymph nodes that are involved. Values are X (stored as 0), 1 to 3.
- **M** describes whether there is distant metastasis. Values are 0 or 1 (boolean).

Specific meaning of the TNM stage can vary depending on the site, e.g. [lung cancer patients](https://www.cancerresearchuk.org/about-cancer/lung-cancer/stages-types-grades/tnm-staging).

- **Location in dataframe:** column 6, 7 and 8
- **Missing value flag:** -1

In [None]:
print( 't_stage, in column 6, has ' + str((cdata.iloc[:,6] == -1).sum()) + ' missing values.' )
print( 'n_stage, in column 7, has ' + str((cdata.iloc[:,7] == -1).sum()) + ' missing values.' )
print( 'm_stage, in column 8, has ' + str((cdata.iloc[:,8] == -1).sum()) + ' missing values.' )


#value is stored as numerical type
plt.figure(); cdata["t_stage"].value_counts().sort_index().plot.pie(figsize=[3,3])
plt.figure(); cdata["n_stage"].value_counts().sort_index().plot.pie(figsize=[3,3])
plt.figure(); cdata["m_stage"].value_counts().sort_index().plot.pie(figsize=[3,3])

# YOUR TURN
1. Decide which variable you want to explore in the second part of the project. We will recommend you to start with survival at 12 months, as this is corresponds to the [original analysis](https://www.sciencedirect.com/science/article/pii/S0959804917312017). Check the section in binarizing survival data.
2. Select the clinical variable for the next part of the analysis and create a new dataframe.
3. Save the new dataframe as a new csv.  Check the function [to_csv()](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-store-in-csv)

Later, you can come back to this code and generate other outputs.