# Explorative Data Analysis - Patients

In this notebook you find the explorative data anaylsis for the label data of patients.
There is also diagnostic data of the patients, data of the used MRI machines and for some cases (like external admissions) there ist lab data available.
The aggregations are always limited to one individual patients.


## Imports and Preprocessing

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# set default plt figsize to (12,6)
plt.rcParams["figure.figsize"] = (12,6)
pd.set_option('display.colheader_justify', 'center')


In [None]:
# runs the clean and preprocessing notebook
%run "clean_preprocessing.ipynb"

In [None]:
# runs the data_partitioning notebook
%run "data_partitioning.ipynb"

In [None]:
# read train data set
df_mris = pd.read_csv(r'../data/train_data.csv')

In [None]:
df_patients = df_mris[df_mris['Patient_ID'].duplicated()].copy()
df_patients

In [None]:
# make datetime values
df_patients["Date_MRI"] = pd.to_datetime(df_patients["Date_MRI"])
df_patients["Entry_date"] = pd.to_datetime(df_patients["Entry_date"])
df_patients["Operation_date"] = pd.to_datetime(df_patients["Operation_date"])
# set category data type in pandas, check datatypes
df_patients['ID_MRI_Machine'] = df_patients['ID_MRI_Machine'].astype('category')
df_patients['Adenoma_size'] = df_patients['Adenoma_size'].astype('category')
df_patients['Label_Quality'] = df_patients['Label_Quality'].astype('category')
df_patients['Diagnosis'] = df_patients['Diagnosis'].astype('category')
df_patients['Category'] = df_patients['Category'].astype('category')

## Dataframe Summary

In [None]:
df_patients.head()

In [None]:
df_patients.tail()

In [None]:
print("Total Dataframe rows:", len(df_patients))
print("Total Dataframe columns:", len(df_patients.columns))

## Distribution Analysis
we will take a look at the distributions of the variables and check for outliers as well.

### Data of MRI
First we will take a look at the distribution of the data for the MRI which were done. 

In [None]:
column = 'Date_MRI'
print("Range of MRI date:", df_patients[column].min().strftime('%d.%m.%Y'), "to", df_patients[column].max().strftime('%d.%m.%Y'))
print("Missing values:", sum(df_patients[column].isna()))

In [None]:
sns.histplot(df_patients[column],bins=50)
plt.title(f"Distribution of Date of MRI")
plt.xlabel('Date of MRI')
plt.show()

### Data of MRI Machines

We will analyse the distribution of the used MRI Machines. As they are setup the same they will not have a big influence in the classification.


In [None]:
column= 'ID_MRI_Machine'
print("Unique MRI Machines:", df_patients[column].unique())
print("Missing values:", sum(df_patients[column].isna()))

In [None]:
sns.histplot(df_patients[column])
plt.title(f"Distribution of counts of MRI machines used")
plt.xlabel('ID of MRI Machine')
plt.show()

### Data Features (screening data)

#### Adenoma Size
The column for the 'Adenoma_size' describes if an Adenoma was labeled as micro or macro.
Micro is defined as <10mm and a macro is defined as >= 10 mm in size.

In [None]:
column= 'Adenoma_size'
print("Summary Statistics:\n",df_patients[column].describe())
print("Percentage Distribution:\n",df_patients[column].value_counts(normalize=True) * 100)
print("Missing values Adenoma size:", sum(df_patients[column].isna()))

In [None]:
sns.histplot(df_patients[column])
plt.title(f"Distribution of Adenoma size categories")
plt.xlabel('Adenoma size category')
plt.show()

#### Pre Operation hormonal disfunctions
The column for the 'Ausfälle prä' describes if a patient was experiencing an under- or an overproduction of a type of hormone (produced by the pituitary gland) before an operation.
Values which indicate a prolaktinoma are an overproduction of prolactin (hyperprolaktin).


In [None]:
# define all "pre op" columns
pre_op_columns = [col for col in df_patients.columns if "Pre_" in col]

In [None]:
# summarise and sort the pre op column values
summary=df_patients[pre_op_columns].sum().sort_values(ascending=False)
sns.barplot(x=summary.index, y=summary.values)
plt.title("Distribution of hormonal disfunctions")
plt.xlabel("Pre OP hormonal disfunctions")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

In [None]:
# correlate all pre op columns to each other
correlation_matrix = df_patients[pre_op_columns].corr()
# Create a heatmap using Seaborn
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix of pre OP hormonal disfunctions")
plt.xticks(rotation=45)
plt.show()

#### Post Operation hormonal disfunctions
The column for the 'Ausfälle post' describes if a patient was experiencing an under- or an overproduction of a type of hormone (produced by the pituitary gland) after an operation.
These are mostly used to confirm a successful operation.

In [None]:
# define all "post op" columns
post_op_columns = [col for col in df_patients.columns if "Post_" in col]

In [None]:
# summarise and sort the post op column values
summary=df_patients[post_op_columns].sum().sort_values(ascending=False)
sns.barplot(x=summary.index, y=summary.values)
plt.title("Distribution of hormonal disfunctions")
plt.xlabel("Post OP hormonal disfunctions")
plt.ylabel("Count")
plt.xticks(rotation=45)

plt.show()

In [None]:
# correlate all pre op columns to each other
correlation_matrix = df_patients[post_op_columns].corr()
# Create a heatmap using Seaborn
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix of post OP hormonal disfunctions")
plt.xticks(rotation=45)
plt.show()

#### Pre and Post Operation hormonal disfunctions
We also take a look at the correlations of pre and post operational hormonal disfunctions.

In [None]:
# correlate all pre and post columns against each other, and sort the values alphabetically by row and then by column
correlation_matrix = df_patients[pre_op_columns+post_op_columns].corr()
correlation_matrix= correlation_matrix.sort_index(ascending=False).sort_index(axis=1, ascending=False)
# Create a heatmap using Seaborn, only use the pre columns on the y-axis and the post columns on the x-axis
sns.heatmap(correlation_matrix.iloc[0:7, 7:], annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix of pre OP to post OP hormonal disfunctions")
plt.xticks(rotation=45)
plt.show()

#### Data Label Quality
In the label data there is column with comments to the quality of the data used for the labeling. This includes if a decision to label was complicated, the decision is not confident or other data quality issues were found.



In [None]:
column= 'Label_Quality'
print("Summary Statistics Data Quality:\n", df_patients[column].describe())
print("Missing values:", sum(df_patients[column].isna()))

In [None]:
sns.histplot(df_patients[column])
plt.title(f"Distribution of Data Quality Comments")
plt.xticks(rotation=45)
plt.xlabel('Data Quality Comment')
plt.show()

#### Date of Entry and Operation
The columns for the 'Eintrittsdatum' and 'Operationdatum' describe when a patient entered the hospital and also when the operation took place.
The difference between these values can show the significance or the extraordinary need for an operation.

In [None]:
column = 'Entry_date'
print("Range of Entry date:", df_patients[column].min().strftime('%d.%m.%Y'), "to", df_patients[column].max().strftime('%d.%m.%Y'))
print("Missing values:", sum(df_patients[column].isna()))

In [None]:
sns.histplot(df_patients[column],bins=20)
plt.title(f"Distribution of Date of Patient Entry")
plt.xlabel('Date of Patient Entry')
plt.show()

In [None]:
column = 'Operation_date'
print("Range of Operation date:", df_patients[column].min().strftime('%d.%m.%Y'), "to", df_patients[column].max().strftime('%d.%m.%Y'))
print("Missing values:", sum(df_patients[column].isna()))

In [None]:
sns.histplot(df_patients[column],bins=20)
plt.title(f"Distribution of Date of Patient Operation")
plt.xlabel('Date of Patient Operation')
plt.show()

In [None]:
# calculate time difference in years between operation and entry date
df_patients['EDDate_OPDate_Difference'] = (df_patients['Operation_date'] - df_patients['Entry_date']).dt.days /365
# Create a histogram to visualize the time differences
sns.histplot(df_patients['EDDate_OPDate_Difference'], bins=20)
plt.title("Time Difference between Entry Date and Operation Date Histogram")
plt.xlabel("Time Difference (years)")
plt.ylabel("Count")
plt.show()

In [None]:
sns.stripplot(y=df_patients['EDDate_OPDate_Difference'], jitter=True, legend=False,alpha=0.7,label="Patients")
sns.boxplot(y=df_patients['EDDate_OPDate_Difference'], width=0.3)
plt.title("Scatterplot with Boxplot for a Time difference between OP Date and Entry Date")
plt.ylabel("Time Difference (years)")
plt.show()

#### Patient Age
The Patients age is also the data set. It might be needed to impute certain missing values of hormone levels.


In [None]:
column = 'Patient_age'
print("Range of ages:", df_patients[column].min(), "to", df_patients[column].max())
print("Mean Patient Age:", df_patients[column].mean())
print("Median Patient Age:", df_patients[column].median())
print("Missing values:", sum(df_patients[column].isna()))

In [None]:
sns.histplot(df_patients[column],bins=30)
plt.title(f"Distribution of Patient age")
plt.xlabel('Patient age')
plt.show()

In [None]:
sns.stripplot(y=df_patients[column], jitter=True, legend=False,alpha=0.7,label="Patients")
sns.boxplot(y=df_patients[column], width=0.3)
plt.title("Scatterplot with Boxplot for Patient Age")
plt.ylabel("Patient Age")
plt.legend()
plt.show()

### Data Additional Laboratory Data (hormonal data)

Some Patients are missing lab values because they were transferred from external facilities like Kantonsspital Baden (KSB).
If the data was found by the labelers, we can include it in the models.

#### Prolactin (hormone)
The Column 'Prolactin' contains the measured values of prolactin for the patient.


In [None]:
column = 'Prolactin'
print("Range of Prolactin:", df_patients[column].min(), "to", df_patients[column].max())
print("Mean Prolactin:", df_patients[column].mean())
print("Median Prolactin:", df_patients[column].median())
print("Missing values:", sum(df_patients[column].isna()))

In [None]:
sns.histplot(df_patients[column])
plt.title(f"Distribution of {column}")
plt.xlabel('Prolactin (μg/l)')
plt.show()

#### TSH (hormone)
The Column 'TSH' contains the measured values of thyroid stimulating
hormone for the patient.


In [None]:
column = 'TSH'
print("Range of TSH:", df_patients[column].min(), "to", df_patients[column].max())
print("Mean TSH:", df_patients[column].mean())
print("Median TSH:", df_patients[column].median())
print("Missing values:", sum(df_patients[column].isna()))

In [None]:
sns.histplot(df_patients[column])
plt.title(f"Distribution of {column}")
plt.xlabel('TSH (mU/l)')
plt.show()

#### IGF1 (hormone)
The Column 'IGF1' contains the measured values of Insulin-like growth factor 1 for the patient.


In [None]:
column = 'IGF1'
print("Range of IGF1:", df_patients[column].min(), "to", df_patients[column].max())
print("Mean IGF1:", df_patients[column].mean())
print("Median IGF1:", df_patients[column].median())
print("Missing values:", sum(df_patients[column].isna()))

In [None]:
sns.histplot(df_patients[column])
plt.title(f"Distribution of {column}")
plt.xlabel('IGF1 (μg/l/l)')
plt.show()

In [None]:
#TODO: add additional lab values if they are genereated

### Data Labels (Medical diagnosis data)

#### Category Prolactinoma (binary Label)
The Column 'Category' contains the actual label data if a adenoma is a prolactinoma or a non-prolactinoma.


In [None]:
column= 'Category'
print("Summary Statistics:\n",df_patients[column].describe())
print("Percentage Distribution:\n",df_patients[column].value_counts(normalize=True) * 100)
print("Missing values Adenoma category:", sum(df_patients[column].isna()))

In [None]:
sns.histplot(df_patients[column])
plt.title(f"Distribution of Adenoma Categorie (binary Label for classification)")
plt.xlabel('Category of Adenoma')
plt.show()

#### Diagnosis Prolactinoma (adenoma description)
The Column 'Diagnosis' contains the actual label data if a adenoma is a prolactinoma or a non-prolactinoma.


In [None]:
column= 'Diagnosis'
print("Summary Statistics:\n",df_patients[column].describe())
print("Percentage Distribution:\n",df_patients[column].value_counts(normalize=True) * 100)
print("Missing values Diagnosis:", sum(df_patients[column].isna()))


In [None]:
sns.histplot(df_patients[column])
plt.title(f"Distribution of Diagnosis Description")
plt.xticks(rotation=45)
plt.xlabel('Diagnosis description')
plt.show()

##

In [None]:
# factorise all features and then correlate them to each other
df_patients_corr = df_patients[['Category','Adenoma_size','Prolactin', 'TSH', 'IGF1']+pre_op_columns + post_op_columns].apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)
correlation_matrix = df_patients_corr.iloc[:,0:1]

# Create a heatmap using Seaborn
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix of Features to the Adenoma Category")
plt.xticks(rotation=45)
plt.show()

## MRI per Patient

In [None]:
summary = df_mris.groupby('Patient_ID')['Patient_ID'].count().sort_values(ascending=False)
#TODO: maybe more eda needed

In [None]:
sns.stripplot(y=summary, jitter=True, legend=False,alpha=0.7,label="Patients")
sns.boxplot(y=summary, width=0.3)
plt.title("Scatterplot with Boxplot for MRI count per Patient")
plt.ylabel("MRI count per Patient")
plt.show()