# Data Cleaning

This notebook is used to clean and combine all datasources into a coherent dataset. 

We have three different data groups:
* Label & Patient Data
* Lab Data
* MRI Data

All of these have to be cleaned, selected and combined into one dataset.

## Imports

In [None]:
import pandas as pd
import datetime
import numpy as np
import re

# Data Cleaning and Selection Label & Patient-data
First we will clean the data for our labels and patients. This includes the basic cleaning and data type matching as well as looking at anomalies and define the output format. We will use the prepared 'no duplicate PID' Sheet with Labels from KSA.

In [None]:
print("Start Clean and Preprocessing patients-data")

In [None]:
df_patients = pd.read_excel(r'../raw_data/Hypophysenpatienten.xlsx',sheet_name='no duplicate PID')
df_patients.head()

In [None]:
df_patients.tail()

In [None]:
df_patients.columns

## Basic Cleaning, Column Selection, Anomaly Correction and Format definition


### Column Selection
First we select only the columns which have value to our model or our analysis. Some columns are already renamed to make their content more intuitive.

In [None]:
# define needed columns
column_list = ['PID','Fall Nr.',"Datum/Zeit","Arbeitsplatz.Kürzel",'Grösse',
       'Ausfälle prä', 'Qualität', 'ED','OP Datum',
       'Diagnose', 'Kategorie', 'Patient Alter',
       'Prolaktin',"IGF1", 'Cortisol','fT4','weiteres Labor']
df_patients = df_patients[column_list]
# rename columns
df_patients= df_patients.rename(columns={"Fall Nr.": "Case_ID","PID": "Patient_ID",
                       "Datum/Zeit": "Date_Case","ED": "Entry_date", "OP Datum": "Operation_date",
                       "Arbeitsplatz.Kürzel":"ID_MRI_Machine","Grösse": "Adenoma_size","Qualität": "Label_Quality",
                       "Patient Alter":"Patient_age","Kategorie":"Category","Diagnose":"Diagnosis",
                       "Prolaktin":"Prolactin","weiteres Labor":"Lab_additional"})

### Check for Anomalies and correct them
There are some Anomalies mostly in the datetime columns (eg. Operation date before Entry Date). These are corrected or were corrected by the KSA after feedback from us. 

In [None]:
#TODO: check KSA
# not parseable correct values corrected
# df_patients.loc[3,'Entry_date'] = datetime.datetime(2006,1,1,0,0,0,0)
# df_patients.loc[12,'Entry_date'] = datetime.datetime(2008,1,1,0,0,0,0)

#TODO: check KSA
# correct a value which is not datetime parseable
# df_patients.loc[df_patients['Operation_date'] == '2006, 2009', 'Operation_date'] = datetime.datetime(2006,1,1,0,0,0,0)

In [None]:
# TODO: anomaly? check KSA
# rows where Entry Date is after Operationdate?
df_patients[df_patients['Operation_date'] < df_patients['Entry_date']][['Entry_date','Operation_date']]

### Data Type Definition
Now we check the column data-types and parse them into their resprective type if not already correct.


In [None]:
# make datetime values
df_patients["Date_Case"] = pd.to_datetime(df_patients["Date_Case"])
df_patients["Entry_date"] = pd.to_datetime(df_patients["Entry_date"])
df_patients["Operation_date"] = pd.to_datetime(df_patients["Operation_date"])

In [None]:
# set category data type in pandas, check datatypes
df_patients['ID_MRI_Machine'] = df_patients['ID_MRI_Machine'].astype('category')
df_patients['Adenoma_size'] = df_patients['Adenoma_size'].astype('category')
df_patients['Label_Quality'] = df_patients['Label_Quality'].astype('category')
df_patients['Diagnosis'] = df_patients['Diagnosis'].astype('category')
df_patients['Category'] = df_patients['Category'].astype('category')
df_patients.dtypes

### Check Duplicates
Check if a patient is duplicated.

### 

In [None]:
# Patient ID Duplicate Check
assert len(df_patients[df_patients["Patient_ID"].duplicated()]) == 0

### Replace some Spelling mistakes

In [None]:
# replace and correct wrong namings from labelers
df_patients["Ausfälle prä"]= df_patients["Ausfälle prä"].str.replace("intak","intakt")
df_patients["Ausfälle prä"]= df_patients["Ausfälle prä"].str.replace("intaktt","intakt")
df_patients["Ausfälle prä"]= df_patients["Ausfälle prä"].str.replace("goando","gonado")

## One Hot Encode Categorical Values

To use and analyse the categorical data we need to one-hot encode them. This is done by splitting the comma separated strings into single strings and then create a one-hot-encoded column of each individual value. This column is then added to the original dataframe.

In [None]:
# Split the 'Ausfälle prä' column into separate strings
df_patients['Ausfälle prä'] = df_patients['Ausfälle prä'].str.split(', ')
# Create a set to store all unique disfunctions
unique_disfunctions = set()

# Iterate over the 'Ausfälle prä' column to gather unique disfunctions
for value in df_patients['Ausfälle prä']:
    if isinstance(value, list):
        unique_disfunctions.update(value)
    elif isinstance(value, str):
        unique_disfunctions.add(value)

# Iterate over the unique disfunctions and create one-hot encoded columns
for disfunction in unique_disfunctions:
    df_patients["Pre_OP_hormone_"+ disfunction] = df_patients['Ausfälle prä'].apply(lambda x: 1 if (isinstance(x, list) and disfunction in x) or (x == disfunction) else 0)
# drop the original 'Ausfälle prä' column
df_patients = df_patients.drop('Ausfälle prä', axis=1)

In [None]:
df_patients.columns

## Remove All NA and not needed Labels

In [None]:
# remove all labels which are not prolaktion or non-prolaktinom
df_patients = df_patients[df_patients['Category'].isin(['non-prolaktinom','prolaktinom'])]
assert len(df_patients['Category'].unique()) == 2

In [None]:
df_patients.to_csv(r'../raw_data/label_data.csv',index=False)

In [None]:
print("End Clean and Preprocessing patient data")

# Data Cleaning and Selection MRI-data

Now we will clean all MRI's.

In [None]:
print("Start Clean and Preprocessing mri data")

### Column Selection
Only select the interesting columns for the mri's.

In [None]:
column_list_mri = ['PID','Fall Nr.',"Datum/Zeit","Arbeitsplatz.Kürzel"]

In [None]:
df_mri = pd.read_excel(r'../raw_data/Hypophysenpatienten.xlsx',sheet_name='w duplicates')
# select and rename columns
df_mri = df_mri[column_list_mri]
df_mri= df_mri.rename(columns={"Fall Nr.": "Case_ID","PID": "Patient_ID",
                       "Datum/Zeit": "Date_Case","Arbeitsplatz.Kürzel":"ID_MRI_Machine"})

In [None]:
df_mri.head()

In [None]:
df_mri['Case_ID'] = df_mri['Case_ID'].replace('-','',regex=True)

In [None]:
df_mri['Case_ID']=df_mri['Case_ID'].astype(int)

## Remove same Day MRI and take only newest

In [None]:
df_mri_clean = df_mri.groupby(["Patient_ID","Case_ID"])[["Date_Case",'ID_MRI_Machine']].max().reset_index()

# if there are multiple
n_cases = len(df_mri_clean)
df_mri_clean = df_mri_clean.groupby(["Patient_ID","Case_ID"])[["Date_Case",'ID_MRI_Machine']].max().reset_index()
print(f"{n_cases-len(df_mri_clean)} Cases were deleted, because they were same-day duplicates.")

In [None]:
df_mri_clean.to_csv(r'../raw_data/mri_data.csv',index=False)

In [None]:
print("End Clean and Preprocessing mri data")

# Data Cleaning and Selection Lab-data

Now the lab data from the explicit KSA export will be cleaned.

In [None]:
print("Start Clean and Preprocessing lab-data")

## Read

In [None]:
lab_data = pd.read_excel("../raw_data/extract_pit.xlsx",
                         usecols=['PATIENT_NR','FALL_NR','Analyse-ID','Resultat','Datum_Resultat']).rename(
                             columns={"PATIENT_NR":"Patient_ID","FALL_NR":"Case_ID","Analyse-ID":"Lab_ID",})

In [None]:
lab_data.columns

### Clean and select Lab's
There is a multitude of labs in the export. We do not need all of them. 

In [None]:
# remove not needed labs
lab_data= lab_data[~lab_data['Lab_ID'].isin(['ABTEST','TBILHB'])].copy()

In [None]:
# rename labs which are integer based with a string name
lab_data['Lab_ID'] = lab_data['Lab_ID'].replace({20396:'IGF1',24382:'PROL',24384:'PROL',24383:'PROL'})

In [None]:
lab_data['Case_ID']= lab_data['Case_ID'].replace('#','',regex=True)
lab_data['Case_ID']=lab_data['Case_ID'].astype(int)

In [None]:
# replace export anomalies 
ids = {'Ã¼': 'ü', 'Ã¤': 'ä', "Ã„":"Ä","√§":"ä"}

for column in lab_data.columns[lab_data.columns.isin(["Case_ID","Patient_ID","Datum_Resultat","Auftragsdatum"]) == False]:
    for old, new in ids.items():
        lab_data[column] = lab_data[column].str.replace(old, new, regex=False)

# clean the greather and less than characters with regex
clean_result = lambda result: re.sub(r'(?<!\d)\.', '', re.sub(r'[^\d.]', '', str(result))) #clean < zahl / > zahl / 1 A zahl
lab_data["Resultat"] = lab_data["Resultat"].apply(clean_result) 
# remove empty results
lab_data = lab_data[lab_data["Resultat"] != ""]
lab_data["Resultat"] = lab_data["Resultat"].astype(float)

In [None]:
# check if the datetime was correctly fixed
assert lab_data["Datum_Resultat"].min() > pd.to_datetime("1995-01-01")

In [None]:
# mean of results of same date
lab_data = lab_data.groupby(["Patient_ID","Lab_ID","Datum_Resultat"])["Resultat"].agg(['mean']).reset_index()

## Merge Cases with Patient Cases

In [None]:
lab_data = pd.merge(lab_data,df_mri_clean,on="Patient_ID",how = "right")
lab_data = lab_data[lab_data["Date_Case"] >= lab_data["Datum_Resultat"]].drop(columns="Date_Case")

In [None]:
# Compute newest date for each patient and analysis
max_dates = lab_data.groupby(['Patient_ID', "Lab_ID","Case_ID"])['Datum_Resultat'].max().reset_index()
# Merge with the original DataFrame to filter rows with minimum dates
lab_data = pd.merge(lab_data, max_dates, on=['Patient_ID', 'Lab_ID', 'Datum_Resultat',"Case_ID"])

In [None]:
# check for any duplicate Values
assert len(lab_data.loc[:,["Case_ID","Lab_ID"]].drop_duplicates()) == len(lab_data)

In [None]:
# make dataframe wide
lab_data = lab_data.pivot(index=["Patient_ID","Case_ID"],values = ['mean'], columns = ['Lab_ID'])
lab_data.columns = lab_data.columns.droplevel()
lab_data = lab_data.reset_index()

### Create LabData from label data

In [None]:
df_additional_lab = pd.read_csv(r'../raw_data/label_data.csv').rename(columns={'Cortisol':'COR60','fT4':'FT4','Prolactin':'PROL'})[['Patient_ID','Case_ID','COR60','FT4','PROL','IGF1','Lab_additional']]
df_additional_lab.columns
df_additional_lab = df_additional_lab.dropna(subset=['PROL','IGF1','COR60','FT4',]).reset_index(drop=True)


In [None]:
df_additional_lab['Lab_additional'] =df_additional_lab['Lab_additional'].fillna('')
for i in ['Test', 'LH','FSH']:
    df_additional_lab[i] = ''
    indices = df_additional_lab[df_additional_lab['Lab_additional'].str.contains(i)].index
    df_additional_lab.loc[indices,i]  = df_additional_lab.iloc[indices]['Lab_additional']



In [None]:
df_additional_lab= df_additional_lab.drop(columns=['Lab_additional'])
df_additional_lab= df_additional_lab.rename(columns={'Test':'TEST'})

In [None]:
df_additional_lab

#### Missing Rows   

In [None]:
#TODO: Tristan fragen
df_additional_lab[df_additional_lab['Patient_ID'] ==300071920]

In [None]:
df_additional_lab= df_additional_lab.drop(df_additional_lab[df_additional_lab['Patient_ID'] ==300071920].index)

#### Testosteron Cleaning

In [None]:
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r' nmol/l','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r'nmol/l','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r'nmol','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r'Testo','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r'Test','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r',','.',regex=True)

In [None]:
df_additional_lab.loc[df_additional_lab['TEST'] == '', 'TEST'] = np.nan

In [None]:
df_additional_lab['TEST']= df_additional_lab['TEST'].astype(float)

#### LH Cleaning

In [None]:
df_additional_lab.loc[df_additional_lab['LH'] == '', 'LH'] = np.nan

#### FSH Cleaning

In [None]:
df_additional_lab['FSH'] = df_additional_lab['FSH'].replace(r'FSH','',regex=True)
df_additional_lab['FSH'] = df_additional_lab['FSH'].replace(r'U/L','',regex=True)

In [None]:
df_additional_lab.loc[df_additional_lab['FSH'] == '', 'FSH'] = np.nan

#### Cortisol Cleaning 

In [None]:
df_additional_lab['COR60'] = df_additional_lab['COR60'].replace(r' nmol/l','',regex=True)

#### FT4 Cleaning

In [None]:
df_additional_lab['FT4'] = df_additional_lab['FT4'].replace(r' pmol/l','',regex=True)

#### Prolaktin Cleaning and Conversion

In [None]:
df_additional_lab["PROL"]= df_additional_lab["PROL"].str.replace("ug/L","ug/l")
df_additional_lab["PROL"]= df_additional_lab["PROL"].str.replace("mu/L","mU/l")

In [None]:
# get indices which need to be converted
indices_to_divide = df_additional_lab.loc[df_additional_lab["PROL"].str.contains('mU/l'),'PROL'].index 
# remove units and strings
df_additional_lab['PROL'] = df_additional_lab['PROL'].str.rstrip(r'mU/l')
df_additional_lab['PROL'] = df_additional_lab['PROL'].str.rstrip(r'ug/l')
df_additional_lab['PROL'] = df_additional_lab['PROL'].str.rstrip(r'ug/L')
df_additional_lab['PROL'] = df_additional_lab['PROL'].astype(float)
# mU/l -> ug/l (mU/l * 0.048)
df_additional_lab.loc[indices_to_divide,'PROL'] = df_additional_lab.loc[indices_to_divide,'PROL'] * 0.048 


#### IGF1 Cleaning and Conversion

In [None]:
df_additional_lab["IGF1"]= df_additional_lab["IGF1"].str.replace("ug/L","ug/l")
df_additional_lab["IGF1"]= df_additional_lab["IGF1"].str.replace(",",".")

# get indices which need to be converted
indices_to_divide = df_additional_lab.loc[df_additional_lab["IGF1"].str.contains('ng/ml'),'IGF1'].index 
# remove units and strings
df_additional_lab['IGF1'] = df_additional_lab['IGF1'].str.rstrip(r'ng/ml')
df_additional_lab['IGF1'] = df_additional_lab['IGF1'].str.rstrip(r'nmol')
df_additional_lab['IGF1'] = df_additional_lab['IGF1'].str.rstrip(r'nmol/l')
df_additional_lab['IGF1'] = df_additional_lab['IGF1'].astype(float)
# ng/ml -> nmol/l (ng/ml * 0.13)
df_additional_lab.loc[indices_to_divide,'IGF1'] = df_additional_lab.loc[indices_to_divide,'IGF1'] * 0.13


In [None]:
# combine additional lab and lab data
lab_data = lab_data.set_index(['Patient_ID','Case_ID']).combine_first(df_additional_lab.set_index(['Patient_ID','Case_ID'])).reset_index()

In [None]:
spar = round(lab_data[['COR60', 'FSH','LH', 'FT4', 'IGF1', 'LH', 'PROL','TEST']].isna().mean().mean(),3)
print(f"Sparsity of labordata: {spar} % (nur von Fällen mit Laborwerten)")
print(f"Von {len(df_mri_clean)-len(lab_data)} Fällen gibt es keine Laborwerte.")

In [None]:
lab_data.to_csv(r'../raw_data/lab_data.csv',index=False)

In [None]:
print("End Clean and Preprocessing labor data")

# Full Merge

In [None]:
# get only newest mri
# df_mri_clean = df_mri.sort_values(['Patient_ID','Date_Case'],ascending=False).drop_duplicates('Patient_ID')

In [None]:
df_temp = pd.merge(df_mri_clean,lab_data,left_on=['Patient_ID','Case_ID'],right_on=['Patient_ID','Case_ID'])

In [None]:
full_merged = pd.merge(df_temp,df_patients[['Patient_ID', 'Entry_date', 'Operation_date', 'Adenoma_size',
       'Diagnosis', 'Category', 'Patient_age','Pre_OP_hormone_cortico',
       'Pre_OP_hormone_gonado', 'Pre_OP_hormone_somato',
       'Pre_OP_hormone_thyreo', 'Pre_OP_hormone_hyperprolaktin',
       'Pre_OP_hormone_keine', 'Pre_OP_hormone_intakt', 'Label_Quality']],how='inner',left_on=['Patient_ID'],right_on=['Patient_ID'])

In [None]:
full_merged = full_merged.sort_values('Patient_ID')

In [None]:
assert full_merged.duplicated(subset=['Patient_ID','Case_ID']).sum() == 0

In [None]:
full_merged.to_csv(r'../raw_data/data_full_merge.csv',index=False)