# Data Cleaning

This notebook is used to clean and combine all datasources into a coherent dataset. 

We have three different data groups:
* Label & Patient Data
* Lab Data
* MRI Data

All of these have to be cleaned, selected and combined into one dataset.

## Imports

In [1]:
import pandas as pd
import datetime
import numpy as np
import re

# Data Cleaning and Selection Label & Patient-data
First we will clean the data for our labels and patients. This includes the basic cleaning and data type matching as well as looking at anomalies and define the output format. We will use the prepared 'no duplicate PID' Sheet with Labels from KSA.

In [2]:
print("Start Clean and Preprocessing patients-data")

Start Clean and Preprocessing patients-data


In [3]:
df_patients = pd.read_excel(r'../raw_data/Hypophysenpatienten.xlsx',sheet_name='no duplicate PID')
df_patients.head()

Unnamed: 0,%ID,Fall Nr.,Datum/Zeit,Modalität,Exam Code,Exam Name,Abteilung,Arbeitsplatz.Kürzel,Aufnahmeart,PID,...,OP Datum,Ausfälle post,Diagnose,Kategorie,Patient Alter,Zuweiser,AnforderungDatum,ÜberweiserIntern.Bereich,ÜberweiserIntern.Klinik,Gender
0,7271667,4210945,2016-01-19 11:45:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI2,Amb,249222,...,2012-01-09,"gonado, cortico, thyreo, somato",inaktiv,non-prolaktinom,77,Berkmann Sven,2015-11-06 14:25:17.0000000,KCH,Neurochirurgie,female
1,7247536,4153000,2016-02-08 11:20:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Amb,154372,...,2002-11-02,gonado,prolaktinom,prolaktinom,25,MU_Kinderklinik Ambulatorium,-,FKL,Kinderklinik,male
2,7317245,40026051,2016-02-16 16:25:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Stat,682611,...,NaT,,,,69,MU_101 (Notfallstation),2016-02-16 15:12:05.0000000,INZ,Notfallstation,
3,7346392,40059944,2016-04-21 12:30:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI2,Amb,509136,...,NaT,keine,prolaktinom,prolaktinom,27,Nebiker Piera,2016-04-20 08:11:06.0000000,MUK,Endokrinologie,female
4,7332424,4191409,2016-05-06 11:50:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Amb,293138,...,NaT,,,,62,MU_Medizinisches Ambulatorium,2016-03-18 15:04:41.0000000,MUK,Allgemeine Innere und Notfallmedizin,male


In [4]:
df_patients.tail()

Unnamed: 0,%ID,Fall Nr.,Datum/Zeit,Modalität,Exam Code,Exam Name,Abteilung,Arbeitsplatz.Kürzel,Aufnahmeart,PID,...,OP Datum,Ausfälle post,Diagnose,Kategorie,Patient Alter,Zuweiser,AnforderungDatum,ÜberweiserIntern.Bereich,ÜberweiserIntern.Klinik,Gender
521,8930425,41843364,2023-05-02 12:04:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Amb,300302329,...,2022-01-12,intakt,keine,non-prolaktinom,56,Hirntumorzen,2023-03-02 15:30:07.0000000,KCH,Hirntumorzentrum,male
522,8930353,41725372,2023-05-05 07:54:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI2,Amb,543641,...,2009-06-04,keine,acth,non-prolaktinom,39,Endo,2023-03-02 14:52:07.0000000,MUK,Endokrinologie,female
523,8688141,41892695,2023-05-05 14:19:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Amb,365189,...,NaT,,,,32,Neurologie,2022-05-05 15:42:32.0000000,MUK,Neurologie,female
524,8947649,41708812,2023-05-06 08:12:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Amb,762512,...,2018-09-19,thyreo,gh,non-prolaktinom,66,Endo,2023-03-21 10:12:56.0000000,MUK,Endokrinologie,female
525,8921949,41835743,2023-05-11 09:00:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Amb,300146159,...,2021-09-17,gonado,inaktiv (gonado),non-prolaktinom,57,Hirntumorzen,2023-02-21 14:41:09.0000000,KCH,Hirntumorzentrum,male


In [5]:
df_patients.columns

Index(['%ID', 'Fall Nr.', 'Datum/Zeit', 'Modalität', 'Exam Code', 'Exam Name',
       'Abteilung', 'Arbeitsplatz.Kürzel', 'Aufnahmeart', 'PID', 'Grösse',
       'Ausfälle prä', 'Prolaktin', 'IGF1', 'Cortisol', 'fT4',
       'weiteres Labor', 'Qualität', 'ED', 'OP Datum', 'Ausfälle post',
       'Diagnose', 'Kategorie', 'Patient Alter', 'Zuweiser',
       'AnforderungDatum', 'ÜberweiserIntern.Bereich',
       'ÜberweiserIntern.Klinik', 'Gender'],
      dtype='object')

## Basic Cleaning, Column Selection, Anomaly Correction and Format definition


### Column Selection
First we select only the columns which have value to our model or our analysis. Some columns are already renamed to make their content more intuitive.

In [6]:
# define needed columns
column_list = ['PID','Fall Nr.',"Datum/Zeit","Arbeitsplatz.Kürzel",'Grösse',
       'Ausfälle prä', 'Qualität', 'ED','OP Datum',
       'Diagnose', 'Kategorie', 'Patient Alter',
       'Prolaktin',"IGF1", 'Cortisol','fT4','weiteres Labor','Gender']
df_patients = df_patients[column_list]
# rename columns
df_patients= df_patients.rename(columns={"Fall Nr.": "Case_ID","PID": "Patient_ID",
                       "Datum/Zeit": "Date_Case","ED": "Entry_date", "OP Datum": "Operation_date",
                       "Arbeitsplatz.Kürzel":"ID_MRI_Machine","Grösse": "Adenoma_size","Qualität": "Label_Quality",
                       "Patient Alter":"Patient_age","Kategorie":"Category","Diagnose":"Diagnosis",
                       "Prolaktin":"Prolactin","weiteres Labor":"Lab_additional", 'Gender':"Patient_gender"})

### Check for Anomalies and correct them
There are some Anomalies mostly in the datetime columns (eg. Operation date before Entry Date). These are corrected or were corrected by the KSA after feedback from us. 

In [7]:
# rows where Entry Date is after Operationdate?
assert len(df_patients[df_patients['Operation_date'] < df_patients['Entry_date']][['Entry_date','Operation_date']]) ==0

### Data Type Definition
Now we check the column data-types and parse them into their resprective type if not already correct.


In [8]:
# make datetime values
df_patients["Date_Case"] = pd.to_datetime(df_patients["Date_Case"])
df_patients["Entry_date"] = pd.to_datetime(df_patients["Entry_date"])
df_patients["Operation_date"] = pd.to_datetime(df_patients["Operation_date"])

In [9]:
# set category data type in pandas, check datatypes
df_patients['ID_MRI_Machine'] = df_patients['ID_MRI_Machine'].astype('category')
df_patients['Adenoma_size'] = df_patients['Adenoma_size'].astype('category')
df_patients['Diagnosis'] = df_patients['Diagnosis'].astype('category')
df_patients['Category'] = df_patients['Category'].astype('category')
df_patients['Patient_gender'] = df_patients['Patient_gender'].astype('category')
df_patients.dtypes

Patient_ID                 int64
Case_ID                    int64
Date_Case         datetime64[ns]
ID_MRI_Machine          category
Adenoma_size            category
Ausfälle prä              object
Label_Quality             object
Entry_date        datetime64[ns]
Operation_date    datetime64[ns]
Diagnosis               category
Category                category
Patient_age                int64
Prolactin                 object
IGF1                      object
Cortisol                  object
fT4                       object
Lab_additional            object
Patient_gender          category
dtype: object

### Check Duplicates
Check if a patient is duplicated.

In [10]:
# Patient ID Duplicate Check
assert len(df_patients[df_patients["Patient_ID"].duplicated()]) == 0

In [11]:
df_patients

Unnamed: 0,Patient_ID,Case_ID,Date_Case,ID_MRI_Machine,Adenoma_size,Ausfälle prä,Label_Quality,Entry_date,Operation_date,Diagnosis,Category,Patient_age,Prolactin,IGF1,Cortisol,fT4,Lab_additional,Patient_gender
0,249222,4210945,2016-01-19 11:45:00,MRI2,makro,"gonado, cortico, thyreo",Einblutung,2012-01-01,2012-01-09,inaktiv,non-prolaktinom,77,,,,,,female
1,154372,4153000,2016-02-08 11:20:00,MRI3,makro,gonado,,2002-11-01,2002-11-02,prolaktinom,prolaktinom,25,,,,,,male
2,682611,40026051,2016-02-16 16:25:00,MRI3,,,kein Adenom,NaT,NaT,,,69,,,,,,
3,509136,40059944,2016-04-21 12:30:00,MRI2,mikro,gonado,,2008-07-01,NaT,prolaktinom,prolaktinom,27,,,,,,female
4,293138,4191409,2016-05-06 11:50:00,MRI3,,,kein Adenom,NaT,NaT,,,62,,,,,,male
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
521,300302329,41843364,2023-05-02 12:04:00,MRI3,makro,keine,,2021-10-01,2022-01-12,keine,non-prolaktinom,56,,,,,,male
522,543641,41725372,2023-05-05 07:54:00,MRI2,mikro,keine,,2006-01-01,2009-06-04,acth,non-prolaktinom,39,,,,,,female
523,365189,41892695,2023-05-05 14:19:00,MRI3,,,keine daten,NaT,NaT,,,32,,,,,,female
524,762512,41708812,2023-05-06 08:12:00,MRI3,makro,keine,,2018-09-01,2018-09-19,gh,non-prolaktinom,66,,,,,,female


### Check Diagnosis

In [12]:
df_patients['Diagnosis'].unique()

['inaktiv', 'prolaktinom', NaN, 'rathke', 'gh', ..., 'normal', 'empty sella', 'inaktiv (gh)', 'inaktiv (acth)', 'keine']
Length: 41
Categories (40, object): ['Akromegalie', 'Akromegalie, gh', 'Hypophyseninfarkt', 'Mikro-/Akromegalie', ..., 'substituiert alle Achsen', 'supprimiertes prolaktin', 'teils inaktiv, intra und supraselläres zystis..., 'zystisch']

## One Hot Encode Categorical Values

To use and analyse the categorical data we need to one-hot encode them. This is done by splitting the comma separated strings into single strings and then create a one-hot-encoded column of each individual value. This column is then added to the original dataframe.

In [13]:
df_patients["Ausfälle prä"]= df_patients["Ausfälle prä"].str.replace(' ', '')
df_patients["Ausfälle prä"]= df_patients["Ausfälle prä"].str.lower()
# Split the 'Ausfälle prä' column into separate strings
df_patients['Ausfälle prä'] = df_patients['Ausfälle prä'].str.split(',')

# Create a set to store all unique disfunctions
unique_disfunctions = set()

# Iterate over the 'Ausfälle prä' column to gather unique disfunctions
for value in df_patients['Ausfälle prä']:
    if isinstance(value, list):
        unique_disfunctions.update(value)
    elif isinstance(value, str):
        unique_disfunctions.add(value)

# Iterate over the unique disfunctions and create one-hot encoded columns
for disfunction in unique_disfunctions:
    df_patients["Pre_OP_hormone_"+ disfunction] = df_patients['Ausfälle prä'].apply(lambda x: 1 if (isinstance(x, list) and disfunction in x) or (x == disfunction) else 0)
# drop the original 'Ausfälle prä' column
df_patients = df_patients.drop('Ausfälle prä', axis=1)

In [14]:
df_patients["Diagnosis"]= df_patients["Diagnosis"].str.replace(' ', '')
df_patients["Diagnosis"]= df_patients["Diagnosis"].str.lower()
# Split the 'Ausfälle prä' column into separate strings
df_patients['Diagnosis'] = df_patients['Diagnosis'].str.split(',')

# Create a set to store all unique disfunctions
unique_disfunctions = set()

# Iterate over the 'Ausfälle prä' column to gather unique disfunctions
for value in df_patients['Diagnosis']:
    if isinstance(value, list):
        unique_disfunctions.update(value)
    elif isinstance(value, str):
        unique_disfunctions.add(value)

# Iterate over the unique disfunctions and create one-hot encoded columns
for disfunction in unique_disfunctions:
    df_patients["Diagnosis_"+ disfunction] = df_patients['Diagnosis'].apply(lambda x: 1 if (isinstance(x, list) and disfunction in x) or (x == disfunction) else 0)
# drop the original 'Ausfälle prä' column
df_patients = df_patients.drop('Diagnosis', axis=1)

## Remove All NA and not needed Labels

In [15]:
# remove all labels which are not prolaktion or non-prolaktinom
df_patients = df_patients[df_patients['Category'].isin(['non-prolaktinom','prolaktinom'])]
assert len(df_patients['Category'].unique()) == 2

In [16]:
df_patients.to_csv(r'../raw_data/label_data.csv',index=False)

In [17]:
print("End Clean and Preprocessing patient data")

End Clean and Preprocessing patient data


# Data Cleaning and Selection MRI-data

Now we will clean all MRI's.

In [18]:
print("Start Clean and Preprocessing mri data")

Start Clean and Preprocessing mri data


### Column Selection
Only select the interesting columns for the mri's.

In [19]:
column_list_mri = ['PID','Fall Nr.',"Datum/Zeit","Arbeitsplatz.Kürzel",'%ID']

In [20]:
df_mri = pd.read_excel(r'../raw_data/Hypophysenpatienten.xlsx',sheet_name='w duplicates')
# select and rename columns
df_mri = df_mri[column_list_mri]
df_mri= df_mri.rename(columns={"Fall Nr.": "Case_ID","PID": "Patient_ID",
                       "Datum/Zeit": "Date_Case","Arbeitsplatz.Kürzel":"ID_MRI_Machine",'%ID':"MRI_Case_ID",})

In [21]:
df_mri.head()

Unnamed: 0,Patient_ID,Case_ID,Date_Case,ID_MRI_Machine,MRI_Case_ID
0,300146159,41835743,2023-05-11 09:00:00,MRI3,8921949
1,762512,41708812,2023-05-06 08:12:00,MRI3,8947649
2,365189,41892695,2023-05-05 14:19:00,MRI3,8688141
3,543641,41725372,2023-05-05 07:54:00,MRI2,8930353
4,300302329,41843364,2023-05-02 12:04:00,MRI3,8930425


In [22]:
df_mri.tail()

Unnamed: 0,Patient_ID,Case_ID,Date_Case,ID_MRI_Machine,MRI_Case_ID
1193,112374,4213315,2016-01-19 07:45:00,MRI3,7279605
1194,153807,4211936,2016-01-12 11:40:00,MRI2,7272498
1195,719666,4180070,2016-01-07 11:45:00,MRI2,7247437
1196,313269,4139115,2016-01-05 07:00:00,MRI3,7272286
1197,637049,4210941,2016-01-04 12:30:00,MRI1,7271507


In [23]:
df_mri['Case_ID'] = df_mri['Case_ID'].replace('-','',regex=True)

In [24]:
df_mri['Case_ID']=df_mri['Case_ID'].astype(int)

## Remove same Day MRI and take only newest

In [25]:
n_cases = len(df_mri)
df_mri_clean = df_mri.groupby(["Patient_ID","Case_ID"])[["Date_Case",'ID_MRI_Machine',"MRI_Case_ID"]].max().reset_index()

# if there are multiple
df_mri_clean = df_mri_clean.groupby(["Patient_ID","Case_ID"])[["Date_Case",'ID_MRI_Machine',"MRI_Case_ID"]].max().reset_index()
print(f"{n_cases-len(df_mri_clean)} Cases were deleted, because they were same-day duplicates.")

69 Cases were deleted, because they were same-day duplicates.


In [26]:
df_mri_clean.to_csv(r'../raw_data/mri_data.csv',index=False)

In [27]:
print("End Clean and Preprocessing mri data")

End Clean and Preprocessing mri data


# Data Cleaning and Selection Lab-data

Now the lab data from the explicit KSA export will be cleaned.

In [28]:
print("Start Clean and Preprocessing lab-data")

Start Clean and Preprocessing lab-data


## Read

In [29]:
lab_data = pd.read_excel("../raw_data/extract_pit.xlsx",
                         usecols=['PATIENT_NR','FALL_NR','Analyse-ID','Resultat','Datum_Resultat']).rename(
                             columns={"PATIENT_NR":"Patient_ID","FALL_NR":"Case_ID","Analyse-ID":"Lab_ID",})

In [30]:
lab_data.columns

Index(['Case_ID', 'Patient_ID', 'Lab_ID', 'Datum_Resultat', 'Resultat'], dtype='object')

### Clean and select Lab's
There is a multitude of labs in the export. We do not need all of them. 

In [31]:
# remove not needed labs
lab_data= lab_data[~lab_data['Lab_ID'].isin(['ABTEST','TBILHB'])].copy()

In [32]:
# rename labs which are integer based with a string name
lab_data['Lab_ID'] = lab_data['Lab_ID'].replace({20396:'IGF1',24382:'PROL',24384:'PROL',24383:'PROL'})

In [33]:
# replace some not used characters in the case ids
lab_data['Case_ID'] = lab_data['Case_ID'].replace('#','',regex=True)
lab_data['Case_ID'] = lab_data['Case_ID'].astype(int)

In [34]:
# clean result column
lab_data['Resultat'] = lab_data['Resultat'].replace(',','.',regex=True)
lab_data['Resultat'] = lab_data['Resultat'].replace('>','',regex=True)
lab_data['Resultat'] = lab_data['Resultat'].replace('<','',regex=True)
lab_data['Resultat'] = lab_data['Resultat'].replace('¬†','',regex=True)
lab_data['Resultat']= lab_data['Resultat'].astype(float)

In [35]:
# replace export anomalies 
ids = {'Ã¼': 'ü', 'Ã¤': 'ä', "Ã„":"Ä","√§":"ä"}

for column in lab_data.columns[lab_data.columns.isin(["Case_ID","Patient_ID","Datum_Resultat","Auftragsdatum"]) == False]:
    for old, new in ids.items():
        lab_data[column] = lab_data[column].replace(old, new, regex=False)

# clean the greather and less than characters with regex
clean_result = lambda result: re.sub(r'(?<!\d)\.', '', re.sub(r'[^\d.]', '', str(result))) #clean < zahl / > zahl / 1 A zahl
lab_data["Resultat"] = lab_data["Resultat"].apply(clean_result) 
# remove empty results
lab_data = lab_data[lab_data["Resultat"] != ""]
lab_data["Resultat"] = lab_data["Resultat"].astype(float)

In [36]:
# check if the datetime was correctly fixed
assert lab_data["Datum_Resultat"].min() > pd.to_datetime("1995-01-01")

In [37]:
# mean of results of same date
lab_data = lab_data.groupby(["Patient_ID","Lab_ID","Datum_Resultat"])["Resultat"].agg(['mean']).reset_index()

## Merge Cases with Patient Cases

In [38]:
lab_data = pd.merge(lab_data,df_mri_clean.loc[:,["Patient_ID","Case_ID","Date_Case"]],on="Patient_ID",how = "right")
lab_data = lab_data[lab_data["Date_Case"] >= lab_data["Datum_Resultat"]].drop(columns="Date_Case")

In [39]:
# Compute newest date for each patient and analysis
max_dates = lab_data.groupby(['Patient_ID', "Lab_ID","Case_ID"])['Datum_Resultat'].max().reset_index()
# Merge with the original DataFrame to filter rows with minimum dates
lab_data = pd.merge(lab_data, max_dates, on=['Patient_ID', 'Lab_ID', 'Datum_Resultat',"Case_ID"])

In [40]:
# check for any duplicate Values
assert len(lab_data.loc[:,["Case_ID","Lab_ID"]].drop_duplicates()) == len(lab_data)

In [41]:
# make dataframe wide
lab_data = lab_data.pivot(index=["Patient_ID","Case_ID"],values = ['mean'], columns = ['Lab_ID'])
lab_data.columns = lab_data.columns.droplevel()
lab_data = lab_data.reset_index()

### Create LabData from label data

In [42]:
df_additional_lab = pd.read_csv(r'../raw_data/label_data.csv').rename(columns={'Cortisol':'COR60','fT4':'FT4','Prolactin':'PROL'})[['Patient_ID','Case_ID','COR60','FT4','PROL','IGF1','Lab_additional']]
df_additional_lab.columns
df_additional_lab = df_additional_lab.dropna(subset=['PROL','IGF1','COR60','FT4','Lab_additional'], how='all').reset_index(drop=True)

In [43]:
df_additional_lab.head()

Unnamed: 0,Patient_ID,Case_ID,COR60,FT4,PROL,IGF1,Lab_additional
0,595661,40076776,,,750ug/l,,
1,300071920,40323241,1150.0,11.8,,25.4nmol/l,"FSH 123U/L, LH 35U/L"
2,562753,40377401,704.0,13.0,15,"17,6nmol","Testo 8,3nmol/l"
3,404252,40285535,588.0,23.3,7.4,"11,2nmol",
4,570064,40375426,,,6.5,,


In [44]:
df_additional_lab.tail()

Unnamed: 0,Patient_ID,Case_ID,COR60,FT4,PROL,IGF1,Lab_additional
33,300008318,41467570,,,,,nihil
34,36127,41579190,110.0,7.3,687mU/l,75.4ng/ml,Testo 0.3nmol/l
35,300311713,41683224,,,,,nihil
36,300312446,41718174,271.0,8.4,743mU/l,20.2nmol/l,
37,300228153,41707994,329.0,10.1,173mU/l,6.3nmol/l,Testo 14.1nmol/l


In [45]:
df_additional_lab['Lab_additional'] = df_additional_lab['Lab_additional'].fillna('')
for i in ['Test', 'LH','FSH']:
    df_additional_lab[i] = ''
    indices = df_additional_lab[df_additional_lab['Lab_additional'].str.contains(i)].index
    df_additional_lab.loc[indices,i] = df_additional_lab.iloc[indices]['Lab_additional']

In [46]:
df_additional_lab = df_additional_lab.drop(columns=['Lab_additional'])
df_additional_lab = df_additional_lab.rename(columns={'Test':'TEST'})

#### Testosteron Cleaning

In [47]:
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r' nmol/l','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r'nmol/l','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r'nmol','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r'Testo','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r'Test','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r',','.',regex=True)
df_additional_lab.loc[df_additional_lab['TEST'] == '', 'TEST'] = np.nan
df_additional_lab['TEST']= df_additional_lab['TEST'].astype(float)

#### LH Cleaning

In [48]:
df_additional_lab.loc[df_additional_lab['LH'] == '', 'LH'] = np.nan
df_additional_lab['LH'] = df_additional_lab['LH'].replace(r'FSH \d*U\/L,','',regex=True)
df_additional_lab['LH'] = df_additional_lab['LH'].replace(r'LH','',regex=True)
df_additional_lab['LH'] = df_additional_lab['LH'].replace(r'U/L','',regex=True)
df_additional_lab['LH']= df_additional_lab['LH'].astype(float)

#### FSH Cleaning

In [49]:
df_additional_lab['FSH'] = df_additional_lab['FSH'].replace(r'FSH','',regex=True)
df_additional_lab['FSH'] = df_additional_lab['FSH'].replace(r'U/L','',regex=True)
df_additional_lab['FSH'] = df_additional_lab['FSH'].replace(r', LH \d*','',regex=True)

df_additional_lab.loc[df_additional_lab['FSH'] == '', 'FSH'] = np.nan
df_additional_lab['FSH']= df_additional_lab['FSH'].astype(float)

#### Cortisol Cleaning 

In [50]:
df_additional_lab['COR60'] = df_additional_lab['COR60'].replace(r' nmol/l','',regex=True)
df_additional_lab['COR60']= df_additional_lab['COR60'].astype(float)

#### FT4 Cleaning

In [51]:
df_additional_lab['FT4'] = df_additional_lab['FT4'].replace(r' pmol/l','',regex=True)
df_additional_lab['FT4']= df_additional_lab['FT4'].astype(float)

#### Prolaktin Cleaning and Conversion

In [52]:
df_additional_lab["PROL"]= df_additional_lab["PROL"].str.replace("ug/L","ug/l")
df_additional_lab["PROL"]= df_additional_lab["PROL"].str.replace("mu/L","mU/l")


In [53]:
# get indices which need to be converted
# indices_to_divide = df_additional_lab.loc[df_additional_lab["PROL"].str.contains('mU/l'),'PROL'].index 
indices_to_divide = df_additional_lab[~df_additional_lab["PROL"].isna() & df_additional_lab['PROL'].str.contains('mU/l')].index 
# remove units and strings
df_additional_lab['PROL'] = df_additional_lab['PROL'].str.rstrip(r'mU/l')
df_additional_lab['PROL'] = df_additional_lab['PROL'].str.rstrip(r'ug/l')
df_additional_lab['PROL'] = df_additional_lab['PROL'].str.rstrip(r'ug/L')
df_additional_lab['PROL'] = df_additional_lab['PROL'].astype(float)
# mU/l -> ug/l (mU/l * 0.048)
df_additional_lab.loc[indices_to_divide,'PROL'] = df_additional_lab.loc[indices_to_divide,'PROL'] * 0.048 


#### IGF1 Cleaning and Conversion

In [54]:
# df_additional_lab["IGF1"]= df_additional_lab["IGF1"].str.replace("ug/L","ug/l")
df_additional_lab["IGF1"]= df_additional_lab["IGF1"].str.replace("ug/l","")
df_additional_lab["IGF1"]= df_additional_lab["IGF1"].str.replace(",",".")

# get indices which need to be converted
# indices_to_divide = df_additional_lab.loc[df_additional_lab["IGF1"].str.contains('ng/ml'),'IGF1'].index 
indices_to_divide = df_additional_lab[~df_additional_lab["IGF1"].isna() & df_additional_lab['IGF1'].str.contains('ng/ml')].index
# remove units and strings
df_additional_lab['IGF1'] = df_additional_lab['IGF1'].str.rstrip(r'ng/ml')
df_additional_lab['IGF1'] = df_additional_lab['IGF1'].str.rstrip(r'nmol')
df_additional_lab['IGF1'] = df_additional_lab['IGF1'].str.rstrip(r'nmol/l')
df_additional_lab['IGF1'] = df_additional_lab['IGF1'].astype(float)
# ng/ml -> nmol/l (ng/ml * 0.13)
df_additional_lab.loc[indices_to_divide,'IGF1'] = df_additional_lab.loc[indices_to_divide,'IGF1'] * 0.13


In [55]:
# combine additional lab and lab data
lab_data = lab_data.set_index(['Patient_ID','Case_ID']).combine_first(df_additional_lab.set_index(['Patient_ID','Case_ID'])).reset_index()

In [56]:
spar = round(lab_data[['COR60', 'FSH','LH', 'FT4', 'IGF1', 'LH', 'PROL','TEST']].isna().mean().mean(),3)
print(f"Sparsity of labordata: {spar} % (nur von Fällen mit Laborwerten)")
print(f"Von {len(df_mri_clean)-len(lab_data)} Fällen gibt es keine Laborwerte.")

Sparsity of labordata: 0.315 % (nur von Fällen mit Laborwerten)
Von 509 Fällen gibt es keine Laborwerte.


In [57]:
lab_data.to_csv(r'../raw_data/lab_data.csv',index=False)

In [58]:
print("End Clean and Preprocessing labor data")

End Clean and Preprocessing labor data


# Full Merge

In [59]:
# get mridata where an mri matches to with case and patient if of a lab
df_temp = pd.merge(df_mri_clean,lab_data,left_on=['Patient_ID','Case_ID'],right_on=['Patient_ID','Case_ID'])

In [60]:
# match this data with the patients data
full_merged = pd.merge(df_temp,df_patients[['Patient_ID','Category','Patient_gender','Patient_age','Adenoma_size']+
                                           list(df_patients.columns[df_patients.columns.str.contains('Pre')])+
                                           list(df_patients.columns[df_patients.columns.str.contains('Diagnosis_')])+
                                              ['Operation_date', 'Entry_date']]
                                              ,how='inner',left_on=['Patient_ID'],right_on=['Patient_ID'])

In [61]:
full_merged = full_merged.sort_values('Patient_ID')

### Delete Cases

There are no MRI-files for these Cases.

In [62]:
no_mri_cases = [41040108,40632245]
full_merged = full_merged[full_merged["Case_ID"].isin(no_mri_cases)==False]

In [63]:
assert full_merged.duplicated(subset=['Patient_ID','Case_ID']).sum() == 0

In [64]:
full_merged.to_csv(r'../raw_data/data_full_merge.csv',index=False)