# Data Cleaning

This notebook is used to clean and combine all datasources into a coherent dataset. 

We have three different data groups:
* Label & Patient Data
* Lab Data
* MRI Data

All of these have to be cleaned, selected and combined into one dataset.

## Imports

In [1]:
import pandas as pd
import datetime
import numpy as np
import re

# Data Cleaning and Selection Label & Patient-data
First we will clean the data for our labels and patients. This includes the basic cleaning and data type matching as well as looking at anomalies and define the output format. We will use the prepared 'no duplicate PID' Sheet with Labels from KSA.

In [2]:
print("Start Clean and Preprocessing patients-data")

Start Clean and Preprocessing patients-data


In [3]:
df_patients = pd.read_excel(r'../raw_data/Hypophysenpatienten.xlsx',sheet_name='no duplicate PID')
df_patients.head()

Unnamed: 0,%ID,Fall Nr.,Datum/Zeit,Modalität,Exam Code,Exam Name,Abteilung,Arbeitsplatz.Kürzel,Aufnahmeart,PID,...,ED,OP Datum,Ausfälle post,Diagnose,Kategorie,Patient Alter,Zuweiser,AnforderungDatum,ÜberweiserIntern.Bereich,ÜberweiserIntern.Klinik
0,8921949,41835743,2023-05-11 09:00:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Amb,300146159,...,2021-09-01 00:00:00,2021-09-17,gonado,inaktiv (gonado),non-prolaktinom,57,Hirntumorzen,2023-02-21 14:41:09.0000000,KCH,Hirntumorzentrum
1,8947649,41708812,2023-05-06 08:12:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Amb,762512,...,2018-09-01 00:00:00,2018-09-19,thyreo,gh,non-prolaktinom,66,Endo,2023-03-21 10:12:56.0000000,MUK,Endokrinologie
2,8688141,41892695,2023-05-05 14:19:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Amb,365189,...,,NaT,,,,32,Neurologie,2022-05-05 15:42:32.0000000,MUK,Neurologie
3,8930353,41725372,2023-05-05 07:54:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI2,Amb,543641,...,2006-01-01 00:00:00,2009-06-04,keine,acth,non-prolaktinom,39,Endo,2023-03-02 14:52:07.0000000,MUK,Endokrinologie
4,8930425,41843364,2023-05-02 12:04:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Amb,300302329,...,2021-10-01 00:00:00,2022-01-12,intakt,inaktiv,non-prolaktinom,56,Hirntumorzen,2023-03-02 15:30:07.0000000,KCH,Hirntumorzentrum


In [4]:
df_patients.tail()

Unnamed: 0,%ID,Fall Nr.,Datum/Zeit,Modalität,Exam Code,Exam Name,Abteilung,Arbeitsplatz.Kürzel,Aufnahmeart,PID,...,ED,OP Datum,Ausfälle post,Diagnose,Kategorie,Patient Alter,Zuweiser,AnforderungDatum,ÜberweiserIntern.Bereich,ÜberweiserIntern.Klinik
521,7332424,4191409,2016-05-06 11:50:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Amb,293138,...,,NaT,,,,62,MU_Medizinisches Ambulatorium,2016-03-18 15:04:41.0000000,MUK,Allgemeine Innere und Notfallmedizin
522,7346392,40059944,2016-04-21 12:30:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI2,Amb,509136,...,2008-07-01 00:00:00,NaT,keine,prolaktinom,prolaktinom,27,Nebiker Piera,2016-04-20 08:11:06.0000000,MUK,Endokrinologie
523,7317245,40026051,2016-02-16 16:25:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Stat,682611,...,,NaT,,,,69,MU_101 (Notfallstation),2016-02-16 15:12:05.0000000,INZ,Notfallstation
524,7247536,4153000,2016-02-08 11:20:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI3,Amb,154372,...,2002-11-01 00:00:00,2002-11-02,gonado,prolaktinom,prolaktinom,25,MU_Kinderklinik Ambulatorium,-,FKL,Kinderklinik
525,7271667,4210945,2016-01-19 11:45:00,M-MR,MNDSHYP,MR Hypophyse,MRN,MRI2,Amb,249222,...,2012-01-01 00:00:00,2012-01-09,"gonado, cortico, thyreo, somato",inaktiv,non-prolaktinom,77,Berkmann Sven,2015-11-06 14:25:17.0000000,KCH,Neurochirurgie


In [5]:
df_patients.columns

Index(['%ID', 'Fall Nr.', 'Datum/Zeit', 'Modalität', 'Exam Code', 'Exam Name',
       'Abteilung', 'Arbeitsplatz.Kürzel', 'Aufnahmeart', 'PID', 'Grösse',
       'Ausfälle prä', 'Prolaktin', 'IGF1', 'Cortisol', 'fT4',
       'weiteres Labor', 'Qualität', 'ED', 'OP Datum', 'Ausfälle post',
       'Diagnose', 'Kategorie', 'Patient Alter', 'Zuweiser',
       'AnforderungDatum', 'ÜberweiserIntern.Bereich',
       'ÜberweiserIntern.Klinik'],
      dtype='object')

## Basic Cleaning, Column Selection, Anomaly Correction and Format definition


### Column Selection
First we select only the columns which have value to our model or our analysis. Some columns are already renamed to make their content more intuitive.

In [6]:
# define needed columns
column_list = ['PID','Fall Nr.',"Datum/Zeit","Arbeitsplatz.Kürzel",'Grösse',
       'Ausfälle prä', 'Qualität', 'ED','OP Datum',
       'Diagnose', 'Kategorie', 'Patient Alter',
       'Prolaktin',"IGF1", 'Cortisol','fT4','weiteres Labor']
df_patients = df_patients[column_list]
# rename columns
df_patients= df_patients.rename(columns={"Fall Nr.": "Case_ID","PID": "Patient_ID",
                       "Datum/Zeit": "Date_Case","ED": "Entry_date", "OP Datum": "Operation_date",
                       "Arbeitsplatz.Kürzel":"ID_MRI_Machine","Grösse": "Adenoma_size","Qualität": "Label_Quality",
                       "Patient Alter":"Patient_age","Kategorie":"Category","Diagnose":"Diagnosis",
                       "Prolaktin":"Prolactin","weiteres Labor":"Lab_additional"})

### Check for Anomalies and correct them
There are some Anomalies mostly in the datetime columns (eg. Operation date before Entry Date). These are corrected or were corrected by the KSA after feedback from us. 

In [7]:
#TODO: check KSA
# not parseable correct values corrected
# df_patients.loc[3,'Entry_date'] = datetime.datetime(2006,1,1,0,0,0,0)
# df_patients.loc[12,'Entry_date'] = datetime.datetime(2008,1,1,0,0,0,0)

#TODO: check KSA
# correct a value which is not datetime parseable
# df_patients.loc[df_patients['Operation_date'] == '2006, 2009', 'Operation_date'] = datetime.datetime(2006,1,1,0,0,0,0)

In [None]:
#TODO: check KSA
df_patients[df_patients['Entry_date'] == 'Ende 80er']

In [8]:
df_patients.loc[df_patients['Entry_date'] == 'Ende 80er', 'Entry_date'] = datetime.datetime(1990,1,1,0,0,0,0)

     Patient_ID   Case_ID           Date_Case ID_MRI_Machine Adenoma_size  \
340       45424  40817457 2019-11-02 10:53:00           MRI3        mikro   

         Ausfälle prä Label_Quality Entry_date Operation_date    Diagnosis  \
340  Hyperprolaktinom           NaN  Ende 80er            NaT  prolaktinom   

        Category  Patient_age Prolactin IGF1 Cortisol  fT4 Lab_additional  
340  prolaktinom           58       NaN  NaN      NaN  NaN            NaN  


In [9]:
# TODO: anomaly? check KSA
# rows where Entry Date is after Operationdate?
df_patients[df_patients['Operation_date'] < df_patients['Entry_date']][['Entry_date','Operation_date']]

Unnamed: 0,Entry_date,Operation_date


### Data Type Definition
Now we check the column data-types and parse them into their resprective type if not already correct.


In [10]:
# make datetime values
df_patients["Date_Case"] = pd.to_datetime(df_patients["Date_Case"])
df_patients["Entry_date"] = pd.to_datetime(df_patients["Entry_date"])
df_patients["Operation_date"] = pd.to_datetime(df_patients["Operation_date"])

In [11]:
# set category data type in pandas, check datatypes
df_patients['ID_MRI_Machine'] = df_patients['ID_MRI_Machine'].astype('category')
df_patients['Adenoma_size'] = df_patients['Adenoma_size'].astype('category')
df_patients['Label_Quality'] = df_patients['Label_Quality'].astype('category')
df_patients['Diagnosis'] = df_patients['Diagnosis'].astype('category')
df_patients['Category'] = df_patients['Category'].astype('category')
df_patients.dtypes

Patient_ID                 int64
Case_ID                    int64
Date_Case         datetime64[ns]
ID_MRI_Machine          category
Adenoma_size            category
Ausfälle prä              object
Label_Quality           category
Entry_date        datetime64[ns]
Operation_date    datetime64[ns]
Diagnosis               category
Category                category
Patient_age                int64
Prolactin                 object
IGF1                      object
Cortisol                  object
fT4                       object
Lab_additional            object
dtype: object

### Check Duplicates
Check if a patient is duplicated.

### 

In [12]:
# Patient ID Duplicate Check
assert len(df_patients[df_patients["Patient_ID"].duplicated()]) == 0

### Replace some Spelling mistakes

In [13]:
# replace and correct wrong namings from labelers
df_patients["Ausfälle prä"]= df_patients["Ausfälle prä"].str.replace("intak","intakt")
df_patients["Ausfälle prä"]= df_patients["Ausfälle prä"].str.replace("intaktt","intakt")
df_patients["Ausfälle prä"]= df_patients["Ausfälle prä"].str.replace("goando","gonado")

## One Hot Encode Categorical Values

To use and analyse the categorical data we need to one-hot encode them. This is done by splitting the comma separated strings into single strings and then create a one-hot-encoded column of each individual value. This column is then added to the original dataframe.

In [14]:
# Split the 'Ausfälle prä' column into separate strings
df_patients['Ausfälle prä'] = df_patients['Ausfälle prä'].str.split(', ')
# Create a set to store all unique disfunctions
unique_disfunctions = set()

# Iterate over the 'Ausfälle prä' column to gather unique disfunctions
for value in df_patients['Ausfälle prä']:
    if isinstance(value, list):
        unique_disfunctions.update(value)
    elif isinstance(value, str):
        unique_disfunctions.add(value)

# Iterate over the unique disfunctions and create one-hot encoded columns
for disfunction in unique_disfunctions:
    df_patients["Pre_OP_hormone_"+ disfunction] = df_patients['Ausfälle prä'].apply(lambda x: 1 if (isinstance(x, list) and disfunction in x) or (x == disfunction) else 0)
# drop the original 'Ausfälle prä' column
df_patients = df_patients.drop('Ausfälle prä', axis=1)

In [15]:
df_patients.columns

Index(['Patient_ID', 'Case_ID', 'Date_Case', 'ID_MRI_Machine', 'Adenoma_size',
       'Label_Quality', 'Entry_date', 'Operation_date', 'Diagnosis',
       'Category', 'Patient_age', 'Prolactin', 'IGF1', 'Cortisol', 'fT4',
       'Lab_additional', 'Pre_OP_hormone_',
       'Pre_OP_hormone_hyperprolaktin stressbedingt', 'Pre_OP_hormone_somato',
       'Pre_OP_hormone_prolaktin', 'Pre_OP_hormone_gonado',
       'Pre_OP_hormone_hormonelle Defizite auf diversen Achsen',
       'Pre_OP_hormone_hypothyreo', 'Pre_OP_hormone_hyperprolkatin',
       'Pre_OP_hormone_Hyperprolaktinom', 'Pre_OP_hormone_hyperprolakin',
       'Pre_OP_hormone_SIADH', 'Pre_OP_hormone_keine',
       'Pre_OP_hormone_hypogonado', 'Pre_OP_hormone_hyperprolaktin',
       'Pre_OP_hormone_gonato', 'Pre_OP_hormone_ADH',
       'Pre_OP_hormone_somatotrop',
       'Pre_OP_hormone_stressbedinge hyperprolakin',
       'Pre_OP_hormone_Hyperprolaktin', 'Pre_OP_hormone_cortico',
       'Pre_OP_hormone_gondao', 'Pre_OP_hormone_coritc

## Remove All NA and not needed Labels

In [16]:
# remove all labels which are not prolaktion or non-prolaktinom
df_patients = df_patients[df_patients['Category'].isin(['non-prolaktinom','prolaktinom'])]
assert len(df_patients['Category'].unique()) == 2

In [17]:
df_patients.to_csv(r'../raw_data/label_data.csv',index=False)

In [18]:
print("End Clean and Preprocessing patient data")

End Clean and Preprocessing patient data


# Data Cleaning and Selection MRI-data

Now we will clean all MRI's.

In [19]:
print("Start Clean and Preprocessing mri data")

Start Clean and Preprocessing mri data


### Column Selection
Only select the interesting columns for the mri's.

In [20]:
column_list_mri = ['PID','Fall Nr.',"Datum/Zeit","Arbeitsplatz.Kürzel"]

In [21]:
df_mri = pd.read_excel(r'../raw_data/Hypophysenpatienten.xlsx',sheet_name='w duplicates')
# select and rename columns
df_mri = df_mri[column_list_mri]
df_mri= df_mri.rename(columns={"Fall Nr.": "Case_ID","PID": "Patient_ID",
                       "Datum/Zeit": "Date_Case","Arbeitsplatz.Kürzel":"ID_MRI_Machine"})

In [22]:
df_mri.head()

Unnamed: 0,Patient_ID,Case_ID,Date_Case,ID_MRI_Machine
0,300146159,41835743,2023-05-11 09:00:00,MRI3
1,762512,41708812,2023-05-06 08:12:00,MRI3
2,365189,41892695,2023-05-05 14:19:00,MRI3
3,543641,41725372,2023-05-05 07:54:00,MRI2
4,300302329,41843364,2023-05-02 12:04:00,MRI3


In [23]:
df_mri['Case_ID'] = df_mri['Case_ID'].replace('-','',regex=True)

In [24]:
df_mri['Case_ID']=df_mri['Case_ID'].astype(int)

## Remove same Day MRI and take only newest

In [25]:
df_mri_clean = df_mri.groupby(["Patient_ID","Case_ID"])[["Date_Case",'ID_MRI_Machine']].max().reset_index()

# if there are multiple
n_cases = len(df_mri_clean)
df_mri_clean = df_mri_clean.groupby(["Patient_ID","Case_ID"])[["Date_Case",'ID_MRI_Machine']].max().reset_index()
print(f"{n_cases-len(df_mri_clean)} Cases were deleted, because they were same-day duplicates.")

0 Cases were deleted, because they were same-day duplicates.


In [26]:
df_mri_clean.to_csv(r'../raw_data/mri_data.csv',index=False)

In [27]:
print("End Clean and Preprocessing mri data")

End Clean and Preprocessing mri data


# Data Cleaning and Selection Lab-data

Now the lab data from the explicit KSA export will be cleaned.

In [28]:
print("Start Clean and Preprocessing lab-data")

Start Clean and Preprocessing lab-data


## Read

In [29]:
lab_data = pd.read_excel("../raw_data/extract_pit.xlsx",
                         usecols=['PATIENT_NR','FALL_NR','Analyse-ID','Resultat','Datum_Resultat']).rename(
                             columns={"PATIENT_NR":"Patient_ID","FALL_NR":"Case_ID","Analyse-ID":"Lab_ID",})

In [30]:
lab_data.columns

Index(['Case_ID', 'Patient_ID', 'Lab_ID', 'Datum_Resultat', 'Resultat'], dtype='object')

### Clean and select Lab's
There is a multitude of labs in the export. We do not need all of them. 

In [31]:
# remove not needed labs
lab_data= lab_data[~lab_data['Lab_ID'].isin(['ABTEST','TBILHB'])].copy()

In [32]:
# rename labs which are integer based with a string name
lab_data['Lab_ID'] = lab_data['Lab_ID'].replace({20396:'IGF1',24382:'PROL',24384:'PROL',24383:'PROL'})

In [33]:
lab_data['Case_ID']= lab_data['Case_ID'].replace('#','',regex=True)
lab_data['Case_ID']=lab_data['Case_ID'].astype(int)

In [34]:
# replace export anomalies 
ids = {'Ã¼': 'ü', 'Ã¤': 'ä', "Ã„":"Ä","√§":"ä"}

for column in lab_data.columns[lab_data.columns.isin(["Case_ID","Patient_ID","Datum_Resultat","Auftragsdatum"]) == False]:
    for old, new in ids.items():
        lab_data[column] = lab_data[column].str.replace(old, new, regex=False)

# clean the greather and less than characters with regex
clean_result = lambda result: re.sub(r'(?<!\d)\.', '', re.sub(r'[^\d.]', '', str(result))) #clean < zahl / > zahl / 1 A zahl
lab_data["Resultat"] = lab_data["Resultat"].apply(clean_result) 
# remove empty results
lab_data = lab_data[lab_data["Resultat"] != ""]
lab_data["Resultat"] = lab_data["Resultat"].astype(float)

In [35]:
# check if the datetime was correctly fixed
assert lab_data["Datum_Resultat"].min() > pd.to_datetime("1995-01-01")

In [36]:
# mean of results of same date
lab_data = lab_data.groupby(["Patient_ID","Lab_ID","Datum_Resultat"])["Resultat"].agg(['mean']).reset_index()

## Merge Cases with Patient Cases

In [37]:
lab_data = pd.merge(lab_data,df_mri_clean,on="Patient_ID",how = "right")
lab_data = lab_data[lab_data["Date_Case"] >= lab_data["Datum_Resultat"]].drop(columns="Date_Case")

In [38]:
# Compute newest date for each patient and analysis
max_dates = lab_data.groupby(['Patient_ID', "Lab_ID","Case_ID"])['Datum_Resultat'].max().reset_index()
# Merge with the original DataFrame to filter rows with minimum dates
lab_data = pd.merge(lab_data, max_dates, on=['Patient_ID', 'Lab_ID', 'Datum_Resultat',"Case_ID"])

In [39]:
# check for any duplicate Values
assert len(lab_data.loc[:,["Case_ID","Lab_ID"]].drop_duplicates()) == len(lab_data)

In [40]:
# make dataframe wide
lab_data = lab_data.pivot(index=["Patient_ID","Case_ID"],values = ['mean'], columns = ['Lab_ID'])
lab_data.columns = lab_data.columns.droplevel()
lab_data = lab_data.reset_index()

### Create LabData from label data

In [41]:
df_additional_lab = pd.read_csv(r'../raw_data/label_data.csv').rename(columns={'Cortisol':'COR60','fT4':'FT4','Prolactin':'PROL'})[['Patient_ID','Case_ID','COR60','FT4','PROL','IGF1','Lab_additional']]
df_additional_lab.columns
df_additional_lab = df_additional_lab.dropna(subset=['PROL','IGF1','COR60','FT4',]).reset_index(drop=True)


In [42]:
df_additional_lab['Lab_additional'] =df_additional_lab['Lab_additional'].fillna('')
for i in ['Test', 'LH','FSH']:
    df_additional_lab[i] = ''
    indices = df_additional_lab[df_additional_lab['Lab_additional'].str.contains(i)].index
    df_additional_lab.loc[indices,i]  = df_additional_lab.iloc[indices]['Lab_additional']



In [43]:
df_additional_lab= df_additional_lab.drop(columns=['Lab_additional'])
df_additional_lab= df_additional_lab.rename(columns={'Test':'TEST'})

In [44]:
df_additional_lab

Unnamed: 0,Patient_ID,Case_ID,COR60,FT4,PROL,IGF1,TEST,LH,FSH
0,300228153,41707994,329,10.1,173mU/l,6.3nmol/l,Testo 14.1nmol/l,,
1,300312446,41718174,271,8.4,743mU/l,20.2nmol/l,,,
2,36127,41579190,110,7.3,687mU/l,75.4ng/ml,Testo 0.3nmol/l,,
3,300291886,41169249,311 nmol/l,14.6 pmol/l,7.8 ug/l,208 ng/ml,,,
4,560863,40469555,607,11.4,269ug/l,22.7nmol,,,
5,459429,40603831,703,13.2,7ug/l,32.5nmol/l,,,
6,17081,40573077,766,14.3,13.5ug/L,16.6nmol,,,
7,112374,40541632,334,11,381ug/l,15.9nmol,Testo 3.8nmol,,
8,113792,40525843,1380,11.8,366ug/l,14.9nmol,,,FSH 0.4U/L
9,242880,40419128,1213,17,954ug/l,22nmol,,,FSH 0.7U/L


#### Missing Rows   

In [45]:
#TODO: Tristan fragen
df_additional_lab[df_additional_lab['Patient_ID'] ==300071920]

Unnamed: 0,Patient_ID,Case_ID,COR60,FT4,PROL,IGF1,TEST,LH,FSH
21,300071920,40323241,1150,1900-01-11 19:12:00,missing,25.4nmol/l,,"FSH 123U/L, LH 35U/L","FSH 123U/L, LH 35U/L"


In [46]:
df_additional_lab= df_additional_lab.drop(df_additional_lab[df_additional_lab['Patient_ID'] ==300071920].index)

#### Testosteron Cleaning

In [47]:
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r' nmol/l','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r'nmol/l','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r'nmol','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r'Testo','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r'Test','',regex=True)
df_additional_lab['TEST'] = df_additional_lab['TEST'].replace(r',','.',regex=True)

In [48]:
df_additional_lab.loc[df_additional_lab['TEST'] == '', 'TEST'] = np.nan

In [49]:
df_additional_lab['TEST']= df_additional_lab['TEST'].astype(float)

#### LH Cleaning

In [50]:
df_additional_lab.loc[df_additional_lab['LH'] == '', 'LH'] = np.nan

#### FSH Cleaning

In [51]:
df_additional_lab['FSH'] = df_additional_lab['FSH'].replace(r'FSH','',regex=True)
df_additional_lab['FSH'] = df_additional_lab['FSH'].replace(r'U/L','',regex=True)

In [52]:
df_additional_lab.loc[df_additional_lab['FSH'] == '', 'FSH'] = np.nan

#### Cortisol Cleaning 

In [53]:
df_additional_lab['COR60'] = df_additional_lab['COR60'].replace(r' nmol/l','',regex=True)

#### FT4 Cleaning

In [54]:
df_additional_lab['FT4'] = df_additional_lab['FT4'].replace(r' pmol/l','',regex=True)

#### Prolaktin Cleaning and Conversion

In [55]:
df_additional_lab["PROL"]= df_additional_lab["PROL"].str.replace("ug/L","ug/l")
df_additional_lab["PROL"]= df_additional_lab["PROL"].str.replace("mu/L","mU/l")

In [56]:
# get indices which need to be converted
indices_to_divide = df_additional_lab.loc[df_additional_lab["PROL"].str.contains('mU/l'),'PROL'].index 
# remove units and strings
df_additional_lab['PROL'] = df_additional_lab['PROL'].str.rstrip(r'mU/l')
df_additional_lab['PROL'] = df_additional_lab['PROL'].str.rstrip(r'ug/l')
df_additional_lab['PROL'] = df_additional_lab['PROL'].str.rstrip(r'ug/L')
df_additional_lab['PROL'] = df_additional_lab['PROL'].astype(float)
# mU/l -> ug/l (mU/l * 0.048)
df_additional_lab.loc[indices_to_divide,'PROL'] = df_additional_lab.loc[indices_to_divide,'PROL'] * 0.048 


#### IGF1 Cleaning and Conversion

In [57]:
df_additional_lab["IGF1"]= df_additional_lab["IGF1"].str.replace("ug/L","ug/l")
df_additional_lab["IGF1"]= df_additional_lab["IGF1"].str.replace(",",".")

# get indices which need to be converted
indices_to_divide = df_additional_lab.loc[df_additional_lab["IGF1"].str.contains('ng/ml'),'IGF1'].index 
# remove units and strings
df_additional_lab['IGF1'] = df_additional_lab['IGF1'].str.rstrip(r'ng/ml')
df_additional_lab['IGF1'] = df_additional_lab['IGF1'].str.rstrip(r'nmol')
df_additional_lab['IGF1'] = df_additional_lab['IGF1'].str.rstrip(r'nmol/l')
df_additional_lab['IGF1'] = df_additional_lab['IGF1'].astype(float)
# ng/ml -> nmol/l (ng/ml * 0.13)
df_additional_lab.loc[indices_to_divide,'IGF1'] = df_additional_lab.loc[indices_to_divide,'IGF1'] * 0.13


In [58]:
# combine additional lab and lab data
lab_data = lab_data.set_index(['Patient_ID','Case_ID']).combine_first(df_additional_lab.set_index(['Patient_ID','Case_ID'])).reset_index()

In [59]:
spar = round(lab_data[['COR60', 'FSH','LH', 'FT4', 'IGF1', 'LH', 'PROL','TEST']].isna().mean().mean(),3)
print(f"Sparsity of labordata: {spar} % (nur von Fällen mit Laborwerten)")
print(f"Von {len(df_mri_clean)-len(lab_data)} Fällen gibt es keine Laborwerte.")

Sparsity of labordata: 0.31 % (nur von Fällen mit Laborwerten)
Von 519 Fällen gibt es keine Laborwerte.


In [60]:
lab_data.to_csv(r'../raw_data/lab_data.csv',index=False)

In [61]:
print("End Clean and Preprocessing labor data")

End Clean and Preprocessing labor data


# Full Merge

In [62]:
# get only newest mri
# df_mri_clean = df_mri.sort_values(['Patient_ID','Date_Case'],ascending=False).drop_duplicates('Patient_ID')

In [63]:
df_temp = pd.merge(df_mri_clean,lab_data,left_on=['Patient_ID','Case_ID'],right_on=['Patient_ID','Case_ID'])

In [64]:
full_merged = pd.merge(df_temp,df_patients[['Patient_ID', 'Entry_date', 'Operation_date', 'Adenoma_size',
       'Diagnosis', 'Category', 'Patient_age','Pre_OP_hormone_cortico',
       'Pre_OP_hormone_gonado', 'Pre_OP_hormone_somato',
       'Pre_OP_hormone_thyreo', 'Pre_OP_hormone_hyperprolaktin',
       'Pre_OP_hormone_keine', 'Pre_OP_hormone_intakt', 'Label_Quality']],how='inner',left_on=['Patient_ID'],right_on=['Patient_ID'])

In [65]:
full_merged = full_merged.sort_values('Patient_ID')

In [66]:
assert full_merged.duplicated(subset=['Patient_ID','Case_ID']).sum() == 0

In [67]:
full_merged.to_csv(r'../raw_data/data_full_merge.csv',index=False)