<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Version-Control" data-toc-modified-id="Version-Control-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Version Control</a></span></li><li><span><a href="#Import-Data" data-toc-modified-id="Import-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import Data</a></span></li><li><span><a href="#Exploration-of-df" data-toc-modified-id="Exploration-of-df-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Exploration of df</a></span></li><li><span><a href="#Preparation-for-laboratory-test-items" data-toc-modified-id="Preparation-for-laboratory-test-items-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Preparation for laboratory test items</a></span></li><li><span><a href="#Bin-the-lab-measurement-data-from-df-to-the-right-columns-in-df2" data-toc-modified-id="Bin-the-lab-measurement-data-from-df-to-the-right-columns-in-df2-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Bin the lab measurement data from df to the right columns in df2</a></span></li><li><span><a href="#Integrate-members'-gender-into-lab-dataset" data-toc-modified-id="Integrate-members'-gender-into-lab-dataset-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Integrate members' gender into lab dataset</a></span></li><li><span><a href="#Save-Lab-dataset-as-output-file" data-toc-modified-id="Save-Lab-dataset-as-output-file-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Save Lab dataset as output file</a></span></li></ul></div>

# Capstone (WhiteCoat) Lab Data Integration

## Version Control  

Prepared by: Lien Wee Liang (NUS-ISS)    

v1.0: 07 Aug 2022: First release

Description:  
This code takes raw tbl_Lab_Reports dataframe and dedupe to only one unique lab report number per row as the output dataframe.  
All the laboratory measurements are re-tabulated to 'category+line_item+unit' columns.  
The member's gender is also extracted from tbl_Patient and merged to the output file which is required for EDA in subsequent program.  


## Import Data

In [1]:
import pickle
import pandas as pd
import numpy as np
#import string

#pd.set_option('display.max_colwidth', None)
#pd.set_option("display.max_rows", 1000)

In [2]:
INPUTFILE_Lab = "tbl_Lab_Reports.csv"
INPUTFILE_Mem = "tbl_Patient.csv"
OUTPUTFILE_Lab = "ByLabRptNumRaw_v0p0"

In [3]:
# Import Lab data 
print("Lab Report Input File:", INPUTFILE_Lab) 

# If csv format
df = pd.read_csv(INPUTFILE_Lab)

# If pkl format
#with open(INPUTFILE_Lab, 'rb') as pklfile:
#        df = pickle.load(pklfile)

Lab Report Input File: tbl_Lab_Reports.csv


In [4]:
# Import Member gender data
print("Member Input File:", INPUTFILE_Mem) 

# If csv format
df_mem = pd.read_csv(INPUTFILE_Mem, header=0, usecols = ['mem_id','gender'])

# If pkl format
#with open(INPUTFILE_Mem, 'rb') as pklfile:
#        df_mem = pickle.load(pklfile)

Member Input File: tbl_Patient.csv


## Exploration of df

In [5]:
#df.info()

In [6]:
#df.columns.tolist()

In [7]:
# Show the numeric and non-numeric columns
#print("numeric", df.select_dtypes(include=[np.number]).columns.values, "\n")
#print("non-numeric", df.select_dtypes(exclude=[np.number]).columns.values, "\n")

In [8]:
#print("Data Header Samples")
#df.head()

## Preparation for laboratory test items

In [9]:
df_sorted = df.sort_values(['category','line_item','unit'])

In [10]:
df_sorted_unique = df_sorted[['category','line_item','unit']].drop_duplicates()
print(df_sorted_unique.head(120))

print("\nTotal number of category+line_item+unit:\n", df_sorted_unique.count())

                 category                      line_item   unit
31    BONE/JOINT FUNCTION                        Calcium  mg/dL
30    BONE/JOINT FUNCTION                      Phosphate  mg/dL
33    BONE/JOINT FUNCTION              Rheumatoid Factor  IU/mL
32    BONE/JOINT FUNCTION                      Uric Acid  mg/dL
2751      CARDIAC MARKERS                             CK    U/L
...                   ...                            ...    ...
90       URINE MICROSCOPY                         Others    NaN
83       URINE MICROSCOPY                Red Blood Cells    /uL
82       URINE MICROSCOPY              White Blood Cells    /uL
87       URINE MICROSCOPY                          Yeast    NaN
561         VITAMIN STUDY  Vitamin D Total (Immunoassay)  ng/mL

[109 rows x 3 columns]

Total number of category+line_item+unit:
 category     109
line_item    109
unit          77
dtype: int64


In [11]:
# Preparing for new lab measurement as new columns
df_sorted_unique['unit']=df_sorted_unique['unit'].astype(str)
new_col = df_sorted_unique.apply(lambda row : row['category'] +'|'+ row['line_item'] +'|'+ row['unit'],axis =1)

In [12]:
print(new_col)
print("\nTotal number of new columns:\n",new_col.count())

31                      BONE/JOINT FUNCTION|Calcium|mg/dL
30                    BONE/JOINT FUNCTION|Phosphate|mg/dL
33            BONE/JOINT FUNCTION|Rheumatoid Factor|IU/mL
32                    BONE/JOINT FUNCTION|Uric Acid|mg/dL
2751                               CARDIAC MARKERS|CK|U/L
                              ...                        
90                            URINE MICROSCOPY|Others|nan
83                   URINE MICROSCOPY|Red Blood Cells|/uL
82                 URINE MICROSCOPY|White Blood Cells|/uL
87                             URINE MICROSCOPY|Yeast|nan
561     VITAMIN STUDY|Vitamin D Total (Immunoassay)|ng/mL
Length: 109, dtype: object

Total number of new columns:
 109


In [13]:
lab_report_unique = df[['lab_report_no','mem_id','requested_date']]
# lab_report_unique_list = df['lab_report_no'].unique()

In [14]:
# Keep only the unique rows and remove the duplicates
df2 = lab_report_unique[~lab_report_unique.duplicated(keep="first")]

In [15]:
# print(df2)

# Confirm the number of unique reports
print("Total number of rows (reports):",len(df2))

Total number of rows (reports): 225


In [16]:
# Append new column names into the original df with null values inserted througout
for i, x in enumerate(new_col):
    df2[x]= np.nan
    #df2[x]=''
    #df2.loc[:,x]=''

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[x]= np.nan
  df2[x]= np.nan


In [17]:
# Confirm that the new columns are created accordingly
df2.head()

Unnamed: 0,lab_report_no,mem_id,requested_date,BONE/JOINT FUNCTION|Calcium|mg/dL,BONE/JOINT FUNCTION|Phosphate|mg/dL,BONE/JOINT FUNCTION|Rheumatoid Factor|IU/mL,BONE/JOINT FUNCTION|Uric Acid|mg/dL,CARDIAC MARKERS|CK|U/L,DIABETES MELLITUS PROFILE|Glucose (Fasting)|mg/dL,DIABETES MELLITUS PROFILE|Glucose (Random)|mg/dL,...,URINE MICROSCOPY|Bacteria|nan,URINE MICROSCOPY|Casts|nan,URINE MICROSCOPY|Crystals|nan,URINE MICROSCOPY|Epithelial Cells|/uL,URINE MICROSCOPY|Mucus Threads|nan,URINE MICROSCOPY|Others|nan,URINE MICROSCOPY|Red Blood Cells|/uL,URINE MICROSCOPY|White Blood Cells|/uL,URINE MICROSCOPY|Yeast|nan,VITAMIN STUDY|Vitamin D Total (Immunoassay)|ng/mL
0,8002323,7AB3B6EF-6283-4889-BF40-C246D1FC9B5A,2021-03-31,,,,,,,,...,,,,,,,,,,
15,8700830,658911EE-72B6-48B7-ADDA-F6F401B2B77A,2021-05-07,,,,,,,,...,,,,,,,,,,
91,8700919,8B88E7DB-85FA-44F6-AEC0-5862D33F3687,2021-05-08,,,,,,,,...,,,,,,,,,,
171,8700066,FB819814-94C6-4826-9026-E91263C4E824,2021-04-21,,,,,,,,...,,,,,,,,,,
247,8701219,CE0D0293-E60F-4E22-918A-EFB7B566CD34,2021-05-28,,,,,,,,...,,,,,,,,,,


## Bin the lab measurement data from df to the right columns in df2

In [18]:
df['unit']=df['unit'].astype(str)

for i, x in enumerate(df['test_result']):
    rpt_num = df['lab_report_no'][i]
    #print(rpt_num)
    df2_index = df2.index[df2.lab_report_no == rpt_num][0]
    #print(df2_index)
    colname = df['category'][i]+"|"+df['line_item'][i]+"|"+df['unit'][i]
    #print(colname)
    df2.loc[df2_index,colname] = x 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [19]:
df2.head()

Unnamed: 0,lab_report_no,mem_id,requested_date,BONE/JOINT FUNCTION|Calcium|mg/dL,BONE/JOINT FUNCTION|Phosphate|mg/dL,BONE/JOINT FUNCTION|Rheumatoid Factor|IU/mL,BONE/JOINT FUNCTION|Uric Acid|mg/dL,CARDIAC MARKERS|CK|U/L,DIABETES MELLITUS PROFILE|Glucose (Fasting)|mg/dL,DIABETES MELLITUS PROFILE|Glucose (Random)|mg/dL,...,URINE MICROSCOPY|Bacteria|nan,URINE MICROSCOPY|Casts|nan,URINE MICROSCOPY|Crystals|nan,URINE MICROSCOPY|Epithelial Cells|/uL,URINE MICROSCOPY|Mucus Threads|nan,URINE MICROSCOPY|Others|nan,URINE MICROSCOPY|Red Blood Cells|/uL,URINE MICROSCOPY|White Blood Cells|/uL,URINE MICROSCOPY|Yeast|nan,VITAMIN STUDY|Vitamin D Total (Immunoassay)|ng/mL
0,8002323,7AB3B6EF-6283-4889-BF40-C246D1FC9B5A,2021-03-31,,,,,,83.0,,...,,,,,,,,,,
15,8700830,658911EE-72B6-48B7-ADDA-F6F401B2B77A,2021-05-07,9.3,3.25,< 10,5.38,,85.0,,...,Nil,Nil,Nil,0.0,Nil,Nil,0.0,0.0,Nil,
91,8700919,8B88E7DB-85FA-44F6-AEC0-5862D33F3687,2021-05-08,9.0,3.13,< 10,6.39,,86.0,,...,Nil,Nil,Nil,0.0,Nil,Nil,0.0,0.0,Nil,
171,8700066,FB819814-94C6-4826-9026-E91263C4E824,2021-04-21,10.2,3.47,< 10,5.55,,90.0,,...,Nil,Nil,Nil,0.0,Nil,Nil,0.0,0.0,Nil,
247,8701219,CE0D0293-E60F-4E22-918A-EFB7B566CD34,2021-05-28,9.7,3.25,< 10,4.2,,83.0,,...,+,Nil,Nil,15.0,+,Nil,0.0,3.0,Nil,


## Integrate members' gender into lab dataset

In [20]:
# Merge the gender column into df
df2 = pd.merge(df2, df_mem, how="left", on="mem_id")

In [21]:
# Move gender column beside mem_id column
col_list = df2.columns.tolist()
col_list.insert(2, col_list.pop(col_list.index('gender')))
df2 = df2.reindex(columns=col_list)

In [22]:
df2.head()

Unnamed: 0,lab_report_no,mem_id,gender,requested_date,BONE/JOINT FUNCTION|Calcium|mg/dL,BONE/JOINT FUNCTION|Phosphate|mg/dL,BONE/JOINT FUNCTION|Rheumatoid Factor|IU/mL,BONE/JOINT FUNCTION|Uric Acid|mg/dL,CARDIAC MARKERS|CK|U/L,DIABETES MELLITUS PROFILE|Glucose (Fasting)|mg/dL,...,URINE MICROSCOPY|Bacteria|nan,URINE MICROSCOPY|Casts|nan,URINE MICROSCOPY|Crystals|nan,URINE MICROSCOPY|Epithelial Cells|/uL,URINE MICROSCOPY|Mucus Threads|nan,URINE MICROSCOPY|Others|nan,URINE MICROSCOPY|Red Blood Cells|/uL,URINE MICROSCOPY|White Blood Cells|/uL,URINE MICROSCOPY|Yeast|nan,VITAMIN STUDY|Vitamin D Total (Immunoassay)|ng/mL
0,8002323,7AB3B6EF-6283-4889-BF40-C246D1FC9B5A,Female,2021-03-31,,,,,,83.0,...,,,,,,,,,,
1,8700830,658911EE-72B6-48B7-ADDA-F6F401B2B77A,Male,2021-05-07,9.3,3.25,< 10,5.38,,85.0,...,Nil,Nil,Nil,0.0,Nil,Nil,0.0,0.0,Nil,
2,8700919,8B88E7DB-85FA-44F6-AEC0-5862D33F3687,Male,2021-05-08,9.0,3.13,< 10,6.39,,86.0,...,Nil,Nil,Nil,0.0,Nil,Nil,0.0,0.0,Nil,
3,8700066,FB819814-94C6-4826-9026-E91263C4E824,Male,2021-04-21,10.2,3.47,< 10,5.55,,90.0,...,Nil,Nil,Nil,0.0,Nil,Nil,0.0,0.0,Nil,
4,8701219,CE0D0293-E60F-4E22-918A-EFB7B566CD34,Female,2021-05-28,9.7,3.25,< 10,4.2,,83.0,...,+,Nil,Nil,15.0,+,Nil,0.0,3.0,Nil,


## Save Lab dataset as output file

In [23]:
# Saving of df2 to csv format
df2.to_csv(OUTPUTFILE_Lab+".csv", index=False, header=True)

# Saving of df2 to pkl format
df2.to_pickle(OUTPUTFILE_Lab+".pkl")