# Healthcare Lab (Stage 3)

**Learning Objectives:**
  * Practice the application of functions to DataFrames
  * Gain exposure to healthcare related DataSets

## Context of the dataset

### 1. The dataset is consisted of records corresponding to medical events.
### 2. Each medical event is uniquely identified by `MedicalClaim`.
### 3. A given medical event might involve several medical procedures.
### 4. Each medical procedure is uniquely identified by `ClaimItem`
### 5. A given medical procedure is characterized by `PrincipalDiagnosisDesc`,`PrincipalDiagnosis`,`RevenueCodeDesc`, `RevenueCode`, `TypeFlag` and `TotalExpenses`

### 6. Each medical procedure involves: `MemberName`,`MemberID`,`County`,`HospitalName`, `HospitalType`, `StartDate`,`EndDate`


## 1. Library Import

In [40]:
import pandas as pd
import warnings
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

In [41]:
warnings.simplefilter('ignore')

## 2. Data loading and DataFrame creation

In [42]:
HealthCareDataSet=pd.read_csv("https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/HealthcareDataset_PublicRelease.csv",sep=',',parse_dates=['StartDate','EndDate','BirthDate'])

In [43]:
HealthCareDataSet.head(3)

Unnamed: 0,Id,MemberName,MemberID,County,MedicalClaim,ClaimItem,HospitalName,HospitalType,StartDate,EndDate,PrincipalDiagnosisDesc,PrincipalDiagnosis,RevenueCodeDesc,RevenueCode,TypeFlag,BirthDate,TotalExpenses
0,634363,e659f3f4,6a380a28,6f943458,c1e3436737c77899,18,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,DRUGS REQUIRE SPECIFIC ID: DRUGS REQUIRING DET...,636.0,ER,1967-05-13,15.148
1,634364,e659f3f4,6a380a28,6f943458,c1e3436737c77899,21,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,DRUGS REQUIRE SPECIFIC ID: DRUGS REQUIRING DET...,636.0,ER,1967-05-13,3.073
2,634387,e659f3f4,6a380a28,6f943458,c1e3436737c77899,10,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,LABORATORY - CLINICAL DIAGNOSTIC: HEMATOLOGY,305.0,ER,1967-05-13,123.9


In [44]:
HealthCareDataSet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52563 entries, 0 to 52562
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Id                      52563 non-null  int64         
 1   MemberName              52563 non-null  object        
 2   MemberID                52563 non-null  object        
 3   County                  52563 non-null  object        
 4   MedicalClaim            52563 non-null  object        
 5   ClaimItem               52563 non-null  int64         
 6   HospitalName            52563 non-null  object        
 7   HospitalType            52563 non-null  object        
 8   StartDate               52563 non-null  datetime64[ns]
 9   EndDate                 52563 non-null  datetime64[ns]
 10  PrincipalDiagnosisDesc  52563 non-null  object        
 11  PrincipalDiagnosis      52563 non-null  object        
 12  RevenueCodeDesc         52561 non-null  object

## 3. Applying Built-In Functions to a DataFrame (Column-wise)

In [45]:
## Let's apply the np.log() function to the column `TotalExpenses`. We save the results in a new column `TotalExpensesLog`
HealthCareDataSet['TotalExpensesLog']=HealthCareDataSet['TotalExpenses'].apply(np.log,axis=1)

In [46]:
# We apply the function join to merge two string columns into a new one
HealthCareDataSet['PrincipalDiagnosisJoined']=HealthCareDataSet[['PrincipalDiagnosisDesc','PrincipalDiagnosis']].apply('---'.join,axis=1)

## 4. Applying Regular Functions to a DataFrame (Column-wise)

In [47]:
def lowerToUpper(string):
    return string.upper()


In [48]:
lowerToUpper('hi there')

'HI THERE'

In [49]:
HealthCareDataSet['PrincipalDiagnosisDescUPPER']=HealthCareDataSet['PrincipalDiagnosisDesc'].apply(lowerToUpper)

In [50]:
def roundingFunction(number):
   return round(number)

In [51]:
HealthCareDataSet['TotalExpensesRounded']=HealthCareDataSet['TotalExpenses'].apply(round)

## 4. Applying Lambda Functions to a DataFrame (Column-wise)

In [52]:
## a lambda function which simply outputs the input
HealthCareDataSet['TotalExpenses'].apply(lambda x:x)

0          15.148
1           3.073
2         123.900
3           7.511
4           8.631
           ...   
52558    2436.000
52559    2075.500
52560     865.900
52561     665.000
52562    4587.800
Name: TotalExpenses, Length: 52563, dtype: float64

In [53]:
## a lambda function which adds 10 to the input
HealthCareDataSet['TotalExpenses'].apply(lambda x:x+10)

0          25.148
1          13.073
2         133.900
3          17.511
4          18.631
           ...   
52558    2446.000
52559    2085.500
52560     875.900
52561     675.000
52562    4597.800
Name: TotalExpenses, Length: 52563, dtype: float64

In [54]:
## a lambda function which applies the np.log() function to the input
HealthCareDataSet['TotalExpenses'].apply(lambda x:np.log(x))

0        2.717869
1        1.122654
2        4.819475
3        2.016369
4        2.155360
           ...   
52558    7.798113
52559    7.637957
52560    6.763769
52561    6.499787
52562    8.431156
Name: TotalExpenses, Length: 52563, dtype: float64

In [55]:
## a lambda function which applies the upper() function to the input
HealthCareDataSet['PrincipalDiagnosisDesc'].apply(lambda x:x.upper())

0                  EPIGASTRIC PAIN
1                  EPIGASTRIC PAIN
2                  EPIGASTRIC PAIN
3                  EPIGASTRIC PAIN
4                  EPIGASTRIC PAIN
                   ...            
52558    TRAUMATIC SUBARACHNOID HE
52559    IRON DEFICIENCY ANEMIA SE
52560    IRON DEFICIENCY ANEMIA SE
52561    IRON DEFICIENCY ANEMIA SE
52562    IRON DEFICIENCY ANEMIA SE
Name: PrincipalDiagnosisDesc, Length: 52563, dtype: object

In [56]:
## a lambda function which applies the split() function to the input
HealthCareDataSet['PrincipalDiagnosisDesc'].apply(lambda x:x.split())

0                    [Epigastric, pain]
1                    [Epigastric, pain]
2                    [Epigastric, pain]
3                    [Epigastric, pain]
4                    [Epigastric, pain]
                      ...              
52558     [Traumatic, subarachnoid, he]
52559    [Iron, deficiency, anemia, se]
52560    [Iron, deficiency, anemia, se]
52561    [Iron, deficiency, anemia, se]
52562    [Iron, deficiency, anemia, se]
Name: PrincipalDiagnosisDesc, Length: 52563, dtype: object

In [58]:
## a lambda function that takes a DataFrame row and outputs it
HealthCareDataSet[['PrincipalDiagnosisDesc','PrincipalDiagnosis']].apply(lambda row:row)

Unnamed: 0,PrincipalDiagnosisDesc,PrincipalDiagnosis
0,Epigastric pain,R10.13
1,Epigastric pain,R10.13
2,Epigastric pain,R10.13
3,Epigastric pain,R10.13
4,Epigastric pain,R10.13
...,...,...
52558,Traumatic subarachnoid he,S06.6X0A
52559,Iron deficiency anemia se,D50.0
52560,Iron deficiency anemia se,D50.0
52561,Iron deficiency anemia se,D50.0


In [75]:
HealthCareDataSet[['PrincipalDiagnosisDesc','PrincipalDiagnosis']].apply(lambda x:x)

Unnamed: 0,PrincipalDiagnosisDesc,PrincipalDiagnosis
0,Epigastric pain,R10.13
1,Epigastric pain,R10.13
2,Epigastric pain,R10.13
3,Epigastric pain,R10.13
4,Epigastric pain,R10.13
...,...,...
52558,Traumatic subarachnoid he,S06.6X0A
52559,Iron deficiency anemia se,D50.0
52560,Iron deficiency anemia se,D50.0
52561,Iron deficiency anemia se,D50.0


In [102]:
## a lambda function that takes a DataFrame row and concatenates its columns
HealthCareDataSet[['PrincipalDiagnosisDesc','PrincipalDiagnosis']].apply(lambda row:row[0]+' -- '+row[1],axis=1)

0                    Epigastric pain -- R10.13
1                    Epigastric pain -- R10.13
2                    Epigastric pain -- R10.13
3                    Epigastric pain -- R10.13
4                    Epigastric pain -- R10.13
                         ...                  
52558    Traumatic subarachnoid he -- S06.6X0A
52559       Iron deficiency anemia se -- D50.0
52560       Iron deficiency anemia se -- D50.0
52561       Iron deficiency anemia se -- D50.0
52562       Iron deficiency anemia se -- D50.0
Length: 52563, dtype: object

## 5. Challenge Yourself !!

### (1) Develop a function that given a sentence returns the first two words of the sentence. (2) Apply that function to extract the first two words of the column `RevenueCodeDesc`
#### Tip: consider using the strip function defined in Python.


### Develop a lambda function to compute the duration of any given medical event
#### Tip: bear in mind that duration can be computed by substracting the column `EndDate` from the column `StartDate`

0       0 days
1       0 days
2       0 days
3       0 days
4       0 days
         ...  
52558   7 days
52559   4 days
52560   4 days
52561   4 days
52562   4 days
Length: 52563, dtype: timedelta64[ns]

In [112]:
HealthCareDataSet[['StartDate','EndDate']]

Unnamed: 0,StartDate,EndDate
0,2020-01-08,2020-01-08
1,2020-01-08,2020-01-08
2,2020-01-08,2020-01-08
3,2020-01-08,2020-01-08
4,2020-01-08,2020-01-08
...,...,...
52558,2020-12-02,2020-12-09
52559,2020-12-18,2020-12-22
52560,2020-12-18,2020-12-22
52561,2020-12-18,2020-12-22


In [113]:
HealthCareDataSet['EndDate']-HealthCareDataSet['StartDate']

0       0 days
1       0 days
2       0 days
3       0 days
4       0 days
         ...  
52558   7 days
52559   4 days
52560   4 days
52561   4 days
52562   4 days
Length: 52563, dtype: timedelta64[ns]