# Healthcare Lab (Text Processing)

**Learning Objectives:**
  * Practice time series operations
  
  * Gain exposure to healthcare related DataSets

## Context of the dataset

### 1. The dataset is consisted of records corresponding to medical events.
### 2. Each medical event is uniquely identified by `MedicalClaim`.
### 3. A given medical event might involve several medical procedures.
### 4. Each medical procedure is uniquely identified by `ClaimItem`
### 5. A given medical procedure is characterized by `PrincipalDiagnosisDesc`,`PrincipalDiagnosis`,`RevenueCodeDesc`, `RevenueCode`, `TypeFlag` and `TotalExpenses`

### 6. Each medical procedure involves: `MemberName`,`MemberID`,`County`,`HospitalName`, `HospitalType`, `StartDate`,`EndDate`


## 1. Library Import

In [1]:
import pandas as pd
import warnings
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

In [2]:
warnings.simplefilter('ignore')

## 2. Data loading and DataFrame creation

In [12]:
HealthCareDataSet=pd.read_csv("https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/HealthcareDataset_PublicRelease.csv",sep=',',parse_dates=['StartDate','EndDate','BirthDate'])

In [82]:
HealthCareDataSet.head(10)

Unnamed: 0,Id,MemberName,MemberID,County,MedicalClaim,ClaimItem,HospitalName,HospitalType,StartDate,EndDate,PrincipalDiagnosisDesc,PrincipalDiagnosis,RevenueCodeDesc,RevenueCode,TypeFlag,BirthDate,TotalExpenses,PrincipalDiagnosisSplit,PrincipalDiagnosisDescExtract,RevenueCodeDescExtract
0,634363,e659f3f4,6a380a28,6f943458,c1e3436737c77899,18,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,DRUGS REQUIRE SPECIFIC ID: DRUGS REQUIRING DET...,636.0,ER,1967-05-13,15.148,"[R10, 13]",,
1,634364,e659f3f4,6a380a28,6f943458,c1e3436737c77899,21,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,DRUGS REQUIRE SPECIFIC ID: DRUGS REQUIRING DET...,636.0,ER,1967-05-13,3.073,"[R10, 13]",,
2,634387,e659f3f4,6a380a28,6f943458,c1e3436737c77899,10,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,LABORATORY - CLINICAL DIAGNOSTIC: HEMATOLOGY,305.0,ER,1967-05-13,123.9,"[R10, 13]",,
3,634388,e659f3f4,6a380a28,6f943458,c1e3436737c77899,20,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,DRUGS REQUIRE SPECIFIC ID: DRUGS REQUIRING DET...,636.0,ER,1967-05-13,7.511,"[R10, 13]",,
4,634389,e659f3f4,6a380a28,6f943458,c1e3436737c77899,19,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,DRUGS REQUIRE SPECIFIC ID: DRUGS REQUIRING DET...,636.0,ER,1967-05-13,8.631,"[R10, 13]",,
5,634390,e659f3f4,6a380a28,6f943458,c1e3436737c77899,2,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,LABORATORY - CLINICAL DIAGNOSTIC: CHEMISTRY,301.0,ER,1967-05-13,263.2,"[R10, 13]",,
6,634391,e659f3f4,6a380a28,6f943458,c1e3436737c77899,6,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,LABORATORY - CLINICAL DIAGNOSTIC: CHEMISTRY,301.0,ER,1967-05-13,44.1,"[R10, 13]",,
7,634392,e659f3f4,6a380a28,6f943458,c1e3436737c77899,12,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,RADIOLOGY - DIAGNOSTIC: CHEST X-RAY,324.0,ER,1967-05-13,364.0,"[R10, 13]",,
8,634393,e659f3f4,6a380a28,6f943458,c1e3436737c77899,15,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,EMERGENCY ROOM,450.0,ER,1967-05-13,789.6,"[R10, 13]",,
9,634394,e659f3f4,6a380a28,6f943458,c1e3436737c77899,17,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,EMERGENCY ROOM,450.0,ER,1967-05-13,478.1,"[R10, 13]",,


In [5]:
HealthCareDataSet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52563 entries, 0 to 52562
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Id                      52563 non-null  int64         
 1   MemberName              52563 non-null  object        
 2   MemberID                52563 non-null  object        
 3   County                  52563 non-null  object        
 4   MedicalClaim            52563 non-null  object        
 5   ClaimItem               52563 non-null  int64         
 6   HospitalName            52563 non-null  object        
 7   HospitalType            52563 non-null  object        
 8   StartDate               52563 non-null  datetime64[ns]
 9   EndDate                 52563 non-null  datetime64[ns]
 10  PrincipalDiagnosisDesc  52563 non-null  object        
 11  PrincipalDiagnosis      52563 non-null  object        
 12  RevenueCodeDesc         52561 non-null  object

## 3. Let's find all records containing the word Sprain in the PrincipalDiagnosisDesc
### Tip: Use chatGPT to find the right regex expression: https://chatgpt.com/share/95a71cec-b278-45c4-b4c8-865ff085c80a


In [49]:
regexPattern=r'\b[sS]prain\b'

In [51]:
HealthCareDataSet[HealthCareDataSet['PrincipalDiagnosisDesc'].str.contains(regexPattern, regex=True)]

Unnamed: 0,Id,MemberName,MemberID,County,MedicalClaim,ClaimItem,HospitalName,HospitalType,StartDate,EndDate,PrincipalDiagnosisDesc,PrincipalDiagnosis,RevenueCodeDesc,RevenueCode,TypeFlag,BirthDate,TotalExpenses
918,635736,eec5c2b9,558f84ad,89e38653,090c146fa3932f54,2,b592f5ae,HOSPITAL,2020-01-20,2020-01-20,Sprain of unspecified lig,S93.401A,RADIOLOGY - DIAGNOSTIC,320.0,ER,1964-08-29,1059.156
919,635737,eec5c2b9,558f84ad,89e38653,090c146fa3932f54,3,b592f5ae,HOSPITAL,2020-01-20,2020-01-20,Sprain of unspecified lig,S93.401A,EMERGENCY ROOM,450.0,ER,1964-08-29,2098.096
920,635738,eec5c2b9,558f84ad,89e38653,090c146fa3932f54,1,b592f5ae,HOSPITAL,2020-01-20,2020-01-20,Sprain of unspecified lig,S93.401A,PHARMACY,250.0,ER,1964-08-29,2.975
2145,638036,588584f1,b9f9e2d3,02af982d,ca62615a0bdccc7b,2,88b42459,HOSPITAL,2020-01-01,2020-01-01,Unspecified sprain of rig,S63.501A,RADIOLOGY - DIAGNOSTIC,320.0,ER,1961-03-05,347.200
2158,638059,588584f1,b9f9e2d3,02af982d,ca62615a0bdccc7b,1,88b42459,HOSPITAL,2020-01-01,2020-01-01,Unspecified sprain of rig,S63.501A,PHARMACY,250.0,ER,1961-03-05,0.364
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49084,733413,25b294ad,b41fcbc8,02af982d,5b88bbc6c91d1dfe,1,4d103af0,HOSPITAL,2020-12-24,2020-12-24,Sprain of unspecified lig,S93.401A,RADIOLOGY - DIAGNOSTIC,320.0,ER,1939-08-02,618.800
49085,733425,25b294ad,b41fcbc8,02af982d,5b88bbc6c91d1dfe,2,4d103af0,HOSPITAL,2020-12-24,2020-12-24,Sprain of unspecified lig,S93.401A,EMERGENCY ROOM,450.0,ER,1939-08-02,2774.842
49495,734523,eec5c2b9,558f84ad,89e38653,3dfc8d295cff09f7,1,b592f5ae,HOSPITAL,2020-12-06,2020-12-06,Sprain of unspecified lig,S93.402A,PHARMACY,250.0,ER,1964-08-29,2.044
49496,734524,eec5c2b9,558f84ad,89e38653,3dfc8d295cff09f7,2,b592f5ae,HOSPITAL,2020-12-06,2020-12-06,Sprain of unspecified lig,S93.402A,RADIOLOGY - DIAGNOSTIC,320.0,ER,1964-08-29,1059.156


## 3. Let's split the field PrincipalDiagnosis into two.
### Tip: use chatGPT to find the right regex: https://chatgpt.com/share/045d60d0-6a4f-4288-b73d-501d29b619f6

In [52]:
regexPattern=r'\.'

In [57]:
HealthCareDataSet['PrincipalDiagnosisSplit']=HealthCareDataSet['PrincipalDiagnosis'].str.split(regexPattern)
HealthCareDataSet['PrincipalDiagnosisSplit']

0          [R10, 13]
1          [R10, 13]
2          [R10, 13]
3          [R10, 13]
4          [R10, 13]
            ...     
52558    [S06, 6X0A]
52559       [D50, 0]
52560       [D50, 0]
52561       [D50, 0]
52562       [D50, 0]
Name: PrincipalDiagnosisSplit, Length: 52563, dtype: object

## 4. Given the column `RevenueCodeDesc`, let's extract all words following the text: "LABORATORY - CLINICAL DIAGNOSTIC"



### Tip: Use chatGPT to find the right regex expression: https://chatgpt.com/share/16f39595-847a-4844-8d82-a021ee02af9c


In [83]:
regexPattern=pattern = r'LABORATORY - CLINICAL DIAGNOSTIC:\s*(.*)'

In [84]:
HealthCareDataSet['RevenueCodeDescExtract']=HealthCareDataSet['RevenueCodeDesc'].str.extract(regexPattern)
HealthCareDataSet['RevenueCodeDescExtract']


0               NaN
1               NaN
2        HEMATOLOGY
3               NaN
4               NaN
            ...    
52558           NaN
52559    HEMATOLOGY
52560           NaN
52561           NaN
52562     CHEMISTRY
Name: RevenueCodeDescExtract, Length: 52563, dtype: object

In [85]:
HealthCareDataSet.groupby(['RevenueCodeDescExtract']).count()

Unnamed: 0_level_0,Id,MemberName,MemberID,County,MedicalClaim,ClaimItem,HospitalName,HospitalType,StartDate,EndDate,PrincipalDiagnosisDesc,PrincipalDiagnosis,RevenueCodeDesc,RevenueCode,TypeFlag,BirthDate,TotalExpenses,PrincipalDiagnosisSplit,PrincipalDiagnosisDescExtract
RevenueCodeDescExtract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
BACTERIOLOGY/MICROBIOLOGY,1831,1831,1831,1831,1831,1831,1831,1831,1831,1831,1831,1831,1831,1831,1831,1831,1831,1831,0
CHEMISTRY,5857,5857,5857,5857,5857,5857,5857,5857,5857,5857,5857,5857,5857,5857,5857,5857,5857,5857,0
HEMATOLOGY,4037,4037,4037,4037,4037,4037,4037,4037,4037,4037,4037,4037,4037,4037,4037,4037,4037,4037,0
IMMUNOLOGY,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,0
OTHER LABORATORY,91,91,91,91,91,91,91,91,91,91,91,91,91,91,91,91,91,91,0
UROLOGY,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,0
