# Healthcare Lab (Multiple Regression Model Inference)

**Learning Objectives:**
  * Define and fit simple regression models with multiple factors. Compute t-tests. Compute F-tests
  
  * Gain exposure to healthcare related DataSets

## Context of the dataset

### 1. The dataset is consisted of records corresponding to medical events.
### 2. Each medical event is uniquely identified by `MedicalClaim`.
### 3. A given medical event might involve several medical procedures.
### 4. Each medical procedure is uniquely identified by `ClaimItem`
### 5. A given medical procedure is characterized by `PrincipalDiagnosisDesc`,`PrincipalDiagnosis`,`RevenueCodeDesc`, `RevenueCode`, `TypeFlag` and `TotalExpenses`

### 6. Each medical procedure involves: `MemberName`,`MemberID`,`County`,`HospitalName`, `HospitalType`, `StartDate`,`EndDate`


## 1. Library Import

In [1]:
import pandas as pd
import warnings
import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf
from matplotlib import pyplot as plt


In [2]:
warnings.simplefilter('ignore')

## 2. Data loading and DataFrame creation

In [3]:
HealthCareDataSet=pd.read_csv("https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/HealthcareDataset_PublicRelease.csv",sep=',',parse_dates=['StartDate','EndDate','BirthDate'])

In [4]:
HealthCareDataSet.head(3)

Unnamed: 0,Id,MemberName,MemberID,County,MedicalClaim,ClaimItem,HospitalName,HospitalType,StartDate,EndDate,PrincipalDiagnosisDesc,PrincipalDiagnosis,RevenueCodeDesc,RevenueCode,TypeFlag,BirthDate,TotalExpenses
0,634363,e659f3f4,6a380a28,6f943458,c1e3436737c77899,18,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,DRUGS REQUIRE SPECIFIC ID: DRUGS REQUIRING DET...,636.0,ER,1967-05-13,15.148
1,634364,e659f3f4,6a380a28,6f943458,c1e3436737c77899,21,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,DRUGS REQUIRE SPECIFIC ID: DRUGS REQUIRING DET...,636.0,ER,1967-05-13,3.073
2,634387,e659f3f4,6a380a28,6f943458,c1e3436737c77899,10,04b77561,HOSPITAL,2020-01-08,2020-01-08,Epigastric pain,R10.13,LABORATORY - CLINICAL DIAGNOSTIC: HEMATOLOGY,305.0,ER,1967-05-13,123.9


In [5]:
HealthCareDataSet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52563 entries, 0 to 52562
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Id                      52563 non-null  int64         
 1   MemberName              52563 non-null  object        
 2   MemberID                52563 non-null  object        
 3   County                  52563 non-null  object        
 4   MedicalClaim            52563 non-null  object        
 5   ClaimItem               52563 non-null  int64         
 6   HospitalName            52563 non-null  object        
 7   HospitalType            52563 non-null  object        
 8   StartDate               52563 non-null  datetime64[ns]
 9   EndDate                 52563 non-null  datetime64[ns]
 10  PrincipalDiagnosisDesc  52563 non-null  object        
 11  PrincipalDiagnosis      52563 non-null  object        
 12  RevenueCodeDesc         52561 non-null  object

In [6]:
# We need to compute the variable: "AgeAtMedicalEvent"
HealthCareDataSet['AgeAtMedicalEvent']=(HealthCareDataSet['StartDate']-HealthCareDataSet['BirthDate'])
HealthCareDataSet['AgeAtMedicalEvent']=HealthCareDataSet['AgeAtMedicalEvent'].dt.total_seconds() / (365.25 * 24 * 60 * 60)

In [7]:
## We need to compute the duration of each Medical Treatment
HealthCareDataSet['MedicalTreatmentDuration']=(HealthCareDataSet['EndDate']-HealthCareDataSet['StartDate']).dt.days

### 3. Impact of age and  Medical Treatment Duration on total costs
#### We are interested in determining the impact of the factors: age and medical treatment duration on the total cost of medical interventions. To do this we fit a model regressing these variables on 'TotalExpenses'

In [8]:
HealthCareDataSet[['MedicalClaim','TotalExpenses','AgeAtMedicalEvent','MedicalTreatmentDuration']]

Unnamed: 0,MedicalClaim,TotalExpenses,AgeAtMedicalEvent,MedicalTreatmentDuration
0,c1e3436737c77899,15.148,52.657084,0
1,c1e3436737c77899,3.073,52.657084,0
2,c1e3436737c77899,123.900,52.657084,0
3,c1e3436737c77899,7.511,52.657084,0
4,c1e3436737c77899,8.631,52.657084,0
...,...,...,...,...
52558,90e8ae169cbba3bd,2436.000,80.637919,7
52559,8b6a8d2720d16e97,2075.500,70.258727,4
52560,8b6a8d2720d16e97,865.900,70.258727,4
52561,8b6a8d2720d16e97,665.000,70.258727,4


In [9]:
#we need to compute the totalExpenses incurred by each MedicalClaim
HealthCareDataSetGroupedByMedicalClaim=HealthCareDataSet.groupby(['MedicalClaim']).agg({'TotalExpenses':'sum','MedicalTreatmentDuration':'mean','AgeAtMedicalEvent':'mean'})
HealthCareDataSetGroupedByMedicalClaim.rename(columns={'TotalExpenses':'TotalExpensesPerClaim'},inplace=True)
HealthCareDataSetGroupedByMedicalClaim.head(3)

Unnamed: 0_level_0,TotalExpensesPerClaim,MedicalTreatmentDuration,AgeAtMedicalEvent
MedicalClaim,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0012a8eb3c2be5f5,4668.692,0.0,64.232717
002fd7d73d8060f1,53501.259,6.0,74.863792
003886fc8ec986d4,17115.714,0.0,64.380561


#### 3.1. Model Fit
##### We impose a simple, linear, model:
##### We specify TotalExpensesPerClaim as the response variable. We set AgeAtMedicalEvent and MedicalTreatmentDuration as the independent variables

In [10]:

reg = smf.ols(formula='TotalExpensesPerClaim ~ AgeAtMedicalEvent+MedicalTreatmentDuration', data=HealthCareDataSetGroupedByMedicalClaim)

In [11]:
#We fit the model
results = reg.fit()

In [12]:
b = results.params
print(f'b: \n{b}\n')

b: 
Intercept                    7505.618861
AgeAtMedicalEvent              82.368437
MedicalTreatmentDuration    11281.455338
dtype: float64



In [13]:
results.rsquared

0.6172571150786659

#### 3.2. Model Interpretation
##### Based on the previous we have fitted the following model:

$ TotalExpensesPerClaim=7505.61+82.36*AgeAtMedicalEvent+11281.45*MedicalTreatmentDuration+u $

#### This means that an increment of one unit in the variable `AgeAtMedicalEvent` increases the variable `TotalExpensesPerClaim` by 82.36 US Dollars

#### This means that an increment of one unit in the variable `MedicalTreatmentDuration` increases the variable `TotalExpensesPerClaim` by 11281 US Dollars


#### The value of R-squared is 0.61, this means that our model explains 61% of the total variance


#### 3.3. t-Test
##### We perform a t-Test for each independent variable under consideration. The good news is that the library statsmodel does it for us.


In [14]:
results.summary()

0,1,2,3
Dep. Variable:,TotalExpensesPerClaim,R-squared:,0.617
Model:,OLS,Adj. R-squared:,0.617
Method:,Least Squares,F-statistic:,2708.0
Date:,"Tue, 30 Jul 2024",Prob (F-statistic):,0.0
Time:,10:52:55,Log-Likelihood:,-40622.0
No. Observations:,3361,AIC:,81250.0
Df Residuals:,3358,BIC:,81270.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7505.6189,4164.397,1.802,0.072,-659.393,1.57e+04
AgeAtMedicalEvent,82.3684,58.001,1.420,0.156,-31.353,196.090
MedicalTreatmentDuration,1.128e+04,154.568,72.987,0.000,1.1e+04,1.16e+04

0,1,2,3
Omnibus:,2391.549,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,136842.851
Skew:,2.774,Prob(JB):,0.0
Kurtosis:,33.763,Cond. No.,406.0


#### Based on the previous table we conclude that:
#### (1) the p-value associated with the factor `AgeAtMedicalEvent` is too large (0.156) and therefore we are unable to reject the null hypotheses. In practice this means that the factor `AgeAtMedicalEvent` is not statistically relevant as far as determining the Total Expenses per Claim is concerned.

#### (2) the p-value associated with the factor `MedicalTreatmentDuration` is close to zero (0.0000) and therefore we REJECT the null hypotheses. In practice this means that the factor `MedicalTreatmentDuration` IS  statistically relevant as far as determining the Total Expenses per Claim is concerned.



#### 3.4. F-Test
##### We perform a F-Test for the whole model. The good news is that the library statsmodel does it for us.

In [15]:
results.summary()

0,1,2,3
Dep. Variable:,TotalExpensesPerClaim,R-squared:,0.617
Model:,OLS,Adj. R-squared:,0.617
Method:,Least Squares,F-statistic:,2708.0
Date:,"Tue, 30 Jul 2024",Prob (F-statistic):,0.0
Time:,10:52:56,Log-Likelihood:,-40622.0
No. Observations:,3361,AIC:,81250.0
Df Residuals:,3358,BIC:,81270.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7505.6189,4164.397,1.802,0.072,-659.393,1.57e+04
AgeAtMedicalEvent,82.3684,58.001,1.420,0.156,-31.353,196.090
MedicalTreatmentDuration,1.128e+04,154.568,72.987,0.000,1.1e+04,1.16e+04

0,1,2,3
Omnibus:,2391.549,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,136842.851
Skew:,2.774,Prob(JB):,0.0
Kurtosis:,33.763,Cond. No.,406.0


####  Given the results of the F-statistic (2708) and its associated p-value (0.00) as shown in the previous table we conclude that the model is statistically significant (not all Beta coefficients are ZERO at the same time)