# Healthcare Investments: Visualization & EDA

## **Content**
1. [Preparation](#1)
1. [Data overview](#2)
1. [EDA](#3)
1. [Modeling](#4)

The average length of stay (ALOS) is used as an indicator of hospital efficiency. A shorter average length of stay means that patients admitted to the hospital are circulating more efficiently, and fewer patients are kept waiting for a long time when they need to be admitted. It will also have a positive impact on the cost side of the hospital.

### Reference

* [OECD Data](https://data.oecd.org/healthcare/length-of-hospital-stay.htm)
* [Health at a Glance 2019 : OECD Indicators](https://www.oecd-ilibrary.org/sites/0d8bb30a-en/index.html?itemId=/content/component/0d8bb30a-en)
* [Decreasing the Patient Length of Stay](https://centrak.com/blog-decreasing-patient-length-of-stay/#:~:text=The%20average%20length%20of%20stay,cost%20of%20%2410%2C400%20per%20day.)

<a id="1"></a> <br>
# <div class="alert alert-block alert-info">Preparation</div>

I'll import required libraries.

In [None]:
!pip install linearmodels

In [None]:
import math
import warnings
warnings.filterwarnings('ignore')

from IPython.display import YouTubeVideo
from linearmodels.panel import PanelOLS, RandomEffects
from linearmodels.panel.data import PanelData
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.stattools import durbin_watson

%matplotlib inline

We can use one csv data.

In [None]:
! ls ../input/healthcare-investments-and-length-of-hospital-stay/

I'll import this data.

In [None]:
df_hihs = pd.read_csv("../input/healthcare-investments-and-length-of-hospital-stay/Healthcare_Investments_and_Hospital_Stay.csv")

<a id="2"></a> <br>
# <div class="alert alert-block alert-success">Data overview</div>

Let's see data.

The data is panel data. As we will see later, this is an unbalanced panel data because the number of data for each country varies.

In [None]:
df_hihs.head()

There are 518 records and 6 columns for each record. There are no missing value.

In [None]:
df_hihs.info()

### Location

Location indicates which country each record belongs to.

We can see 

In [None]:
locations = set(df_hihs["Location"])
print(f"There are {len(locations)} countries data.\n")

print("The country breakdown is as follows:")
print(locations)

In [None]:
fig = plt.figure(figsize=(15, 4))
g = sns.countplot(data=df_hihs, x="Location",
                  order = df_hihs['Location'].value_counts().index)
g.set_xticklabels(g.get_xticklabels(), rotation=90)
g.set_title("records for each location")

### Time

Year when the data got. The more recent the data, the more data there is.

In [None]:
print(f'Minimum year: {min(df_hihs["Time"])}')
print(f'Maxmum year: {max(df_hihs["Time"])}')

In [None]:
g = sns.distplot(df_hihs["Time"],  kde=False, rug=False, color="darkgoldenrod")
g.set_title("Time distribution")

### ALOS

The indicator I mentioned above. It is generally measured by dividing the total number of days stayed by all inpatients during a year by the number of admissions or discharges. Day cases are excluded. The indicator is presented both for all acute care cases and for childbirth without complications.

In [None]:
g = sns.distplot(df_hihs["Hospital_Stay"],  kde=False, rug=False, color="b")
g.set_title("Hospital_Stay distribution")

### MRI_Units

This indicator is measured in the numbers of equipment per 1,000,000 inhabitants.

I've also included a reference video in case anyone is not familiar with MRI.

In [None]:
YouTubeVideo('kmfmGhI8l9E')

In [None]:
g = sns.distplot(df_hihs["MRI_Units"],  kde=False, rug=False, color="g")
g.set_title("MRI_Units distribution")

### CT Scanners

This indicator is measured in the numbers of equipment per 1,000,000 inhabitants.

I also included a reference video for CT scanner.

In [None]:
YouTubeVideo('l9swbAtRRbg')

In [None]:
g = sns.distplot(df_hihs["CT_Scanners"],  kde=False, rug=False, color="y")
g.set_title("CT_Scanners distribution")

### Hospital Beds

 The indicator is presented as a total and for curative care and psychiatric care. It is measured in number of beds per 1,000 inhabitants.

In [None]:
g = sns.distplot(df_hihs["Hospital_Beds"],  kde=False, rug=False, color="orange")
g.set_title("Hospital_Beds distribution")

<a id="3"></a> <br>
# <div class="alert alert-block alert-success">EDA</div>

First, we look at the correlation between indicators.

There does not seem to be any indicator that correlates well with ALOS. On the other hand, there seems to be a high correlation between indicators other than ALOS.

In [None]:
g = sns.heatmap(df_hihs[["Hospital_Beds", "CT_Scanners", "MRI_Units", "Hospital_Stay"]].corr(),
                annot=True, cmap="YlGnBu")
g.set_title("Correlation matrix for each indicator")

Let's look at the distributions in detail.

In [None]:
g = sns.pairplot(df_hihs[["Hospital_Beds", "CT_Scanners", "MRI_Units", "Hospital_Stay"]])
#ax = plt.gca()
#ax.set_title("Pair plot for each indicator")

The distribution of ALOS with other indicators has a curious shape.At first glance, it looks like a messed up set of data points, but it also looks like a collection of several semi-proportional data series.

Let's be able to look at it in more detail. Using Pyplot, we can find out which points belong to which countries.

In [None]:
fig = px.scatter(df_hihs, x="Time", y="Hospital_Stay",color="Location",
                 hover_data=['Time', "Location"])
fig.show()

Very interestingly, our intuition seemed to be right. By changing the color of the data points for each country, we could see that each data was a right-shouldering data series.

We will try to draw them so that we can see data for each country. Assuming that it can be approximated by a straight line, we will also draw an approximate line. Of course, this assumption is local, since no matter how much we modernize, ALOS will never go below zero.

In [None]:
for loc in locations:
    df_hihs_tmp = df_hihs[df_hihs["Location"]==loc]
    sns.lmplot(x="Time", y="Hospital_Stay", data=df_hihs_tmp).set(xlim=(1990, 2018))
    ax = plt.gca()
    ax.set_title(f"Hospital_Stay vs Time of {loc}")

Basically, the hypothesis that data can be approximated by a locally linear system seems to be correct, and the slope is right-shouldering. It makes sense that as we modernize, improvements in medical standards and hospitalization systems will occur, and ALOS will become smaller.

However, some countries have widely varying data. In such countries, the data may not have been taken correctly, or there may be reasons why ALOS has not decreased over time.

## Log Transformation

In [None]:
df_hihs_log = df_hihs.copy()
df_hihs_log["Hospital_Stay"] = df_hihs_log["Hospital_Stay"].map(lambda x: math.log(x))

In [None]:
fig = px.scatter(df_hihs_log, x="Time", y="Hospital_Stay",color="Location",
                 hover_data=['Time', "Location"])
fig.show()

In [None]:
for loc in locations:
    df_hihs_tmp = df_hihs_log[df_hihs_log["Location"]==loc]
    sns.lmplot(x="Time", y="Hospital_Stay", data=df_hihs_tmp).set(xlim=(1990, 2018))
    ax = plt.gca()
    ax.set_title(f"Hospital_Stay vs Time of {loc}")

<a id="4"></a> <br>
# <div class="alert alert-block alert-success">Modeling</div>

In [None]:
Location = pd.Categorical(df_hihs_log.Location)
df_hihs_log = df_hihs_log.set_index(['Location', 'Time'])

In [None]:
df_hihs_log

In [None]:
formula_fe = 'Hospital_Stay ~ CT_Scanners + Hospital_Beds + EntityEffects'
mod_fe = PanelOLS.from_formula(formula_fe, data=df_hihs_log)
result_fe = mod_fe.fit()

In [None]:
print(result_fe.summary.tables[1])

In [None]:
durbin_watson(result_fe.resids)

In [None]:
exog_vars = ['CT_Scanners', 'Hospital_Beds']
exog = sm.add_constant(df_hihs_log[exog_vars])
mod_ra = RandomEffects(df_hihs_log.Hospital_Stay, exog)
result_ra = mod_ra.fit()

In [None]:
print(result_ra)

In [None]:
durbin_watson(result_ra.resids)