# Multi-level Modeling

From [http://openonlinecourses.com/causalanalysis/Multi%20Level%20Regression.asp](http://openonlinecourses.com/causalanalysis/Multi%20Level%20Regression.asp).

## Question 1

### Question 1.1

Load the data.

In [1]:
import pandas as pd

df = pd.read_excel('./TraumaData.xlsx').drop(columns=['Unnamed: 0'])
df.shape

(50, 6)

Notice that `Hosp` is categorical? 

In [2]:
df.head()

Unnamed: 0,Prob Survival,Severe Burn,Head Injury,65+ Years,Male,Hosp
0,0.694551,1,1,1,1,A
1,0.733619,1,1,1,0,A
2,0.785537,1,1,0,1,A
3,0.81877,1,0,1,1,A
4,0.868275,1,0,0,1,A


We need to `one-hot encode` the hospitals from the `Hosp` column.

In [3]:
ohe_df = pd.get_dummies(df['Hosp'], prefix='hosp')
ohe_df.shape

(50, 5)

In [4]:
ohe_df.head()

Unnamed: 0,hosp_A,hosp_B,hosp_C,hosp_D,hosp_E
0,1,0,0,0,0
1,1,0,0,0,0
2,1,0,0,0,0
3,1,0,0,0,0
4,1,0,0,0,0


Now, join the original data to the one-hot encoded data. Notice how we drop the original categorical field `Hosp` and one of the one-hot encoded fields `hosp_A`.

In [5]:
df = df.join(ohe_df).drop(columns=['Hosp', 'hosp_A'])
df.shape

(50, 9)

In [6]:
df.head()

Unnamed: 0,Prob Survival,Severe Burn,Head Injury,65+ Years,Male,hosp_B,hosp_C,hosp_D,hosp_E
0,0.694551,1,1,1,1,0,0,0,0
1,0.733619,1,1,1,0,0,0,0,0
2,0.785537,1,1,0,1,0,0,0,0
3,0.81877,1,0,1,1,0,0,0,0
4,0.868275,1,0,0,1,0,0,0,0


Let's pull out our design matrix `X` and response variable `y` from the dataframe.

In [7]:
X, y = df[[c for c in df.columns if c!= 'Prob Survival']], df['Prob Survival']

X.shape, y.shape

((50, 8), (50,))

Now, perform regression.

In [8]:
from sklearn.linear_model import LinearRegression

m1 = LinearRegression()
m1.fit(X, y)

m1.intercept_, m1.coef_

(1.1621835879267253,
 array([-0.24542462, -0.18178176, -0.10929294, -0.02670298, -0.11793209,
        -0.22613064, -0.27645   , -0.34113592]))

We can align the variable names with the coefficients.

In [9]:
coefficients = pd.Series(m1.coef_, X.columns)
coefficients

Severe Burn   -0.245425
Head Injury   -0.181782
65+ Years     -0.109293
Male          -0.026703
hosp_B        -0.117932
hosp_C        -0.226131
hosp_D        -0.276450
hosp_E        -0.341136
dtype: float64

### Question 1.2

In [10]:
hosp_df = pd.DataFrame({
    'tertiary_center': [1, 1, 0, 0, 0],
    'burn_center': [0, 1, 0, 0, 0],
    'y': [0] + list(m1.intercept_ - coefficients[4:])
}, index=['A', 'B', 'C', 'D', 'E'])

hosp_df.shape

(5, 3)

In [11]:
hosp_df

Unnamed: 0,tertiary_center,burn_center,y
A,1,0,0.0
B,1,1,1.280116
C,0,0,1.388314
D,0,0,1.438634
E,0,0,1.50332


In [12]:
X, y = hosp_df[['tertiary_center', 'burn_center']], hosp_df['y']

X.shape, y.shape

((5, 2), (5,))

In [13]:
m2 = LinearRegression()
m2.fit(X, y)

m2.intercept_, m2.coef_

(1.4434224440280423, array([-1.44342244,  1.28011568]))

### Question 1.3

- First regression: all hospitals reduce mortality probability.
- Second regression: holding patient characteristics, tertiary centers reduce mortality probability and burn centers increase mortality probability

## Question 2

Since the Excel file is password-protected, export it to CSV format first and then use it.

In [14]:
df = pd.read_csv('./LungCancerComorbiditesWithMedicalCenters.csv').set_index('ID')
df.shape

(829799, 45)

Take a peek at the data.

In [15]:
df.head()

Unnamed: 0_level_0,Dead in 6 Months,Medical Center,S-Days,LungCancer,I4019,I496,I2724,I3051,I486,I53081,...,I7866,I51889,I78659,I78791,IV4581,IE8490,I07054,I30390,I2875,IV4582
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
538831,0,Center1,1.333334,0,1,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
476484,0,Center 1,0.0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
777073,0,Center1,0.0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
238242,0,Center1,0.0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
175895,1,Center1,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Check the types. Notice `Medical Center` is categorical. We will have to one-hot encode that field.

In [16]:
df.dtypes

Dead in 6 Months      int64
Medical Center       object
S-Days              float64
LungCancer            int64
I4019                 int64
I496                  int64
I2724                 int64
I3051                 int64
I486                  int64
I53081                int64
I41401                int64
I2859                 int64
I42731                int64
I60000                int64
I311                  int64
I49121                int64
I2761                 int64
I4280                 int64
I27651                int64
I2768                 int64
I5990                 int64
I40390                int64
IE8497                int64
I30981                int64
I5859                 int64
I30000                int64
I41400                int64
I4439                 int64
I2449                 int64
I7242                 int64
IV5861                int64
I25000                int64
I42789                int64
I78820                int64
I2809                 int64
I7866               

Let's actually see the distribution of values for `Medical Center`. I do not like how these values show up. Let's format these values.

In [17]:
df['Medical Center'].value_counts()

Center4     189294
Center3     120954
Center7     118147
Center5     113861
Center1      96810
Center6      95373
Center2      95359
Center 1         1
Name: Medical Center, dtype: int64

In [18]:
df['Medical Center'] = df['Medical Center'].apply(lambda s: s.strip().replace('Center', '').strip())

I can live with these values.

In [19]:
df['Medical Center'].value_counts()

4    189294
3    120954
7    118147
5    113861
1     96811
6     95373
2     95359
Name: Medical Center, dtype: int64

One-hot encode `Medical Center`.

In [20]:
ohe_df = pd.get_dummies(df['Medical Center'], prefix='medcenter')
ohe_df.shape

(829799, 7)

In [21]:
ohe_df.head()

Unnamed: 0_level_0,medcenter_1,medcenter_2,medcenter_3,medcenter_4,medcenter_5,medcenter_6,medcenter_7
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
538831,1,0,0,0,0,0,0
476484,1,0,0,0,0,0,0
777073,1,0,0,0,0,0,0
238242,1,0,0,0,0,0,0
175895,1,0,0,0,0,0,0


Join the raw data with the one-hot encoded data. Also, notice how we drop the original column `Medical Center` and one of the one-hot encoded columns `medcenter_1`?

In [22]:
df = df.join(ohe_df).drop(columns=['Medical Center', 'medcenter_1'])
df.shape

(829799, 50)

In [23]:
df.head()

Unnamed: 0_level_0,Dead in 6 Months,S-Days,LungCancer,I4019,I496,I2724,I3051,I486,I53081,I41401,...,I07054,I30390,I2875,IV4582,medcenter_2,medcenter_3,medcenter_4,medcenter_5,medcenter_6,medcenter_7
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
538831,0,1.333334,0,1,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
476484,0,0.0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
777073,0,0.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
238242,0,0.0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
175895,1,,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


I see some null values, let's drop those too.

In [24]:
df = df.dropna()
df.shape

(700589, 50)

Create your `X` and `y`.

In [25]:
X, y = df[[c for c in df.columns if c != 'S-Days']], df['S-Days']

X.shape, y.shape

((700589, 49), (700589,))

Apply linear regression.

In [26]:
m1 = LinearRegression()
m1.fit(X, y)

m1.intercept_, m1.coef_

(-0.2346861402446493,
 array([ 0.71056139, -0.68318404,  0.10749041,  0.19285794,  0.12254476,
         0.30161609,  0.25023024,  0.25255974,  0.09059173,  0.34512468,
         0.04303098,  0.18489607,  0.43137038,  0.21238871,  0.16642979,
         0.17940978,  0.35198414,  0.29858265,  0.51888277,  0.07999001,
         0.48999786,  0.30304943,  0.24558814,  0.38395177,  0.19998797,
         0.21354118,  0.14161079,  0.38028697,  0.39644764,  0.1226767 ,
         0.29992952,  0.24649921,  0.35576579, -0.14849044,  0.30384105,
         0.3498786 ,  0.4199639 ,  0.04807463,  0.55396408,  0.51581564,
         0.74957829,  0.26709767,  0.21159049,  0.00557548, -0.00075857,
         0.00318654,  0.00202615,  0.00334539,  0.00577512]))

Align the field names with coefficients.

In [27]:
s = pd.Series(m1.coef_, X.columns)
s

Dead in 6 Months    0.710561
LungCancer         -0.683184
I4019               0.107490
I496                0.192858
I2724               0.122545
I3051               0.301616
I486                0.250230
I53081              0.252560
I41401              0.090592
I2859               0.345125
I42731              0.043031
I60000              0.184896
I311                0.431370
I49121              0.212389
I2761               0.166430
I4280               0.179410
I27651              0.351984
I2768               0.298583
I5990               0.518883
I40390              0.079990
IE8497              0.489998
I30981              0.303049
I5859               0.245588
I30000              0.383952
I41400              0.199988
I4439               0.213541
I2449               0.141611
I7242               0.380287
IV5861              0.396448
I25000              0.122677
I42789              0.299930
I78820              0.246499
I2809               0.355766
I7866              -0.148490
I51889        

In [28]:
hosp_df = pd.DataFrame({
    'avg_travel_distance': [50, 80, 70, 70, 80, 70, 80],
    'pct_satisfied': [79, 82, 80, 79, 79, 83, 81],
    'y': [0] + list(m1.intercept_ - s[43:])
}, index=[f'Hosp{i}' for i in range(1, 8)])

hosp_df.shape

(7, 3)

In [29]:
hosp_df

Unnamed: 0,avg_travel_distance,pct_satisfied,y
Hosp1,50,79,0.0
Hosp2,80,82,-0.240262
Hosp3,70,80,-0.233928
Hosp4,70,79,-0.237873
Hosp5,80,79,-0.236712
Hosp6,70,83,-0.238032
Hosp7,80,81,-0.240461


In [30]:
X, y = hosp_df[['avg_travel_distance', 'pct_satisfied']], hosp_df['y']

m2 = LinearRegression()
m2.fit(X, y)

m2.intercept_, m2.coef_

(0.7814067077284637, array([-0.00718334, -0.00587113]))

What can we say?

- Regression 1: All medical centers have very little influence over mortality rates (very small coefficients).
- Regression 2: Increase in average distance travel and satisfaction have very small influence as well.

## Question 3

Manually construct the data.

In [31]:
clinician_df = pd.DataFrame({
    'previous_mi': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1],
    'chf': [1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1],
    'los': [6, 5, 6, 6, 6, 5, 6, 5, 6, 5, 6, 4, 4, 4, 6, 6, 6, 5, 5, 6]
})

peer_df = pd.DataFrame({
    'previous_mi': [1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0],
    'chf': [1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1],
    'los': [5, 5, 4, 3, 4, 4, 5, 5, 5, 5, 5, 3, 4, 4, 4, 4, 4, 3, 4, 5, 5, 5, 4, 4]
})

In [32]:
clinician_df.head()

Unnamed: 0,previous_mi,chf,los
0,1,1,6
1,1,0,5
2,1,1,6
3,1,1,6
4,1,1,6


In [33]:
peer_df.head()

Unnamed: 0,previous_mi,chf,los
0,1,1,5
1,1,1,5
2,0,1,4
3,0,0,3
4,0,1,4


Regress $y \sim X$ for each of the data sets.

In [34]:
X, y = clinician_df[['previous_mi', 'chf']], clinician_df['los']

m1 = LinearRegression()
m1.fit(X, y)

m1.intercept_, m1.coef_

(3.0, array([2., 1.]))

In [35]:
X, y = peer_df[['previous_mi', 'chf']], peer_df['los']

m2 = LinearRegression()
m2.fit(X, y)

m2.intercept_, m2.coef_

(3.0, array([1., 1.]))

Notice that the intercepts are the same? There is no difference between the two groups.