# Hospitalization Hypothesis Testing

## Assignment



As a data scientist working at Apollo, the ultimate goal is to tease out meaningful and actionable insights from Patient-level collected data. You can help Apollo hospitals to be more efficient, influence diagnostic and treatment processes, and to map the spread of a pandemic.

One of the best examples of data scientists making a meaningful difference at a global level is in the response to the COVID-19 pandemic, where they have improved information collection, provided ongoing and accurate estimates of infection spread and health system demand, and assessed the effectiveness of government policies.

The company wants to know:

- Which variables are significant in predicting the reason for hospitalization for different regions;
- How well some variables like viral load, smoking, and severity level describe the hospitalization charges;

## Import Necessary Libraries

In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


## Load Data

In [48]:
data = pd.read_csv('/Users/rosiebai/Downloads/datasets-3/apollo_data.csv')

In [49]:
data.head()

Unnamed: 0.1,Unnamed: 0,age,sex,smoker,region,viral load,severity level,hospitalization charges
0,0,19,female,yes,southwest,9.3,0,42212
1,1,18,male,no,southeast,11.26,1,4314
2,2,28,male,no,southeast,11.0,3,11124
3,3,33,male,no,northwest,7.57,0,54961
4,4,32,male,no,northwest,9.63,0,9667


## Data Exploration (EDA)

In [50]:
data.isnull().sum()

Unnamed: 0                 0
age                        0
sex                        0
smoker                     0
region                     0
viral load                 0
severity level             0
hospitalization charges    0
dtype: int64

In [51]:
data['severity level'].value_counts()

severity level
0    574
1    324
2    240
3    157
4     25
5     18
Name: count, dtype: int64

In [52]:
data.groupby('region')['viral load'].mean()

region
northeast     9.724722
northwest     9.733508
southeast    11.118516
southwest    10.198985
Name: viral load, dtype: float64

In [53]:
data.groupby('region')['severity level'].mean()

region
northeast    1.046296
northwest    1.147692
southeast    1.049451
southwest    1.141538
Name: severity level, dtype: float64

In [54]:
data.groupby('region')['age'].mean()

region
northeast    39.268519
northwest    39.196923
southeast    38.939560
southwest    39.455385
Name: age, dtype: float64

In [55]:
data.groupby('region')['hospitalization charges'].mean()

region
northeast    33515.966049
northwest    31043.941538
southeast    36838.541209
southwest    30867.332308
Name: hospitalization charges, dtype: float64

In [56]:
data['region'].value_counts()

region
southeast    364
southwest    325
northwest    325
northeast    324
Name: count, dtype: int64

In [57]:
data['sex'].value_counts()

sex
male      676
female    662
Name: count, dtype: int64

In [58]:
data['age'].describe()

count    1338.000000
mean       39.207025
std        14.049960
min        18.000000
25%        27.000000
50%        39.000000
75%        51.000000
max        64.000000
Name: age, dtype: float64

In [59]:
severity_by_region_cnt = pd.crosstab(data['region'], data['severity level'], margins = True)
severity_by_region_cnt

severity level,0,1,2,3,4,5,All
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
northeast,147,77,51,39,7,3,324
northwest,132,74,66,46,6,1,325
southeast,157,95,66,35,5,6,364
southwest,138,78,57,37,7,8,325
All,574,324,240,157,25,18,1338


In [60]:
severity_by_region_pct = round(pd.crosstab(data['region'], data['severity level'], normalize='index'),2)
severity_by_region_pct

severity level,0,1,2,3,4,5
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
northeast,0.45,0.24,0.16,0.12,0.02,0.01
northwest,0.41,0.23,0.2,0.14,0.02,0.0
southeast,0.43,0.26,0.18,0.1,0.01,0.02
southwest,0.42,0.24,0.18,0.11,0.02,0.02


## Hypothesis Testing

Viral load can be a variable to predict the reasons for hospitalization.

In [61]:
data.groupby('region')['viral load'].mean()

region
northeast     9.724722
northwest     9.733508
southeast    11.118516
southwest    10.198985
Name: viral load, dtype: float64

In [62]:
northeast_vl = data[data['region'] == 'northeast']['viral load']
northwest_vl = data[data['region'] == 'northwest']['viral load']
southeast_vl = data[data['region'] == 'southeast']['viral load']
southwest_vl = data[data['region'] == 'southwest']['viral load']

### Perform One-Way ANOVA

In [63]:
# Perform One-Way ANOVA
f_stat, p_value = stats.f_oneway(northeast_vl,northwest_vl,southeast_vl,southwest_vl)
print(f_stat, p_value)
# Interpretation
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference among the groups.")
else:
    print("Fail to reject the null hypothesis: No significant difference among the groups.")

39.46870879747586 1.9508165724451252e-24
Reject the null hypothesis: There is a significant difference among the groups.


### two-sample t-test

In [64]:
# Perform two-sample t-test
t_stat, p_value = stats.ttest_ind(northeast_vl, southwest_vl, equal_var=False)  # Welch's t-test

# Print results
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the two groups.")

T-statistic: -3.1153
P-value: 0.0019
Reject the null hypothesis: There is a significant difference between the two groups.


In [65]:
# Perform two-sample t-test
t_stat, p_value = stats.ttest_ind(northeast_vl, southwest_vl, equal_var=False)  # Welch's t-test

# Print results
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the two groups.")

T-statistic: -3.1153
P-value: 0.0019
Reject the null hypothesis: There is a significant difference between the two groups.


In [66]:
# Perform two-sample t-test
t_stat, p_value = stats.ttest_ind(northwest_vl, southeast_vl, equal_var=False)  # Welch's t-test

# Print results
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the two groups.")

T-statistic: -9.3750
P-value: 0.0000
Reject the null hypothesis: There is a significant difference between the two groups.


In [67]:
# Perform two-sample t-test
t_stat, p_value = stats.ttest_ind(northwest_vl, southwest_vl, equal_var=False)  # Welch's t-test

# Print results
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the two groups.")

T-statistic: -3.2831
P-value: 0.0011
Reject the null hypothesis: There is a significant difference between the two groups.


In [68]:
data.groupby('region')['hospitalization charges'].mean()

region
northeast    33515.966049
northwest    31043.941538
southeast    36838.541209
southwest    30867.332308
Name: hospitalization charges, dtype: float64

In [71]:
data.head()

Unnamed: 0.1,Unnamed: 0,age,sex,smoker,region,viral load,severity level,hospitalization charges,gender,smoke_ind
0,0,19,female,yes,southwest,9.3,0,42212,0,1
1,1,18,male,no,southeast,11.26,1,4314,1,0
2,2,28,male,no,southeast,11.0,3,11124,1,0
3,3,33,male,no,northwest,7.57,0,54961,1,0
4,4,32,male,no,northwest,9.63,0,9667,1,0


In [72]:
data.columns

Index(['Unnamed: 0', 'age', 'sex', 'smoker', 'region', 'viral load',
       'severity level', 'hospitalization charges', 'gender', 'smoke_ind'],
      dtype='object')

## Regression Analysis

### Model 1

In [70]:
data['gender'] = np.where(data['sex'] == 'male', 1,0)
data['smoke_ind'] = np.where(data['smoker'] == 'yes', 1,0)

In [73]:
# One-Hot Encode Categorical Variables
data_encoded = pd.get_dummies(data, columns=['region'], drop_first=True)

# Define X (independent variables) and y (dependent variable)
X = data_encoded.drop(columns=['Unnamed: 0','hospitalization charges','sex','smoker'])
y = data_encoded['hospitalization charges']

# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print model coefficients
print(f"Intercept: {model.intercept_:.2f}")
print(f"Coefficients: {dict(zip(X.columns, model.coef_))}")

# Evaluate model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

Intercept: -29836.81
Coefficients: {'age': 642.4145536917176, 'viral load': 2529.0607534343785, 'severity level': 1063.368287089856, 'gender': -46.04709582221805, 'smoke_ind': 59128.33064777131, 'region_northwest': -926.8100263707785, 'region_southeast': -1644.6564946337726, 'region_southwest': -2024.6388080945576}
Mean Squared Error: 209984443.82
R² Score: 0.78


### Model 2

In [77]:

# Define X (independent variables) and y (dependent variable)
X = data[['viral load','smoke_ind','severity level']]
y = data['hospitalization charges']

# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print model coefficients
print(f"Intercept: {model.intercept_:.2f}")
print(f"Coefficients: {dict(zip(X.columns, model.coef_))}")

# Evaluate model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

Intercept: -10697.33
Coefficients: {'viral load': 2989.0339418169224, 'smoke_ind': 57959.3291831184, 'severity level': 1531.822525621188}
Mean Squared Error: 297022745.73
R² Score: 0.69
