# COVID-19 India : Intelligent Resource Predictor

### Finding gap between the requirement of beds for Covid care and their availability in every state's medical facility.

# Flow 
1. Predict the estimated No. of covid cases on a day for a state.
2. Derive the number of cases that require hospitalization out of the total count obtained in the previous step.
3. Find the gap between the figure obtained above and from the recent data of vacant beds for a state.

# Contents
1. Datasets
2. Data Preparation - Train/test
3. Model Training
4. Predicting the number of COVID cases 7 days from present day
5. Backtesting
6. Model Evaluation
7. Finding No. of patients to be hospitalised
8. Finding Gap in availabity of No. of beds for COVID cases

### Training data -
Data span - 2nd April to 15th May
Using past 30 days' data, predict estimated postive COVID cases for 7th day from a present day.

### Test data -
Using 30 day rolling window of past data from 15th April onwards, predict count of positive cases for 23rd May to 4th June for each state in India.


### Assumptions for deriving requirement of bed resources
The distribution of demographics w.r.t age of positive COVID cases -

Category 1 - 25% - young children 

Category 2 - 45% - Youth/Middle aged

Category 3 - 30% - Senior  Citizens

We include the 1st and 3rd category (65% of total positive cases) as cases that require hospitalization with priority over the 2nd category.

In [None]:
import numpy as np
np.random.seed(1)

import tensorflow as tf
tf.random.set_seed(2)

import math
from datetime import datetime
import pandas as pd
import time
import warnings
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')
import os

from tensorflow import keras
from tensorflow.keras import backend
from tensorflow.keras.models import Sequential,load_model
from tensorflow.keras.layers import Dense,LSTM,Dropout,Flatten
from tensorflow.keras import optimizers
from sklearn.metrics import mean_absolute_error, mean_squared_error,r2_score

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Features in the dataset below are at daily level derived from the COVID19 dataset containing cumulative statewise counts of new, recovered & deceased cases.


In [None]:
covid_features = pd.read_csv("/kaggle/input/features-covid/statewise_features.csv")
beds_dataset = pd.read_csv("/kaggle/input/covid19-in-india/HospitalBedsIndia.csv", index_col=0)
containment_zones = pd.read_csv("/kaggle/input/district-containment-zones/district-containment-zones-2020-04-30.csv")

In [None]:
covid_features.head()

In [None]:
covid_features['Date'] = pd.to_datetime(covid_features['Date'], format="%Y-%m-%d")

covid_features['Date'].min(), covid_features['Date'].max()

In [None]:
covid_features.info()

In [None]:
covid_features['State'].nunique()

In [None]:
# No. of steps for LSTM to look back for making predictions

timesteps = 30

# Data Preparation

In [None]:
def transformDF(df,state,date_start,date_end,date_col="Date"):
    # fetch selective columns
    df_columns = ["State",date_col,"scaled_pop_density","new_cases","samples_tested"]
    df = df[df_columns]
    state_df = df[df['State'] == state]
    state_df[date_col] = pd.to_datetime(state_df[date_col])
    # Generate valid dates in the specified data range
    all_dates_df = pd.DataFrame(pd.date_range(start=date_start,end=date_end,freq='D'),columns=['date'])
    all_days_state_df = all_dates_df.merge(state_df,
                                                how='left',
                                                left_on='date',
                                                right_on=date_col)
    pop_density = all_days_state_df[~all_days_state_df['scaled_pop_density'].isnull()]["scaled_pop_density"].unique()[0]
    values = {"State":state,
             "new_cases":0,
             "samples_tested":0,
             "scaled_pop_density":pop_density}
    final_df = all_days_state_df.fillna(value=values)\
                               .drop(date_col,axis=1)
    return final_df

In [None]:
def getPreparedData(timesteps,df):
    # Preparing the data
    input_data = []              # historic input values 
    actual_data = []             # actual values at current time instance as target values
    scalers = []  
    N_STEPS = timesteps          # No. of timesteps back in time that the LSTM model sees
    # df = df[['pop_density','new_cases','samples_tested']]
    df = df[['scaled_pop_density','new_cases','samples_tested']]
    lookahead_day = 7            # The Nth day in future for which to make prediction
    nfeatures = len(df.columns)
    
    feature_index = 1            #  index of feature - new_cases
    ## At each prediction timestep, we provide values from n_steps historical timesteps
    for i in range(N_STEPS,len(df) - lookahead_day):  
        historical_x_val = df.values[i-N_STEPS:i]
        y_val = df.values[i+lookahead_day][feature_index]
        minmax_scaler = MinMaxScaler(feature_range=(0,1))
        scaled_x = historical_x_val[:]   
        
        # 1st feature is scaled already. Skipping this feature in scaling step below
        scaled_x[:,1:] = minmax_scaler.fit_transform(historical_x_val[:,1:])     
        
        # Scaling target values
        scaling_shape = (1, nfeatures - 1)                        # subtract 1 from num features scaled
        reshaped_y = np.broadcast_to(y_val,scaling_shape)         # Make its shape compatible for minmax scale transformation
        scaled_y = max(0, minmax_scaler.transform(reshaped_y)[0][0])
    
        input_data.append(scaled_x) 
        actual_data.append(scaled_y)
        scalers.append(minmax_scaler)
    train_x = np.array(input_data)
    train_y = np.array(actual_data)
    assert len(train_x.shape) == 3 , "x_train is expected to be 3-Dimensional for LSTM"
    return train_x,train_y,scalers



### Training data

In [None]:
# Scaling population density using data from all states
density_scaler = MinMaxScaler((0,1))
covid_features['scaled_pop_density'] = density_scaler.fit_transform(covid_features['pop_density'].values.reshape(-1,1))

In [None]:
# Use data from 1st April to 15th May for training
# Combining data from multiple state to prepare training dataset 30 days long

start_time = time.time()
trainX = trainY = TrainScalers = None

for i,state in enumerate(covid_features['State'].unique()):
    state_df = transformDF(covid_features,state,date_start='2020-04-02',date_end='2020-05-15',date_col="Date")
    # print(state_df)
    train_x,train_y,train_scaler = getPreparedData(timesteps=timesteps\
                                                 ,df=state_df)
    
    if trainX is None:
            trainX = train_x
            trainY = train_y
            TrainScalers = train_scaler

    else:
        trainX = np.vstack((trainX,train_x))
        trainY = np.hstack((trainY,train_y))
        TrainScalers = TrainScalers + train_scaler

print("Training data prepared in {} seconds".format(time.time() - start_time))

In [None]:
trainX.shape, trainY.shape, len(TrainScalers)

### Testing data

In [None]:
# Use data from 1st April to 15th May for training
# Combining data from multiple states to prepare training dataset 30 days long

nrecords = 13   # No. of test data points per state
start_time = time.time()
testX  = testY = testScalers = testStateDf = target_y = None

for i,state in enumerate(covid_features['State'].unique()):
    state_df = transformDF(covid_features,state,date_start='2020-04-15',date_end='2020-06-03',date_col="Date")
    
    test_X,test_Y,test_scalers = getPreparedData(timesteps=timesteps\
                                                 ,df=state_df)
    target_Y = state_df['new_cases'].values[-nrecords: ]
    if testX is None:
        testX = test_X
        testY = test_Y
        testScalers = test_scalers 
        target_y = target_Y
    else:
        testX = np.vstack((testX,test_X))
        testY = np.hstack((testY,test_Y))
        testScalers = testScalers + test_scalers
        testStateDf = pd.concat([testStateDf,state_df])
        target_y = np.hstack((target_y,target_Y))
        
print("Test data prepared in {} seconds".format(time.time() - start_time))

With data span from 15th April to 14th May, considering 30 back time steps, and to predict for 7th day from 15th May onwards,we will test predictions from 23rd May to 4th June. This way, we have 13 test data points for each of the 35 states.

# Model training

In [None]:
bs = 16
epoch = 400
lr = 0.001
lstm_dropout = 0.2
lstm_nodes = 8
optimizer = keras.optimizers.Adam(learning_rate=lr)
num_features = 3     # no. of positive cases, total samples tested on a day & population density of state
training_loss = tf.keras.losses.MAE

checkpoint_path = "./covid_adam_tanh_relu_checkpoint"
logdir = "./logs/covid_adam_tanh_relu" + datetime.now().strftime(format="&Y%m%d-%H-%M-%S")

# for visualizing training metrics on tensorboard
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir,update_freq='epoch')
checkpoint_callback = keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                        verbose=2,
                                                        period=20,
                                                        save_best_only=True)

In [None]:
## Setting up the forecasting model using LSTM
model = Sequential()
model.add(LSTM(units=lstm_nodes, input_shape=(timesteps, num_features), return_sequences=True,dropout=0.0))
model.add(LSTM(units=lstm_nodes, return_sequences=False,dropout=0.0))
model.add(Dropout(rate=lstm_dropout))
model.add(Dense(units=20,activation='tanh',name="dense_1"))
model.add(Dense(units=1,activation='relu',name="dense_2_output"))
model.summary()

In [None]:
model.compile(optimizer=optimizer,loss=training_loss,metrics=['mae','mse'])

In [None]:
%time history = model.fit(x=trainX\
                        ,y=trainY\
                        ,batch_size=bs\
                        ,epochs=epoch\
                        ,verbose=2\
                        ,callbacks=[tensorboard_callback,checkpoint_callback]\
                        ,shuffle=True)

In [None]:
# Plotting loss function during training phase

plt.plot(history.history['loss'])
plt.show()

# Predicting the number of COVID cases 7 days from present day

In [None]:
training_orginal_scaled_predictions = []
training_predictions = model.predict(trainX,batch_size=bs)

# Making shape compatible for scaling back into original range
train_size = trainX.shape[0]
reshaped_tr_predictions = np.broadcast_to(training_predictions, (train_size, num_features - 1))
reshaped_tr_predictions[:10]

# scaling the predictions back to orginal scale
for i in range(len(trainX)):
    training_inverse_scaled_predictions = TrainScalers[i].inverse_transform(reshaped_tr_predictions[i].reshape(1,-1))[:,0]
    training_orginal_scaled_predictions.append(training_inverse_scaled_predictions)

# Backtesting

In [None]:
test_inverse_scaled_predictions = []
test_original_scaled_predictions = []

test_predictions = model.predict(testX, batch_size=bs)

test_size = testX.shape[0]
reshaped_test_predictions = np.broadcast_to(test_predictions, (test_size, num_features - 1 ))

# scaling the predictions back to orginal scale
for i in range(len(testScalers)):
    prediction = testScalers[i].inverse_transform(reshaped_test_predictions[i].reshape(1,-1))[:,0]
    test_original_scaled_predictions.append(prediction)


### Rounding off the actual and predicted values upto 3 decimal places

In [None]:
target_y = np.round(target_y,0)
predictions_all_states = np.round(test_original_scaled_predictions,0)
predictions_all_states[predictions_all_states < 0] = 0.0

In [None]:
# List of dates for which the predictions are made

dates = covid_features.sort_values('Date')['Date'].astype(str).unique()[-nrecords:].tolist()

In [None]:
predictions_df = None

index = 0
states = covid_features['State'].unique()
for i in range(0,len(testScalers),nrecords):
    test_state_df = pd.DataFrame(predictions_all_states[i:i+nrecords], columns=['predictions'])
    test_state_df['State'] = states[index]
    test_state_df['Date'] = dates
    test_state_df['target_values'] = target_y[i:i+nrecords]
    test_state_df[["predictions","target_values"]].plot(kind='bar',figsize=(6,4),grid=True)
    plt.xlabel("State: {}".format(states[index]))
    plt.ylabel("No. of Positive COVID cases ")
    plt.xticks(range(nrecords),dates)
    index += 1
    predictions_df = pd.concat([predictions_df,test_state_df])
    plt.show()

High deviation between model predictions and actual count of confirmed cases from above visualizations for states Nagaland, Punjaba and Andaman and Nicobar Islands can be justified by the past trend of counts in these states

In [None]:
covid_features.info()

In [None]:
predictions_df['Date'] = pd.to_datetime(predictions_df['Date'], format="%Y-%m-%d")


In [None]:
# Combining predictions with original features

covid_features['Date'] = covid_features.Date.astype(str)
predictions_df['Date'] = predictions_df.Date.astype(str)
features_with_predictions = covid_features.merge(predictions_df[['predictions','Date','State']], 
                     how='left', 
                    on=['Date','State'])
features_with_predictions.drop("scaled_pop_density", axis=1, inplace=True)

In [None]:
state = "Nagaland"
plt.figure(figsize=(10,4))
features_with_predictions[features_with_predictions['State']==state].set_index('Date')[['new_cases','predictions']][-30:].plot(kind="bar")
plt.show()

In [None]:
state = "Punjab"
plt.figure(figsize=(10,4))
features_with_predictions[features_with_predictions['State']==state].set_index('Date')[['new_cases','predictions']][-30:].plot(kind="bar")
plt.show()

In [None]:
state = "Andaman and Nicobar Islands"
plt.figure(figsize=(10,4))
features_with_predictions[features_with_predictions['State']==state].set_index('Date')[['new_cases','predictions']][-30:].plot(kind="bar")
plt.show()

# Model Evaluation 

In [None]:
# Mean absolute error

mae = mean_absolute_error(target_y, predictions_all_states)
mae

In [None]:
# Mean squared error
mse = mean_squared_error(target_y, predictions_all_states)
rmse = math.sqrt(mse)
rmse


# Number of patients to be hospitalized

Assuming 65% of the predicted positive cases require hospitalization. This group of patients may belong to the population of senior citizens or poeple with pre-morbid health conditions.
Then the total number of hospitalizations that is expected to be done is - 

In [None]:
statewise_hosp_requirement = predictions_df[['Date','State','predictions']]
statewise_hosp_requirement.columns = ['Date', 'State', 'predicted_case_count']
statewise_hosp_requirement['num_hosp_required'] = np.round(statewise_hosp_requirement['predicted_case_count']*0.65, 0)

In [None]:
statewise_hosp_requirement = statewise_hosp_requirement.groupby('State').sum().reset_index()

In [None]:
statewise_hosp_requirement.head()

# Gap in availabity of No. of beds for COVID cases

In [None]:
beds_dataset.head()

In [None]:
# Assuming 5% of total Beds are reserved for COVID care
# Assuming only 2% of those beds are vacant for hospitalizing newly diagnozed patients
# The available bed count is - 

beds_dataset['total_beds'] = beds_dataset['NumPublicBeds_HMIS'] + beds_dataset['NumRuralBeds_NHP18'] + beds_dataset['NumUrbanBeds_NHP18']
beds_dataset['beds_for_covid'] = np.round(beds_dataset['total_beds']*0.05, 0)
# beds_dataset['vacant_beds_for_covid'] = np.round(beds_dataset['beds_for_covid']*0.02, 0)

In [None]:
covid_beds_data = beds_dataset[['State/UT','beds_for_covid']]

In [None]:
# Data cleaning for consistent data before joining

covid_beds_data['State/UT'] = covid_beds_data['State/UT'].str.replace('&', 'and')
covid_beds_data.loc[covid_beds_data['State/UT'].str.contains('Diu'), 'State/UT'] = "Dadra and Nagar Haveli and Daman and Diu"
covid_beds_data.loc[covid_beds_data['State/UT'].str.contains('Dadra'), 'State/UT'] = "Dadra and Nagar Haveli and Daman and Diu"

In [None]:
covid_beds_data = covid_beds_data.groupby('State/UT').sum().reset_index()
covid_beds_data.columns = ['State', 'beds_for_covid']

### Merging the details of requirements per state and available hospital beds

In [None]:
statewise_avail_req = statewise_hosp_requirement.merge(covid_beds_data, 
                                on='State',
                                how='left')

In [None]:
statewise_avail_req.tail()

For 13 days of span, the difference between the beds availaible for COVID and total No. of hospitalizations required will give us the gap in bed resource. A positive value of the difference denotes the shortage of the resource.

In [None]:
statewise_avail_req['gap_in_req_beds'] = statewise_avail_req['num_hosp_required'] - statewise_avail_req['beds_for_covid']


## Top 4 states falling short of bed resources

In [None]:
statewise_avail_req.sort_values('gap_in_req_beds', ascending=False).head(4)