# Adjusting Mortality Rates for COVID-19

### The problems with current mortality rates
I have taken some issue with the way mortality rates are often being calculated for COVID-19. I believe they are being understated in favor of simple calculations, and should be altered to better reflect the reality of this virus. We have been reporting the mortality rate as: 

$mortality = \frac{deaths_{n}}{confirmed_{n}}$

Where $n$ is the current day, and $deaths_{n}$ and $confirmed_{n}$ are the cumulative deaths and confirmed cases respectively. 

However this ignores how many of the current cases will end up being deaths. I propose comparing the number of deaths to the total number of cases reported some $k$ days earlier. This value of $k$ should be the average time it takes for a confirmed case to become a death. The formula would thus look something like:

$mortality = \frac{deaths_{n-k}}{confirmed_{n}}$

Where $k$ is the average number of days it takes for a patient to die after being confirmed positive. This shift should eliminate the bias introduced by newly confirmed cases, and better reflect the probablity of survival. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime
import matplotlib.pyplot as plt

pd.set_option('precision', 8)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 90)

csv_files = {}
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        csv_files[filename.replace('.csv', '')] = pd.read_csv(os.path.join(dirname, filename))

### Time Series Data

In [None]:
time = csv_files['Time']
time['mortality_rate'] = time['deceased'] / time['confirmed']
print(time.tail(10))

This is the way mortality rates are being represented in the media. It's a simple calculation, but ignores the fact that most deaths are lagging behind newly confirmed cases. Simply put, new confirmed cases haven't had the chance to die yet. One way to adjust for this is to compare deaths today with confirmed cases some number of days ago. To do this we need to explore the details of deceased patients. 

## Patient Information

This csv file contains the details of patients, including their confirmed dates and deceased dates if applicable. 

In [None]:
patient = csv_files['PatientInfo']
patient = patient[['sex', 'age', 'country', 'confirmed_date', 'deceased_date', 'state']]

patient_deaths = patient[patient['deceased_date'].notnull()].copy()

patient_deaths['deceased_date'] = pd.to_datetime(patient_deaths['deceased_date'])
patient_deaths['confirmed_date'] = pd.to_datetime(patient_deaths['confirmed_date'])
patient_deaths['days_til_deceased'] = patient_deaths['deceased_date'] - patient_deaths['confirmed_date']

print(patient_deaths)

There are a few rows here where deceased_date is earlier than confirmed date. This could either be a mistake, or maybe the patients were confirmed positive post mortem. The tests can take more than a day, so this is a definite posibility. I'm going to take the liberty of converting these values to 0, since there are only two rows, and changing them to 0 won't have a large impact on our average.

In [None]:
negatives = patient_deaths[patient_deaths['days_til_deceased'] < datetime.timedelta(days=0)]
patient_deaths.loc[negatives.index, 'days_til_deceased'] = datetime.timedelta(days=0)

print(patient_deaths)

Now we can calculate the average number of days for a patient to go from confirmed to deceased.

In [None]:
days_to_death = patient_deaths['days_til_deceased'].mean()
print(days_to_death)

So the offset I'm suggesting for comparing deaths to confirmed cases is about 3 days and 10 hours. Since our time series data is in increments of whole days, we will round down and shift things by 3 days. 

## Adjusting the Mortality Rate

In [None]:
shifted_deaths = time['deceased']
shifted_deaths.index -= 3
time['shifted_deaths'] = shifted_deaths
time['adjusted_mortality'] = time['shifted_deaths'] / time['confirmed']
time['mortality_difference'] = time['adjusted_mortality'] / time['mortality_rate']
differences = time['mortality_difference'].replace([np.inf], np.nan).dropna()
print(time[['date', 'confirmed', 'deceased', 'mortality_rate', 'shifted_deaths', 'adjusted_mortality', 'mortality_difference']].tail(20))

print('\nAverage difference in mortality rate: ', differences.mean())

This 3 day shift increases mortality rates by an average of 92%. It should be noted that this is largely skewed by some of the earlier data. But it seems like this difference converges to somewhere between 10-20% more than reported mortality rates. 

## Visualizing the adjusted rate

In [None]:
plt.figure(figsize=(7, 7))
plt.plot(time['mortality_rate'], label='previous mortality rate')
plt.plot(time['adjusted_mortality'], label='adjusted mortality rate')
plt.legend(loc="upper left")
plt.xlabel('Days')
plt.ylabel('Mortality Rate')
plt.title('Mortality vs Adjusted Mortality')

This is my first Kaggle notebook, so I would love any feedback on how to improve. I plan on doing more explorations of all the COVID-19 data out there. 