# Which patient populations pass away from COVID-19?

Task Details:

The Roche Data Science Coalition is a group of like-minded public and private organizations with a common mission and vision to bring actionable intelligence to patients, frontline healthcare providers, institutions, supply chains, and government. The tasks associated with this dataset were developed and evaluated by global frontline healthcare providers, hospitals, suppliers, and policy makers. They represent key research questions where insights developed by the Kaggle community can be most impactful in the areas of at-risk population evaluation and capacity management.

Analysis of the question:
Patient population refers to the demographics and other particulars of a population being serviced – for example, a population’s ethnicity, socioeconomic status, or population density.

Individuals can also be grouped into a patient population based on their health characteristics, such as people with asthma or women who are pregnant.
https://www.paubox.com/blog/patient-population/


Method:

Using the DS4C dataset, specifically the patient information dataset, I will analyse how different demographic features potentially contribute to the death of COVID-19 patients in South Korea.

Potential areas of analysis:
Analysing if the difference in the time between the confirmed date and the released/deceased date is significant between recovered/deceased patients.

Analysing the dataset only based on the released and deceased dataset, excluding the isolated patients since their outcome is unknown.

Information on dataset:
https://www.kaggle.com/kimjihoo/ds4c-what-is-this-dataset-detailed-description

In [None]:
import pandas as pd                
import numpy as np    

import matplotlib.pyplot as plt     
import plotly.express as px       
import plotly.offline as py       
import seaborn as sns             
import plotly.graph_objects as go 
from plotly.subplots import make_subplots
import matplotlib.ticker as ticker
import matplotlib.animation as animation
from matplotlib.pyplot import figure, show

import warnings
warnings.filterwarnings('ignore')
from IPython.display import HTML
import matplotlib.colors as mc
import colorsys
from random import randint
import re
import missingno as msno


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
path = '/kaggle/input/coronavirusdataset/'

case = p_info = pd.read_csv(path+'Case.csv')
p_info = pd.read_csv(path+'PatientInfo.csv')
p_route = pd.read_csv(path+'PatientRoute.csv')
time = pd.read_csv(path+'Time.csv')
t_age = pd.read_csv(path+'TimeAge.csv')
t_gender = pd.read_csv(path+'TimeGender.csv')
t_provin = pd.read_csv(path+'TimeProvince.csv')
region = pd.read_csv(path+'Region.csv')
weather = pd.read_csv(path+'Weather.csv')
search = pd.read_csv(path+'SearchTrend.csv')
floating = pd.read_csv(path+'SeoulFloating.csv')
policy = pd.read_csv(path+'Policy.csv')

In [None]:
p_info.head()

# Checking null data

In [None]:
p_info.isnull().sum()

In [None]:
for col in p_info.columns:
    msg = 'column: {:>10}\t Percent of NaN value: {:.2f}%'.format(col, 100 * (p_info[col].isnull().sum() / p_info[col].shape[0]))
    print(msg)

In [None]:
msno.matrix(df=p_info.iloc[:, :], figsize=(8,8), color=(0.7,0.9,0.2))


Observations:

There are quite a bit of data missing. Highest percentage of data not present in features such as disease (99.49%), infection order (99.12%), deceased_date (98.24%) and symptom_onset_date (85.93%). That being said, presence of patients with missing disease data implies they had no underlying disease. I will just fill the null data with 'False'. Similarly for deceased_date, it just implies that patients with missing deceased_date are discharged or isolated.

However, with so many missing data on features like symptom_onset_date and infection order, I may choose to ignore the features for the analysis.

Notably there are patients with sex, birth_year and age data all missing. 

Overall, the analysis may be limited due to the large presence of missing data.

# Checking the target label (state)

In [None]:
f, ax = plt.subplots(1,2, figsize=(18,8))
p_info['state'].value_counts().plot.pie(explode=[0, 0.1,0], autopct='%1.1f%%', ax=ax[0], shadow=False)
ax[0].set_title('Pie plot -state')
ax[0].set_ylabel('')
sns.countplot('state', data=p_info, ax=ax[1])
ax[1].set_title("Count plot- state")

plt.show()

Observation:

Only 2.0% of the patient population has deceased. The distribution of target label (released/deceased/isolated) is unbalanced.

# Features

# Age

In [None]:
order = ['0s','10s','20s','30s','40s','50s','60s','70s','80s','90s','100s']
figure(figsize=(20,20))
graph= sns.countplot('age', hue='state', data = p_info, order=order)
for p in graph.patches:
    height = p.get_height()
    graph.text(p.get_x()+p.get_width()/2., height + 0.1,height ,ha="center")
show()

In [None]:
print('Mortality rate within the age group')
print('Mortality rate of 30s: {:.2f}%'.format(1*100/(1+169+282)))
print('Mortality rate of 40s: {:.2f}%'.format(2*100/(2+150+308)))
print('Mortality rate of 50s: {:.2f}%'.format(7*100/(7+375+218)))
print('Mortality rate of 60s: {:.2f}%'.format(12*100/(12+190+203)))
print('Mortality rate of 70s: {:.2f}%'.format(19*100/(19+81+104)))
print('Mortality rate of 80s: {:.2f}%'.format(23*100/(23+84+49)))
print('Mortality rate of 90s: {:.2f}%'.format(7*100/(7+13+25)))


Observations:
1) In the age groups of 30s to 80s, the absolute number of deaths increased as the age group increased. There is overall trend of increase in death of 0s to 90s.

2) The mortality rate within the age group increased as the age increases. There were no death from patients in 0s to 20s age group. While there is also no deaths from 100s age group, there is only 1 patient that belong to that group. Hence the mortality rate in 100s may be inaccurate.

3) We can infer from the above observations that as the patient's age group increases, there is higher chance of the patient passing away from COVID-19

# Sex

In [None]:
p_info['sex'].fillna('NaN', inplace=True)

In [None]:
sns.countplot('sex', hue='state', data = p_info)
ax[1].set_title('Sex: State')
plt.show

In [None]:
pd.crosstab(p_info['sex'], p_info['state'], margins = True).style.background_gradient(cmap='summer_r')

In [None]:
print('Mortality rate by sex')
print('Mortality rate of male: {:.2f}%'.format(46*100/(1469)))
print('Mortality rate of female: {:.2f}%'.format(25*100/(1865)))

Observations:

1) Male mortality rate is much higher than female (3.13% vs 1.34%)

2) There was no deaths from patients with unknown sex

# Presence of underlying disease

In [None]:
p_info['disease'].fillna('False', inplace=True)

In [None]:
p_info.disease.describe()

In [None]:
f, ax = plt.subplots(1,2, figsize=(18,8))
p_info['disease'].value_counts().plot.pie(explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=False)
ax[0].set_title('Pie plot -disease')
ax[0].set_ylabel('')
graph= sns.countplot('disease', data=p_info, ax=ax[1])
ax[1].set_title("Count plot- disease")
for p in graph.patches:
    height = p.get_height()
    graph.text(p.get_x()+p.get_width()/2., height + 0.1,height ,ha="center")
plt.show()

In [None]:
pd.crosstab(p_info['disease'], p_info['state'], margins = True).style.background_gradient(cmap='summer_r')

In [None]:
print('Mortality rate by presence of underlying disease')
print('Mortality rate of patients with underlying disease: {:.2f}%'.format(18*100/(18)))
print('Mortality rate of patients without underlying disease: {:.2f}%'.format(53*100/(3501)))

Observations:
1) All patients with underlying disease have died from COVID-19, showing much higher mortality rate than those without underlying disease.

Limitations:
However, we need to acknowledge that only 0.5% of the patients in the dataset had underlying disease. Thus, the data is highly unbalanced resulting in inaccuracy. One possible guess is that the underlying disease was identified during the post mortem of the deceased patients

# Checking Province

In [None]:
figure(figsize=(20,20))
graph= sns.countplot('province', hue='state', data = p_info)
ax[1].set_title('Province: State')
graph.set_xticklabels(graph.get_xticklabels(), rotation=45)
plt.show

In [None]:
pd.crosstab(p_info['province'], p_info['state'], margins = True).style.background_gradient(cmap='summer_r')

In [None]:
print('Mortality rate of patients from Daegu: {:.2f}%'.format(20*100/(63)))
print('Mortality rate of patients from Busan: {:.2f}%'.format(3*100/(141)))

print('Mortality rate of patients from Gyeongsangbuk-do: {:.2f}%'.format(40*100/(1230)))
print('Mortality rate of patients from Seoul: {:.2f}%'.format(4*100/(714)))


Observations

1) From our dataset, Daegu had highest mortality rate with 31.75%.  Gyeongsangbuk-do followed with 3.25%, with the rest of the provinces having low or zero mortality rate

# Conclusion

1) As many of the other kernels point out, there is higher chance of infection and mortality as the age group of the patient increases.

2) There was higher mortality rate of males than females

3) From the dataset, the presence of underlying disease seems to be fatal given the high mortality rate. However, this observation is possibly inaccurate due to the limitation of the dataset.

4) In South Korea, highest infection and mortality rate came from Daegu and Gyeongsangbuk-do, where there was the uncontrolled spread of the virus due to shincheonji (religious cult)