Recently, the whole world is dealing with the same issue. Although there is a lot of misleading information on social media, it is a great opportunity for us to reach correct information and a structured dataset like this.

This dataset contains patients infected with COVID-19 in South Korea. In this analysis, I tried to figure out to answer some questions such as ** "which reasons are behind the infection?" **, ** "is there any correlation between treatment duration and age?" **.

In [None]:
import numpy as np
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm

In [None]:
patient=pd.read_csv('../input/coronavirusdataset/patient.csv')

**Update:** South Korea has 8162 coronavirus cases.


In these statistics below we can say that **96,4%** of the total cases are included in our dataset.

In [None]:
patient.count()

 Columns raletad dates come as object. We will convert them to datetime later.

In [None]:
patient.dtypes

There is a column named "infection_reason". Let's look at the distribution of the reasons for the infection.

In [None]:
patient.head()

In [None]:
func = lambda x: 100*x.count()/patient['infection_reason'].dropna().shape[0]
reason_by_age=patient.pivot_table("patient_id",index="infection_reason",aggfunc=func)\
.sort_values(by="patient_id",ascending=False);
reason_by_age

Almost half of the patients were infected by interaction with other infected people. Visiting another city is less common among infection reasons. Let's see if infection reason has changed depending on time.

In [None]:
reasonsbydate=patient.pivot_table(index='confirmed_date', columns='infection_reason',values='patient_id',aggfunc="count")
colors = sns.color_palette("husl", n_colors=len(patient['infection_reason'].unique()))
reasonsbydate.plot.bar(stacked=True, figsize=(16,8), rot=60, color=colors)

The first 4 days "visit to Wuhan" is the only reason for the infection. I guess after Wuhan quarantined, it is left its place to other reasons. As we know the virus is highly contagious, in cases occur every day we see that "contact with patient" is the one of the main reason why **covid-19** is spreading so quickly. It is great that we are able to confirm what doctors say to us with data. 

In [None]:
patient['age'] = 2020 - patient['birth_year']
patient['age'].describe()


There is a lot of missing values on age column. We remove it before move forward.

In [None]:
age=patient['age'].dropna()

After removing missing values there is only 666 patient left, what an interesting number...

In [None]:
age.count() 

On the histogram below it looks like "bimodal" since it has two values or data ranges that appear most often in the data. 

In [None]:
sns.distplot(age, fit=norm);
fig = plt.figure()

Let's see the histogram for gender breakdown.

In [None]:
age_gender= pd.concat([patient['age'], patient['sex']], axis=1).dropna(); age_gender
sns.kdeplot(age_gender.loc[(age_gender['sex']=='female'), 
            'age'], color='g', shade=True, Label='female')

sns.kdeplot(age_gender.loc[(age_gender['sex']=='male'), 
            'age'], color='r', shade=True, Label='male')

Although the distribution for the female looks normal, there are 2 peaks at age 20 and age 50. Also, the mean age for the female looks higher than male.
![](http://)Besides, the percentage of infected people is higher in female which is 57%.

In [None]:
func = lambda x: 100*x.count()/patient['sex'].dropna().shape[0]
gender_by_age=patient.pivot_table("patient_id", index="sex", aggfunc=func)\
.sort_values(by="patient_id", ascending=False); gender_by_age

Let's draw a box plot and a stacked bar plot to gain more information on the variability and how the values in the data are spread out.

In [None]:
genderbydate=patient.pivot_table(index='confirmed_date', columns='sex',values='patient_id',aggfunc="count")
genderbydate.plot.bar(stacked=True, figsize=(16,8), rot=60)

Even though the number of cases is decreasing for both genders, the number of cases on females decreases drastically.

In [None]:
data = pd.concat([patient['age'], patient['sex'], patient['confirmed_date']], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x='confirmed_date', y='age', hue='sex', data=data)
plt.xticks(rotation=60);

Although it's not a strong difference between gender, in men it can be seen that a spread in all age groups.

We have both released and confirmed date of each patient. Let's look at the distribution of the duration of discharged.

In [None]:
patient['confirmed_date']=pd.to_datetime(patient['confirmed_date'])
patient['released_date']=pd.to_datetime(patient['released_date'])
patient['discharged']=patient['released_date']-patient['confirmed_date']
patient['discharged']=patient['discharged'].astype('timedelta64[D]')
sns.kdeplot(patient['discharged'], color='r', shade=True, Label="duration of discharged")

In [None]:
sns.distplot(patient['discharged'].dropna(), fit=norm);
fig = plt.figure()

Since the distribution seems right skewed the mean is a little bit higher than the mode. The average day of stay in the hospital is 12.

In [None]:
patient['discharged'].describe()

The dataset has the column named "state". State displays that the status of the patient. As we see, 98,8% of the patients are isolated.

In [None]:
func = lambda x: 100*x.count()/patient['state'].dropna().shape[0]
state=patient.pivot_table("patient_id", index='state', aggfunc=func)\
.sort_values(by="patient_id", ascending=False); state

* I draw a box plot to see is there any correlation between infection reason and stay in hospital. 
* It can be seen that visiting Wuhan requires more days for recovery among other visiting reasons. 
* Contact with patient as an infection reason varies more in terms of staying hospital.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
fig = sns.boxplot(x='infection_reason', y='discharged', data=patient)
plt.xticks(rotation=60);

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
fig = sns.boxplot(x='state', y='age', data=patient)
plt.xticks(rotation=60);

The average age of deceased around 70. The minimum average age for the released patients is 40.

**Final Thoughts**

* After the precautions the government takes, it can be seen that cases are decreasing drastically.
* The states of the cases may vary according to patient's age. Gender doesn't show any meaningful difference.
* The number of cases for women is more than men in the days which infection spread aggressively. However, the number of cases decreasing more for women more than men.
* On average recovery time is around 10-12 days.
* In the first days, coronavirus was spread by the patients who visit Wuhan.
* Contact with the patient is the most important reason to spread the coronavirus.


Thank you!
