# This notebook is the analysis of the coronavirus in South Korea.

A. Virus from overseas to Korea

B. Case infections

C. Patients' routes

D. Time from confirmation to release or death

E. the number of tests

F. a relation between weather and the confirmed number

G. a Simple prediction of the future confirmed number

,  and the prospect of the future in South Korea

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import plotly.express as px
from datetime import date, timedelta
from sklearn.cluster import KMeans
from fbprophet import Prophet
from fbprophet.plot import plot_plotly, add_changepoints_to_plot
import plotly.offline as py
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm
from keras.models import Sequential
from keras.layers import LSTM,Dense
from keras.layers import Dropout
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

In [None]:
path = '/kaggle/input/coronavirusdataset/'

case = pd.read_csv(path+'Case.csv')
p_info = pd.read_csv(path+'PatientInfo.csv')
p_route = pd.read_csv(path+'PatientRoute.csv')
time = pd.read_csv(path+'Time.csv')
t_age = pd.read_csv(path+'TimeAge.csv')
t_gender = pd.read_csv(path+'TimeGender.csv')
t_provin = pd.read_csv(path+'TimeProvince.csv')
region = pd.read_csv(path+'Region.csv')
weather = pd.read_csv(path+'Weather.csv')
search = pd.read_csv(path+'SearchTrend.csv')

Let's check the data.

In [None]:
case.head()

In [None]:
caseList = case['infection_case'].unique()
columns = ['total_confirmed']
caseTotal = pd.DataFrame(index = caseList, columns = columns)
for i in range(len(caseList)):
    caseTotal.loc[caseList[i]] = case[case['infection_case'] == caseList[i]]['confirmed'].sum()
caseTotal = caseTotal.sort_values(by=['total_confirmed'], ascending=False)
caseTotal

In [None]:
p_info.head()

We are going to check from the overseas inflow, which might be the beginning of the disaster.

In [None]:
case[case['infection_case'] == 'overseas inflow']['confirmed'].sum()

In [None]:
p_info['infection_case'].value_counts()

Let's see the prohibition of entry of foreign people is effective solution in S.Korea.

# A. Virus from overseas to Korea

In [None]:
sns.set(rc={'figure.figsize':(5,5)})
sns.countplot(x=p_info['state'].loc[
    (p_info['infection_case']=='overseas inflow')
])

In [None]:
p_info['country'].value_counts()

There are only 18 foreign patients in this dataset, but we can assume that there are no many foreign patients.

In [None]:
inflow_p_info = p_info[p_info['infection_case'] == 'overseas inflow']

In [None]:
inflow_p_info['country'].value_counts()

There are even fewer foreign overseas inflow patients.

# 1. Foreign overseas inflow patients

# overseas inflow - Chinese

In [None]:
china_inflow = p_info[p_info['country'] == 'China']
china_inflow = china_inflow[china_inflow['infection_case'] == 'overseas inflow']
china_inflow = china_inflow.reset_index(drop=True)
china_inflow

In [None]:
infectedby = p_info[p_info['infected_by'].notna()]
infectedby = infectedby.reset_index(drop=True)
infectedby.shape

Patients infected by Chinese overseas inflow

In [None]:
infectedbyList = infectedby['infected_by'].isin(china_inflow['patient_id']).replace({False:np.nan}).dropna().index

In [None]:
infectedbyChinese = pd.DataFrame(infectedby, index=infectedbyList)
infectedbyChinese

5 -> 2

In [None]:
infectedbyList2 = infectedby['infected_by'].isin(infectedbyChinese['patient_id']).replace({False:np.nan}).dropna().index

In [None]:
infectedbyChinese2 = pd.DataFrame(infectedby, index=infectedbyList2)
infectedbyChinese2

The most Chinese overseas inflow patients are released state, and no big affects.

# Overseas inflow - US

In [None]:
US_inflow = p_info[p_info['country'] == 'United States']
US_inflow

In [None]:
infectedbyList = infectedby['infected_by'].isin(US_inflow['patient_id']).replace({False:np.nan}).dropna().index
infectedbyUS = pd.DataFrame(infectedby, index=infectedbyList)
infectedbyUS

No infection from US inflow

# Overseas inflow - France

In [None]:
france_inflow = p_info[p_info['country'] == 'France']
france_inflow

In [None]:
infectedbyList = infectedby['infected_by'].isin(france_inflow['patient_id']).replace({False:np.nan}).dropna().index
infectedbyFrance = pd.DataFrame(infectedby, index=infectedbyList)
infectedbyFrance

# Overseas inflow - Thailand

In [None]:
thailand_inflow = p_info[p_info['country'] == 'Thailand']
thailand_inflow

In [None]:
infectedbyList = infectedby['infected_by'].isin(thailand_inflow['patient_id']).replace({False:np.nan}).dropna().index
infectedbyThailand = pd.DataFrame(infectedby, index=infectedbyList)
infectedbyThailand

# Overseas inflow - Switz

In [None]:
switz_inflow = p_info[p_info['country'] == 'Switzerland']
switz_inflow

In [None]:
infectedbyList = infectedby['infected_by'].isin(switz_inflow['patient_id']).replace({False:np.nan}).dropna().index
infectedbySwitz = pd.DataFrame(infectedby, index=infectedbyList)
infectedbySwitz

# Overseas inflow - Mongolia

In [None]:
mongolia_inflow = p_info[p_info['country'] == 'Mongolia']
mongolia_inflow

In [None]:
infectedbyList = infectedby['infected_by'].isin(mongolia_inflow['patient_id']).replace({False:np.nan}).dropna().index
infectedbyMongolia = pd.DataFrame(infectedby, index=infectedbyList)
infectedbyMongolia

There was no critical affects from foreign overseas inflow in S.Korea. The most part of the overseas inflow is Korean, so affects from Korean overseas inflow might be bigger. 


Let's check it out.

# 2. Korean overseas inflow patients

In [None]:
koreanPatient = p_info[p_info['country'] == 'Korea']

In [None]:
koreanPatient['infection_case'].value_counts()

In [None]:
foreignVisit = ['overseas inflow', 'Pilgrimage to Israel']

In [None]:
korea_inflow = koreanPatient.loc[koreanPatient['infection_case'].isin(foreignVisit)]

In [None]:
korea_inflow.shape[0]

In [None]:
korea_inflow.head()

In [None]:
sns.set(rc={'figure.figsize':(5,5)})
sns.countplot(x=korea_inflow['state'])

In [None]:
plt.figure(figsize=(13, 8))
plt.title('Korean overseas inflow patients province')
korea_inflow.province.value_counts(ascending=True).plot.barh()
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

In [None]:
plt.figure(figsize=(13, 8))
plt.title('Korean overseas inflow patients age')
korea_inflow.age.value_counts(ascending=True).plot.barh()
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

In [None]:
sns.set(rc={'figure.figsize':(5,5)})
sns.countplot(x=korea_inflow['sex'])

Patients infected by Korean overseas inflow 

In [None]:
infectedbyList = infectedby['infected_by'].isin(korea_inflow['patient_id']).replace({False:np.nan}).dropna().index
infectedbyKorean = pd.DataFrame(infectedby, index=infectedbyList)
infectedbyKorean.shape[0]

284 -> 52

In [None]:
infectedbyKorean.head()

In [None]:
infectedbyList2 = infectedby['infected_by'].isin(infectedbyKorean['patient_id']).replace({False:np.nan}).dropna().index
infectedbyKorean2 = pd.DataFrame(infectedby, index=infectedbyList2)
infectedbyKorean2

In [None]:
infectedbyList3 = infectedby['infected_by'].isin(infectedbyKorean2['patient_id']).replace({False:np.nan}).dropna().index
infectedbyKorean3 = pd.DataFrame(infectedby, index=infectedbyList3)
infectedbyKorean3

Total 54 patients are infected by Korean overseas inflow patients, and it might be more than this.
We can assume that Korean overseas inflow patients are far more than foreign patients, and they affect more than foreign inflow patients.

It can be overfitting, but we can assume some facts from the data.

1. The most foreign overseas inflow patients have no critical affects to patients until now. (No big group infection or many contacts)
2. Rather Koreans overseas inflow have more affects.(No big group infection either but the number is far more than it of foreign overseas inflow)

This is the reason that the Korean government does not commit the prohibition of entry of foreign people. If it were effective, it must be a border closure for all people. But it seems like unrealistic.

# B.Case infection

In [None]:
case.head()

Let's check the total confirmed number of each provinces and cities

In [None]:
provinceList = case['province'].unique()
columns = ['total_confirmed']
provinceTotal = pd.DataFrame(index = provinceList, columns = columns)
for i in range(len(provinceList)):
    provinceTotal.loc[provinceList[i]] = case[case['province'] == provinceList[i]]['confirmed'].sum()
provinceTotal = provinceTotal.sort_values(by=['total_confirmed'], ascending=True)

In [None]:
provinceTotal.tail()

In [None]:
dataFrame = pd.DataFrame(data=provinceTotal, index=provinceTotal.index);
dataFrame.plot.barh(figsize=(20,10));
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)

In [None]:
cityList = case['city'].unique()
columns = ['total_confirmed']
cityTotal = pd.DataFrame(index = cityList, columns = columns)
for i in range(len(cityList)):
    cityTotal.loc[cityList[i]] = case[case['city'] == cityList[i]]['confirmed'].sum()
cityTotal = cityTotal.sort_values(by=['total_confirmed'], ascending=True)

In [None]:
dataFrame = pd.DataFrame(data=cityTotal, index=cityTotal.index);
dataFrame.plot.barh(figsize=(20,10));
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)

Let's see the total confirmed number of each cases

In [None]:
caseList = case['infection_case'].unique()
columns = ['total_confirmed']
caseTotal = pd.DataFrame(index = caseList, columns = columns)
for i in range(len(caseList)):
    caseTotal.loc[caseList[i]] = case[case['infection_case'] == caseList[i]]['confirmed'].sum()
caseTotal = caseTotal.sort_values(by=['total_confirmed'], ascending=True)

In [None]:
caseTotal.tail()

In [None]:
dataFrame = pd.DataFrame(data=caseTotal, index=caseTotal.index);
dataFrame.plot.barh(figsize=(15,18));
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)

Shincheonji Church


over 60% patients are infected from SCJ Church group infection.
over 80% patients are from Daegu and Gyeongsangbuk-do.
Daegu is the heart of SCJ Church, and Gyeongsangbuk-do is very close to Daegu, so more than 80% patients are infected from the SCJ church.

In [None]:
sns.set(rc={'figure.figsize':(5,5)})
sns.countplot(x=p_info['state'].loc[
    (p_info['infection_case']=='Shincheonji Church')
])

In [None]:
sns.set(rc={'figure.figsize':(5,5)})
sns.countplot(x=p_info['state'].loc[
    (p_info['infection_case']=='etc')
])

Let's see whether group infection or not.

In [None]:
columns = ['group']
caseGroup = pd.DataFrame(index = caseList, columns = columns)
for i in range(len(caseList)):
    caseGroup.loc[caseList[i]] = case[case['infection_case'] == caseList[i]]['group'].values[0]

In [None]:
caseAnalysis = pd.concat([caseTotal, caseGroup], axis=1, sort=False)
caseAnalysis.sort_values(by=['total_confirmed'], ascending=False)

In [None]:
caseGroup = caseAnalysis[caseAnalysis['group'] == True]['total_confirmed'].sum()

In [None]:
caseNotGroup = caseAnalysis[caseAnalysis['group'] == False]['total_confirmed'].sum()

In [None]:
index = ['group', 'not group']
column = ['total_confirmed']
df = pd.DataFrame(index=index, columns=column)
df.loc['group']['total_confirmed'] = caseGroup
df.loc['not group']['total_confirmed'] = caseNotGroup
df

In [None]:
plot = df.plot.pie(y='total_confirmed', figsize=(5, 5))
plt.title("Group vs. Non-group")

Over 70% patients are group infection
The most part of the rest patients might be infections from family or friends, which is not counted as group infection.

In [None]:
cityTotal['city'] = cityTotal.index
cityTotal = cityTotal.reset_index(drop=True)
cityTotal = cityTotal.sort_values(by=['total_confirmed'], ascending=False)

In [None]:
caseTemp = case.loc[:, ['city', 'latitude', 'longitude']]
clus = cityTotal.merge(caseTemp, on='city')
clus = clus.sort_values(by=['total_confirmed'], ascending=False)
clus = clus.drop(clus[clus['latitude'] == '-'].index)
clus = clus.drop_duplicates('city')
clus = clus.reset_index(drop=True)

In [None]:
clus

In [None]:
clus['longitude'] = pd.to_numeric(clus['longitude'])
clus['latitude'] = pd.to_numeric(clus['latitude'])

In [None]:
from shapely.geometry import Point

geometry = [Point(xy) for xy in zip(clus['longitude'], clus['latitude'])]
geometry[1:3]

In [None]:
import geopandas as gpd
crs = {'init': 'epsg:4326'}
geo_df = gpd.GeoDataFrame(clus, crs=crs, geometry=geometry)
geo_df

The bigger radius the bigger total confirmed number

In [None]:
import folium
southkorea_map = folium.Map(location=[36.55,126.983333 ], zoom_start=7,tiles='Stamen Toner')

for lat, lon, city, total in zip(geo_df['latitude'], geo_df['longitude'], geo_df['city'], geo_df['total_confirmed']):
    folium.CircleMarker([lat, lon],
                        radius=int(total/100),
                        color='red',
                      popup =('City: ' + str(city) + '<br>'),
                        fill_color='red',
                        fill_opacity=0.7 ).add_to(southkorea_map)
southkorea_map

The number of cases from Daegu and Gyeongbuk accounts for over 80 percent of the total infections. This is closely related to a religious group called Shincheonji Church of Jesus based in Daegu, where a collective infection has occurred and is spreading throughout the country. About 60 percent of all infections in South Korea were related to Shincheonji. Over 70 percent is group infection, and about 15 percent is contact with family-friend patients.

# C. Patients' routes

Let's check the routes of the overseas inflow patients

In [None]:
p_route.head()

In [None]:
chinaInflowList = p_route['patient_id'].isin(china_inflow['patient_id']).replace({False:np.nan}).dropna().index

In [None]:
koreaInflowList = p_route['patient_id'].isin(korea_inflow['patient_id']).replace({False:np.nan}).dropna().index

In [None]:
china_inflow_route = pd.DataFrame(p_route, index=chinaInflowList)
china_inflow_route

In [None]:
korea_inflow_route = pd.DataFrame(p_route, index=koreaInflowList)
korea_inflow_route.head()

In [None]:
p_info['infection_case'].unique()

In [None]:
SCJChurchList = p_route['patient_id'].isin(p_info[p_info['infection_case'] == 'Shincheonji Church']['patient_id']).replace({False:np.nan}).dropna().index
SCJChurch_route = pd.DataFrame(p_route, index=SCJChurchList)
SCJChurch_route.shape

In [None]:
contactPList = p_route['patient_id'].isin(p_info[p_info['infection_case'] == 'contact with patient']['patient_id']).replace({False:np.nan}).dropna().index
contactP_route = pd.DataFrame(p_route, index=contactPList)
contactP_route.shape

# 1. all patients' routes

In [None]:
clus=p_route.loc[:,['patient_id','latitude','longitude']]
clus.head(10)

In [None]:
import folium
southkorea_map = folium.Map(location=[36.55,126.983333 ], zoom_start=7,tiles='Stamen Toner')

for lat, lon,city in zip(p_route['latitude'], p_route['longitude'], p_route['city']):
    folium.CircleMarker([lat, lon],
                        radius=5,
                        color='red',
                      popup =('City: ' + str(city) + '<br>'),
                        fill_color='red',
                        fill_opacity=0.7 ).add_to(southkorea_map)
southkorea_map

# 2. Overseas inflow patients' routes

# Chinese overseas inflow

In [None]:
clus=china_inflow_route.loc[:,['patient_id','latitude','longitude']]

In [None]:
import folium
southkorea_map = folium.Map(location=[36.55,126.983333 ], zoom_start=7,tiles='Stamen Toner')

for lat, lon,city in zip(china_inflow_route['latitude'], china_inflow_route['longitude'], china_inflow_route['city']):
    folium.CircleMarker([lat, lon],
                        radius=5,
                        color='red',
                      popup =('City: ' + str(city) + '<br>'),
                        fill_color='red',
                        fill_opacity=0.7 ).add_to(southkorea_map)
southkorea_map

# Korean overseas inflow routes

In [None]:
clus=korea_inflow_route.loc[:,['patient_id','latitude','longitude']]

In [None]:
import folium
southkorea_map = folium.Map(location=[36.55,126.983333 ], zoom_start=7,tiles='Stamen Toner')

for lat, lon,city in zip(korea_inflow_route['latitude'], korea_inflow_route['longitude'], korea_inflow_route['city']):
    folium.CircleMarker([lat, lon],
                        radius=5,
                        color='red',
                      popup =('City: ' + str(city) + '<br>'),
                        fill_color='red',
                        fill_opacity=0.7 ).add_to(southkorea_map)
southkorea_map

# patients infected from the Shincheonji church routes

In [None]:
clus=SCJChurch_route.loc[:,['patient_id','latitude','longitude']]

In [None]:
SCJChurch_route['patient_id'].unique()

the SCJ Church did not cooperate with the investigation center, so there is only three patients' routes data.
But they seem like moved in Daegu and GSBD where the most patients come from.

In [None]:
import folium
southkorea_map = folium.Map(location=[36.55,126.983333 ], zoom_start=7,tiles='Stamen Toner')

for lat, lon,city in zip(SCJChurch_route['latitude'], SCJChurch_route['longitude'], SCJChurch_route['city']):
    folium.CircleMarker([lat, lon],
                        radius=5,
                        color='red',
                      popup =('City: ' + str(city) + '<br>'),
                        fill_color='red',
                        fill_opacity=0.7 ).add_to(southkorea_map)
southkorea_map

# Patients infected by contact routes

In [None]:
clus=contactP_route.loc[:,['patient_id','latitude','longitude']]

In [None]:
import folium
southkorea_map = folium.Map(location=[36.55,126.983333 ], zoom_start=7,tiles='Stamen Toner')

for lat, lon,city in zip(contactP_route['latitude'], contactP_route['longitude'], contactP_route['city']):
    folium.CircleMarker([lat, lon],
                        radius=5,
                        color='red',
                      popup =('City: ' + str(city) + '<br>'),
                        fill_color='red',
                        fill_opacity=0.7 ).add_to(southkorea_map)
southkorea_map

except patients infected from the SCJ Church, the most patients move nearby Seoul and the big cities

# D. Time from confirmation to release or death

In [None]:
p_info.head()

In [None]:
date_cols = ["confirmed_date", "released_date", "deceased_date"]
for col in date_cols:
    p_info[col] = pd.to_datetime(p_info[col])

In [None]:
p_info["time_to_release_since_confirmed"] = p_info["released_date"] - p_info["confirmed_date"]

p_info["time_to_death_since_confirmed"] = p_info["deceased_date"] - p_info["confirmed_date"]
p_info["duration_since_confirmed"] = p_info[["time_to_release_since_confirmed", "time_to_death_since_confirmed"]].min(axis=1)
p_info["duration_days"] = p_info["duration_since_confirmed"].dt.days
p_info["state_by_gender"] = p_info["state"] + "_" + p_info["sex"]

In [None]:
p_info.head()

In [None]:
plt.figure(figsize=(12, 8))
sns.boxplot(x="state",
            y="duration_days",
            order=["released", "deceased"],
            data=p_info)
plt.title("Time from confirmation to release or death", fontsize=16)
plt.xlabel("State", fontsize=16)
plt.ylabel("Days", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

In [None]:
order_duration_sex = ["female", "male"]
plt.figure(figsize=(12, 8))
sns.boxplot(x="sex",
            y="duration_days",
            order=order_duration_sex,
            hue="state",            
            hue_order=["released", "deceased"],
            data=p_info)
plt.title("Time from confirmation to release or death by gender",
          fontsize=16)
plt.xlabel("Gender", fontsize=16)
plt.ylabel("Days", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

In [None]:
ageList = ['0s', '10s', '20s', '30s', '40s', '50s', '60s', '70s', '80s', '90s', '100s']
plt.figure(figsize=(12, 8))
sns.boxplot(x="age",
            y="duration_days",
            order=ageList,
            hue="state",
            hue_order=["released", "deceased"],
            data=p_info)
plt.title("Time from confirmation to release or death", fontsize=16)
plt.xlabel("Age Range", fontsize=16)
plt.ylabel("Days", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

Released patients have been released in 25 days.
Deceased patients have been dead in 10 days.

# E. the number of tests

In [None]:
timeGraph = time.set_index('date')

In [None]:
timeTemp = timeGraph[['confirmed', 'released', 'deceased']]

In [None]:
timeTemp.plot(figsize=(10,8))

In [None]:
confirm_perc=(time['confirmed'].sum()/(time['test'].sum()))*100
released_perc=(time['released'].sum()/(time['test'].sum()))*100
deceased_perc=(time['deceased'].sum()/(time['test'].sum()))*100

In [None]:
print("The percentage of confirm  is "+ str(confirm_perc) )
print("The percentage of released is "+ str(released_perc) )
print("The percentage of deceased is "+ str(deceased_perc) )

In [None]:
plt.figure(figsize=(100,50))
plt.bar(time.date, time.test,label="Test")
plt.bar(time.date, time.confirmed, label = "Confirmed")
plt.xlabel('Date')
plt.ylabel("Count")
plt.title('Test vs Confirmed',fontsize=100)
plt.legend(frameon=True, fontsize=12)
plt.show()

# 1. time age

In [None]:
t_ageGraph = t_age.set_index('date')

In [None]:
t_ageGraph = t_ageGraph[['age', 'confirmed', 'deceased']]

In [None]:
t_ageGraph.head()

In [None]:
t_0s = t_ageGraph[t_ageGraph['age'] == '0s'][['confirmed', 'deceased']]
t_10s = t_ageGraph[t_ageGraph['age'] == '10s'][['confirmed', 'deceased']]
t_20s = t_ageGraph[t_ageGraph['age'] == '20s'][['confirmed', 'deceased']]
t_30s = t_ageGraph[t_ageGraph['age'] == '30s'][['confirmed', 'deceased']]
t_40s = t_ageGraph[t_ageGraph['age'] == '40s'][['confirmed', 'deceased']]
t_50s = t_ageGraph[t_ageGraph['age'] == '50s'][['confirmed', 'deceased']]
t_60s = t_ageGraph[t_ageGraph['age'] == '60s'][['confirmed', 'deceased']]
t_70s = t_ageGraph[t_ageGraph['age'] == '70s'][['confirmed', 'deceased']]
t_80s = t_ageGraph[t_ageGraph['age'] == '80s'][['confirmed', 'deceased']]

In [None]:
t_20s.plot(figsize=(8,8))

In [None]:
t_50s.plot(figsize=(8,8))

In [None]:
t_age = t_age[t_age['date'] == '2020-03-22']

In [None]:
t_age['confirmed'].sum()

In [None]:
ageList = ['0s', '10s', '20s', '30s', '40s', '50s', '60s', '70s', '80s']
ageConfirmed = pd.DataFrame(index=ageList, columns=['total_confirmed'])
ageDeceased = pd.DataFrame(index=ageList, columns=['total_deceased'])

for i in range(len(ageList)):
    ageConfirmed.loc[ageList[i]]['total_confirmed'] = t_age[t_age['age'] == ageList[i]]['confirmed'].sum()
    ageDeceased.loc[ageList[i]]['total_deceased'] = t_age[t_age['age'] == ageList[i]]['deceased'].sum()
    
ageConfirmed = ageConfirmed.sort_values(by='total_confirmed', ascending=True)
ageDeceased = ageDeceased.sort_values(by='total_deceased', ascending=True)

In [None]:
ax = ageConfirmed.plot.barh(figsize=(13,8))

In [None]:
plt.figure(figsize=(13, 8))
plt.title('Patients age')
p_info.age.value_counts(ascending=True).plot.barh()
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

20s, 50s, 40s are the most.

In [None]:
ax = ageDeceased.plot.barh(figsize=(13,8))

As we expect, the older the more deceased patients.

# 2. time gender

In [None]:
t_genderGraph = t_gender.set_index('date')

In [None]:
t_genderGraph.head()

In [None]:
t_male = t_genderGraph[t_genderGraph['sex'] == 'male'][['confirmed', 'deceased']]
t_female = t_genderGraph[t_genderGraph['sex'] == 'female'][['confirmed', 'deceased']]

In [None]:
t_male.plot(figsize=(8,8))

In [None]:
t_female.plot(figsize=(8,8))

In [None]:
t_gender = t_gender[t_gender['date'] == '2020-03-22']

In [None]:
index = t_gender['sex'].unique()
sexConfirmed = pd.DataFrame(index=index, columns=['total_confirmed'])
sexDeceased = pd.DataFrame(index=index, columns=['total_deceased'])

for i in range(2):
    sexConfirmed.loc[index[i]]['total_confirmed'] = t_gender[t_gender['sex'] == index[i]]['confirmed'].sum()
    sexDeceased.loc[index[i]]['total_deceased'] = t_gender[t_gender['sex'] == index[i]]['deceased'].sum()

In [None]:
sexConfirmed

In [None]:
sexDeceased

In [None]:
sexConfirmed.plot.pie(y='total_confirmed', figsize=(5, 5))
plt.title("Male vs. Female")

In [None]:
sexDeceased.plot.pie(y='total_deceased', figsize=(5, 5))
plt.title("Male vs. Female")

# 3. Time province

In [None]:
t_provin = t_provin[t_provin['date'] == '2020-03-22']
t_provin = t_provin.reset_index(drop=True)

In [None]:
t_provin

In [None]:
provinceList = t_provin['province'].unique()

In [None]:
totalCProvince = pd.DataFrame(index=t_provin['province'].unique(), columns=['total_confirmed'])
totalDProvince = pd.DataFrame(index=t_provin['province'].unique(), columns=['total_deceased'])
totalRProvince = pd.DataFrame(index=t_provin['province'].unique(), columns=['total_released'])

In [None]:
for i in range(len(provinceList)):
    totalCProvince.loc[provinceList[i]]['total_confirmed'] = t_provin[t_provin['province'] == provinceList[i]]['confirmed'].sum()
    totalDProvince.loc[provinceList[i]]['total_deceased'] = t_provin[t_provin['province'] == provinceList[i]]['deceased'].sum()
    totalRProvince.loc[provinceList[i]]['total_released'] = t_provin[t_provin['province'] == provinceList[i]]['released'].sum()
    

In [None]:
totalCProvince

In [None]:
ax = totalCProvince.plot.barh(figsize=(12,8))

In [None]:
ax = totalDProvince.plot.barh(figsize=(12,8))

In [None]:
ax = totalRProvince.plot.barh(figsize=(12,8))

There are the most patients in Daegu and GSBD, so the number of released, deceased, confirmed is the largest.

# F. relationship between weather and the confirmed number

In [None]:
weather.head()

In [None]:
weatherTemp = weather.groupby(['date']).mean()

In [None]:
weatherTemp = weatherTemp[['avg_temp', 'precipitation', 'max_wind_speed', 'avg_relative_humidity']]

In [None]:
weatherTemp.head()

In [None]:
timeTemp = time[['date', 'confirmed']]
timeTemp = timeTemp.set_index('date')

In [None]:
weather_confirmed = pd.merge(timeTemp, weatherTemp, on='date')
weather_confirmed = weather_confirmed.reindex(columns = ['avg_temp', 'precipitation', 'max_wind_speed', 'avg_relative_humidity', 'confirmed'])
weather_confirmed.tail()

In [None]:
weather_confirmed.plot(figsize=(8,8))

There is no big relationship between weather and the virus.

# G. Prediction 

# 1. Facebook prophet model

In [None]:
df_korea = time[['date', 'confirmed']]

In [None]:
df_korea.tail()

In [None]:
# Make dataframe for Facebook Prophet prediction model
df_prophet = df_korea.rename(columns={
    'date': 'ds',
    'confirmed': 'y'
})

df_prophet.tail()

In [None]:
m = Prophet(
    changepoint_prior_scale=0.2, # increasing it will make the trend more flexible
    changepoint_range=0.98, # place potential changepoints in the first 98% of the time series
    yearly_seasonality=False,
    weekly_seasonality=False,
    daily_seasonality=True,
    seasonality_mode='additive'
)
m.fit(df_prophet)

In [None]:
future = m.make_future_dataframe(periods=7)
forecast = m.predict(future)
forecast.tail(7)

In [None]:
fig = m.plot(forecast)

# 2. Regression model

In [None]:
df_korea = time[['date', 'confirmed']]
df_korea = df_korea[20:]
df_korea = df_korea.reset_index(drop=True)

In [None]:
df_korea_reg = df_korea.copy()

In [None]:
df_korea_reg = df_korea_reg.set_index('date')
df_korea_reg = df_korea_reg[20:]

In [None]:
df_korea_reg.index = pd.to_datetime(df_korea_reg.index)

In [None]:
x = np.arange(len(df_korea_reg)).reshape(-1, 1)
y = df_korea_reg.values

In [None]:
from sklearn.neural_network import MLPRegressor
model = MLPRegressor(hidden_layer_sizes=[32, 32, 10], max_iter=50000, alpha=0.0005, random_state=26)
_=model.fit(x, y)

In [None]:
test = np.arange(len(df_korea_reg)+7).reshape(-1, 1)
pred = model.predict(test)
prediction = pred.round().astype(int)
week = [df_korea_reg.index[0] + timedelta(days=i) for i in range(len(prediction))]
dt_idx = pd.DatetimeIndex(week)
predicted_count = pd.Series(prediction, dt_idx)

In [None]:
predicted_count.tail()

In [None]:
pd.plotting.register_matplotlib_converters()

In [None]:
df_korea_reg.plot()
predicted_count.plot()
plt.title('Prediction of Accumulated Confirmed Count')
plt.legend(['current confirmd count', 'predicted confirmed count'])
plt.show()

# 3. ARIMA(Auto Regressive Integrated Moving Average)

In [None]:
df_korea.tail()

In [None]:
model = ARIMA(df_korea['confirmed'].values, order=(1, 2, 1))
fit_model = model.fit(trend='c', full_output=True, disp=True)
fit_model.summary()

In [None]:
fit_model.plot_predict()
plt.title('Forecast vs Actual')
pd.DataFrame(fit_model.resid).plot()

In [None]:
forcast = fit_model.forecast(steps=7)
pred_y = forcast[0].tolist()
pd.DataFrame(pred_y)

# To sum up!
# Let's check the data except the provinces Daegu & Gyeongsangbukdo, where are the spots of the Shincheonji church.
p.s. Shincheonji Church of Jesus is based in Daegu, and GSBD is the closest region from Daegu.

In [None]:
t_provin = pd.read_csv(path+'TimeProvince.csv')

In [None]:
t_provin.head()

In [None]:
t_provinG = t_provin.groupby('date')['confirmed'].sum()
t_provinG = pd.DataFrame(t_provinG)

In [None]:
t_provinG['date'] = t_provinG.index
t_provinG.reset_index(drop=True, inplace=True)
t_provinG = t_provinG.reindex(columns=['date', 'confirmed'])

In [None]:
t_provinG['confirmed'] = t_provinG['confirmed'].diff()

In [None]:
t_provinG.head()

In [None]:
t_provinG.nlargest(3, 'confirmed')

In [None]:
t_provinG['confirmed'].sum()

In [None]:
t_provinG.plot(figsize=(10,8))

# Total confirmed number per a day looks like calming down.

In [None]:
i = t_provin[((t_provin.province == 'Daegu') | (t_provin.province == 'Gyeongsangbuk-do'))].index
t_provin_except = t_provin.drop(i)

In [None]:
t_provin_exceptG = t_provin_except.groupby('date')['confirmed'].sum()
t_provin_exceptG = pd.DataFrame(t_provin_exceptG)

In [None]:
t_provin_exceptG['date'] = t_provin_exceptG.index
t_provin_exceptG.reset_index(drop=True, inplace=True)
t_provin_exceptG = t_provin_exceptG.reindex(columns=['date', 'confirmed'])

In [None]:
t_provin_exceptG['confirmed'] = t_provin_exceptG['confirmed'].diff()

In [None]:
t_provin_exceptG.head()

In [None]:
t_provin_exceptG['confirmed'].sum()

In [None]:
t_provin_exceptG.plot(figsize=(10,8))

# Except Daegu and GSBD, the confirmed number per a day is upwards, because the overseas inflow is increasing more than before.

These prediction models are not trained with other factors which are related to the virus. We can predict approximate result by refering the past data.

After the simple analysis of the data, we could know how affects of the Shincheonji church were big. Daegu and Gyeongsangbukdo were affected the most by the Shincheonji church. Many Korean experts are saying that the SCJ church distorts the graph of the confirmed number. Except the confirmed patients who were infected by the SCJ church, the graph is upwards because the overseas inflow is increasing more than before. The most percentage of the overseas inflow is Korean living in foreign countries. They might comeback to Korea because of the medical insurance and the developed medical resources. All costs for testing and remedy are free for not only all Koreans but all foreign patients. All costs of testing and remedy are reimbursed by the Korean government for preventing virus from transmission. So many Koreans and experts are worrying about abrupt overseas inflow of foreign people from foreign countries. As you see the above graph except Daegu and Gyeongsangbukdo, experts are warning that a big crisis could come around Seoul and the big cities very soon. Solutions for the increasing overseas inflow to Korea might be needed for preventing a future crisis. 