This is a supportive notebook for my coursework in INM433 Visual Analytics module as a MSc Data Science student at City, University of London (2020-21 Academic Year). The extra reports for this assessment has also been submitted in turnitin system so far.

***Summary note*** This study aimed at data analysis how traffic volume in highway had changed at the South Korea county level in correlation with the number of COVID-19 confirmed cases from January to June in 2020. To visualise the COVID-19 trajectories, this notebook used the South Korea Patient, Policy, and Provincial dataset(DS4C-PPP) and interlaced it to the Korea Traffic Mobility data which was measured at every expressway gate by the Korea Expressway Corporation(KEC) operation.

For reference tracking human patterns in epidemic trajectory, Cartenì et al. (2020) showed Italian’s daily mobility habits trends as a result of surveys and mobile interview. In addition, Scala et al. (2020) calculated inter-Regional mobility matrix with Facebook user location and inter-Age social contact matrix in Italy.

[1] Cartenì, A., Di Francesco, L. and Martino, M., 2020. How mobility habits influenced the spread of the COVID-19 pandemic: Results from the Italian case study. Science of the Total Environment, 741, p.140489. <br>[2] Scala, A., Flori, A., Spelta, A., Brugnoli, E., Cinelli, M., Quattrociocchi, W. and Pammolli, F., 2020. Time, space and social interactions: exit mechanisms for the COVID-19 epidemics. arXiv preprint arXiv:2004.04608.

***Keyworkds- Coronavirus, COVID-19, South Korea, traffic mobility***

**Please reach out to me on Unibuddy if you have questions in City's MSc Data Science programme!**
https://www.city.ac.uk/study/ask-a-student?unibuddy=buddies/students/5e21c2fcd16678055b9fac23

In [None]:
# basics
import pandas as pd
from datetime import timedelta

# statistics
import statsmodels.api as sm
import scipy.stats as scs

# modelling
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

# visualisation
import plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from plotly import tools

## Load the First dataset (DS4C-PPP)

The South Korean government determined to undertake daily new confirmed patient’s infection open-data and updated their status of released or deceased in order of simultaneous time on the Korea Centers for Disease Control & Prevention (KCDC). Due to the detailed data, however, the concerns raised about privacy and about the stigma associated with the visitor’s route for restaurant or commercial shops. Hence, the project team, Data Science for COVID-19 (DS4C), anonymized Patient Route and designed a South Korea Patient, Policy, and Provincial dataset (DS4C-PPP) on Kaggle according to their paper in NeurIPS 2020.

In [None]:
# modified province to county level in TimeProvince
timeprovince = pd.read_csv('../input/edit-timeprovince/edit_TimeProvince.csv')
timeprovince.head(3)

In [None]:
timeprovince['date'] = pd.to_datetime(timeprovince['date'], format='%m/%d/%Y')
timeprovince = timeprovince[timeprovince['sum_confirmed'].notna()]
timeprovince.drop(columns = ['confirmed', 'released', 'deceased'], axis = 1, inplace = True)
timeprovince = timeprovince.rename(columns={'province': 'region'})
timeprovince.head(3)

In [None]:
# function to get daily cases
def diff_func(x):
    first_value = x.sum_confirmed[x.index[0]]
    x['new_daily_cases'] = x.sum_confirmed.diff()
    x.iloc[0, -1] = first_value
    return x

timeprovince = timeprovince.groupby(['region']).apply(lambda x : diff_func(x))

In [None]:
policy = pd.read_csv('../input/coronavirusdataset/Policy.csv')
policy['date'] = pd.to_datetime(policy['start_date'], format='%Y/%m/%d')
policy['date_end'] = pd.to_datetime(policy['end_date'], format='%Y/%m/%d')
policy.head(3)

## Load the Second dataset (TCS)

In addition, an aggregated dataset is downloaded to support human movement perspective from the Public Data Portal. Six Traffic Control System(TCS) csv files for each month were easy to concatenate as they followed the same structure but required to encode Korean header and a few of data label in English or unicode_escape. Since the matrix divided Arrivals into eight features, a new field to sum values by Departure and by Arrival was calculated in the final column. ***trafficmobilitysouthkorea2020 is now up on Kaggle***

In [None]:
import glob, os
tcs = pd.concat(map(pd.read_csv, glob.glob(os.path.join('../input/trafficmobilitysouthkorea2020/', 'edit_TCS_*.csv'))))
tcs.head(8)

In [None]:
tcs.date = pd.to_datetime(tcs['date'], format='%Y%m%d')
tcs.drop(columns = ['arrival_Capital','arrival_Gangwon','arrival_DaejeonChungnam','arrival_GwangjuJeonnam','arrival_DaeguGyeongbuk','arrival_BusanGyeongnam','arrival_Jeonbuk','arrival_Chungbuk'], axis = 1, inplace = True)
tcs['region_mobility'] = tcs['sum_bydeparture'].values + tcs['sum_byarrival'].values
tcs.head(8)

## Design a new dataset

In [None]:
newdf = pd.merge(timeprovince[['date','region','new_daily_cases']], tcs[['date','region','region_mobility']], how = 'left', on = ['date','region']).sort_values(by=['date'])
newdf.head(8)

In [None]:
# make a copy for modelling before removing NA rows 
modelling_data = newdf.copy(deep = True)
modelling_data.reset_index(drop = True, inplace = True)

In [None]:
# remove NA rows
newdf = newdf[~pd.isna(newdf.new_daily_cases)]
newdf = newdf[~pd.isna(newdf.region_mobility)] #Jeju-do

## 1)Computation Analysis part: Modelling new daily cases as a function of mobility

Cartenì et al. (2020) model the number of daily new COVID-19 infections recorded using a multiple linear regression model with features including socio-economic variables (e.g. population), territorial variables (e.g. kilometers of coastline), environmental variables such as average temperature, health care variables and mobility habits variables (e.g. number of citizens making a trip per day). The exact model specification is given as:

$$
\begin{aligned}
y_{t, i} =& \beta_{1} \cdot \text { POPdensity }_{t}+\beta_{2} \cdot P M_{t}+\beta_{3} \cdot \text { NTESTS }_{t, i}+\beta_{4} \cdot T T D_{t, i} \\
&+\beta_{5} \cdot \operatorname{MOB}_{t, i-x}+\beta_{6} \cdot \text { TEMP }_{t, i-x}+\text { Constant }
\end{aligned}
$$

where $t$ corresponds to a region in Italy, $i$ is an integral day in 2020. $\text{POPdensity}_t$ gives the population density in region $t$, $PM_t$ gives an environmental pollutant measure, $\text{NTESTS}_{t, i}$ gives the number of COVID tests in region $t$, day $i$. $\text{TTD}_{t,i}$ gives a (weighted) measure of travel time from Codogno where the outbreak in Italy began. $\text{MOB}_{t, i - x}$ gives a measure of mobility in region $t$, $x$ days before the recorded cases on day $i$ and likewise, $\text{TEMP}_{t, i-x}$ is the average temperature $x$. The use of mobility and temperature data $x$ days prior is to account for the delay between contagion and detection.


In this study, a similar, but simpler methodology is used to measure the significance of mobility on the spread of COVID-19 in South Korea, as defined by the number of new daily cases. An adaptation of the model in Cartenì et al. (2020) is given as follows:

$$
y_{t,i} = \beta_{1, t} \cdot \text{MOB}_{t,i - x} + \beta_{2, t} \cdot \text{ MONTH } + \beta_{3} \cdot \text{ DAYS }_t + \text{ Constant }
$$

where $\beta_{1,t}$ is the coefficient for the mobility in region $t$, $\beta_{2,t}$ is a coefficient for relationship between month and region (to account for COVID-19 seasonality) and similarly $\beta_3$ is a coefficient for the effect of individual days and COVID-19 cases.

In [None]:
# modelling data predicting the number of cases on the current day
# given the temperature and mobility x days prior

# function to generate modelling data
def prior_temp_mobility(df, n):
    
    _df = df.reset_index(drop = True)
    
    min_not_na = 50
    max_not_na = _df.shape[0]

    _df.iloc[min_not_na:max_not_na, 3:] = _df.iloc[(max(0, min_not_na - n)):(max_not_na - n), 3:].values

    return _df
    

# days 1 to 30
frames = []

for n in range(1, 31):
    temp_frame = modelling_data.groupby(['region']).apply(lambda x : prior_temp_mobility(x, n)).reset_index(drop = True)
    
    # remove NA cases
    temp_frame = temp_frame[~pd.isna(temp_frame.new_daily_cases)]
    temp_frame = temp_frame[~pd.isna(temp_frame.region_mobility)]
    
    # create days since first case column
    temp_frame['days_since'] = (temp_frame.date - pd.Timestamp('2020-01-20')).dt.days
    # extract month from date     
    temp_frame['month'] = temp_frame.date.dt.month
    
    # for modelling only include 20/01/2020 - 01/05/2020 prior to MAY data     
    temp_frame = temp_frame[temp_frame.date <= pd.Timestamp('2020-05-01')]

    # append to container
    frames.append(temp_frame)

In [None]:
temp_frame.head()

In [None]:
models = []

for modelling_frame in frames:
    # copy modelling_frame     
    _frame = modelling_frame.copy(deep = True).reset_index(drop = True)
    
    # get response columns     
    response = _frame.new_daily_cases.values
    
    # one hot encode regions
    region = _frame['region'].values.reshape(-1, 1)
    _frame.drop(['new_daily_cases', 'region', 'date'], axis = 1, inplace = True)
    enc = OneHotEncoder(handle_unknown = 'ignore', sparse = False)
    enc.fit(region)
    
    # min max scale quantitative mobilit features     
    minmaxscaler = MinMaxScaler()
    cols = ['region_mobility']
    minmaxscaler.fit(_frame.loc[:, cols])
    _frame.loc[:, cols] =   minmaxscaler.transform(_frame.loc[:, cols])
    
    # create region x _ features
    
    # region one hot encoded
    region_ohe = enc.transform(region)

    # region x mobility
    mobility = pd.DataFrame(region_ohe, columns = enc.get_feature_names(['mobility'])).multiply(_frame.region_mobility, axis = 0)
    
    # region x month
    monthly = pd.DataFrame(region_ohe, columns = enc.get_feature_names(['monthly'])).multiply(_frame.month, axis = 0)
    
    # concatenate feature frames
    _frame = pd.concat([_frame, mobility, monthly], axis = 1)

    # add a constant column for the intercept 
    _frame = sm.add_constant(_frame)
    
    _frame = _frame.loc[:, [col for col in _frame.columns if col not in ['region_mobility', 'month']]]
  
    # train an OLS model
    reg = sm.OLS(response, _frame.values)
    reg = reg.fit()
    
    #_ = {'model': reg, 'r2': reg.score(_frame, response), 'coefficients':reg.coef_, 'intercept': reg.intercept_}
    
    models.append(reg)

In [None]:
reg.predict(_frame.values).max()

In [None]:
_frame.columns

# var 1(global mobility), 7(mobility DaeguGyeongbuk), 15(monthly DaeguGyeongbuk)

# 5 and 13 

# DaeguGyeongbuk with the most data has mobiltiy corrleated 
# shifted 28 days ahead -> DaeguGyeongbuk is significant

In [None]:
for i, model in enumerate(models):
    print(i+1)
    print(model.summary())
    print('\n')

In [None]:
# using the mobility data 2 days prior fits the trianing data the best in terms 
# of R2 -> no validation data, but just want to see if there is any signal 
# in the mobility feature
# but just wanting to see if the mobility feature has any signal with respect to
# number of new covid-19 cases

# model is y = bi*Mob + ci*pit  + days_since_outbreak + const (bit is a parameter for the mobility in region i, cit for the month information in region i)
# bi = x5 (statistically significant)
models[29].summary().tables[0]

In [None]:
# x5 corresponds to mobility_DaeguGyeongbuk -> p>|t| <0.05 which means that mobility is statistically significant 
# in the linear model, all other features are not significant e.g. mobility_capital e.t.c
models[29].summary().tables[1]

## 2)Visual Analysis part:

Cartenì et al. (2020) identify strong (large and statistically significant) coefficients between the number of COVID-19 cases and the modelling features. In particular, their best model is obtained by setting the lag between contagion and detection at $x$ = 21. In this study, however, shifting the mobility data 29 days forward is more in line with the recording of new COVID-19 cases in region DaeguGyeongbuk as illustrated in the Figure about DaeguGyeongbuk mobility and COVID-19 cases.

In [None]:
subset_data = newdf[newdf.region == 'DaeguGyeongbuk']
subset_data = subset_data[~pd.isna(subset_data.new_daily_cases)]
subset_data = pd.merge(subset_data, policy[['type', 'gov_policy', 'detail', 'date', 'date_end']],how = 'left', on  = ['date'])
subset_data[subset_data.detail == 'Level 4 (Red)'].date
subset_data.head(3)

In [None]:
trace1 = go.Scatter(x = subset_data['date'], y = subset_data['new_daily_cases'], text = subset_data['gov_policy'], name = 'Number of new cases')
trace2 = go.Scatter(x = subset_data['date'],y = subset_data['region_mobility'], name = 'Number of arrivals and departures')
trace3 = go.Scatter(x = subset_data['date'] + timedelta(days = 29),y = subset_data['region_mobility'],name = 'Number of arrivals and departures shifted by +28 day')

fig = make_subplots(specs = [[{'secondary_y': True}]])
fig.add_trace(trace1)
fig.add_trace(trace2, secondary_y = True)
fig.add_trace(trace3, secondary_y = True)

fig.update_layout(title = 'Number of daily COVID-19 cases and total number of arrivals and departures in DaeguGyeongbuk', xaxis_title = 'Date', yaxis_title = 'Number of new COVID-19 cases')
fig.update_layout(shapes=[dict(type= 'line', yref= 'paper', y0= 0, y1= 1, xref= 'x', x0=subset_data[subset_data.detail == "Level 4 (Red)"].date)])
fig.update_yaxes(title_text = "No. arrivals and departures", secondary_y = True)