# Understanding the spread of COVID-19 based on European traffic data during summer

## 1. Introduction

### 1.1. Data

- Our World in Data COVID-19 dataset
    - Data on deaths in different countries.


- Google's traffic data 
    - Google provides anonymized insights from products such as Google Maps for researchers to help them to make critical analysis to combat COVID-19. 
    - Google has divided their traffic data into six traffic components: 
        1. retail \& recreation
            - places like restaurants, cafes, shopping centers, theme parks, museums, libraries and movie theaters
        2. grocery \& pharmacy
            - places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores and pharmacies
        3. parks
            - places like national parks, public beaches, marinas, dog parks, plazas and public gardens
        4. transit stations
            - places like public transport hubs such as subway, bus and train stations
        5. workplaces 
            - places of work
        6. residential 
            -  places of residence
     
  - These components do not tell anything how much time people spend in each section on average but they still give a lot of information how people's traffic behavior changed during the pandemic
 
 - The traffic data's baseline is counted as a median value of multiple days. Day-to-day changes should not be emphasized too much because they are effected on many different factors, f.e. the weather and public events.  




### 1.2. Moral behind the Bayesian model


- Getting the traffic data down slows down the spread of the virus. 
    - Several articles (Ferretti et al 2020, ECDC report, LSHTM report) have pointed out pre-symptomatic and asymptomatic  infections  play  a  significant  role  in  the  spread  of  COVID-19.  Indeed,  this observation  is  an  argument  it  may  not  be  enough  to  get  the  symptomatic  cases  to  stay at  home. Also  governmental  restrictions  should  be  implemented  to  get people’s movement down and furthermore the pandemic under control.
    - Essentially, the reason to implement non-pharmaceutical interventions is to get people's traffic data down!


- There are multiple reasons why it makes sense to analyse European countries in this research
    - COVID-19 hit European countries badly during autumn
    - European governments have similar capabilities to restrict their citizens movement in comparison to many countries, f.e. China
    - European countries adapted suppression strategy instead of mitigation one
    
    
- There are many major differences between European countries which effect on the spread of COVID-19
    - Examples: different age distributions, different population densities in cities, cultural differences
    - Therefore comparisons between European countries should be avoided

    
- The COVID-19-case data is not reliable at least as the only measure about the development of the epidemic. COVID-19-death data has many benefits compared to the COVID-19 case data! 
    - The amount of testing varies a lot between countries
    - Also using death data over infected data has the benefit that deaths measures much better country's success against the epidemic than infections
    





### 1.3. Motivation, research goals


1. To create a model which predicts the spread of the epidemic well based on the traffic data and data on deaths.

2. Based on the created model, trying to understand which of these Google's traffic components predicted the spread of COVID-19 in different European countries


### 1.4.  Structure



1. Introduction
    - Describes the essentials of this notebook
2. Getting an overview how COVID-19-cases and people's traffic behavior developed during the pandemic
    - This section gives moral for sections 4-8.
3. Dividing regions inside European countries in groups based on the development of the epidemic
    - COVID-19 hit European countries very differently. Therefore they are divided in three different groups.
4. Based on countries which did well against the epidemic during summer, understanding the impact of different traffic components to the spread of the epidemic
    - This section uses the assumption: The spread of COVID-19 can be predicted with people's traffic behavior.
    - This section gives understanding which Google traffic components had the biggest impact on the spread of COVID-19.
    - Important section, contains results!
5. Based on countries where the epidemic escalated during the summer, understanding the impact of different traffic components to the spread of the epidemic
    - This section has the same aim as the section 5 has. However, the methods used in these sections strongly differ from each other 
    - Important section, contains results!
6. Getting an overview if the traffic components with the most impact predicted well the spread of COVID-19 in different European countries
    - NOT YET IMPLEMENTED AT ALL
7. Summary
    - NOT YET IMPLEMENTED AT ALL
    - Summaries the whole notebook and also opens discussion about the findings

### 1.5. Libraries used in this notebook

In [9]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn import linear_model 

import datetime

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import stan

### 1.6. Parameters which need to be defined manually

- All the parameters, which need to be defined manually, are here


- During text there are detailed explanations what these parameters are

In [10]:
#############################################################################

# There is COVID-19-data from February until now
observations_start_date = datetime.datetime(2020, 2, 1, 0, 0)
observations_end_date = datetime.datetime(2020, 11, 15, 0, 0)

# However, the data analysis of this notebook concentrates on autumn months, i.e. on so called tail
tail_start_date = datetime.datetime(2020, 9, 1, 0, 0)


#############################################################################

# European countries ordered by population
european_countries = [
    'Germany', 'United Kingdom', 'France', 'Italy', 'Spain', 
    'Ukraine', 'Poland', 'Romania','Netherlands', 'Belgium', 
    'Greece', 'Sweden', 'Portugal', 'Hungary', 'Belarus', 
    'Austria', 'Switzerland', 'Bulgaria', 'Serbia', 'Denmark', 
    'Finland', 'Norway', 'Ireland', 'Croatia', 'Moldova', 
    'Bosnia and Herzegovina', 'Lithuania', 'Slovenia', 'Estonia' ]


#############################################################################

# Window size for convolution
w = 7

# Death limit
d = 1

In [11]:
# Follows directly from manual definitions 
num_countries = len(european_countries)
whole_interval_len = (observations_end_date - observations_start_date).days 
tail_interval_len = (observations_end_date - tail_start_date).days 
date_list = [observations_start_date + datetime.timedelta(days=x) for x in range(whole_interval_len)]

print("There are " + str(num_countries) + " European countries analysed in this notebook.")
print("The length of the whole interval: " + str(whole_interval_len))
print("The length of the tail interval: " + str(tail_interval_len))

There are 29 European countries analysed in this notebook.
The length of the whole interval: 288
The length of the tail interval: 75


#### Definition: Autumn interval $\tau_{\text{summer}}$

Let's denote the summer interval with $\tau_{\text{summer}} = \{ 0,1,2, \dots, 75 \}$. Indeed $t = 0$ corresponds the date 1.9.2020, $t = 1$ corresponds the date 2.9.2020 and so on.

### 1.7. Adding previously cleaned data to dataframes 

#### Dataframe df_countries

- The length of this dataframe is 'num_countries'. Indeed, for each country there is one row.

In [12]:
# A dataframe sorted by countries

dtypes_countries = np.dtype([
          ('country', str),
          ('group', float),
          ('population', int),
          ('population_in_millions', int),
          ('significance_level', float),
          ('control_point', str),
          ('escalation_point_deaths', str),
          ('deaths_escalated_rapidly', str),
          ('escalation_point_infections', str),
          ('infections_escalated_rapidly', str),
          ])

df_countries = pd.DataFrame(pd.read_csv('Data/Preprocessed_data/df_countries.csv', dtype=dtypes_countries))

# Change date-columns from string-type to datetype
df_countries['control_point'] = pd.to_datetime(df_countries['control_point'], format='%Y-%m-%d')
df_countries['escalation_point_deaths'] = pd.to_datetime(df_countries['escalation_point_deaths'], format='%Y-%m-%d')
df_countries['escalation_point_infections'] = pd.to_datetime(df_countries['escalation_point_infections'], format='%Y-%m-%d')

# Show the dataframe
df_countries

FileNotFoundError: [Errno 2] No such file or directory: 'Data/Preprocessed_data/df_countries.csv'

#### Dataframe df_days_by_countries

- All the days to which there exists traffic and infected data to each country


- The length of the dataframe equals 'num_countries' * 'whole_interval_len'



In [None]:
# A countrywise sorted dataframe s.t. for each day on the time interval of each country there is a row 

dtypes_days_by_countries = np.dtype([
          ('country', str),  # country name
          ('date', str), # current date. This will become datetime-time using parse_dates!
          ('new_infections', int), # new infections on that date
          ('new_infections_smooth', int), # smoothened new infections on that date
          ('new_deaths', int), # new deaths on that date
          ('new_deaths_smooth', int), # smoothened new deaths on that date
          ('total_deaths_per_million', float), # how many deaths per million has occured until that date
          ('traffic_retail', float), # retail and recreation traffic on that date
          ('traffic_supermarket', float), # supermarket and pharmacy traffic on that date
          ('traffic_parks', float),  # park traffic on that date
          ('traffic_transit_stations', float), # transit station traffic on that date
          ('traffic_workplaces', float), # workplace traffic on that date
          ('traffic_residential', float), # residential traffic on that date
          ])

df_days_by_countries = pd.DataFrame(pd.read_csv('Data/Preprocessed_data/df_days_by_countries.csv', dtype=dtypes_days_by_countries))   

# Change the date-column from string-type to datetype
df_days_by_countries['date'] = pd.to_datetime(df_days_by_countries['date'], format='%Y-%m-%d')


#pd.set_option('display.max_rows', None)
# Show the dataframe
df_days_by_countries

#### Dataframe df_regions


- Each country consists of smaller regions.


- For each region there is one row.

In [None]:
# A countrywise sorted dataframe s.t. for each day on the time interval of each country there is a row 

dtypes_regions = np.dtype([
          ('country', str),  # country name
          ('region', str),   # region name
          ('group', float),    # the group of the region. Remark: This should be str!
          ])

df_regions = pd.DataFrame(pd.read_csv('Data/Preprocessed_data/df_regions.csv', dtype=dtypes_regions)) 

# Show the dataframe
df_regions

#### Dataframe df_days_by_regions


- Similar to the dataframe 'df_days_by_countries' but instead of a country, each row equals a specific day of a region

In [None]:
# A countrywise sorted dataframe s.t. for each day on the time interval of each country there is a row 

dtypes_days_by_regions = np.dtype([
          ('country', str),  # country name
          ('region', str),  # country name
          ('date', str), # current date
          ('new_infections', float), # new infections. These information is found out only for some regions!
          ('traffic_retail', float), # retail and recreation traffic on that date
          ('traffic_supermarket', float), # supermarket and pharmacy traffic on that date
          ('traffic_parks', float),  # park traffic on that date
          ('traffic_transit_stations', float), # transit station traffic on that date
          ('traffic_workplaces', float), # workplace traffic on that date
          ('traffic_residential', float), # residential traffic on that date
          ('traffic_retail_smooth', float), # retail and recreation traffic on that date
          ('traffic_supermarket_smooth', float), # supermarket and pharmacy traffic on that date
          ('traffic_parks_smooth', float),  # park traffic on that date
          ('traffic_transit_stations_smooth', float), # transit station traffic on that date
          ('traffic_workplaces_smooth', float), # workplace traffic on that date
          ('traffic_residential_smooth', float), # residential traffic on that datep
          ])

df_days_by_regions = pd.DataFrame(pd.read_csv('Data/Preprocessed_data/df_days_by_regions.csv', dtype=dtypes_days_by_regions)) 

# Change the date-column from string-type to datetype
df_days_by_regions['date'] = pd.to_datetime(df_days_by_regions['date'], format='%Y-%m-%d')

# Show the dataframe
df_days_by_regions

#### Dataframe df_group1_regions_and_traffic


- Later in the notebook there is a more detailed explanation for group 1 -countries. Practically these are countries where the epidemic was under control during the whole summer.


- For each traffic component and for each of these groups' region there is a row 

In [None]:
dtypes_group1_regions_and_traffic = np.dtype([
          ('country', str),  # country name
          ('region', str),   # region name
          ('traffic_component', str),   # current traffic component
          ('highest_traffic_average', float), # the lowest traffic average of a 2 week period
          ])
data_group1_regions_and_traffic = np.empty(0, dtype=dtypes_group1_regions_and_traffic)
df_group1_regions_and_traffic = pd.DataFrame(data_group1_regions_and_traffic)

# Show the dataframe
df_group1_regions_and_traffic

#### Dataframe df_group2_regions_and_traffic


- Similar to the dataframe 'df_group1_regions_and_traffic' but instead of group 1 -regions, this dataframe deals with group 2 -countries' regions. Group 2 -countries are countries where the epidemic was under control at first but later escalated again during summer.

In [None]:
dtypes_group2_regions_and_traffic = np.dtype([
          ('country', str),  # country name
          ('region', str),   # region name
          ('traffic_component', str),   # current traffic component
          ('traffic_average_before_escalation', float), # what was the traffic components value when the epidemic started to escalate 
          ('infected_coefficient', float), # With linear regression there will be fitted infected coefficient 
          ('infected_intercept', float),  # and intercept to the escalating infection data
          ('traffic_coefficient', float),  # The same is done for the current traffic component
          ('traffic_intercept', float),
          ])
data_group2_regions_and_traffic = np.empty(0, dtype=dtypes_group2_regions_and_traffic)
df_group2_regions_and_traffic = pd.DataFrame(data_group2_regions_and_traffic)

# Show the dataframe
df_group2_regions_and_traffic

## 2. Getting an overview how COVID-19-cases, -deaths and people's traffic behavior developed during the pandemic

### 2.1. Plot all Google's traffic components and also death and infected data countrywise

- The vertical black line represents when the tail of the pandemic starts 

In [13]:
# Define categories which are plotted
traffic_components = ['traffic_retail', 'traffic_supermarket', 'traffic_parks', 
                      'traffic_transit_stations', 'traffic_workplaces', 'traffic_residential'
                     ]

description = ['traffic in retail and recreation', 'traffic in supermarkets and pharmacy', 'traffic in parks',
               'traffic in transit stations', 'traffic in workplaces', 'traffic in residential ares'
              ]


# Loop over each country
for i in range(num_countries):
    
    # Define the current country, a temporary dataframe of the country and x-axis (dates)
    current_country = european_countries[i]
    df_current = df_days_by_countries[(df_days_by_countries['country'] == current_country)]
    x = df_current['date'].tolist() 
    
    print('\033[1m' + current_country)
    
    # Loop over each traffic component
    for j in range(len(traffic_components)):

        # Define y-components which are going to be plotted in one figure
        y_traffic = df_current[traffic_components[j]].tolist()
        y_infected = df_current['new_infections'].tolist()
        y_deaths = df_current['new_deaths'].tolist()

        # Define the figure and different y-axis (there are 3 in total: traffic, infected, deaths)
        fig, host = plt.subplots(figsize=(26, 6))
        fig.subplots_adjust(right=0.75)
        par1 = host.twinx()
        par2 = host.twinx()

        # Set the most right one y-axis to right
        par2.spines["right"].set_position(("axes", 1.1))

        # Plot the traffic, infected and death data
        p1, = host.plot(x, y_traffic, "b-", label='The change in ' + description[j] +  ' (%)' )
        p2, = par1.plot(x, y_infected, marker = 'o', linestyle='', color = "red", label="New infections ()") 
        p3, = par2.plot(x, y_deaths, marker = 'o', linestyle='', color = "black", label="New deaths ()")

        # Define the texts
        host.set_ylabel('The change in ' + description[j] +  ' (%)')
        par1.set_ylabel("New infections ()")
        par2.set_ylabel("New deaths ()")

        # Text on the axis with the correct color
        host.yaxis.label.set_color(p1.get_color())
        par1.yaxis.label.set_color(p2.get_color())
        par2.yaxis.label.set_color(p3.get_color())

        # Make little spikes for different y-axis
        tkw = dict(size=30, width=1.6)
        host.tick_params(axis='y', colors=p1.get_color(), **tkw)
        par1.tick_params(axis='y', colors=p2.get_color(), **tkw)
        par2.tick_params(axis='y', colors=p3.get_color(), **tkw)
        host.tick_params(axis='x', **tkw)

        plt.axvline(tail_start_date, color='black')
        
        print('\033[0m' + description[j])
        plt.show()

NameError: name 'df_days_by_countries' is not defined

### 2.2. Conclusions

- Based on previous plots, clearly residential traffic does not impact negatively on the spread of COVID-19. The effect of other traffic components will be analysed more detailed and residential traffic data is not analysed any more in this project.
    
    
- Workplace data varies a lot. Therefore the smoothened workplace data ignores the values during weekends!


- It is difficult to analyse the impact of different traffic components to the spread of COVID-19 based on the first local maximum of the epidemic (in most countries it happend on March or on April)
    
    - Almost in every European country all the traffic data components except residential traffic went strongly down at the same time. Therefore, it is difficult to say which traffic component truely mattered based on the beginning of the pandemic.
    
    
- Therefore, let's concentrate the analysis what happend later after the first local maximum, i.e. concentrate on the tail of the pandemic which is defined to start on 1.6.2020
    - Concentrating on the tail has the benefit that people have been aware of COVID-19 since then. In many countries COVID-19 started to spread fast because people did not take the pandemic seriously in the first place.

## 3. Dividing regions inside European countries in groups based on the development of the epidemic

### 3.1. Moral


- Based on the previous conclusions, it is reasonable to concentrate the analysis on the tail of the pandemic. The tail interval practically means all the days in June-August.


- European countries have major differences how well they were capable of keeping the epidemic under control. Some countries were very succesful but some had difficulties. Therefore, European countries are divided in three groups especially based on their smoothened death data $\bar{D}_{t, c}$ but also smoothened infected data $\bar{C}_{t, c}$ based on the tail interval
    - Group 1: countries where the epidemic was under control during the whole summer
    - Group 2: countries where the epidemic suddenly escalated during summer 
    - Group 3: countries with too much noise
        - Indeed, countries in group 3 will not be analysed later! 


- Regions inside European countries provide much more fine-grained data than the countries themselves. Indeed, the success against the epidemic may have varied strongly inside a country! 
    - It is natural to assume all the regions inside group 1 -countries had the epidemic under control
        - If the epidemic escalated in one region inside the country, this information would have appeared in countrywise infected and death data as well!
    - However, inside group 2 -countries there are regions where the epidemic escalated and there are regions where it did not!


- Analysing European regions this way has two main benefits
    1. Both infected and death data are now taken into account but the emphasis is on the death data.
    2. Regional data is not easily available. This approach is a shortcut to find out the regions to which infected data is searched manually!

### 3.2. An exact description how countries are divided in groups


- European countries are divided in groups based on three different statements and the definition of significance level.


##### Definition: Significance level

- A country $c$'s significance level: $\alpha_c = \text{max}(c_1, c_2 \cdot p_c)$ $\quad  || \: c_1 = 5, c_2 = 1$
    - A country $i$'s population in millions $p_c \in \mathbb{N}$ is a constant as well
    - Intuitively, if a country's smoothened deaths are under its significant level, then there are very few deaths in the country


##### Statement 1: The epidemic was under control at some point during summer 


- A country $c$ had the epidemic under control at some point during summer if there is a day $t$ to which     
    - $\exists t' \geq t$ s.t. 
        - $\bar{D}_{t', c} < \text{max}(\alpha_c, c_3 \cdot \bar{D}_{t, c})$  $\quad  \text{|| Smoothened deaths must be halved or below the significance level,  } c_3= \frac{1}{2}$
        - AND $\bar{D}_{\text{max}(t,t'' - w), c} \geq \bar{D}_{t'', c}, \forall t'' \in \{ t, t+1, \dots, t'\}$ $\quad  \text{|| There must be constantly an increasing trend}$


- If a country had the epidemic under control, let's denote the first day when the epidemic was under control with $t_{\text{con}}$.


##### Statement 2: After the epidemic was under control, the smoothened deaths increased



- Firstly, the country must fulfil the statement 1 so that the statement 2 can be true.


- Secondly, the smoothened deaths escalates fast enough in the country on a day $t > t_c$ if
    - $\exists t' > t$ s.t. 
        - $\bar{D}_{t', c} > \text{max}(\alpha_c, c_4 \cdot \bar{D}_{t, c})$   $\quad  \text{|| Smoothened deaths must be doubled and above the significance level,  } c_4= 2$
        - AND $\bar{D}_{\text{max}(t,t'' - w), c} \leq \bar{D}_{t'', c}, \forall t'' \in \{ t, t+1, \dots, t'\}$ $\quad  \text{|| There must be constantly an increasing trend}$
        - AND $\bar{D}_{t'', c} \geq c_5 , \forall t'' \in \{ t, t+1, \dots, t'\}$ $\quad  \text{|| If there isn't enough deaths, data is too noisy to analyse trends, } c_5= 5$


- Let's denote the first day when the smoothened deaths started to escalate with $t_{D, \text{esc}}$.


##### Additional condition to the statement 2: After the epidemic was under control, the smoothened deaths did not only increase but they increased fast enough


- Furthermore, if $(t' - t) \leq c_6 = 14$, the epidemic escalates rapidly in a country $c$.



#### Dividing countries and regions in groups

- A country is in 
    - Group 1, if it fullfills the statement 1 but not the statement 2
    - Group 2, if it fullfills statements 1 and 2 and the additional condition to the statement 2
    - Group 3, otherwise
    

- Furthermore, a region is in 
    - Group 1, if it belongs to a Group 1 -country
    - Group 2, if it belongs to a Group 2 -country
    - Group 3, if it belongs to a Group 3 -country
    
    
#### For group 2 -countries, there will be manually found out the date when the epidemic started to change based on their infected data

### 3.3. A practical implementation for dividing the countries in groups

#### Constants

In [None]:
c_1 = 5
c_2 = 1
c_3 = 0.5
c_4 = 2
c_5 = 5
c_6 = 14

#### Define the significance level for each country

In [None]:
# Loop over each country
for c in european_countries:
    
    # Get the p_c from the dataframe
    current_population_in_millions = df_countries.loc[df_countries['country'] == c, 'population_in_millions'].tolist()[0]
    
    # Add the significance level to the dataframe
    df_countries.loc[df_countries['country'] == c, 'significance_level'] = max(c_1, c_2 * current_population_in_millions)

#### Check the statement 1

In [None]:
countries_with_control_point = []

# Loop over each country 
for c in european_countries: 

    # The significance level alpha_c 
    alpha_c = df_countries.loc[df_countries['country'] == c, 'significance_level'].tolist()[0] 

    # Smoothened deaths of the summer interval of the country c
    D_c = df_days_by_countries[(df_days_by_countries['country'] == c)
                & (df_days_by_countries['date'] >= tail_start_date)
                & (df_days_by_countries['date'] <= observations_end_date)]['new_deaths_smooth'].tolist()

    # The whole tail will be scanned and found out if there exists a control day at some point 
    t_con_found = False 
    t_con = 0 
    t = 0 


    # Loop over each day in the tail interval as long as t_con is not found
    while (t < tail_interval_len) and (t_con_found == False):

        # Let's loop as long as t can be the control day t_con
        t_possibly_t_con = True

        # The value for smoothened deaths
        D_t_c = D_c[t]

        # Set t' to be t at first and then start increasing it one by one
        t_prime = t


        # Loop as long as t_prime did not exceed the summer interval and t can be possibly t_con and
        # a suitable t_con is not yet found
        while (t_prime < tail_interval_len) and t_possibly_t_con and (t_con_found == False):

            # The value for smoothened deaths
            D_t_prime_c = D_c[t_prime]

            # The compared value 
            D_value_compared = D_c[max(t, t_prime - w)]

            if D_value_compared < D_t_prime_c:
                t_possibly_t_con = False

            if t_possibly_t_con and (D_t_prime_c < max(alpha_c, c_3 * D_t_c)):

                t_con_found = True
                t_con = t

                # Add value to the dataframe
                df_countries.loc[df_countries['country'] == c, 'control_point'] = tail_start_date + datetime.timedelta(t)

                # Add the current country to countries which have a control point
                countries_with_control_point.append(c)
                
            t_prime += 1

        t += 1

#### Check the statement 2 and the additional condition for it

In [None]:
countries_epidemic_escalated = []
countries_deaths_escalated_rapidly = []

# Loop over each country 
for c in countries_with_control_point: 

    # The significance level alpha_c 
    alpha_c = df_countries.loc[df_countries['country'] == c, 'significance_level'].tolist()[0] 

    # The control point of a country 
    control_point = df_countries.loc[(df_countries['country'] == c), 'control_point'].tolist()[0] 

    # Smoothened deaths of the summer interval of the country c
    D_c = df_days_by_countries[(df_days_by_countries['country'] == c)
                & (df_days_by_countries['date'] >= tail_start_date)
                & (df_days_by_countries['date'] <= observations_end_date)]['new_deaths_smooth'].tolist()

    # The whole tail will be scanned and found out if there exists an escalation day at some point 
    t_esc_found = False 
    t_esc = 0 # Initialize just some value

    t = (control_point - tail_start_date).days + 1


    # Loop over each day in the tail interval as long as t_con is not found
    while (t < tail_interval_len) and (t_esc_found == False):

        # Let's loop as long as t can be the control day t_con
        t_possibly_t_esc = True

        # The value for smoothened deaths 
        D_t_c = D_c[t]

        # Set t' to be t at first and then start increasing it one by one
        t_prime = t


        # Loop as long as t_prime did not exceed the summer interval and t can be possibly t_con and 
        # a suitable t_con is not yet found
        while (t_prime < tail_interval_len) and t_possibly_t_esc and (t_esc_found == False): 

            # The value for smoothened deaths 
            D_t_prime_c = D_c[t_prime]

            # The compared value 
            D_value_compared = D_c[max(t, t_prime - w)]

            if D_value_compared > D_t_prime_c or D_t_prime_c < c_5:
                t_possibly_t_esc = False


            if t_possibly_t_con and (D_t_prime_c > max(alpha_c, c_4 * D_t_c)):
            #if t_possibly_t_esc and (D_t_prime_c > max(10, 2 * D_t_c)):
                t_esc_found = True
                t_esc = t

                # Add value to the dataframe
                df_countries.loc[df_countries['country'] == c, 'escalation_point_deaths'] = tail_start_date + datetime.timedelta(t)

                countries_epidemic_escalated.append(c)
                
                # Check if the epidemic escalated fast enough
                if t_prime - t < c_6:
                    df_countries.loc[df_countries['country'] == c, 'deaths_escalated_rapidly'] = "yes"
                    countries_deaths_escalated_rapidly.append(c)
                else:
                    df_countries.loc[df_countries['country'] == c, 'deaths_escalated_rapidly'] = "no"

            t_prime += 1

        t += 1

#### Based on the statements, conclude to which group each country and furthermore each region belongs to

In [None]:
# Add group 1 countries to the dataframe
group_1_countries = list(set(countries_with_control_point) - set(countries_epidemic_escalated))
df_countries.loc[df_countries['country'].isin(group_1_countries), ['group']] = 1

# Add group 2 countries to the dataframe
group_2_countries = countries_deaths_escalated_rapidly
df_countries.loc[df_countries['country'].isin(group_2_countries), ['group']] = 2

# Add group 3 countries to the dataframe
group_3_countries = list(set(european_countries) - set(group_1_countries) - set(group_2_countries))
df_countries.loc[df_countries['country'].isin(group_3_countries), ['group']] = 3

#### For group 2 -countries the date when infections started to escalate are searched manually

In [None]:
df_countries.loc[(df_countries['country'] == 'Serbia'), 'escalation_point_infections'] = datetime.datetime(2020, 6, 17, 0, 0)

#### A TEMPORARY CELL which will change Netherlands and Spain in group 2

- Netherlands is not yet in group 2 because its death data is not out yet!

In [None]:
df_countries.loc[(df_countries['country'] == 'Netherlands'), 'group'] = 2

df_countries.loc[(df_countries['country'] == 'Netherlands'), 'escalation_point_deaths'] = observations_end_date
df_countries.loc[(df_countries['country'] == 'Netherlands'), 'deaths_escalated_rapidly'] = "yes"
df_countries.loc[(df_countries['country'] == 'Netherlands'), 'escalation_point_infections'] = datetime.datetime(2020, 7, 16, 0, 0)

In [None]:
df_countries.loc[(df_countries['country'] == 'Spain'), 'group'] = 2

df_countries.loc[(df_countries['country'] == 'Spain'), 'escalation_point_deaths'] = observations_end_date
df_countries.loc[(df_countries['country'] == 'Spain'), 'deaths_escalated_rapidly'] = "yes"
df_countries.loc[(df_countries['country'] == 'Spain'), 'escalation_point_infections'] = datetime.datetime(2020, 7, 16, 0, 0)

In [None]:
df_countries

#### Plot each country's infections and deaths and possibly its control and escalation days with explanations

In [None]:
# Loop over each country
for i in range(num_countries):
    
    # Define the current country, a temporary dataframe of the country and x-axis (dates)
    current_country = european_countries[i]
    current_group = df_countries.loc[(df_countries['country'] == current_country), 'group'].tolist()[0]
    print('\033[1m' + current_country + ': GROUP ' + str(current_group))
    
    df_current = df_days_by_countries[(df_days_by_countries['country'] == current_country)
                                    & (df_days_by_countries['date'] >= tail_start_date)]
    x = df_current['date'].tolist() 
    
    # Define y-components which are going to be plotted in one figure
    y_infected = df_current['new_infections'].tolist()
    y_deaths = df_current['new_deaths'].tolist()

    # Define the figure and different y-axis (there are 3 in total: traffic, infected, deaths)
    fig, host = plt.subplots(figsize=(26, 6))
    fig.subplots_adjust(right=0.75)
    par1 = host.twinx()

    # Plot the traffic, infected and death data
    p1, = host.plot(x, y_infected, marker = 'o', linestyle='', color = "red", label="New infections ()" )
    p2, = par1.plot(x, y_deaths, marker = 'o', linestyle='', color = "black", label="New deaths ()") 

    # Define the texts
    host.set_ylabel("New infections ()")
    par1.set_ylabel("New deaths ()")

    # Text on the axis with the correct color
    host.yaxis.label.set_color(p1.get_color())
    par1.yaxis.label.set_color(p2.get_color())

    # Make little spikes for different y-axis
    tkw = dict(size=30, width=1.6)
    host.tick_params(axis='y', colors=p1.get_color(), **tkw)
    par1.tick_params(axis='y', colors=p2.get_color(), **tkw)
    host.tick_params(axis='x', **tkw)

    # Find out the potential control and the escalation days
    potential_control_point = df_countries.loc[(df_countries['country'] == current_country), 'control_point'].tolist()[0]
    potential_escalation_point_deaths = df_countries.loc[(df_countries['country'] == current_country), 'escalation_point_deaths'].tolist()[0]
    potential_deaths_escalated_rapidly = df_countries.loc[(df_countries['country'] == current_country), 'deaths_escalated_rapidly'].tolist()[0]
    potential_escalation_point_infections = df_countries.loc[(df_countries['country'] == current_country), 'escalation_point_infections'].tolist()[0]
    
    # Plot them, if they just exist
    if str(potential_control_point) != 'NaT':
        plt.axvline(potential_control_point, color='green')
        print('The country got the epidemic under control (green line).')
        
    if str(potential_escalation_point_deaths) != 'NaT' and potential_deaths_escalated_rapidly == 'no':
        plt.axvline(potential_escalation_point_deaths, color='black')
        print('Later, the epidemic escalated but slowly which is the reason the country is in Group 3 (black line).')
        
    if str(potential_escalation_point_deaths) != 'NaT' and potential_deaths_escalated_rapidly == 'yes':
        plt.axvline(potential_escalation_point_deaths, color='black')
        print('Later the epidemic escalated rapidly in the country (black line).')
        
    if str(potential_escalation_point_infections) != 'NaT':
        plt.axvline(potential_escalation_point_infections, color='red')
        print('There has been manually found out when the epidemic escalated based on the infected data (red line).')
        
        
        
    #print('\033[0m' + description[j])
    plt.show()

## 4. Based on countries which did well against the epidemic during summer, understanding the impact of different traffic components to the spread of the epidemic

### 4.1. Definition

- Remaining traffic components: basic notation and T_t,r^R ja T hat



### 4.2. Moral

- The goal of this section is to find out which traffic components (retail, grocery, park, transit stations, workplace) were necessary to have reduced so that the epidemic was under control in a country.


- The analysis of this section is done with Google's regional data, i.e. regions inside European countries


- Traffic components of Group 1 -countries' regions are plotted in this section
    - If it seems that a lot of these regions managed to increase the traffic of some component back to the baseline level, it indicates that this traffic component did not play very significant role in the spread of the epidemic.
    

- EXPLAINED WITH T-hat

### 4.3. Get an overview on traffic component data of different regions where the epidemic did not escalate


- These regions are in many ways heterogeneous but it may give an overview of the impact of different traffic components to the epidemic

### 4.4 Conclusions

- Many regions of group 1 -countries had their grocery traffic data close to the baseline level and park data much higher than the baseline level on average. This observation indicates these traffic components did not play a significant role in the pandemic.


- Difficult to say anything reasonable about other traffic components based on the previous plot.

## 5. Based on countries where the epidemic escalated during the summer, understanding the impact of different traffic components to the spread of the epidemic

### 5.1. Moral


- This section is similar to the section 5
    - The aim is to understand which traffic components had an impact to the spread of the epidemic
    - The analysis is also made using Google's regional data


- However, now regional analysis is done with regions which are in group 2 -countries. Indeed, in some regions the epidemic escalated and in some not! To find out this information, it is essential to get regional infected data.
    - Getting regional infected data requires extra work because it is not as easily available as countrywise data
    - Practically regional infected data is used over regional death data because there is no regional death data available for many countries


- Let's assume out of the traffic data components can be found out the reason why the epidemic escalated in the region. There are two hypothesis for this statement.
    1. Hypothesis: In the regions where the epidemic started to escalate, the level of some traffic component before the escalation was higher than in the regions where there was no escalation.
        - Practically this hypothesis explains the escalation of the epidemic badly.
    2. Hypothesis: In the regions where the epidemic started to escalate, some traffic component increased before the escalation.
        - Practically this approach leads to much better findings.
        
        
- AGAIN, Show with the new notations what next!!

### 5.2 Get an overview of Spanish regions' traffic and infected data

- Later other regions will be added. Now Spain is the only example!

### 5.3. Plot the traffic level before the escalation and the infected coefficients of each region

#### Hypothesis 

- In the regions where the epidemic started to escalate, the level of some traffic component was higher than in the regions where there was no escalation.

#### Plots

#### Conclusions


- Indeed, this hypothesis does not say too much at least about Spanish regions. This observation indicates epidemiological regional differences are big also inside a country.

### 5.4. Plot the increase of infected and the change of traffic components

#### Hypothesis 

-  In the regions where the epidemic started to escalate, some traffic component increased before the escalation.

#### Plots

#### Conclusions


- There were differences between the increments of traffic components
    - Retail and public transport traffic data always increased before the epidemic escalated
    - In some cases before escalation supermarket and workplace data increased and in some cases not


- Indeed, retail and public transport traffic data seem to be a trigger for the escalation of the epidemic


- However, it is difficult to say anything about the supermarket and workplace data based on this subsection because many there may be many traffic components which trigger the escalation of the epidemic 

### 5.5. Conclusions of the section 5 and 6


- Combining how big impact different traffic components had on the epidemic
    - Retail: most likely had an effect
        
    - Supermarket: most likely did not have too much effect

    - Park: did not have an effect
         
    - Transit station: most likely had an effect
         
    - Workplace: may have had an effect

## 6. Getting an overview which countries were the most succesful against the epidemic

### 6.1. Plot different countries' death and traffic data

### 6.2. Conclusions


- Different European countries are very difficult to compare with another


- Different countriues should adapt very diffent strategies

## 7. Summary

### 7.1. The findings of this research

### 7.2. An analysis of potential mistakes

- Different traffic components correlate with another which adds noise
    - Example: A person uses public transport, goes to the supermarket, goes to a park and travels back home
    
    
- Clustering traffic data in components always adds noise

### 7.3. Discussion 

- The results which traffic components have the biggest impact on the spread of COVID-19 are intuitive: 
    - In places like supermarkets and pharmacies people can hold distance with other people relatively well.
    - On contrary, in public transports people easily go very close to other people.


- If the grocery traffic is okay but retail problematic, it indicates there are only some high-risk places related to retail where people should not go


- Better research would be possible to do with more fine-grained Google's traffic data. 
    - Of course some clustering needs to be done


- Coincidence plays a huge role what comes to the spread of the epidemic.


- During summer, people act differently than during autumn


- Discussion about masks related to this research


- Analysis: Based on the observation of different traffic data components, these NPIs seem  to be important



- This research indicates suppression-strategy can work for the whole epidemic before vaccine is developed in European countries. A new normal does not have to be anything to the massive lockdowns what was seen in Europe.


- Section 8 is a reason why testing is important


### 7.4. Good articles

- https://www.medrxiv.org/content/10.1101/2020.05.28.20116129v3.full.pdf
    - NPI comparisons made between countries