# Introduction

To aid public health officials in the fight against coronavirus, Google and Apple started publishing anonymized mobility reports, showing how **movement trends** across different categories of places (residential, parks, retail, etc.) and modes of transport (driving, walking, public transport) have been varying since the beginning of the outbreak. 

In this notebook I take a closer look at those two datasets - with a particular focus in European Union countries. I look for recent **changes in trends** and **whether the number of cases had a significant effect on public's mobility trends**. Given the inherently different nature of the two datasets, I investigate whether the correlation between the two is strong enough for them to be used interchangably. Finally, I then combine part of the dataset to create an **aggregate EU28 mobility index** that **serves a single point of reference for how mobility trends have been evolving in the European Union as a whole**.

****Before we start, a disclaimer: **** Always **make sure you abide by the terms and conditions of the data you are using**. 

Apple's [Terms of Use](https://www.apple.com/covid19/mobility) state that "You may use Mobility Trends Reports provided on the Site, including any updates thereto (collectively, the “Apple Data”), only for so long as reasonably necessary to coordinate a response to COVID-19 public health concerns (including the creation of public policy) while COVID-19 is defined as a pandemic by the World Health Organisation."

The goal of this notebook is to offer an alternative perspective and observations from looking at the data. It also provides ways to programmatically get the data from the respective websites without having to download any CSVs and placing them in your working directories. 

Python is very widely used for data science and this kernel could provide people with a starting point to generating valuable insights with the data.



# Installs and Imports


In [None]:
!pip install eurostat

In [None]:
import requests
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
import io
import seaborn as sns
import matplotlib.pyplot as plt
import eurostat

# Apple Mobility Data

The data shows the relative volume of directions requests placed on Apple Maps compared to a baseline volume on 13 January 2020. 

Apple's data are more frequently updated than Google's (which we investigate in the next section), however they are less detailed. There is no split based on the type of places users request directions for (like train stations, parks, retails shops etc.) but rather the data are split by mode of transport (driving, walking, public transit).

More details on the data can be found on the Apple's [Mobility Trends Reports](https://www.apple.com/covid19/mobility) website. 

In [None]:
quote_page = 'https://covid19-static.cdn-apple.com/covid19-mobility-data/current/v3/index.json'

page = requests.get(quote_page).content
csv_path = json.loads(page)['regions']['en-us']['csvPath']
base_path = json.loads(page)['basePath']
apple_url = 'https://covid19-static.cdn-apple.com%s%s' %(base_path, csv_path)

response = requests.get(apple_url).content

In [None]:
pd.set_option('display.max_columns', None)
data=pd.read_csv(apple_url)
data = data.set_index(['region','transportation_type']).transpose()
data.drop(data.index[:4], inplace=True)
data.index = pd.to_datetime(data.index)
data.bfill(inplace=True)
apple_data = data

In [None]:
apple_data

In [None]:
# Let's plot one of the countries to get an idea of the data. 
sns.set()
pd.set_option('display.max_columns', None)
apple_data[['United Kingdom']].plot()
plt.show()

Notice that there is a lot of weekly seasonality in our dataset, especially in the walking directions category. 

We can filter this weekly seasonality out by using 7-day moving average.

In [None]:
apple_data[['United Kingdom']].rolling(7).mean().plot()
plt.show()

# Google Mobility Data

As discussed before, the Google dataset is less frequently updated but more detailed. 

It shows mobility trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential. More details on the data can be found in the [Google Covid-19 Mobility Reports](https://www.google.com/covid19/mobility/) website.

In [None]:
google_url = 'https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv'
response = requests.get(google_url).content

In [None]:
data = pd.read_csv(google_url, low_memory=False)
# Renaming the columns
rename_dict = {'date':'Date',
               'country_region_code':'Country Code',
               'country_region':'Country_Region',
               'sub_region_1':'Subregion 1',
               'sub_region_2':'Subregion 2',
               'retail_and_recreation_percent_change_from_baseline': 'Retail and Recreation',
               'grocery_and_pharmacy_percent_change_from_baseline': 'Grocery and Pharmacy',
               'parks_percent_change_from_baseline': 'Parks',
               'transit_stations_percent_change_from_baseline': 'Transit Stations',
               'workplaces_percent_change_from_baseline': 'Workplaces',
               'residential_percent_change_from_baseline': 'Residential'}
data.rename(columns = rename_dict, inplace=True)
data = data.set_index('Date')
data.index = pd.to_datetime(data.index)
data

In [None]:
# Let's plot a country to again quickly inspect the data

country = 'United Kingdom'
country_data = data.query('Country_Region == @country')
country_data[country_data['Subregion 1'].isnull()].drop(['Subregion 1', 'Subregion 2'], axis=1).rolling(7).mean().plot()
plt.legend(loc="best")
plt.axhline(y=0, color='black', linestyle='-')
plt.title(country+': Percentage change from baseline')
plt.show()

On the back of lockdown measures in most of the countries around the world, visits in retail stores, transit stations (train stations, airports, etc.) and workplaces has fallen sharply, while activity in residences has picked up as people were ordered to stay indoors. For many countries, activity in outdoor public open spaces like parks has also increased, as people been started exercised outside nad governments allowed citizens to meet with people from other households outside, as long as they follow the social distancing guidelines. 

## Looking at trends across the European Union

In [None]:
# Note: While the UK is officially no longer part of the European Union during the transition period, I nonetheless 
# included it due to the large amount of coronavirus cases it has, giving us more scope for comparison. Thus I use the old EU28 country standard.

eu28_countries = ['Austria','Belgium','Bulgaria',
             'Croatia','Cyprus','Czech Republic',
             'Denmark','Estonia','Finland',
             'France','Germany','Greece',
             'Hungary','Ireland','Italy',
             'Latvia','Lithuania','Luxembourg',
             'Malta','Netherlands','Poland',
             'Portugal','Romania','Slovakia',
             'Slovenia','Spain','Sweden',
             'United Kingdom']

eu28 = data.query('Country_Region in @eu28_countries')
eu28[eu28['Subregion 1'].isnull()][['Country_Region','Retail and Recreation','Workplaces','Transit Stations','Parks','Residential']].groupby('Country_Region').rolling(7).mean().groupby('Country_Region').last().sort_values(by='Retail and Recreation').plot.bar(figsize=(5,9), subplots=True, legend = False)
plt.suptitle('Google Mobility: Percentage change from baseline')
plt.xlabel('Country')
plt.show()

## Google Aggregate Mobility Index

We combine the mobility data for Workplaces, Retail and Recreation, and Transit Stations to create an aggregate mobility index per EU country. 

We do not include residential as doing so would likely skew our index to a higher level than it should be. 

We also do not include parks as trends there are heavily dependent on each government's guidance and our focus is on the economic impact of the mobility trends (hence workplaces, retail and transit are more important). 

In [None]:
agg_mobility_index = data[data['Subregion 1'].isnull()].query('Country_Region == @eu28_countries').drop(['Subregion 1', 'Subregion 2'], axis=1) \
                    .groupby('Country_Region').rolling(7).mean()[['Retail and Recreation','Transit Stations','Workplaces']].mean(axis=1).unstack(level=0)
agg_mobility_index.rename_axis("Country", axis="columns", inplace=True)

# Plot for the Big 4 (Four biggest EU economies by GDP: Germany, France, Italy, Spain), 
# the United Kingdom (which implemented social distancing measures later than most other EU countries) 
# and Greece (which was one of the first countries to impose significant social distancing measures).


agg_mobility_index[['Germany','France','Italy','Spain','Greece','United Kingdom']].plot()
plt.axhline(y=0, color='black', linestyle='-')
plt.title('Aggregate Mobility Index*')
plt.annotate('Note: *Average of Retail and Recreation, Transit Stations, Workplaces', (0,0), (0, -50), xycoords='axes fraction', textcoords='offset points', va='top',fontsize=8)
plt.show()

# Back to Apple Mobility Data

Upon looking at the data, a natural question arises. Do the two datasets tell us the same thing about mobility trends in each country, despite the fact they use different measures to track activity? Both datasets compare activity levels relative to some baseline level, that baseline however is defined differently in each dataset. 

Apple uses - as described above - 13 January as its baseline, while Google uses "the median value for that day of the week from the 5‑week period Jan 3 – Feb 6, 2020". 

Let's try and transform the Apple data using Google's thinking. Given we are using a 7-day moving average in our calculations, calculating the median value for each day of the week for the above 5-week period is unnecessary as we are already filtering out the weekly seasonality. 

For that reason, we instead calculate the median index value until 6 February (note the Apple data start from 13 January)

In [None]:
pre_covid_median = apple_data.loc[:'2020-02-06',(eu28_countries,['driving'])].median(axis=0).droplevel('transportation_type',axis=0) 
pre_covid_median

Now that we have the median values for each EU28 country, let's apply the transformation by dividing each data point in the country series with our country median value. 

Not all of the EU28 countries have data for all the different methods of transportation, so we only use driving.

In [None]:
apple_agg_mobility_index = apple_data.loc[:,(eu28_countries,['driving'])].rolling(7).mean().apply(lambda x: x/pre_covid_median[x.name[0]]*100-100,axis=0).droplevel('transportation_type',axis=1)
apple_agg_mobility_index.rename_axis("Date", axis="columns", inplace=True)

apple_agg_mobility_index[['Germany','France','Italy','Spain','Greece','United Kingdom']].plot()
plt.title('Apple Aggregate Mobility Index')
plt.axhline(y=0, color='black', linestyle='-')
plt.show()

I decided to create a plot with the Big 4 (the biggest four European economies when measured by their GDP - Germany, France, Italy, Spain), a country with a large number of coronavirus cases (United Kingdom) and a country with a small number of cases (Greece). 

Notice how activity levels in the UK never reached the lows seen in Spain and have recovered by more than in Spain. Also note how our mobility index shows activity (to be precise, driving activity) has already recovered close to its pre Covid-19 level.

# Merging the two datasets

How closely related are our two datasets? Can they be used interchangably given one (Apple) is more frequently updated than the other but one is more detailed?

In [None]:
d = {'Google' : agg_mobility_index, 'Apple' : apple_agg_mobility_index}
mobility_index_all = pd.concat(d.values(), axis=1, keys=d.keys())

specific_country = 'Greece'
mobility_index_all.loc[:,(slice(None),specific_country)].plot()
plt.legend(title=specific_country,labels=['Google','Apple'])
plt.title('Aggregate Mobility Indices')
plt.axhline(y=0, color='black', linestyle='-')
plt.show()

By plotting a couple of countries there appears to be comovement between the two series.

As a first step let's merge the two dataframes into one to keep track of all our series in one place

In [None]:
mobility_index_all.rename_axis(['Indicator','Country'], axis="columns", inplace=True)
mobility_index_all.rename_axis('Date', axis="index", inplace=True)

pd.set_option('display.max_columns', 20)
mobility_index_all.tail(10)

Now that we have all series combined into one dataframe let's compare across all EU28 countries

In [None]:
latest = mobility_index_all.stack().groupby('Country').last()
axes = latest.sort_values('Google').plot.bar(subplots=True,width=1,figsize=(8,5),legend=False)
plt.suptitle('Aggregate Mobility Indices (Latest)')
for ax in axes.flatten():
    ax.axhline(0.0001, color='k', linestyle='-')
plt.show()

In [None]:
month_ago = mobility_index_all.stack(level=1).reset_index().set_index('Date').loc['2020-04-30',:].reset_index().set_index('Country').drop('Date',axis=1)
latest.subtract(month_ago).sort_values('Google').plot.bar(subplots=True,width=1,figsize=(8,5),legend=False)
plt.suptitle('Aggregate Mobility Indices (Change since a month ago)')
plt.show()

Is there a more formal way to check if there is linear relationship between the two indices? Let's start with a scatter plot

In [None]:
# Suppress warnings
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')

latest = mobility_index_all.stack(level=1).reset_index().set_index('Date').groupby('Country').last()
latest.plot(kind='scatter',x='Google',y='Apple',figsize=(10,10))

for i, txt in enumerate(latest.index.values):
    plt.annotate(txt, (latest['Google'][i], latest['Apple'][i]),xytext=(5,-3), 
                textcoords='offset points',fontsize=12)
plt.title('Aggregate Mobility Indices by Country (Latest)')
plt.show()

We can see there appears to be some sort of linear relationship between our two series, as common sense dictates: a lower number of direction requests (which can be seen in the Apple dataset) should mean that people in general spend less time visiting different places (which should be reflected in the Google mobility index). 

Can we make these observations more mathematically formal? Yes, by estimating a linear regression model for each country.


# Regression Analysis

Note that our data is not cross sectional (calculated at a single point in time) but rather a time series. As such they exhibit trending behaviour which can render our regression results inaccurate. 

Without getting too much into the details, to counteract this we can run our regression with day-on-day changes rather than absolute levels as inputs (which we do using the diff() function below). 

We use United Kingdom as an example for running our regression but results are similar with the rest of EU28 countries.

In [None]:
# Suppress warnings
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')

country = 'United Kingdom'
mobility_index_all.loc[:,(slice(None),[country])].diff(1).dropna().droplevel(1,axis=1).plot.scatter(x='Google',y='Apple')
plt.title('Aggregate Mobility Indices (DoD change)')
plt.axhline(y=0, color='black', linestyle='-')
plt.axvline(x=0, color='black', linestyle='-')
plt.show()

In [None]:
import statsmodels.api as sm

index_Google = mobility_index_all.loc[:,(slice(None),[country])].diff(1).dropna().droplevel(1,axis=1)['Google'].values
index_Apple = mobility_index_all.loc[:,(slice(None),[country])].diff(1).dropna().droplevel(1,axis=1)['Apple'].values

# We estimate a simple linear regression model with an intercept term and fit the model

index_Google = sm.add_constant(index_Google)
model = sm.OLS(index_Apple,index_Google)
results = model.fit()
print(results.summary())


The intercept term is statistically insignificant (p-value = 0.95 >> 0.05 which is a widely-used critical level) but that does not matter as the value of the intercept is very close to zero anyway.

More importantly, the beta coefficient x1 is strongly statistically significant, suggesting there is comovement between the two series (at least in the case of the United Kingdom).



# Comparing with the number of cases

Do countries with a larger number of coronavirus cases show lower levels of activity? Let's investigate.

In [None]:
quote_page = 'https://api.covid19api.com/summary'
page = requests.get(quote_page).content
coronavirus_data = json.loads(page)


# next((item for item in coronavirus_data['Countries'] if item["Country"] == "United Kingdom"))

pd.set_option('display.max_rows', 10)
cases = pd.DataFrame([[item['Country'],item['TotalConfirmed']] for item in list(filter(lambda item: item['Country'] in eu28_countries, coronavirus_data['Countries']))], columns=['Country','Confirmed Cases'])
cases.set_index('Country',inplace=True)

In [None]:
google_vs_cases = pd.DataFrame([mobility_index_all.stack().groupby('Country').last()['Google'],cases['Confirmed Cases']]).transpose()
google_vs_cases.plot.scatter(x='Google',y='Confirmed Cases',figsize=(10,10))

for i, txt in enumerate(google_vs_cases.index.values):
    plt.annotate(txt, (google_vs_cases['Google'][i], google_vs_cases['Confirmed Cases'][i]),xytext=(5,-3), 
                textcoords='offset points',fontsize=12)
plt.title('Google Mobility Indices vs Confirmed Cases')
plt.show()

# Creating an EU aggregate measure

We have data for all the individual EU28 countries. Can we combine them to create an **aggregate EU28 mobility index** that serves a single point of reference for how mobility trends have been evolving in the European Union as a whole? 

An important issue arises. How are we going to weigh each of the EU28 country series? There are many ways to do this, including using a simple average (assigning equal weight to each of the country series). I decided to go with GDP weights instead, to create an index that puts more emphasis on activity levels of countries with a larger amount of economic activity. 

The weights can be found in the Eurostat website [here](https://appsso.eurostat.ec.europa.eu/nui/show.do?query=BOOKMARK_DS-406763_QID_2E34A8C6_UID_-3F171EB0&layout=UNIT,L,X,0;TIME,C,X,1;GEO,L,Y,0;NA_ITEM,L,Z,0;INDICATORS,C,Z,1;&zSelection=DS-406763INDICATORS,OBS_FLAG;DS-406763NA_ITEM,B1GQ;&rankName1=INDICATORS_1_2_-1_2&rankName2=NA-ITEM_1_2_-1_2&rankName3=UNIT_1_2_0_0&rankName4=TIME_1_0_1_0&rankName5=GEO_1_2_0_1&rStp=&cStp=&rDCh=&cDCh=&rDM=true&cDM=true&footnes=false&empty=false&wai=false&time_mode=NONE&time_most_recent=false&lang=EN&cfo=%23%23%23%2C%23%23%23.%23%23%23).



In [None]:
gdp_data = eurostat.get_data_df('nama_10_gdp', flags=False)
gdp_data = gdp_data[(gdp_data['na_item']=='B1GQ') & (gdp_data['unit']=='PC_EU28_MEUR_CP')]

gdp_proportions = gdp_data.set_index('geo\\time')[2019].dropna().drop(['EA','NO','CH','EA12','EA19','EU15','EU27_2020','EU28','AL','IS','RS'])
gdp_proportions = gdp_proportions.rename(eurostat.get_dic('geo')).rename({'Germany (until 1990 former territory of the FRG)':'Germany'})/100
gdp_proportions

In [None]:
eu_mobility = mobility_index_all.loc[:,('Google',slice(None))].droplevel(0, axis=1)
country_names = eu_mobility.columns.values

eu_mobility_aggregate = eu_mobility.groupby(eu_mobility.index).apply(lambda x: pd.Series(sum([(x[v] * gdp_proportions[v]) for v in country_names])))


In [None]:
pd.set_option('display.max_rows', 1000)
sum([eu_mobility[v]* gdp_proportions[v] for v in country_names]).plot()
plt.suptitle('EU28 Aggregate Mobility Index')
plt.axhline(y=0, color='black', linestyle='-')
plt.show()

This index allows us to track the path of recovery across the entirety of the European Union, putting more emphasis on the recovery of activity in countries that form a larger proportion of EU's total GDP.