# Introduction to `pandas` using Google's Covid-19 Community Mobility Data
***

Data wrangling, exploration, and visualisation with `pandas`, SC207 Computational Social Sceince, Sociology, University of Essex, November 2020

## 1. This Jupyter Notebook
* Hands-on tutorial on data wrangling, exploratory data analysis and visualisation with `pandas` and `seaborn`.
* Analysis of the [Google COVID-19 Community Mobility Reports](https://www.google.com/covid19/mobility/), a large anonimised and open data set of aggreagate mobility trends tracing how global communities respond to Covid-19. 
* Real-world examples and understanding of local mobility trends in the United Kingdom and Essex in comparison to other countries and counties.
* Open and reproducible research workflow.

[Getting started with `pandas`](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)
* A fast, powerful, and flexible open source tool for doing real world data analysis in Python.
* Offers a diverse range of high-performance tools for data loading, cleaning, wrangling, merging, reshaping,  and summarising.
* The go-to data sceince library in Python.

<img src="https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg" title='Pandas Logo' width="400" height="200"/>

### Dataset: Google Covid-19 Community Mobility Reports (GCMR)
* Aggregated, anonymized sets of data that protect individual privacy.
* Shows trends of human mobility over time by country and region, across different categories of places, including retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential. 
* For each place in a region, the data display the percentage change in visits for the reported date compared to a baseline day. Mobility changes are reported as a positive or negative percentage. An overview of the data from the Community Mobility Reports is provided [here](https://support.google.com/covid19-mobility/answer/9824897?hl=en&ref_topic=9822927).
* Provides an opportunity to explore how mobility trends have changed as a response to non-pharmaceutical public health interventions (e.g., lockdowns, school closure)  designed to reduce the spread of Covid-19.

<img src="https://www.google.com/covid19/static/reports-icon-grid.png" title='Google Covid-19 Community Mobility Data' width="400" height="200"/>

# 2. Importing `pandas` and other libraries

In [1]:
# To use pandas, we first import the pandas library via the Python's import command


In [2]:
# Import other libbraries we will use to analyse and visualise data 
import numpy as np
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# 3. Loading your data

Pandas supports many data file formats, including csv, excel, sql, json.
For details, see [How do I read and write tabular data?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html#min-tut-02-read-write)

<img src="https://pandas.pydata.org/docs/_images/02_io_readwrite.svg" width="800" height="400" >

### Loading data from the Web

In [3]:
# Covid-19 Google Community Mobility Reports (GCMR) is provided as a .csv file
# To load a .csv data file into Python/pandas, we use the read_csv pandas function
# The code below loads the most recent online version of the data
# Data file address: https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv

# Pandas represents tabular data as a DataFrame


### Loading data from your local computer

In [4]:
# The same read_csv function can be used to load the file Global_Mobility_Report.csv from your computer 
# Prerequisite: the file needs to be pre-downloaded from https://www.google.com/covid19/mobility/
# Replace 'Downloads' with the actual folder in which the file is stored in your computer
# Data file on local computer '~/Downloads/Global_Mobility_Report.csv'


## 4. Pandas DataFrame

['A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.'](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#min-tut-01-tableoriented)

<img src="https://pandas.pydata.org/docs/_images/01_table_dataframe.svg" title='Pandas DataFrame' width="400" height="200"/>

# 5. Viewing, Describing, and Accessing your Data

### 5.1 Viewing data

In [5]:
# Show the first five rows using the method DataFrame.head()


In [6]:
# Show the last five rows using the method DataFrame.tail()   


In [7]:
# Specify the number of rows to return


### 5.2 Describing your DataFrame

In [8]:
# Accessing columns using the DataFrame.columns attribute


In [9]:
# Accessing the index using the DataFrame.index attribute


In [10]:
# Accessing the values using the DataFrame.values attribute 


In [11]:
# Type of data structure


In [12]:
# Dimensionality of a DataFrame  



In [13]:
# Use the print function to display the number of rows and columns in a DataFrame 
#print("\nThe Google COVID-19 Community Mobility Reports contain", 
#      gcmr_df.shape[0], "rows and", gcmr_df.shape[1],"columns.")

In [14]:
# Information about a DataFrame


### 5.3 Accessing columns and rows in your data

#### 5.3.1 Accessing columns
We can access columns via column name and column position.

*Accessing columns via column name*

In [15]:
# Get the country column and save it to its own variable
# The double square bracket option `[[]]` gives DataFrame


In [16]:
# Display first five rows


In [17]:
# Display the type of data structure 


In [18]:
# The single square braket `[]` option gives Series


In [19]:
# Display the type of data structure 



In [20]:
# Accessing more than one column by using Python list syntax


In [21]:
# Display first five rows


---

> # Try on your own—Exercise 1
Access the column `country_region_code` from the DataFrame `gcmr_df`

In [22]:
# Please write the code related to Exercise 1 in this cell   






---

*Accessing columns via column position*

In [23]:
# Accessing columns via column position



In [24]:
# Accessing a subset of rows and columns



#### 5.3.2 Accessing rows

Rows can be accessed via row labels `df.loc` and row index `df.iloc`

In [25]:
# Before accessing particular rows, let's see the names of all countries in the dataset 
# by listing all unique values in the df['country_region'] column



In [26]:
# Accessing specific rows from a DataFrame
# We are interested in the data about the United Kingdom 



In [27]:
# Display the last five rows


In [28]:
# Accessing data about multiple countries 



In [29]:
# Filter by two conditions — country and county — simultenously
# First let's see the list of counties in the dataset



In [30]:
# Access data about UK and Essex



In [31]:
# Display the top five rows


---

> # Try on your own—Exercise 2
Access all rows about `Greater London`

In [32]:
# Please write the code related to Exercise 2 in this cell






---

#### 5.3.3 Accessing multiple rows and columns and conditioning

In [33]:
# Let's see which UK counties had the lower retail and recreation mobility the day after Italy went in lockdown

# Sort in decreasing order


In [34]:
# UK counties with the lower retail and recreation the day after UK went in lockdown in March 2020


# Sort in decreasing order



## 6. Summarising your data

In [35]:
# For each country, find the maximum value of visits to retail and recriation compared to baseline 



In [36]:
# For each country, find the mean value of visits to retail and recriation compared to baseline


In [37]:
# Type of data structure


In [38]:
# Find mean value of mobility in retail and recreation in Italy 


In [39]:
# Group over two variables


---

> # Try on your own—Exercise 3
Find mean value of mobility in `workplace` in the `United Kingdom`

In [40]:
# Please write the code related to Exercise 3 in this cell





---

## 7. Visualising your data using `Seaborn`

In [41]:
# Differences in mobility trends across countries (for selected countries) 
# We use the DataFrame gcmr_df_countries we already created



### 7.1 Wide and Long Data Format

[Reshaping data](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html)
<img src="https://pandas.pydata.org/pandas-docs/stable/_images/reshaping_melt.png" title='Pandas DataFrame' width="600" height="300"/>

### 7.2 Visualise your data across mobility categories



In [42]:
# Visualise across all six mobility variables

# Create an object (mobility_variables) containing the names of the six mobility variables




In [43]:
# Display the first five rows



In [44]:
# Differences in mobility trends across mobility variables and countries (for selected countries)



### 7.3 Visualise your data over time
Mobility trends in the United Kingdom across mobility categories.

In [45]:
# Extract month from year-month-date
# Format datetime as Month-Year


# The output is a datatime object



In [46]:
# Select only data about the UK 


# Plot mobility trends in the UK over months and across the six mobility categories 




In [47]:
# Visualise daily mobility trends across mobility categories and countries of interest 




In [48]:
# Alternatively, the 'wide' format of the data and 'for' loop can be used to create a plot similar to the one above





### 7.4 Mobility trends in Essex, United Kingdom

In [49]:
# Reuse the 'gcmr_df_country_UK_county_Essex' DataFrame we already created
# gcmr_df_country_UK_county_Essex = gcmr_df[(gcmr_df['country_region'] == 'United Kingdom') & 
#                                 (gcmr_df['sub_region_1']=='Essex')]




## Acknowledgements
* Wes McKinney. 2017. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython.
* Daniel Chen. 2017. Pandas for Everyone: Python Data Analysis.
* Manuel Amunategui. 2020. COVID-19 Community Mobility Reports From Google and Apple - Available to All - Explore with Python. 