# DSAA-Kulimi Rwanda Data Camp Capstone Project <br> Lockdowns Impact on Air Quality 🌍

GitHub repo: https://github.com/stoufa/Lockdowns-Impact-on-Air-Quality

In [1]:
# Enabling Line Wrapping
from IPython.display import HTML, display
def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# Getting Data

## Air Quality Data

We'll be getting this data from the [Worldwide COVID-19 dataset](https://aqicn.org/data-platform/covid19/verify/44b4316d-6a53-46ee-8238-4e23f8cce63a) provided by the *Air Quality Open Data Platform* and since the COVID19 pandemic started from 31st of December 2019 (according to [the WHO website](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/interactive-timeline?gclid=CjwKCAiAp8iMBhAqEiwAJb94z_u3Y189qK5wdEkJ7dxauGgErxchzHCOD4Ul4xme5SwCLbFT5KQPFxoCAG4QAvD_BwE#event-0)), We'll be focussing on the following periods:

<pre align=center>
2019Q4, 2020Q1, 2020Q2, 2020Q3, 2020Q4
</pre>

In [2]:
# creates the project's folder structure if it doesn't already exist
!mkdir -p data/{air_quality,lockdowns}
!mkdir -p data/air_quality/{raw,clean}
!mkdir -p data/lockdowns/{raw,clean}

In [3]:
from pathlib import Path
import pandas as pd
from pprint import pprint as pp

In [4]:
# better way to display pandas dataframes
%load_ext google.colab.data_table

In [5]:
DATA_FOLDER = Path('data')

AIR_QUALITY_DATA_FOLDER = DATA_FOLDER / 'air_quality'
AIR_QUALITY_RAW_DATA_FOLDER = AIR_QUALITY_DATA_FOLDER / 'raw'
AIR_QUALITY_CLEAN_DATA_FOLDER = AIR_QUALITY_DATA_FOLDER / 'clean'

LOCKDOWNS_DATA_FOLDER = DATA_FOLDER / 'lockdowns'
LOCKDOWNS_RAW_DATA_FOLDER = LOCKDOWNS_DATA_FOLDER / 'raw'
LOCKDOWNS_CLEAN_DATA_FOLDER = LOCKDOWNS_DATA_FOLDER / 'clean'

In [6]:
periods = '2019Q4, 2020Q1, 2020Q2, 2020Q3, 2020Q4'.split(', ')
periods

['2019Q4', '2020Q1', '2020Q2', '2020Q3', '2020Q4']

In [7]:
# we decided to include the remaining quarters of 2019 as well
periods = '2020Q1, 2020Q2, 2020Q3, 2020Q4, 2019Q1, 2019Q2, 2019Q3, 2019Q4'.split(', ')
periods

['2020Q1',
 '2020Q2',
 '2020Q3',
 '2020Q4',
 '2019Q1',
 '2019Q2',
 '2019Q3',
 '2019Q4']

In [None]:
# move the data to its appropriate location
# !mv data/air_quality/*.csv data/air_quality/raw/

In [None]:
# remove all CSV files in data/
# !rm data/*.csv

In [8]:
for period in periods:
  # !echo {period}
  CSV_FILE_URL = f'https://aqicn.org/data-platform/covid19/report/13220-61539cf0/{period}'
  FILE_PATH = AIR_QUALITY_RAW_DATA_FOLDER / f'{period}.csv'
  if not FILE_PATH.exists():
    # if the file doesn't exist, download it
    print(f'downloading {period}.csv ...')
    !curl --compressed -o {FILE_PATH} {CSV_FILE_URL} &>/dev/null
  else:
    print(f'{period}.csv already downloaded!')

downloading 2020Q1.csv ...
downloading 2020Q2.csv ...
downloading 2020Q3.csv ...
downloading 2020Q4.csv ...
downloading 2019Q1.csv ...
downloading 2019Q2.csv ...
downloading 2019Q3.csv ...
downloading 2019Q4.csv ...


In [9]:
# put all raw data in one zip file (to make downloading easier)
!zip raw_data.zip data/air_quality/raw/*.csv

  adding: data/air_quality/raw/2019Q1.csv (deflated 78%)
  adding: data/air_quality/raw/2019Q2.csv (deflated 78%)
  adding: data/air_quality/raw/2019Q3.csv (deflated 78%)
  adding: data/air_quality/raw/2019Q4.csv (deflated 77%)
  adding: data/air_quality/raw/2020Q1.csv (deflated 77%)
  adding: data/air_quality/raw/2020Q2.csv (deflated 78%)
  adding: data/air_quality/raw/2020Q3.csv (deflated 78%)
  adding: data/air_quality/raw/2020Q4.csv (deflated 78%)


let's now take a look at the data

In [10]:
!ls -alh *.zip

-rw-r--r-- 1 root root 52M Nov 19 15:22 raw_data.zip


In [None]:
TEST_AQ_DATA_FILE = AIR_QUALITY_RAW_DATA_FOLDER / '2019Q4.csv'

In [None]:
# display the first 10 lines of the data file (with line numbers)
!head {TEST_AQ_DATA_FILE} | cat -n

     1	# Daily air quality and meteorological measurementa for majors world-wide cities in 2020
     2	# By using this data you agree with the terms of service: https://aqicn.org/data-platform/tos/
     3	# For more information check: https://aqicn.org/data-platform/covid19/
     4	# Data-Set Generated on 2020-04-13T01:05:30+01:00
     5	Date,Country,City,Specie,count,min,max,median,variance
     6	2019-11-02,HU,Debrecen,o3,72,1.9,12.2,7.0,59.60
     7	2019-11-11,HU,Debrecen,o3,66,0.6,15.2,7.1,151.12
     8	2019-11-12,HU,Debrecen,o3,65,2.3,18.9,12.0,193.29
     9	2019-12-22,HU,Debrecen,o3,45,11.3,24.1,18.6,110.54
    10	2020-01-05,HU,Debrecen,o3,34,1.6,26.4,16.7,586.81


We can see that we have a comma-separated dataset with a few comments on top starting with `#`


In [None]:
df = pd.read_csv(AIR_QUALITY_RAW_DATA_FOLDER / '2019Q4.csv', comment='#')

In [None]:
# let's lowercase colum names to make using them easier later
df.columns = map(lambda x : x.lower(), df.columns)
df.columns

Index(['date', 'country', 'city', 'specie', 'count', 'min', 'max', 'median',
       'variance'],
      dtype='object')

In [None]:
# displays the first 5 rows of the data
df.head()

Unnamed: 0,date,country,city,specie,count,min,max,median,variance
0,2019-11-02,HU,Debrecen,o3,72,1.9,12.2,7.0,59.6
1,2019-11-11,HU,Debrecen,o3,66,0.6,15.2,7.1,151.12
2,2019-11-12,HU,Debrecen,o3,65,2.3,18.9,12.0,193.29
3,2019-12-22,HU,Debrecen,o3,45,11.3,24.1,18.6,110.54
4,2020-01-05,HU,Debrecen,o3,34,1.6,26.4,16.7,586.81


In [None]:
# getting more details about the column names and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 578226 entries, 0 to 578225
Data columns (total 9 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   date      578226 non-null  object 
 1   country   578226 non-null  object 
 2   city      578226 non-null  object 
 3   specie    578226 non-null  object 
 4   count     578226 non-null  int64  
 5   min       578226 non-null  float64
 6   max       578226 non-null  float64
 7   median    578226 non-null  float64
 8   variance  578226 non-null  float64
dtypes: float64(4), int64(1), object(4)
memory usage: 39.7+ MB


In [None]:
# getting a summary of the main descriptive statistics of this data file
df.describe()

Unnamed: 0,count,min,max,median,variance
count,578226.0,578226.0,578226.0,578226.0,578226.0
mean,130.891657,101.549097,134.929109,115.861437,10128.53
std,185.690897,295.10425,296.394926,291.429274,292257.6
min,2.0,-3276.6,-3065.6,-3065.6,0.0
25%,45.0,0.9,9.7,4.0,21.61
50%,72.0,4.2,25.2,12.5,116.88
75%,154.0,22.5,85.0,50.0,721.9325
max,2736.0,1170.6,3272.6,1330.8,41277300.0


In [None]:
print(sorted(df.date.unique()))

['2019-09-30', '2019-10-01', '2019-10-02', '2019-10-03', '2019-10-04', '2019-10-05', '2019-10-06', '2019-10-07', '2019-10-08', '2019-10-09', '2019-10-10', '2019-10-11', '2019-10-12', '2019-10-13', '2019-10-14', '2019-10-15', '2019-10-16', '2019-10-17', '2019-10-18', '2019-10-19', '2019-10-20', '2019-10-21', '2019-10-22', '2019-10-23', '2019-10-24', '2019-10-25', '2019-10-26', '2019-10-27', '2019-10-28', '2019-10-29', '2019-10-30', '2019-10-31', '2019-11-01', '2019-11-02', '2019-11-03', '2019-11-04', '2019-11-05', '2019-11-06', '2019-11-07', '2019-11-08', '2019-11-09', '2019-11-10', '2019-11-11', '2019-11-12', '2019-11-13', '2019-11-14', '2019-11-15', '2019-11-16', '2019-11-17', '2019-11-18', '2019-11-19', '2019-11-20', '2019-11-21', '2019-11-22', '2019-11-23', '2019-11-24', '2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28', '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02', '2019-12-03', '2019-12-04', '2019-12-05', '2019-12-06', '2019-12-07', '2019-12-08', '2019-12-09', '2019

In [None]:
# date range
df.date.min(), df.date.max()

('2019-09-30', '2020-01-05')

In [None]:
# countries found in the dataset
print(sorted(df.country.unique()))

['AE', 'AF', 'AR', 'AT', 'AU', 'BA', 'BD', 'BE', 'BG', 'BH', 'BO', 'BR', 'CA', 'CH', 'CL', 'CN', 'CO', 'CR', 'CY', 'CZ', 'DE', 'DK', 'DZ', 'EC', 'EE', 'ES', 'ET', 'FI', 'FR', 'GB', 'GE', 'GR', 'GT', 'HK', 'HR', 'HU', 'ID', 'IE', 'IL', 'IN', 'IQ', 'IR', 'IS', 'IT', 'JO', 'JP', 'KG', 'KR', 'KW', 'KZ', 'LA', 'LK', 'LT', 'MK', 'ML', 'MM', 'MN', 'MO', 'MX', 'MY', 'NL', 'NO', 'NP', 'NZ', 'PE', 'PH', 'PK', 'PL', 'PR', 'PT', 'RE', 'RO', 'RS', 'RU', 'SA', 'SE', 'SG', 'SK', 'SV', 'TH', 'TJ', 'TM', 'TR', 'TW', 'UA', 'UG', 'US', 'UZ', 'VN', 'XK', 'ZA']


In [None]:
# cities found in the dataset
print(sorted(df.city.unique()))

['Abha', 'Abu Dhabi', 'Adana', 'Adapazarı', 'Addis Ababa', 'Adelaide', 'Aguascalientes', 'Akita', 'Albuquerque', 'Algiers', 'Alor Setar', 'Amiens', 'Amman', 'Amsterdam', 'Andong', 'Ankara', 'Antakya', 'Antwerpen', 'Anyang', 'Arad', 'Arāk', 'Ashdod', 'Ashgabat', 'Ashkelon', 'Athens', 'Atlanta', 'Auckland', 'Augsburg', 'Austin', 'Bacău', 'Baghdad', 'Baguio', 'Baia Mare', 'Baltimore', 'Balıkesir', 'Bamako', 'Bandar Abbas', 'Bangkok', 'Barcelona', 'Beijing', 'Belfast', 'Belgrade', 'Bengaluru', 'Bergen', 'Berlin', 'Besançon', 'Bhopal', 'Bilbao', 'Biratnagar', 'Birmingham', 'Bishkek', 'Bloemfontein', 'Bogotá', 'Boise', 'Bologna', 'Bordeaux', 'Boston', 'Bratislava', 'Braşov', 'Breda', 'Brescia', 'Brisbane', 'Bristol', 'Brno', 'Brooklyn', 'Brussels', 'Brăila', 'Bucharest', 'Budapest', 'Buenos Aires', 'Buraydah', 'Burgas', 'Burgos', 'Bursa', 'Busan', 'Butuan', 'Bydgoszcz', 'Caen', 'Calama', 'Calgary', 'Canberra', 'Cape Town', 'Cardiff', 'Castelló de la Plana', 'Chandigarh', 'Changchun', 'Changs

In [None]:
# species found in the dataset
print(sorted(df.specie.unique()))

['aqi', 'co', 'dew', 'humidity', 'mepaqi', 'neph', 'no2', 'o3', 'pm1', 'pm10', 'pm25', 'pol', 'precipitation', 'pressure', 'so2', 'temperature', 'uvi', 'wd', 'wind gust', 'wind speed', 'wind-gust', 'wind-speed']


In [None]:
# we might have duplicated columns ('wind gust' and 'wind-gust', 'wind speed'
# and 'wind-speed'), so, let's check that
df1 = df[(df.country == 'TR') & (df.specie == 'wind gust')]
df1

Unnamed: 0,date,country,city,specie,count,min,max,median,variance
549446,2019-12-25,TR,İzmit,wind gust,144,7.0,12.6,9.1,15.06
549447,2019-12-26,TR,İzmit,wind gust,144,9.6,16.8,14.0,34.93
549448,2019-12-27,TR,İzmit,wind gust,144,1.0,14.2,4.6,148.74
549449,2019-12-28,TR,İzmit,wind gust,138,0.8,15.2,4.0,170.54
549450,2020-01-03,TR,İzmit,wind gust,36,0.3,3.0,1.3,8.21
549451,2020-01-04,TR,İzmit,wind gust,144,0.5,7.8,3.6,43.67
549452,2020-01-05,TR,İzmit,wind gust,144,0.3,7.0,2.5,28.76
550748,2019-12-27,TR,Ankara,wind gust,120,0.6,11.6,6.6,84.48
550749,2019-12-28,TR,Ankara,wind gust,100,0.1,2.5,0.4,3.08
550750,2020-01-03,TR,Ankara,wind gust,78,0.4,12.6,5.2,124.04


In [None]:
df2 = df[(df.country == 'TR') & (df.specie == 'wind-gust')]
df2

Unnamed: 0,date,country,city,specie,count,min,max,median,variance
548141,2019-12-23,TR,Eskişehir,wind-gust,2,9.2,9.7,9.2,1.25
548974,2019-11-02,TR,İzmit,wind-gust,144,0.7,11.8,2.8,90.98
548975,2019-11-12,TR,İzmit,wind-gust,144,0.3,3.2,1.1,5.99
548976,2019-11-15,TR,İzmit,wind-gust,144,0.5,3.0,1.2,5.92
548977,2019-12-11,TR,İzmit,wind-gust,138,0.7,9.0,2.1,63.33
...,...,...,...,...,...,...,...,...,...
564363,2019-11-03,TR,Erzurum,wind-gust,4,6.6,6.6,6.6,0.00
564364,2019-11-27,TR,Erzurum,wind-gust,4,9.7,9.7,9.7,0.00
564365,2019-11-29,TR,Erzurum,wind-gust,16,9.2,12.3,10.8,13.39
564366,2019-10-08,TR,Erzurum,wind-gust,12,11.3,12.3,11.3,2.42


In [None]:
# merged both data and sorted them based on city
df3 = pd.concat([df1, df2]).sort_values(by='city', ascending=True)
# selecting a date and a city where we have both types of columns
# for brevity
df3[df3.date == '2020-01-05']
# we can clearly see that the rows are duplicated so we can safely drop
# either of them

Unnamed: 0,date,country,city,specie,count,min,max,median,variance
552642,2020-01-05,TR,Adana,wind-gust,72,0.2,17.3,6.6,231.13
552002,2020-01-05,TR,Adana,wind gust,72,0.2,17.3,6.6,231.13
555981,2020-01-05,TR,Adapazarı,wind-gust,72,0.3,7.0,2.5,28.97
556454,2020-01-05,TR,Adapazarı,wind gust,72,0.3,7.0,2.5,28.97
550752,2020-01-05,TR,Ankara,wind gust,96,0.2,2.7,0.8,4.81
551098,2020-01-05,TR,Ankara,wind-gust,96,0.2,2.7,0.8,4.81
558489,2020-01-05,TR,Antakya,wind gust,18,0.2,17.3,6.6,241.33
557876,2020-01-05,TR,Antakya,wind-gust,18,0.2,17.3,6.6,241.33
555514,2020-01-05,TR,Balıkesir,wind gust,12,14.9,16.9,15.9,4.92
554969,2020-01-05,TR,Balıkesir,wind-gust,12,14.9,16.9,15.9,4.92


In [None]:
# let's check if this is the same thing for 'wind speed' and 'wind-speed'
df1 = df[(df.country == 'TR') & (df.specie == 'wind speed')]
df2 = df[(df.country == 'TR') & (df.specie == 'wind-speed')]
df3 = pd.concat([df1, df2]).sort_values(by=['date', 'city'], ascending=True)

In [None]:
df3[(df3.date.str.startswith('2020-01')) & (df3.city == 'İzmir')]
# we can see here as well that the rows are duplicated so, we can safely drop
# either one or the other

Unnamed: 0,date,country,city,specie,count,min,max,median,variance
560349,2020-01-03,TR,İzmir,wind speed,128,6.1,11.5,7.9,23.08
560239,2020-01-03,TR,İzmir,wind-speed,128,6.1,11.5,7.9,23.08
560343,2020-01-04,TR,İzmir,wind speed,192,0.5,7.9,4.3,39.33
560310,2020-01-04,TR,İzmir,wind-speed,192,0.5,7.9,4.3,39.33
560344,2020-01-05,TR,İzmir,wind speed,168,0.2,4.6,1.5,16.25
560254,2020-01-05,TR,İzmir,wind-speed,168,0.2,4.6,1.5,16.25


In [11]:
def update_specie_values(old_value):
  if old_value == 'wind gust': return 'wind-gust'
  if old_value == 'wind speed': return 'wind-speed'
  return old_value

In [None]:
df.specie.apply(update_specie_values).unique()

array(['o3', 'so2', 'pressure', 'temperature', 'humidity', 'no2',
       'wind-speed', 'wind-gust', 'precipitation', 'dew', 'co', 'pm10',
       'pm25', 'wd', 'pm1', 'neph', 'aqi', 'pol', 'uvi', 'mepaqi'],
      dtype=object)

In [None]:
df.specie = df.specie.apply(update_specie_values)

In [None]:
df.specie.unique()

array(['o3', 'so2', 'pressure', 'temperature', 'humidity', 'no2',
       'wind-speed', 'wind-gust', 'precipitation', 'dew', 'co', 'pm10',
       'pm25', 'wd', 'pm1', 'neph', 'aqi', 'pol', 'uvi', 'mepaqi'],
      dtype=object)

In [None]:
# getting the number of rows in the data before and after removing duplicates
n_rows_before = len(df.index)
n_rows_before

578226

In [None]:
df = df.drop_duplicates()

In [None]:
n_rows_after = len(df.index)
n_rows_after

570834

In [None]:
# 7392 duplicated rows found and removed
n_rows_before - n_rows_after

7392

In [None]:
df.info()
# we can see that we don't have any missing value, so, we can consider this
# version, the clean and processed version of our dataset

<class 'pandas.core.frame.DataFrame'>
Int64Index: 570834 entries, 0 to 578225
Data columns (total 9 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   date      570834 non-null  object 
 1   country   570834 non-null  object 
 2   city      570834 non-null  object 
 3   specie    570834 non-null  object 
 4   count     570834 non-null  int64  
 5   min       570834 non-null  float64
 6   max       570834 non-null  float64
 7   median    570834 non-null  float64
 8   variance  570834 non-null  float64
dtypes: float64(4), int64(1), object(4)
memory usage: 43.6+ MB


In [None]:
# save a clean version of the data
TEST_AQ_DATA_FILE_DST = AIR_QUALITY_CLEAN_DATA_FOLDER / '2019Q4.csv'
df.to_csv(TEST_AQ_DATA_FILE_DST, index=False, header=True)

In [None]:
# pd.set_option('display.max_rows', None)
# pd.reset_option('display.max_rows')

In [None]:
# and this processing should be done on all the remaining CSV files
# df = pd.read_csv(AIR_QUALITY_RAW_DATA_FOLDER / '2020Q1.csv', comment='#')
# df

In [12]:
# processing steps

# - load file
# - lowercase column names
# - update_specie_values
# - drop_duplicates
# - save clean version

def process_file(src_file_path, dst_file_path):
  # check if dst_file already there, if so, skip this file
  if dst_file_path.exists():
    print(f'{dst_file_path} found, skipping...')
    return
  df = pd.read_csv(src_file_path, comment='#') # load file
  df.columns = map(lambda x : x.lower(), df.columns) # lowercase column names
  df.specie = df.specie.apply(update_specie_values) # update_specie_values
  df = df.drop_duplicates() # drop_duplicates
  df.to_csv(dst_file_path, index=False, header=True) # save clean version
  print(f'{dst_file_path} saved.')

In [13]:
for period in periods:
  SRC_FILE_PATH = AIR_QUALITY_RAW_DATA_FOLDER / f'{period}.csv'
  DST_FILE_PATH = AIR_QUALITY_CLEAN_DATA_FOLDER / f'{period}.csv'
  process_file(SRC_FILE_PATH, DST_FILE_PATH)

data/air_quality/clean/2020Q1.csv saved.
data/air_quality/clean/2020Q2.csv saved.
data/air_quality/clean/2020Q3.csv saved.
data/air_quality/clean/2020Q4.csv saved.
data/air_quality/clean/2019Q1.csv saved.
data/air_quality/clean/2019Q2.csv saved.
data/air_quality/clean/2019Q3.csv saved.
data/air_quality/clean/2019Q4.csv saved.


In [14]:
# put all clean data in one zip file (to make downloading easier)
!zip clean_data.zip data/air_quality/clean/*.csv

  adding: data/air_quality/clean/2019Q1.csv (deflated 78%)
  adding: data/air_quality/clean/2019Q2.csv (deflated 78%)
  adding: data/air_quality/clean/2019Q3.csv (deflated 78%)
  adding: data/air_quality/clean/2019Q4.csv (deflated 77%)
  adding: data/air_quality/clean/2020Q1.csv (deflated 77%)
  adding: data/air_quality/clean/2020Q2.csv (deflated 78%)
  adding: data/air_quality/clean/2020Q3.csv (deflated 78%)
  adding: data/air_quality/clean/2020Q4.csv (deflated 78%)


In [15]:
!ls -alh *.zip

-rw-r--r-- 1 root root 52M Nov 19 15:24 clean_data.zip
-rw-r--r-- 1 root root 52M Nov 19 15:22 raw_data.zip


In [16]:
from google.colab import files

In [17]:
files.download('raw_data.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [18]:
files.download('clean_data.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# References
* [python - Running for loop terminal commands in Jupyter - Stack Overflow](https://stackoverflow.com/questions/46920538/running-for-loop-terminal-commands-in-jupyter/50730032)
* [shell - How to mkdir only if a directory does not already exist? - Stack Overflow](https://stackoverflow.com/questions/793858/how-to-mkdir-only-if-a-directory-does-not-already-exist)
* [How do I append a string to a Path in Python? - Stack Overflow](https://stackoverflow.com/questions/48190959/how-do-i-append-a-string-to-a-path-in-python)
* [Line Wrapping in Collaboratory Google results - Stack Overflow](https://stackoverflow.com/questions/58890109/line-wrapping-in-collaboratory-google-results)
* [python - How to see complete rows in Google Colab * Stack Overflow](https://stackoverflow.com/questions/60013721/how-to-see-complete-rows-in-google-colab)
* [python - Pandas: Setting no. of max rows - Stack Overflow](https://stackoverflow.com/questions/16424493/pandas-setting-no-of-max-rows)
* [command line - how to pass multiple files to zip for zipping - Ask Ubuntu](https://askubuntu.com/questions/1001258/how-to-pass-multiple-files-to-zip-for-zipping)

---