# Scrapping Pollution Data from Website

Environmental agencies often developed countries make the pollution publicly available, however, this is not the case for many developing countries.  Thailand environmental protection agency(EPA), only one month history is available through their website without special request. ref[3]( http://www.aqmthai.com/public_report.php). Luckily, some historical data can be found in Berkeley Earth website. 

This notebok demonstrates how to automatically download Thailand's pollution data from Berkeley Earth's database ref [1](http://berkeleyearth.lbl.gov/air-quality/maps/cities/Thailand/) using requests, beautifulsoup and wget libraries. I also has another [notebook](https://github.com/worasom/aqi_thailand/blob/master/webscraping-AQI.ipynb) demonstrating how to scrap data from Thai EPA directly.


Reference
1. http://berkeleyearth.lbl.gov/air-quality/maps/cities/Thailand/
2. https://automatetheboringstuff.com/chapter11/ 
3. http://www.aqmthai.com/public_report.php


## Download PM2.5  data 

Berekeley Earth database looks like this. 

<img src="data/berekeley_earth1.png" width="400">

The each folder contains airpollution data for each province in text and json format. There maybe more than one files in the folder, for example Nonthaburi province folder has three data files as shown here.

<img src="data/berekeley_earth3.png" width="500">

I am interested in obtaining data from Bangkok provinces and her neighbors. The website provides a list of Bangkok neighbors in Bangkok.neighbors.json file. Therefore, it more productive use Python to downloads these file automatically. In addition, I can reuse this work when I work Chiang Mai and its neighbors. 


In [1]:
# import libraries
import sys
from pathlib import Path
import requests
import wget
from bs4 import BeautifulSoup
from selenium import webdriver
import re

In [2]:
url = 'http://berkeleyearth.lbl.gov/air-quality/maps/cities/Thailand/'
res = requests.get(url)
# create a soup object of Berkeley earth website 
soup = BeautifulSoup(res.text)

In [3]:
# find all provinces in this database
provinces = soup.find_all(href=re.compile('/'))[1:]

In [6]:
# find the soup object of Bangkok link
bkk = provinces[0]

In [5]:
#bulid a list of provinces to download the data 
to_grab = []
to_grab.append(bkk['href'])
# locate BKK neighbors json file
nei_url = url+to_grab[0].replace('/','')+'.neighbors.json'
bkk_j = requests.get(nei_url).json()
nei = [prov[1]+'/' for prov in bkk_j if prov[0]=='Thailand']

# some provinces in the list are not the province of interest's neighbors 
to_grab = to_grab+nei[:7]
to_grab

['Bangkok/',
 'Nonthaburi/',
 'Samut_Prakan/',
 'Samut_Sakhon/',
 'Pathum_Thani/',
 'Nakhon_Pathom/',
 'Phra_Nakhon_Si_Ayutthaya/',
 'Ratchaburi/']

In [12]:
for folder in to_grab: 
    # extract the href for data folders to download
    link = soup.find_all(href = re.compile(folder))
    print(link)

[<a href="Bangkok/">Bangkok/</a>]
[<a href="Nonthaburi/">Nonthaburi/</a>]
[<a href="Samut_Prakan/">Samut_Prakan/</a>]
[<a href="Samut_Sakhon/">Samut_Sakhon/</a>]
[<a href="Pathum_Thani/">Pathum_Thani/</a>]
[<a href="Nakhon_Pathom/">Nakhon_Pathom/</a>]
[<a href="Phra_Nakhon_Si_Ayutthaya/">Phra_Nakhon_Si_Ayutthaya/</a>]
[<a href="Ratchaburi/">Ratchaburi/</a>]


In [24]:
Path('data').mkdir(parents=True, exist_ok=True)

In [5]:
def download_txt(folder):
    '''Input: soup object that contain the href for downloading data'''
    grab_url = url+folder
    prov_r = requests.get(grab_url)
    prov_s = BeautifulSoup(prov_r.text)
    for tag in prov_s.find_all(href=re.compile('.txt')):
        data_url = grab_url+tag['href']
        name = 'data/'+ tag['href']
        wget.download(data_url,name)

In [36]:
# start downloading !
for folder in to_grab:
    download_txt(folder)

100% [............................................................................] 576938 / 576938

## Scrap Pollution Data for All Provinces

In [6]:
for province in provinces:
    download_txt(province['href'])

100% [............................................................................] 613902 / 613902