<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#A-Dataset-Object" data-toc-modified-id="A-Dataset-Object-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>A Dataset Object</a></span></li><li><span><a href="#Compiling-pollution-data" data-toc-modified-id="Compiling-pollution-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Compiling pollution data</a></span></li></ul></div>

In [1]:
#  Load the "autoreload" extension so that code can change
%load_ext autoreload
%reload_ext autoreload
#  always reload modules so that as you change code in src, it gets loaded
%autoreload 2
%matplotlib inline

import sys
sys.path.append('../')
import src
from src.imports import *
from src.gen_functions import *
# import the Dataset object class
from src.features.dataset import Dataset
from src.visualization.mapper import *

plt.rcParams.update({'font.size': 16})

# A Dataset Object

It is more convenience to have a `Dataset` object that keep tracks of all relavant data for a city along with necessary meta information such as city location etc. This is object is under `src.features.dataset.py`.

The `Dataset` object is also in charge of compile raw pollution, weather, fire data from the data folder into a ready-to-use format using `dataset.build_all_data()`. The processed data are saved under ../data/city_name/. The code below illustrates how to `Dataset` object compile the data using a build_all_data command. This object also keep track of feature engineering parameters during the model optimization step[notebook](https://github.com/worasom/aqi_thailand2/blob/master/notebooks/5.0-ML_ChiangMai.ipynb). For the Dataset object's documentation, please refer to https://github.com/worasom/aqi_thailand2/blob/master/docs/_build/html/src.features.html.

In [None]:
# init a dataset object and build the data from scratch 
# only perform this when new data files are added 
dataset = Dataset('Chiang Mai')

# build pollution,  weather data and (optional) fire data
dataset.build_all_data( build_fire=True, build_holiday=True)

th_stations ['35t', '36t']
Averaging data from 3 stations
Loading all hotspots data. This might take sometimes


`dataset.build_all_data()` calls four functions: 
- `dataset.build_pollution()`: compiles pollution data form all available sources, averages all the pollution data, and add as `dataset.poll_df` attribute. 
- `dataset.build_weather()`: load weather data, fills the missing, and add as `dataset.wea`.
- `dataset.build_fire()`: Compile the satellite data files into a `dataset.fire` dataframe.
- `dataset.build_holiday()`: scrapes holiday information from the website and save as a csv file. 
    
These function can be called separately when needing to update any data.

After the building process, which might take sometimes because of the size of the fire data (building the fire data is optional and can be set to false (`build_fire=False`). The complied data can be loaded using `_load()` command.

In [2]:
# load saved process data 
dataset = Dataset('Chiang Mai')
dataset.load_()

The hourly pollution data, weather data, and fire data are under `dataset.poll_df`, `dataset.wea` and `dataset.fire` attributes accordingly. Each data is a panda dataframe with datetime index. For example, the pollution data for Chiang Mai looks like

In [48]:
print(dataset.poll_df.tail(2).to_markdown())

| datetime            |   PM2.5 |   PM10 |   O3 |   CO |   NO2 |   SO2 |
|:--------------------|--------:|-------:|-----:|-----:|------:|------:|
| 2020-06-17 15:00:00 |     8.5 |   19.5 |   15 | 0.4  |     5 |     1 |
| 2020-06-17 16:00:00 |     7.5 |   16.5 |   11 | 0.43 |     5 |     1 |


Additionally the dataset also has city information under `city_info` attribute

In [27]:
dataset.city_info

{'Country': 'Thailand',
 'City': 'Chiang Mai',
 'City (ASCII)': 'Chiang Mai',
 'Region': 'Chiang Mai',
 'Region (ASCII)': 'Chiang Mai',
 'Population': '200952',
 'Latitude': '18.7904',
 'Longitude': '98.9847',
 'Time Zone': 'Asia/Bangkok',
 'lat_km': 2117.0,
 'long_km': 11019.0}

# Compiling pollution data

`dataset.build_pollution()` compiles data from many data sources and average them into a single dataframe under `dataset.poll_df` attribute. Internally, this function calls `dataset.collect_stations_data()` for a list of pollution dataframes from different sources. It calls many functions in `src.data.read_data.py`. Here, I will explain `dataset.collect_stations_data()` function. 

Below is the definition of this function

In [None]:
    def collect_stations_data(self):
        """Collect all Pollution data from a different sources and take the average.

        Since each city have different data sources. It has to be treat differently. 
        The stations choices is specified by the config.json

        Returns: a list of dataframe each dataframe is the data from all station.

        """
        # data list contain the dataframe of all pollution data before merging
        # all of this data has 'datetime' as a columns
        data_list = []

        # load data from Berkeley Earth Projects This is the same for all cities
        b_data, _ = read_b_data(self.main_folder + 'pm25/' + self.city_name.replace(' ', '_') + '.txt')
        data_list.append(b_data)

        try:
            # load config_dict for the city 
            config_dict = self.config_dict[self.city_name]
        except:
            config_dict = {}
        
        # load thailand stations if applicable 
        if 'th_stations' in config_dict.keys():
            station_ids = config_dict['th_stations']
            print('th_stations', station_ids)
            self.merge_new_old_pollution(station_ids)
            # load the file
            for station_id in station_ids:
                filename = self.data_folder + station_id + '.csv'
                data = pd.read_csv(filename)
                data['datetime'] = pd.to_datetime(data['datetime'])
                data_list.append(data)
        # load the Thailand stations maintained by cmucdc project 
        if 'cmu_stations' in config_dict.keys():
            station_ids = config_dict['cmu_stations']
            print('cmu_stations', station_ids)
            for station_id in station_ids:
                filename = self.main_folder + 'cdc_data/' + str(station_id) + '.csv' 
                data_list.append(read_cmucdc(filename))
        
        if 'b_stations' in config_dict.keys():
            # add Berkeley stations in near by provinces 
            station_ids = config_dict['b_stations']
            print('add Berkerley stations', station_ids)
            for station_id in station_ids:
                b_data, _ = read_b_data(self.main_folder + 'pm25/' + f'{station_id}.txt')
                data_list.append(b_data)

        if 'us_emb' in config_dict.keys():
            # add the data from US embassy 
            print('add US embassy data')
            data_list += build_us_em_data(city_name=self.city_name,
                                                    data_folder=f'{self.main_folder}us_emb/')

        return data_list 

`src.data.read_data.read_b_data()` loads the [Berkeley project](http://berkeleyearth.org/) data using the `self.city_name`. This data is common for all the cities. For the other data sources, the availability varies, so it must be specified in `src.features.config.py`. The dictionary in this function is added as `dataset.conf_dict` attribute. 

For example for **Chiang Mai**, it says 

In [66]:
print('Chiang Mai stations ', dataset.config_dict )

Chiang Mai stations  {'th_stations': ['35t', '36t']}


This means that for Chiang Mai, include the data form two Thailand DPC stations: '35t', '36t', in addition to the Berkeley project data. The function dataset.merge_new_old_pollution() compiled the historical data and scraped data for each station and save under `../data/city_name/` folder. The merged data is then read and append to the data_list. 

For **Nakhon Si Thammarat**, the config_dict says

In [69]:
print('Nakhon Si Thammarat', dataset.config_dict )

Nakhon Si Thammarat {'th_stations': ['42t', 'm3'], 'cmu_stations': [118]}


For Nakhon Si Thammarat, I borrow the data from the near by province (station '42t'), in addition to the mobile station 'm3'. Then I also include the data from [Chiang Mai University Monitoring Stations](https://www.cmuccdc.org/), station number 118. This is done by calling `src.data.read_data.read_cmucdc()`

To check for Thailand DPC's stations number, use `src.data.read_data.get_th_stations().

In [75]:
src.data.read_data.get_th_stations('Chiang Mai')

(['35t', '36t'],
 [{'stationID': '35t',
   'nameTH': 'ศูนย์ราชการจังหวัดเชียงใหม่ ',
   'nameEN': 'City Hall, Chiangmai',
   'areaTH': 'ต.ช้างเผือก อ.เมือง, เชียงใหม่',
   'areaEN': 'Chang Phueak, Meuang, Chiang Mai',
   'stationType': 'GROUND',
   'lat': '18.840633',
   'long': '98.969661',
   'LastUpdate': {'date': '2020-09-17',
    'time': '23:00',
    'PM25': {'value': '15', 'unit': 'µg/m³'},
    'PM10': {'value': '27', 'unit': 'µg/m³'},
    'O3': {'value': '18', 'unit': 'ppb'},
    'CO': {'value': '-', 'unit': 'ppm'},
    'NO2': {'value': '0', 'unit': 'ppb'},
    'SO2': {'value': '0', 'unit': 'ppb'},
    'AQI': {'Level': '1', 'aqi': '15'}}},
  {'stationID': '36t',
   'nameTH': 'โรงเรียนยุพราชวิทยาลัย ',
   'nameEN': 'Yupparaj Wittayalai School',
   'areaTH': 'ต.ศรีภูมิ อ.เมือง, เชียงใหม่',
   'areaEN': 'Si Phum, Meuang, Chiang Mai',
   'stationType': 'GROUND',
   'lat': '18.7909205',
   'long': '98.9881062',
   'LastUpdate': {'date': '2020-09-17',
    'time': '23:00',
    'PM25': 

For [Chiang Mai University Monitoring Stations](https://www.cmuccdc.org/), one can search for the station ids from their json file. For example,

In [87]:
station_info_list = requests.get('https://www.cmuccdc.org/api/ccdc/stations').json()

station_ids= []
for station in station_info_list:
    if 'Nakhon Si Thammarat' in station['dustboy_name_en']:
        station_ids.append(station['dustboy_id'])

print(station_ids)

['118']


For **Hanoi**, the configuration file says

In [68]:
print('Hanoi stations ', dataset.config_dict)

Hanoi stations  {'b_stations': ['Ha_Dong'], 'us_emb': True}
