# Bangkok PM2.5 Forecast - Cleaned Data Updated


## Dataset
The data collected for time series analysis and prediction with VAR and AR were from Dec 1, 2021 to Jan 5, 2022. As it was hourly measured, there were 864 rows in total with 815 rows (with differencing level 1) for training and 48 rows for testing. Then, the dataset was updated to Feb 23, 2022 09:00:00 (totally 2,026 data rows) in order to train with RNN model.


## Import libraries

In [1]:
import pandas as pd
import numpy as np

## Import Data and Data Preparation

In [2]:
df1 = pd.read_csv('report_54t_dindaeng.csv')
df2 = pd.read_csv('report_54t_dindaeng2.csv')

In [3]:
print(len(df1))
print(len(df2))

864
1162


In [4]:
raw_data = pd.concat([df1, df2])
raw_data.head()

Unnamed: 0,No,date,time,PM2.5,PM10,O3,CO,NO2
0,1,2021-12-01,00:00 - 01:00,32,79,18.0,1.28,32.0
1,2,2021-12-01,01:00 - 02:00,27,78,19.0,1.23,29.0
2,3,2021-12-01,02:00 - 03:00,25,74,21.0,1.22,25.0
3,4,2021-12-01,03:00 - 04:00,33,72,30.0,1.15,18.0
4,5,2021-12-01,04:00 - 05:00,40,66,19.0,1.23,26.0


In [5]:
# drop 'No' column 
raw_data = raw_data.drop(['No'], axis=1)

In [6]:
raw_data['time_'] = raw_data['time'].str.split().str.get(0)
raw_data.head()

Unnamed: 0,date,time,PM2.5,PM10,O3,CO,NO2,time_
0,2021-12-01,00:00 - 01:00,32,79,18.0,1.28,32.0,00:00
1,2021-12-01,01:00 - 02:00,27,78,19.0,1.23,29.0,01:00
2,2021-12-01,02:00 - 03:00,25,74,21.0,1.22,25.0,02:00
3,2021-12-01,03:00 - 04:00,33,72,30.0,1.15,18.0,03:00
4,2021-12-01,04:00 - 05:00,40,66,19.0,1.23,26.0,04:00


In [7]:
raw_data['datetime'] = raw_data['date'] + ' ' + raw_data['time_'] + ':00'
raw_data.head()

Unnamed: 0,date,time,PM2.5,PM10,O3,CO,NO2,time_,datetime
0,2021-12-01,00:00 - 01:00,32,79,18.0,1.28,32.0,00:00,2021-12-01 00:00:00
1,2021-12-01,01:00 - 02:00,27,78,19.0,1.23,29.0,01:00,2021-12-01 01:00:00
2,2021-12-01,02:00 - 03:00,25,74,21.0,1.22,25.0,02:00,2021-12-01 02:00:00
3,2021-12-01,03:00 - 04:00,33,72,30.0,1.15,18.0,03:00,2021-12-01 03:00:00
4,2021-12-01,04:00 - 05:00,40,66,19.0,1.23,26.0,04:00,2021-12-01 04:00:00


In [8]:
pollutants = raw_data[['datetime', 'PM2.5', 'PM10', 'O3', 'CO', 'NO2']]
pollutants.head()

Unnamed: 0,datetime,PM2.5,PM10,O3,CO,NO2
0,2021-12-01 00:00:00,32,79,18.0,1.28,32.0
1,2021-12-01 01:00:00,27,78,19.0,1.23,29.0
2,2021-12-01 02:00:00,25,74,21.0,1.22,25.0
3,2021-12-01 03:00:00,33,72,30.0,1.15,18.0
4,2021-12-01 04:00:00,40,66,19.0,1.23,26.0


In [9]:
pollutants['datetime'] = pd.to_datetime(pollutants["datetime"])
pollutants.set_index('datetime', inplace=True)
pollutants.index.freq = 'H'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pollutants['datetime'] = pd.to_datetime(pollutants["datetime"])


In [10]:
pollutants.head()

Unnamed: 0_level_0,PM2.5,PM10,O3,CO,NO2
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-12-01 00:00:00,32,79,18.0,1.28,32.0
2021-12-01 01:00:00,27,78,19.0,1.23,29.0
2021-12-01 02:00:00,25,74,21.0,1.22,25.0
2021-12-01 03:00:00,33,72,30.0,1.15,18.0
2021-12-01 04:00:00,40,66,19.0,1.23,26.0


In [11]:
pollutants.index

DatetimeIndex(['2021-12-01 00:00:00', '2021-12-01 01:00:00',
               '2021-12-01 02:00:00', '2021-12-01 03:00:00',
               '2021-12-01 04:00:00', '2021-12-01 05:00:00',
               '2021-12-01 06:00:00', '2021-12-01 07:00:00',
               '2021-12-01 08:00:00', '2021-12-01 09:00:00',
               ...
               '2022-02-23 00:00:00', '2022-02-23 01:00:00',
               '2022-02-23 02:00:00', '2022-02-23 03:00:00',
               '2022-02-23 04:00:00', '2022-02-23 05:00:00',
               '2022-02-23 06:00:00', '2022-02-23 07:00:00',
               '2022-02-23 08:00:00', '2022-02-23 09:00:00'],
              dtype='datetime64[ns]', name='datetime', length=2026, freq='H')

In [12]:
pollutants.isna().sum()

PM2.5      0
PM10       0
O3       199
CO       145
NO2      145
dtype: int64

In [13]:
cleaned_pollutants = pollutants.fillna(method='ffill')
cleaned_pollutants.isna().sum()

PM2.5    0
PM10     0
O3       0
CO       0
NO2      0
dtype: int64

In [14]:
cleaned_pollutants.to_csv('cleaned_dataset_updated.csv')