# Preprosesing

In [2]:
import pandas as pd

>data ini memiliki missing value sejumlah 36 data

In [3]:
data_no2_covid = pd.read_csv("no2_bojonegoro_post/timeseries.csv")
data_no2_covid.isnull().sum()

date              0
feature_index     0
NO2              36
dtype: int64

### Menangani Missing Value dengan Linear Interpolasi

#### Pengertian
**Interpolasi** adalah metode untuk memperkirakan nilai data yang hilang (**missing value**) dengan menggunakan nilai data di sekitarnya.  
Metode ini berasumsi bahwa perubahan antar data bersifat **kontinu** dan memiliki **pola tertentu** yang dapat diestimasi.

**Linear Interpolation**  Menghubungkan dua titik data dengan garis lurus. Cocok untuk data yang berubah secara konstan dan halus.  


In [4]:
data_no2_covid["date"] = pd.to_datetime(data_no2_covid["date"])
data_no2_covid.set_index("date", inplace=True)
data_no2_covid["NO2"] = data_no2_covid["NO2"].interpolate(method='linear')
data_no2_covid["NO2"] = data_no2_covid["NO2"].ffill().bfill()
data_no2_covid

Unnamed: 0_level_0,feature_index,NO2
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2025-07-09 00:00:00+00:00,0,0.000020
2025-07-07 00:00:00+00:00,0,0.000012
2025-07-05 00:00:00+00:00,0,0.000014
2025-07-06 00:00:00+00:00,0,0.000017
2025-07-08 00:00:00+00:00,0,0.000019
...,...,...
2025-06-08 00:00:00+00:00,0,0.000022
2025-06-04 00:00:00+00:00,0,0.000021
2025-06-03 00:00:00+00:00,0,0.000016
2025-06-01 00:00:00+00:00,0,0.000016


In [5]:
data_no2_covid.isnull().sum()

feature_index    0
NO2              0
dtype: int64

In [6]:
data_no2_covid.to_csv("no2_bojonegoro_post/clean_data.csv")

## Pembentukan Data Supervised

Pembentukan data supervised adalah proses mengubah data deret waktu (time series) menjadi bentuk yang dapat digunakan oleh algoritma *supervised learning*, seperti *K-Nearest Neighbors* atau *Linear Regression*.

### Konsep Dasar
Pada data deret waktu, nilai masa depan (target) biasanya diprediksi berdasarkan beberapa nilai sebelumnya (*lag*).  
Setiap *lag* merepresentasikan nilai pada waktu sebelumnya.

Contoh:

| lag_3 | lag_2 | lag_1 | target |
|-------|-------|-------|--------|
| 10    | 12    | 14    | 16     |
| 12    | 14    | 16    | 18     |

Artinya, untuk memprediksi `target = 16`, digunakan data 3 waktu sebelumnya (`10, 12, 14`).

### Langkah Umum
1. Tentukan jumlah *lag* (misalnya 3 atau 5).
2. Geser data sebanyak *lag* nilai ke belakang untuk membentuk kolom *lag_1, lag_2, …*.
3. Kolom terakhir menjadi *target* yang berisi nilai sebenarnya setelah *lag* tersebut.

### Catatan
- Semakin banyak *lag*, model memiliki lebih banyak informasi, tetapi juga berisiko menangkap *noise*.
- Pemilihan jumlah *lag* yang tepat sangat penting agar model tetap sederhana dan akurat.


In [7]:
import pandas as pd

In [8]:
data_clean_post = pd.read_csv("no2_bojonegoro_post/clean_data.csv")
data_clean_post

Unnamed: 0,date,feature_index,NO2
0,2025-07-09 00:00:00+00:00,0,0.000020
1,2025-07-07 00:00:00+00:00,0,0.000012
2,2025-07-05 00:00:00+00:00,0,0.000014
3,2025-07-06 00:00:00+00:00,0,0.000017
4,2025-07-08 00:00:00+00:00,0,0.000019
...,...,...,...
136,2025-06-08 00:00:00+00:00,0,0.000022
137,2025-06-04 00:00:00+00:00,0,0.000021
138,2025-06-03 00:00:00+00:00,0,0.000016
139,2025-06-01 00:00:00+00:00,0,0.000016


In [9]:
def create_columns(lag=5):
    columns = []

    for i in range(lag, lag - lag, -1):
        columns.append("lag_" + str(i))

    columns.append("target")

    return columns

def create_supervised_data(data, lag=5):
    data_supervised = []
    for i in range(lag, len(data)):
        row = data[i-lag:i+1]
        data_supervised.append(row)

    columns = create_columns(lag)

    supervised_df = pd.DataFrame(data_supervised, columns=columns)

    return supervised_df

In [10]:
no2 = data_clean_post["NO2"].to_list()

> disini saya membuat Data Supervised dengan lag 1 - 5, 10 dan membandingkanya

## Data Supervised lag 1

In [11]:
data = data_clean_post["NO2"].to_list()
supervised_df = create_supervised_data(data, lag=1)
supervised_df

Unnamed: 0,lag_1,target
0,0.000020,0.000012
1,0.000012,0.000014
2,0.000014,0.000017
3,0.000017,0.000019
4,0.000019,0.000020
...,...,...
135,0.000023,0.000022
136,0.000022,0.000021
137,0.000021,0.000016
138,0.000016,0.000016


In [12]:
supervised_df.to_csv("no2_bojonegoro_post/supervised_data_lag_1.csv", index=False)

## Data Supervised lag 2

In [13]:
data = data_clean_post["NO2"].to_list()
supervised_df = create_supervised_data(data, lag=2)
supervised_df

Unnamed: 0,lag_2,lag_1,target
0,0.000020,0.000012,0.000014
1,0.000012,0.000014,0.000017
2,0.000014,0.000017,0.000019
3,0.000017,0.000019,0.000020
4,0.000019,0.000020,0.000020
...,...,...,...
134,0.000027,0.000023,0.000022
135,0.000023,0.000022,0.000021
136,0.000022,0.000021,0.000016
137,0.000021,0.000016,0.000016


In [14]:
supervised_df.to_csv("no2_bojonegoro_post/supervised_data_lag_2.csv", index=False)

## Data Supervised lag 3

In [15]:
data = data_clean_post["NO2"].to_list()
supervised_df = create_supervised_data(data, lag=3)
supervised_df

Unnamed: 0,lag_3,lag_2,lag_1,target
0,0.000020,0.000012,0.000014,0.000017
1,0.000012,0.000014,0.000017,0.000019
2,0.000014,0.000017,0.000019,0.000020
3,0.000017,0.000019,0.000020,0.000020
4,0.000019,0.000020,0.000020,0.000021
...,...,...,...,...
133,0.000019,0.000027,0.000023,0.000022
134,0.000027,0.000023,0.000022,0.000021
135,0.000023,0.000022,0.000021,0.000016
136,0.000022,0.000021,0.000016,0.000016


In [16]:
supervised_df.to_csv("no2_bojonegoro_post/supervised_data_lag_3.csv", index=False)

## Data Supervised lag 4

In [17]:
data = data_clean_post["NO2"].to_list()
supervised_df = create_supervised_data(data, lag=4)
supervised_df

Unnamed: 0,lag_4,lag_3,lag_2,lag_1,target
0,0.000020,0.000012,0.000014,0.000017,0.000019
1,0.000012,0.000014,0.000017,0.000019,0.000020
2,0.000014,0.000017,0.000019,0.000020,0.000020
3,0.000017,0.000019,0.000020,0.000020,0.000021
4,0.000019,0.000020,0.000020,0.000021,0.000020
...,...,...,...,...,...
132,0.000021,0.000019,0.000027,0.000023,0.000022
133,0.000019,0.000027,0.000023,0.000022,0.000021
134,0.000027,0.000023,0.000022,0.000021,0.000016
135,0.000023,0.000022,0.000021,0.000016,0.000016


In [18]:
supervised_df.to_csv("no2_bojonegoro_post/supervised_data_lag_4.csv", index=False)

## Data Supervised lag 5

In [19]:
data = data_clean_post["NO2"].to_list()
supervised_df = create_supervised_data(data, lag=5)
supervised_df

Unnamed: 0,lag_5,lag_4,lag_3,lag_2,lag_1,target
0,0.000020,0.000012,0.000014,0.000017,0.000019,0.000020
1,0.000012,0.000014,0.000017,0.000019,0.000020,0.000020
2,0.000014,0.000017,0.000019,0.000020,0.000020,0.000021
3,0.000017,0.000019,0.000020,0.000020,0.000021,0.000020
4,0.000019,0.000020,0.000020,0.000021,0.000020,0.000018
...,...,...,...,...,...,...
131,0.000028,0.000021,0.000019,0.000027,0.000023,0.000022
132,0.000021,0.000019,0.000027,0.000023,0.000022,0.000021
133,0.000019,0.000027,0.000023,0.000022,0.000021,0.000016
134,0.000027,0.000023,0.000022,0.000021,0.000016,0.000016


In [20]:
supervised_df.to_csv("no2_bojonegoro_post/supervised_data_lag_5.csv", index=False)

## Data Supervised lag 10

In [24]:
data = data_clean_post["NO2"].to_list()
supervised_df = create_supervised_data(data, lag=10)
supervised_df

Unnamed: 0,lag_10,lag_9,lag_8,lag_7,lag_6,lag_5,lag_4,lag_3,lag_2,lag_1,target
0,0.000020,0.000012,0.000014,0.000017,0.000019,0.000020,0.000020,0.000021,0.000020,0.000018,0.000018
1,0.000012,0.000014,0.000017,0.000019,0.000020,0.000020,0.000021,0.000020,0.000018,0.000018,0.000024
2,0.000014,0.000017,0.000019,0.000020,0.000020,0.000021,0.000020,0.000018,0.000018,0.000024,0.000029
3,0.000017,0.000019,0.000020,0.000020,0.000021,0.000020,0.000018,0.000018,0.000024,0.000029,0.000029
4,0.000019,0.000020,0.000020,0.000021,0.000020,0.000018,0.000018,0.000024,0.000029,0.000029,0.000028
...,...,...,...,...,...,...,...,...,...,...,...
126,0.000021,0.000023,0.000027,0.000031,0.000035,0.000028,0.000021,0.000019,0.000027,0.000023,0.000022
127,0.000023,0.000027,0.000031,0.000035,0.000028,0.000021,0.000019,0.000027,0.000023,0.000022,0.000021
128,0.000027,0.000031,0.000035,0.000028,0.000021,0.000019,0.000027,0.000023,0.000022,0.000021,0.000016
129,0.000031,0.000035,0.000028,0.000021,0.000019,0.000027,0.000023,0.000022,0.000021,0.000016,0.000016


In [25]:
supervised_df.to_csv("no2_bojonegoro_post/supervised_data_lag_10.csv", index=False)

## Data Dapat Diunduh

lag_1: [supervised_dataset_lag1](./no2_bojonegoro_post/supervised_data_lag_1.csv)  
lag_2: [supervised_dataset_lag2](./no2_bojonegoro_post/supervised_data_lag_2.csv)  
lag_3: [supervised_dataset_lag3](./no2_bojonegoro_post/supervised_data_lag_3.csv)  
lag_4: [supervised_dataset_lag4](./no2_bojonegoro_post/supervised_data_lag_4.csv)  
lag_5: [supervised_dataset_lag5](./no2_bojonegoro_post/supervised_data_lag_5.csv)  
lag_10: [supervised_dataset_lag10](./no2_bojonegoro_post/supervised_data_lag_10.csv)  