# Predictive Analytics Project: PM2.5 Level Prediction in Beijing
---
* **Nama:** Wildan Mufid Ramadhan
* **Email:** wildan.20nov@gmail.com
* **Dicoding ID:** wildan.20nov@gmail.com

### Latar Belakang

Polusi udara, khususnya partikel PM2.5, telah menjadi masalah serius di banyak kota besar di seluruh dunia, termasuk Beijing. Tingkat PM2.5 yang tinggi dapat berdampak negatif pada kesehatan manusia dan lingkungan. Oleh karena itu, kemampuan untuk memprediksi kadar PM2.5 sangat penting untuk memungkinkan pihak berwenang dan masyarakat mengambil tindakan pencegahan yang tepat.

### Permasalahan

Proyek ini bertujuan untuk mengatasi tiga permasalahan utama:

1.  **Faktor-faktor apa saja yang paling berpengaruh terhadap tingkat polusi PM2.5 di Beijing?**
    *   Analisis ini akan membantu mengidentifikasi variabel-variabel kunci (misalnya, kondisi cuaca, polutan lain) yang memiliki korelasi kuat dengan konsentrasi PM2.5.

2.  **Bagaimana mengembangkan model machine learning untuk memprediksi kadar PM2.5 berdasarkan data cuaca dan polutan lainnya?**
    *   Fokusnya adalah membangun model prediktif yang akurat menggunakan teknik regresi.

3.  **Model prediksi mana yang memberikan hasil terbaik untuk estimasi PM2.5 — Regresi Linear atau Random Forest?**
    *   Perbandingan kinerja antara model-model ini akan dilakukan untuk menentukan pendekatan yang paling efektif.

### Problem Statements

*   Tingginya kadar PM2.5 di Beijing memerlukan sistem prediksi yang akurat untuk mitigasi dampak kesehatan dan lingkungan.
*   Kurangnya pemahaman mendalam tentang faktor-faktor pendorong utama polusi PM2.5 menghambat pengembangan strategi pengendalian yang efektif.

### Goals

*   Mengidentifikasi faktor-faktor lingkungan dan meteorologi yang paling signifikan yang mempengaruhi konsentrasi PM2.5.
*   Mengembangkan model machine learning yang mampu memprediksi kadar PM2.5 dengan akurasi tinggi.
*   Membandingkan kinerja model Regresi Linear dan Random Forest untuk menentukan model terbaik untuk prediksi PM2.5.

### Solution Statement

Untuk mencapai tujuan ini, kami akan menerapkan pendekatan machine learning dengan membandingkan beberapa algoritma regresi. Secara spesifik, kami akan:

1.  Menggunakan model **Regresi Linear** sebagai baseline untuk memahami hubungan linier antara fitur dan target.
2.  Menggunakan model **Random Forest Regressor** yang dikenal mampu menangani hubungan non-linier dan interaksi fitur yang kompleks.

Kinerja model akan diukur menggunakan metrik seperti Root Mean Squared Error (RMSE) dan R-squared (R2 Score) untuk memastikan solusi yang terukur dan dapat diandalkan.

## Download dan Load Dataset

In [3]:
!curl -L -o ./beijing-multi-site-air-quality-data.zip https://www.kaggle.com/api/v1/datasets/download/aravindpcoder/beijing-multi-site-air-quality-data
!mkdir ./data
!unzip beijing-multi-site-air-quality-data.zip -d ./data
!rm  beijing-multi-site-air-quality-data.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 7563k  100 7563k    0     0  1978k      0  0:00:03  0:00:03 --:--:-- 2569k
mkdir: cannot create directory ‘./data’: File exists
Archive:  beijing-multi-site-air-quality-data.zip
  inflating: ./data/Beijing Multisite air Quality data.csv  


In [4]:
import pandas as pd

df_raw = pd.read_csv("./data/Beijing Multisite air Quality data.csv")
df_raw.head(3)

Unnamed: 0,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
0,2013,3,1,0,4.0,4.0,4.0,7.0,300.0,77.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,Aotizhongxin
1,2013,3,1,1,8.0,8.0,4.0,7.0,300.0,77.0,-1.1,1023.2,-18.2,0.0,N,4.7,Aotizhongxin
2,2013,3,1,2,7.0,7.0,5.0,10.0,300.0,73.0,-1.1,1023.5,-18.2,0.0,NNW,5.6,Aotizhongxin


## Data Understanding

Dataset yang digunakan dalam proyek ini adalah "Beijing Multisite Air Quality Data" yang berisi data kualitas udara dan meteorologi dari berbagai stasiun di Beijing. Dataset ini mencakup observasi per jam dari tahun 2013 hingga 2017.

### Informasi Dataset
* **Dataset:** Beijing Multi-Site Air-Quality Data<br>
* **URL:** https://www.kaggle.com/api/v1/datasets/download/aravindpcoder/beijing-multi-site-air-quality-data<br>
*   **Jumlah Sampel:** 420,768 entri.
*   **Kolom:** 17 kolom, termasuk informasi waktu (tahun, bulan, hari, jam), konsentrasi polutan (PM2.5, PM10, SO2, NO2, CO, O3), data meteorologi (TEMP, PRES, DEWP, RAIN, wd, WSPM), dan nama stasiun.
*   **Tipe Data:** Campuran `int64`, `float64`, dan `object` (untuk `wd` dan `station`).

### Deskripsi Kolom Penting:

*   **PM2.5:** Konsentrasi partikel PM2.5 (target variabel).
*   **PM10:** Konsentrasi partikel PM10.
*   **SO2, NO2, CO, O3:** Konsentrasi polutan gas.
*   **TEMP:** Suhu (Celsius).
*   **PRES:** Tekanan atmosfer (hPa).
*   **DEWP:** Titik embun (Celsius).
*   **RAIN:** Curah hujan (mm).
*   **wd:** Arah angin.
*   **WSPM:** Kecepatan angin (m/s).
*   **station:** Nama stasiun pemantauan.

In [9]:
# Display basic information about the dataset
print("Dataset Info:")
df_raw.info()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 420768 entries, 0 to 420767
Data columns (total 17 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   year     420768 non-null  int64  
 1   month    420768 non-null  int64  
 2   day      420768 non-null  int64  
 3   hour     420768 non-null  int64  
 4   PM2.5    412029 non-null  float64
 5   PM10     414319 non-null  float64
 6   SO2      411747 non-null  float64
 7   NO2      408652 non-null  float64
 8   CO       400067 non-null  float64
 9   O3       407491 non-null  float64
 10  TEMP     420370 non-null  float64
 11  PRES     420375 non-null  float64
 12  DEWP     420365 non-null  float64
 13  RAIN     420378 non-null  float64
 14  wd       418946 non-null  object 
 15  WSPM     420450 non-null  float64
 16  station  420768 non-null  object 
dtypes: float64(11), int64(4), object(2)
memory usage: 54.6+ MB


In [21]:
# Display the first few rows of the dataset
print("First 5 Rows:")
df_raw.head()

First 5 Rows:


Unnamed: 0,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
0,2013,3,1,0,4.0,4.0,4.0,7.0,300.0,77.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,Aotizhongxin
1,2013,3,1,1,8.0,8.0,4.0,7.0,300.0,77.0,-1.1,1023.2,-18.2,0.0,N,4.7,Aotizhongxin
2,2013,3,1,2,7.0,7.0,5.0,10.0,300.0,73.0,-1.1,1023.5,-18.2,0.0,NNW,5.6,Aotizhongxin
3,2013,3,1,3,6.0,6.0,11.0,11.0,300.0,72.0,-1.4,1024.5,-19.4,0.0,NW,3.1,Aotizhongxin
4,2013,3,1,4,3.0,3.0,12.0,12.0,300.0,72.0,-2.0,1025.2,-19.5,0.0,N,2.0,Aotizhongxin


In [22]:
# Display descriptive statistics
print("Descriptive Statistics:")
df_raw.describe()

Descriptive Statistics:


Unnamed: 0,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,WSPM
count,420768.0,420768.0,420768.0,420768.0,412029.0,414319.0,411747.0,408652.0,400067.0,407491.0,420370.0,420375.0,420365.0,420378.0,420450.0
mean,2014.66256,6.52293,15.729637,11.5,79.793428,104.602618,15.830835,50.638586,1230.766454,57.372271,13.538976,1010.746982,2.490822,0.064476,1.729711
std,1.177198,3.448707,8.800102,6.922195,80.822391,91.772426,21.650603,35.127912,1160.182716,56.661607,11.436139,10.474055,13.793847,0.821004,1.246386
min,2013.0,1.0,1.0,0.0,2.0,2.0,0.2856,1.0265,100.0,0.2142,-19.9,982.4,-43.4,0.0,0.0
25%,2014.0,4.0,8.0,5.75,20.0,36.0,3.0,23.0,500.0,11.0,3.1,1002.3,-8.9,0.0,0.9
50%,2015.0,7.0,16.0,11.5,55.0,82.0,7.0,43.0,900.0,45.0,14.5,1010.4,3.1,0.0,1.4
75%,2016.0,10.0,23.0,17.25,111.0,145.0,20.0,71.0,1500.0,82.0,23.3,1019.0,15.1,0.0,2.2
max,2017.0,12.0,31.0,23.0,999.0,999.0,500.0,290.0,10000.0,1071.0,41.6,1042.8,29.1,72.5,13.2


In [23]:
# Check for missing values
print("Missing Values:")
df_raw.isnull().sum()

Missing Values:


year           0
month          0
day            0
hour           0
PM2.5       8739
PM10        6449
SO2         9021
NO2        12116
CO         20701
O3         13277
TEMP         398
PRES         393
DEWP         403
RAIN         390
wd          1822
WSPM         318
station        0
dtype: int64

In [24]:
# Check unique values in 'station' column
print("Unique Stations:")
print(df_raw['station'].unique())

Unique Stations:
['Aotizhongxin' 'Changping' 'Dingling' 'Dongsi' 'Guanyuan' 'Gucheng'
 'Huairou' 'Nongzhanguan' 'Shunyi' 'Tiantan' 'Wanliu' 'Wanshouxigong']


In [26]:
# Check unique values in 'wd' (wind direction) column
print("Unique Wind Directions:")
print(df_raw['wd'].unique())

Unique Wind Directions:
['NNW' 'N' 'NW' 'NNE' 'ENE' 'E' 'NE' 'W' 'SSW' 'WSW' 'SE' 'WNW' 'SSE'
 'ESE' 'S' 'SW' nan]


## Data Preparation
Tahap persiapan data sangat krusial untuk memastikan kualitas data yang masuk ke model machine learning. Langkah-langkah berikut telah dilakukan:

In [27]:
df_prep = df_raw.copy()

### Penanganan Missing Values

In [None]:
for col in ["PM2.5", "PM10", "SO2", "NO2", "CO", "O3", "TEMP", "PRES", "DEWP", "RAIN", "WSPM"]:
    df_prep[col] = pd.to_numeric(df_prep[col], errors="coerce") # Convert to numeric, coercing errors to NaN
    df_prep[col].fillna(df_prep[col].median(), inplace=True)

In [None]:
df_prep["wd"].fillna(df_prep["wd"].mode()[0], inplace=True)

In [31]:
df_prep.isnull().sum()

year       0
month      0
day        0
hour       0
PM2.5      0
PM10       0
SO2        0
NO2        0
CO         0
O3         0
TEMP       0
PRES       0
DEWP       0
RAIN       0
wd         0
WSPM       0
station    0
dtype: int64

### Pembuatan Fitur Berbasis Waktu

In [33]:
df_prep["date"] = pd.to_datetime(df_prep[["year", "month", "day", "hour"]])
df_prep["day_of_week"] = df_prep["date"].dt.dayofweek
df_prep["day_of_year"] = df_prep["date"].dt.dayofyear
df_prep["week_of_year"] = df_prep["date"].dt.isocalendar().week.astype(int)
df_prep["quarter"] = df_prep["date"].dt.quarter

In [34]:
df_prep.head()

Unnamed: 0,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,...,DEWP,RAIN,wd,WSPM,station,date,day_of_week,day_of_year,week_of_year,quarter
0,2013,3,1,0,4.0,4.0,4.0,7.0,300.0,77.0,...,-18.8,0.0,NNW,4.4,Aotizhongxin,2013-03-01 00:00:00,4,60,9,1
1,2013,3,1,1,8.0,8.0,4.0,7.0,300.0,77.0,...,-18.2,0.0,N,4.7,Aotizhongxin,2013-03-01 01:00:00,4,60,9,1
2,2013,3,1,2,7.0,7.0,5.0,10.0,300.0,73.0,...,-18.2,0.0,NNW,5.6,Aotizhongxin,2013-03-01 02:00:00,4,60,9,1
3,2013,3,1,3,6.0,6.0,11.0,11.0,300.0,72.0,...,-19.4,0.0,NW,3.1,Aotizhongxin,2013-03-01 03:00:00,4,60,9,1
4,2013,3,1,4,3.0,3.0,12.0,12.0,300.0,72.0,...,-19.5,0.0,N,2.0,Aotizhongxin,2013-03-01 04:00:00,4,60,9,1


### Encoding Fitur Kategorikal

In [35]:
from sklearn.preprocessing import LabelEncoder

In [36]:
le = LabelEncoder()
df_prep["wd_encoded"] = le.fit_transform(df_prep["wd"])

In [38]:
df_prep = pd.get_dummies(df_prep, columns=["station"], prefix="station")

In [39]:
df_prep.head()

Unnamed: 0,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,...,station_Dingling,station_Dongsi,station_Guanyuan,station_Gucheng,station_Huairou,station_Nongzhanguan,station_Shunyi,station_Tiantan,station_Wanliu,station_Wanshouxigong
0,2013,3,1,0,4.0,4.0,4.0,7.0,300.0,77.0,...,False,False,False,False,False,False,False,False,False,False
1,2013,3,1,1,8.0,8.0,4.0,7.0,300.0,77.0,...,False,False,False,False,False,False,False,False,False,False
2,2013,3,1,2,7.0,7.0,5.0,10.0,300.0,73.0,...,False,False,False,False,False,False,False,False,False,False
3,2013,3,1,3,6.0,6.0,11.0,11.0,300.0,72.0,...,False,False,False,False,False,False,False,False,False,False
4,2013,3,1,4,3.0,3.0,12.0,12.0,300.0,72.0,...,False,False,False,False,False,False,False,False,False,False


### Feature Scaling

In [43]:
from sklearn.preprocessing import StandardScaler

In [None]:
df_prep.drop(columns=["year", "month", "day", "hour", "wd", "date"], inplace=True)

In [41]:
X = df_prep.drop("PM2.5", axis=1)
y = df_prep["PM2.5"]

In [44]:
numerical_cols = ["PM10", "SO2", "NO2", "CO", "O3", "TEMP", "PRES", "DEWP", "RAIN", "WSPM", "day_of_year", "week_of_year"]

scaler = StandardScaler()

X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

In [45]:
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

In [49]:
X.describe()

Unnamed: 0,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,WSPM,day_of_week,day_of_year,week_of_year,quarter,wd_encoded
count,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0
mean,1.9453600000000003e-17,3.377361e-18,-2.5397760000000003e-17,1.999398e-17,1.9453600000000003e-17,-3.4584180000000005e-17,-2.3776620000000003e-17,-7.565289e-18,-9.456611e-19,2.9990970000000006e-17,3.000684,-2.2695870000000002e-17,-4.7553240000000006e-17,2.508556,6.78144
std,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,2.0012,1.000001,1.000001,1.117084,4.562535
min,-1.122355,-0.7157132,-1.425788,-0.9831958,-1.017299,-2.925431,-2.707636,-3.328543,-0.0784963,-1.388071,0.0,-1.72731,-1.701007,1.0,0.0
25%,-0.7491737,-0.5891995,-0.762619,-0.6303195,-0.8060929,-0.9133155,-0.8068137,-0.8262284,-0.0784963,-0.6657282,1.0,-0.8642491,-0.8389293,2.0,3.0
50%,-0.2442819,-0.4027662,-0.2141515,-0.2774431,-0.2147199,0.08399403,-0.03311231,0.04414188,-0.0784963,-0.2644267,3.0,-0.001187958,0.02314851,3.0,6.0
75%,0.4362244,0.1565337,0.5652496,0.2518713,0.4124939,0.8450987,0.7883484,0.9145121,-0.0784963,0.3776557,5.0,0.8618732,0.8852263,4.0,11.0
max,9.820626,22.57514,6.915926,7.750493,18.1716,2.454791,3.061693,1.929944,88.26881,9.206289,6.0,1.734419,1.747304,4.0,15.0


### Pembagian Data

In [52]:
from sklearn.model_selection import train_test_split

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

print("\nShape of X_train:", X_train.shape)
print("Shape of X_val:", X_val.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_val:", y_val.shape)
print("Shape of y_test:", y_test.shape)


Shape of X_train: (252460, 27)
Shape of X_val: (84154, 27)
Shape of X_test: (84154, 27)
Shape of y_train: (252460,)
Shape of y_val: (84154,)
Shape of y_test: (84154,)


## Modeling

Pada tahap ini, dua jenis model machine learning diimplementasikan untuk memprediksi kadar PM2.5: Regresi Linear dan Random Forest Regressor.

In [60]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

### Linear Regression

In [54]:
from sklearn.linear_model import LinearRegression

In [56]:
print("Training Linear Regression Model...")
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

Training Linear Regression Model...


In [61]:
y_pred_linear = linear_model.predict(X_val)
mse_linear = mean_squared_error(y_val, y_pred_linear)
rmse_linear = np.sqrt(mse_linear)
r2_linear = r2_score(y_val, y_pred_linear)

print(f"Linear Regression - RMSE on Validation Set: {rmse_linear:.4f}")
print(f"Linear Regression - R2 Score on Validation Set: {r2_linear:.4f}")

Linear Regression - RMSE on Validation Set: 31.6726
Linear Regression - R2 Score on Validation Set: 0.8453


### Random Forest Regressor

In [62]:
from sklearn.ensemble import RandomForestRegressor

In [63]:
print("Training Random Forest Regressor Model...")

rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1) 
rf_model.fit(X_train, y_train)

Training Random Forest Regressor Model...


In [65]:
y_pred_rf = rf_model.predict(X_val)
mse_rf = mean_squared_error(y_val, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_val, y_pred_rf)

print(f"Random Forest - RMSE on Validation Set: {rmse_rf:.4f}")
print(f"Random Forest - R2 Score on Validation Set: {r2_rf:.4f}")

Random Forest - RMSE on Validation Set: 18.0848
Random Forest - R2 Score on Validation Set: 0.9495


## Evaluation

Evaluasi model dilakukan berdasarkan metrik Root Mean Squared Error (RMSE) dan R-squared (R2 Score) pada data validasi. RMSE mengukur rata-rata besarnya kesalahan prediksi model, di mana nilai yang lebih rendah menunjukkan kinerja yang lebih baik. R2 Score menunjukkan proporsi varians dalam variabel dependen yang dapat dijelaskan oleh model, di mana nilai yang lebih tinggi menunjukkan kecocokan model yang lebih baik.

### Hasil Evaluasi pada Data Validasi:

**Linear Regression:**
*   RMSE: 31.6726
*   R2 Score: 0.8453

**Random Forest:**
*   RMSE: 18.0848
*   R2 Score: 0.9495

### Perbandingan dan Pemilihan Model:

Dari hasil di atas, terlihat jelas bahwa model **Random Forest** menunjukkan kinerja yang jauh lebih unggul dibandingkan dengan model Regresi Linear. RMSE Random Forest (18.0848) secara signifikan lebih rendah daripada RMSE Regresi Linear (31.6726), menunjukkan bahwa prediksi Random Forest memiliki kesalahan rata-rata yang lebih kecil. Selain itu, R2 Score Random Forest (0.9495) jauh lebih tinggi daripada Regresi Linear (0.8453), yang berarti model Random Forest mampu menjelaskan hampir 95% variabilitas dalam kadar PM2.5, dibandingkan dengan sekitar 84.5% oleh Regresi Linear.

Berdasarkan perbandingan ini, **model Random Forest dipilih sebagai model terbaik** untuk memprediksi kadar PM2.5 di Beijing di antara model-model yang berhasil dilatih.