## Notebook 4 – Summary and Visual Insights  

**LuftDataQC: PM2.5 Air Quality in Oslo (2023)**  
Source: [https://api.nilu.no](https://api.nilu.no)  

---

### EN: Project overview  
**Goal:** Present a structured and visual summary of the full LuftDataQC pipeline, combining results from data collection, exploratory analysis, and anomaly detection.  
**Method:** Summarize outputs from Notebooks 1-3: integrate visual storytelling, highlight critical insights and limitations.  
**Tools (full pipeline):** Python (`requests`, `json`, `pathlib`, `sqlite3`, `pandas`, `numpy`,  
`matplotlib`, `seaborn`, `scikit-learn`, `plotly`, `pyarrow`), Jupyter Notebook  

---

### Reproducibility - quick reference | Reproduserbarhet - hurtigoversikt  

**Inputs (from previous notebooks):**  
- `data/processed/pm25_2023.sqlite`  
- Figures generated in Notebooks 1-3  

**Outputs (this notebook):**  
- `04_report.ipynb` *(portfolio summary with figures)*  

**Parameters (this notebook):**  
- **Year:** 2023
- **Stations:** Skøyen, Furulund  
- **Scope:** Final report and visual storytelling.

## Introduction

This report presents the results of the **LuftDataQC** project, which analyzed hourly PM2.5 concentrations measured by the Norwegian Institute for Air Research (NILU) in Oslo during 2023. The project demonstrates a complete data science workflow from raw data acquisition to anomaly detection and showcases practical skills in **API integration, SQL-based storage, exploratory data analysis (EDA), statistical testing, and machine learning (ML)**.

Two NILU stations were selected for detailed study due to their contrasting environments:

* **Skøyen** — urban traffic influence.
* **Furulund** — residential background.

PM2.5 is a key pollutant associated with cardiopulmonary morbidity and mortality, making it an important target for transparent analysis and decision support.

---

## Data Collection and Storage (Notebook 1)

Hourly PM2.5 measurements for all NILU Oslo stations in 2023 were collected via the NILU Open API and stored in a **SQLite** database for reproducibility and efficient querying.

![PM2.5 Station Coverage – 2023](../results/pm25_station_coverage_2023.png)
*Figure: PM2.5 Station Coverage (2023) **[View interactive chart (HTML)](../results/pm25_station_coverage_2023.html)** 

---

## Exploratory Data Analysis (Notebook 2)

### Data Quality and Distribution

Quality control included checks for missing values, duplicates, and plausible value ranges. Distributional differences between stations were assessed to inform statistical choices.

![PM2.5 Distribution – Boxplot + Histogram (Skøyen vs. Furulund)](../results/pm25_distribution_box_hist_skøyen_furulund.png)
*Figure: PM2.5 Distribution – Boxplot + Histogram (Skøyen vs. Furulund)* 

**Rationale for non-parametric testing:** PM2.5 distributions were skewed and non-normal; therefore, the **Mann–Whitney U** test was preferred over a t-test due to its robustness to outliers and lack of normality assumptions.

### Hourly Patterns

Diurnal behavior reflects traffic and activity cycles.

![PM2.5 Hourly Trends – Skøyen vs. Furulund](../results/pm25_hourly_trends_skøyen_furulund.png)
*Figure: PM2.5 Hourly Trends (Skøyen vs. Furulund)* 

### Monthly Patterns

Seasonality shows winter elevations and summer lows, consistent with meteorological drivers and emissions patterns.

![PM2.5 Monthly Trends – Skøyen vs. Furulund](../results/pm25_monthly_trends_skøyen_furulund.png)
*Figure: PM2.5 Monthly Trends (Skøyen vs. Furulund)* 

### Weekday vs Weekend

Lower weekend concentrations align with reduced traffic and human activity.

![PM2.5 – Weekday vs. Weekend (2023)](../results/pm25_weekday_vs_weekend.png)
*Figure: Weekday vs. Weekend (2023)*  

### Trend and Smoothing

Rolling means highlight underlying trends beyond short-term variability.

![PM2.5 Trend – Rolling Mean (Skøyen vs. Furulund)](../results/pm25_trend_rolling_mean_skøyen_furulund.png)
*Figure: Trend - Rolling Mean (Skøyen vs. Furulund)*  

**Interpretation:** Skøyen consistently exhibits higher levels and greater variability than Furulund, aligning with its urban setting. The Mann–Whitney U test indicated a statistically significant difference between station distributions.

---

## Feature Engineering & Anomaly Detection (Notebook 3)
 
This proof-of-concept illustrates how machine learning can support **real-time air-quality monitoring** by flagging unusual hourly PM2.5 values for expert inspection. These features provide concise temporal context. Anomalies may reflect either genuine pollution episodes or  **sensor irregularities**.

### Step 1 - Contamination Validation  
IsolationForest was evaluated with contamination values of **0.5%, 1%, 2%, and 5%** to calibrate model sensitivity.  
Visual assessment indicated that a contamination rate of **1%** provided the most appropriate balance between sensitivity to rare pollution spikes and control of false positives.

![PM2.5 Contamination Comparison – Skøyen](../results/pm25_skøyen_contamination_comparison.png)  
*Figure: Contamination Comparison – Skøyen*  

![PM2.5 Contamination Comparison – Furulund](../results/pm25_furulund_contamination_comparison.png)
*Figure: Contamination Comparison – Furulund*  

### Step 2 - Temporal Feature Engineering  
Time-based features were derived from the timestamp, including hour, weekday, and weekend flags (with human-readable names for visualization).  
These features provide minimal yet meaningful temporal context, helping to distinguish **expected variations** (e.g., rush hours, seasonal patterns) from **atypical events**.

### Results (1% contamination)  
With the selected parameter, IsolationForest identified candidate anomalies at both stations. Visual inspection shows that some anomalies correspond to plausible pollution spikes, while others may be linked to sensor artifacts.  
**Domain expertise remains essential** to interpret flagged events in context.

![PM2.5 Anomaly Detection - Skøyen](../results/pm25_skøyen_anomalies.png) 
*Figure: PM2.5 Anomaly Detection – Skøyen*  

![PM2.5 Anomaly Detection - Furulund](../results/pm25_furulund_anomalies.png)
*Figure: PM2.5 Anomaly Detection – Furulund*  

---

## Insights

* **Urban–residential contrast:** Skøyen records higher and more variable PM2.5 than Furulund.  
* **Temporal structure:** Clear diurnal and seasonal cycles align with traffic intensity and winter conditions.  
* **ML-assisted QA/QC (Quality Assurance / Quality Control):** IsolationForest effectively highlights candidate anomalies for expert review.  
* **Statistical evidence:** Non-parametric tests confirm significant differences between the two stations.

---

## Scope and Future Perspective

* **Baseline:** Analysis focused on NILU validated 2023 data. Year 2024 was excluded to ensure reliability, as QA/QC
validation was not complete at the time of analysis.  
* **Coverage:** The study is limited to two Oslo stations as a case study; the pipeline is designed to scale to multiple stations and longer timeframes.  
* **Potential extensions:**  
  - Incorporate meteorological variables (temperature, wind, humidity) to refine anomaly interpretation.  
  - Extend to multi-year, multi-station analysis for broader applicability.  
  - Benchmark alternative anomaly detection methods (e.g., One-Class SVM, autoencoders).

---

## Conclusion

This pilot analysis demonstrates the feasibility of applying anomaly detection techniques to environmental monitoring.  
The established pipeline - **data ingestion - EDA - feature engineering - anomaly detection - visualization** - is reproducible and scalable.  
Candidate anomalies represent **candidates for expert review** rather than final labels, underscoring the need for contextual interpretation before operational use.


**End-to-end outcomes.**  

- **Acquisition and storage:** Pulled 2023 hourly PM2.5 from NILU Open API and persisted to **SQLite** for auditable, portable analysis.  
- **Exploration and testing:** Ran QC and EDA (distribution, hourly, monthly, weekday vs weekend, rolling trends); chose   **Mann–Whitney U** for non-normal distributions.  
- **Modeling:** Engineered time-based features and applied **IsolationForest** to surface candidate anomalies for expert review.  
- **Visual storytelling:** Consolidated figures highlight contrasts between **Skøyen** (urban) and **Furulund** (residential), supporting transparent interpretation.

**Analytical insights.**  
- **Contrast:** Skøyen generally higher/more variable PM2.5 than Furulund across temporal views.  
- **Temporal structure:** Diurnal and seasonal patterns align with human activity and winter conditions.  
- **Statistical evidence:** Non-parametric testing supports significant station differences.  
- **ML-assisted QA/QC:** IsolationForest is effective for screening unusual observations prior to operational use.

**Reproducibility and artifacts.**  
- Database: `data/processed/pm25_2023.sqlite`  
- Results (figures): see **Figure Index** below  
- Notebooks (pipeline): see **Navigation Links** below  
- Parameters: Year - 2023, Stations - Skøyen, Furulund (no hard-coded coverage assumptions)


## References

- McDonald, J.H. (2014). *Handbook of Biological Statistics* (3rd ed.). (Mann–Whitney U)  
- Ghasemi, A., and Zahediasl, S. (2012). Normality tests for statistical analysis: A guide for non-statisticians. **JCLA, 26(2)**, 68–72.  
- Wilcox, R.R. (2012). *Introduction to Robust Estimation and Hypothesis Testing* (3rd ed.).  
- Liu, F.T., Ting, K.M., and Zhou, Z.H. (2008). Isolation Forest. *KDD*.  
- SciPy Documentation - `scipy.stats.mannwhitneyu`  
- scikit-learn Documentation — `sklearn.ensemble.IsolationForest`
---

## Navigation Links

- [Notebook 1 – Data Collection, Inspection and Storage](./01_data_sqlite.ipynb) 
- [Notebook 2 – Exploratory Analysis and Quality Checks](./02_exploratory_qc.ipynb)  
- [Notebook 3 - Feature Engineering and Anomaly Detection](./03_features_anomalies.ipynb)
- [Notebook 5 – Sammendragsrapport (NO)](./05_report_norsk.ipynb) 

---

## Figure Index (file paths)

**Notebook 1** 

`../results/pm25_station_coverage_2023.png`

**Notebook 2**

`../results/pm25_distribution_box_hist_skøyen_furulund.png`  
`../results/pm25_hourly_trends_skøyen_furulund.png`  
`../results/pm25_monthly_trends_skøyen_furulund.png`  
`../results/pm25_weekday_vs_weekend.png`  
`../results/pm25_trend_rolling_mean_skøyen_furulund.png`

**Notebook 3**

`../results/pm25_skøyen_contamination_comparison.png`
`../results/pm25_furulund_contamination_comparison.png`
`../results/pm25_skøyen_anomalies.png`       
`../results/pm25_furulund_anomalies.png`      
`../results/pm25_skøyen_with_anomalies.csv`
`../results/pm25_furulund_with_anomalies.csv`
