# Predicting precipitation in Vancouver

by Dan Zhang, Doris (Yun Yi) Cai, Hayley (Yi) Han & Sivakorn (Oak) Chong 2023/11/30

In [1]:
import pandas as pd
from myst_nb import glue
import pickle
#from sklearn import set_config

<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)


# Summary

Our project investigates the prediction of daily precipitation in Vancouver using machine learning methods. Using a dataset spanning from 1990 to 2023, we explored the predictive power of some key environmental and cliamte features such as temperature, wind speed, and evapotranspiration. Our results suggest the best classification model is Support Vector Machine with Radial Basis Fuction (SVM RBF) model with the hyperparameter C=10.0. The model achieved a notable F1 score of 0.87 on the positive class (precipitation is present) when generalized to the unseen data, suggesting a high accuracy in precipitation prediction. We also explored feature importance, showing ET₀ reference evapotranspiration and the cosine transformation of months as robust predictors. Hyperparameter optimization did not make improvement to our curren model, indicating the potential need for feature engineering or incoportating more features. Our project presents a reliable model for predicting precipitation with potential practical applications in various fields.

# 1. Introduction 

Prediction of daily precipitation is a fundamental aspect of meteorological studies {cite:p}`new2001precipitation`. Accurate precipitation prediction is crucial for agriculture, water resources management, as well as daily activities of people. Specifically, in a geographically and climatically diverse region like Vancouver, predicting precipitation is vital for people to prepare for extreme weather events, reducing hazards and minimizing property damage.

In this project, we aim to predict the occurrence of daily precipitation in Vancouver using machine learning (ML) classification methods {cite:p}`ortiz2014accurate`. Specifically, our analysis utilizes a dataset containing daily precipitation information in Vancouver from 1990 to the present (i.e., 6 Nov, 2023). This dataset, sourced from Open-Meteo’s Historical Weather API {cite:p}`Zippenfenig2023open`, includes a number of parameters relevant to precipitation prediction. 
Key parameters include month, daily temperature measures, wind speeds, wind direction, shortwave radiation, and ET₀ reference evapotranspiration. Specifically, shortwave radiation represents the sum of solar energy received in a day; ET₀ reference evapotranspiration provides an indication of the atmospheric demand for moisture (i.e., higher relative humidity reduces ET₀ ); and month is also included as a variable since it accounts for the seasonal variations in precipitation {cite:p}`pal2000simulation`. This project may contributes insights into accurate forecast of the precipitation in Vancouver.

# 2. Methods & Results

## 2.1 Data

The dataset used in this project was sourced from Open-Meteo’s Historical Weather API {cite:p}`Zippenfenig2023open`, which can be found [here](https://open-meteo.com/en/docs/historical-weather-api#latitude=49.2497&longitude=-123.1193&hourly=weather_code&daily=weather_code,temperature_2m_max,temperature_2m_min,temperature_2m_mean,apparent_temperature_max,apparent_temperature_min,apparent_temperature_mean,sunrise,sunset,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,wind_speed_10m_max,wind_gusts_10m_max,wind_direction_10m_dominant,shortwave_radiation_sum,et0_fao_evapotranspiration&timezone=auto). Each row in the data set represents daily precipitation information in Vancouver with  various parameters relevant to precipitation. Parameters included in the following analysis are listed with a short description as follows.  

#### Column description  
- `temperature_2m_max`: Maximum daily air temperature at 2 meters above ground (°C)
- `temperature_2m_min`: Minimum daily air temperature at 2 meters above ground (°C)
- `temperature_2m_mean`: Mean daily air temperature at 2 meters above ground (°C)
- `apparent_temperature_max`: Maximum daily apparent temperature (°C)
- `apparent_temperature_min`: Minimum daily apparent temperature (°C)
- `apparent_temperature_mean`: Mean daily apparent temperature (°C)
- `precipitation_sum`: Sum of daily precipitation (including rain, showers and snowfall) (mm)
- `wind_speed_10m_max`: Maximum wind speed on a day (km/h)
- `wind_gusts_10m_max`: Maximum wind gusts on a day (km/h)
- `wind_direction_10m_dominant`: Dominant wind direction (°)
- `shortwave_radiation_sum`: The sum of solar radiaion on a given day in Megajoules (MJ/m²)
- `et0_fao_evapotranspiration`: Daily sum of ET₀ Reference Evapotranspiration of a well watered grass field (mm)
- `month`: Month of the record

## 2.2 Exploratory Data Analysis

All the features in our data are numeric. To look at how these features are distributed in order to decided appropiate data transformation method, we plotted the distribution of each numeric features ({numref}`Figure {number} <histogram_numeric_features>`). 

The histograms for temperature variables (maximum, minimum, and mean temperatures at 2 meters above ground, and maximum, minimum, and mean apparent temperatures) generally show a bell-shaped distribution. The histogram for precipitation sums is highly skewed to the right, indicating that there are many days with low precipitation and fewer days with high precipitation. The wind speed and wind gusts also show right-skewed distributions, which is typical for wind speed data, where calm days are more common than extremely windy ones. The dominant wind direction histogram appears to be multimodal. Lastly, the shortwave radiation and the ET₀ Reference Evapotranspiration distribution also appear right-skewed. In summary, the distributions for these climate and environmental features are reasonable and there are no obvious anomalies or outliers that need for further data cleaning.

```{figure} ../../results/figures/histogram_numeric_features.png
---
width: 800px
name: histogram_numeric_features
---
Distributions for the climate and environmental features.
```

We used a correlation heatmap to further examine the potential correlations between the features ({numref}`Figure {number} <correlation_heatmap>`). The correlation heatmap indicates strong correlations among the temperature variables (maximum, minimum, and mean temperatures at 2 meters above ground and apparent temperatures), with coefficients close to 1. Thus, we used only `temperature_2m_mean` for analysis to avoid multicollinearity. Similarly, wind-related features (wind speed and wind gusts) also show a high degree of correlation. Therefore, we chose to keep the wind speed feature to reduce redundancy in the model.


```{figure} ../../results/figures/correlation_heatmap.png
---
width: 600px
name: correlation_heatmap
---
Correlation heatmap between the climate and environmental features.
```

## 2.3 Classification Model Development

Multiple models are tested and scored across four possible models (Decision Trees, Support Vector Machine, Logistic Regression and K-Nearest Neighbours). The features had been preprocessed through a pipeline with `StandardScaler` tranformer before passing to fit the models. The transformer converts all numeric features into standard normal scale to mitigate potential adverse effects on performance arising from disparate feature scales. Considering the observed seasonal pattern in Vancouver precipitation data, we transformed the feature `month` to circular features `month_sin` and `month_cos` using the trigonometry functions (i.e. sine and cosine), aiming to capture this cylical nature of data. For each model, we run 5-fold cross validation and extract F1 score to identify the best performing class of model. The performace of each model is plotted below ({numref}`Figure {number} <model_comparison>`).

```{figure} ../../results/figures/model_comparison.png
---
width: 800px
name: model_comparison
---
Comparison of model performance (F1) for predicting existance of rainfall in Vancouver. 
```

With the comparison, the Support Vector Machine with Radial Basis Function (SVM RBF) exhibited the most favorable performance, attaining a F1 score of 0.87. The next stage in our analytical pipeline involves the optimization of hyperparameters for this particular model, with the objective of further performance improvement. 

Before proceeding with this optimization, we assess the importance of individual feature. With the limitation of SVM RBF that it does not inherently provide direct measures of feature importance, we utilized the Logistic Regression analysis previously conducted to derived the relative importance of features based on their fitted coefficients. While the feature importance interpretation may not align precisely with that of the Support Vector Machine, it nonetheless provides valuable insights. Comparsion of features importance are displayed as below ({numref}`Figure {number} <Feature_importance>`). `evapotranspiration` exhibits the highest correlation with precipitation. This correlation aligns intuitively with that increased water evaporation corresponds to an increased likelihood of rainfall. It is observed that `month_cos` shows a strong correlation with existance of rain, whereas such correlation is not evident in the case of `month_sin`. The cosine value of month tends to coincide well with the seasons in Vancouver.

```{figure} ../../results/figures/Feature_importance.png
---
width: 800px
name: Feature_importance
---
Comparison of importance of features for predicting existance of rainfall in Vancouver. 
```

We performed cross validation grid search to optimize SVM RBF model performance by tunning its hyperparameter C. The best performing SVM RBF model had C=10.0, and yielded the scoring metrics below ({numref}`Figure {number} <correlation_heatmap>`). F1 score is consistent across the two classes (rain and no rain), which indicates that the model delivers an unbiased prediction. 
The weighted average F1-score of 0.87 is closer to that of the pre-tuned SVM RBF model. The model did not improve from hyperparameter optimization further with its default settings. Nonetheless, our current model performed well on predicting rain with 87% accuracy, with 8 features. 

```{figure} ../../results/figures/classification_report.png
---
width: 800px
name: classification_report
---
The performance metrics of our final chosen model 

# 3. Discussion

The results of this project demonstrate the utility of machine learning techniques, specifically Support Vector Machine with Radial Basis Function (SVM RBF), in predicting precipitation in Vancouver. The high F1 score of 0.87 achieved by the optimized SVM model demonstrates the model's good performance in distinguishing between rainy and non-rainy days. 

Our results show the prominent predictive power of evapotranspiration on precipitation prediction. Furthermore, we showed that the inclusion of month as a circular feature effectively capture seasonal variations, which enhance the model's predictive power. 

The lack of improvement from hyperparameter tuning suggests that the default settings of the SVM model could be already near optimal. Future model improvements may not benefit from hyperparameter tuning but rather from exploring additional feature engineering approaches. Next, we can explore regression models for quantitative precipitation prediction. In the meanwhile, examinig potential non-linear relationships between features and precipitation may facilitate further improvement. 

In summary, our project provides a model that can accurately predict Vancouver's precipitation with a limited number of environemntal and climate features. This is a promising tool and could lead to more powerful precipitation forecasting, with its potential applications in agriculture, urban planning, and disaster preparedness. 

# References

```{bibliography}
```