# Predicting precipitation in Vancouver

by Dan Zhang, Doris (Yun Yi) Cai, Hayley (Yi) Han & Sivakorn (Oak) Chong 2023/11/30

In [1]:
import pandas as pd
from myst_nb import glue
import pickle
#from sklearn import set_config

# Summary

Our project is to build a classification model to predict if there's precipitation in a day (True or False) and a regression model to predict the amount of precipitation, based on features of temperature, wind speed, direction, shortwave radiation and evapotranspiration. The best classification model in our training and testing process is SVC-RBF with hyperparameter C=10.0. It yields the best test score of 0.8625 and f1-score of 0.87 on the positive class (there's precipitation) when generalizes to the unseen data. This is a pretty high accuracy to predict whether there's rain on a particular day. The best regression model trained with the same features to predict the amount of precipitaiton is SVR with gamma=0.1 and C=1000. It produces the best score on the unseen test data of 0.6993. The accuracy is adequate. More study could be done to improve the regression model.

# 1. Introduction 

Prediction of daily precipitation is a fundamental aspect of meteorological studies {cite:p}`new2001precipitation`. Accurate precipitation prediction is crucial for agriculture, water resources management, as well as daily activities of people. Specifically, in a geographically and climatically diverse region like Vancouver, predicting precipitation is vital for people to prepare for extreme weather events, reducing hazards and minimizing property damage.

In this project, we aim to predict the occurrence and the amount of daily precipitation in Vancouver using machine learning (ML) classification methods {cite:p}`ortiz2014accurate`. Specifically, our analysis utilizes a dataset containing daily precipitation information in Vancouver from 1990 to the present (i.e., 6 Nov, 2023). This dataset, sourced from Open-Meteo’s Historical Weather API {cite:p}`Zippenfenig2023open`, includes a number of parameters relevant to precipitation prediction. 
Key parameters include month, daily temperature measures, wind speeds, wind direction, shortwave radiation, and ET₀ reference evapotranspiration. Specifically, shortwave radiation represents the sum of solar energy received in a day; ET₀ reference evapotranspiration provides an indication of the atmospheric demand for moisture (i.e., higher relative humidity reduces ET₀ ); and month is also included as a variable since it accounts for the seasonal variations in precipitation {cite:p}`pal2000simulation`. This project may contributes insights into accurate forecast of the precipitation in Vancouver.

# 2. Methods & Results

## 2.1 Data

The dataset used in this project was sourced from Open-Meteo’s Historical Weather API {cite:p}`Zippenfenig2023open`, which can be found [here](https://open-meteo.com/en/docs/historical-weather-api#latitude=49.2497&longitude=-123.1193&hourly=weather_code&daily=weather_code,temperature_2m_max,temperature_2m_min,temperature_2m_mean,apparent_temperature_max,apparent_temperature_min,apparent_temperature_mean,sunrise,sunset,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,wind_speed_10m_max,wind_gusts_10m_max,wind_direction_10m_dominant,shortwave_radiation_sum,et0_fao_evapotranspiration&timezone=auto).Each row in the data set represents daily precipitation information in Vancouver with  various parameters relevant to precipitation. Parameters included in the following analysis are listed with a short description as follows.  

#### Column description  
- `temperature_2m_max`: Maximum daily air temperature at 2 meters above ground (°C)
- `temperature_2m_min`: Minimum daily air temperature at 2 meters above ground (°C)
- `temperature_2m_mean`: Mean daily air temperature at 2 meters above ground (°C)
- `apparent_temperature_max`: Maximum daily apparent temperature (°C)
- `apparent_temperature_min`: Minimum daily apparent temperature (°C)
- `apparent_temperature_mean`: Mean daily apparent temperature (°C)
- `precipitation_sum`: Sum of daily precipitation (including rain, showers and snowfall) (mm)
- `wind_speed_10m_max`: Maximum wind speed on a day (km/h)
- `wind_gusts_10m_max`: Maximum wind gusts on a day (km/h)
- `wind_direction_10m_dominant`: Dominant wind direction (°)
- `shortwave_radiation_sum`: The sum of solar radiaion on a given day in Megajoules (MJ/m²)
- `et0_fao_evapotranspiration`: Daily sum of ET₀ Reference Evapotranspiration of a well watered grass field (mm)

## 2.2 Exploratory Data Analysis

All the features in our data are numeric. To look at how these features are distributed in order to decided appropiate data transformation method, we plotted the distribution of each numeric features ( {numref}`Figure {number} <histogram_numeric_features>`). 

The histograms for temperature variables (maximum, minimum, and mean temperatures at 2 meters above ground, and maximum, minimum, and mean apparent temperatures) generally show a bell-shaped distribution. The histogram for precipitation sums is highly skewed to the right, indicating that there are many days with low precipitation and fewer days with high precipitation. The wind speed and wind gusts also show right-skewed distributions, which is typical for wind speed data, where calm days are more common than extremely windy ones. The dominant wind direction histogram appears to be multimodal. Lastly, the shortwave radiation and the ET₀ Reference Evapotranspiration distribution also appear right-skewed. In summary, the distributions for these climate and environmental features are reasonable and there are no obvious anomalies or outliers that need for further data cleaning.

```{figure} ../../results/figures/histogram_numeric_features.png
---
width: 800px
name: histogram_numeric_features
---
Distributions for the climate and environmental features.
```

We used a correlation heatmap to further examine the potential correlations between the features ( {numref}`Figure {number} <correlation_heatmap>`). The correlation heatmap indicates strong correlations among the temperature variables (maximum, minimum, and mean temperatures at 2 meters above ground and apparent temperatures), with coefficients close to 1. Thus, we used only `temperature_2m_mean` for analysis to avoid multicollinearity. Similarly, wind-related features (wind speed and wind gusts) also show a high degree of correlation. Therefore, we choosing to keep the wind speed feature to reduce redundancy in the model.


```{figure} ../../results/figures/correlation_heatmap.png
---
width: 600px
name: correlation_heatmap
---
Correlation heatmap between the climate and environmental features.
```

## 2.3 Classification Model Development

Multiple models are tested and scored across four possible models (Decision Trees, Support Vector Machine, Logistic Regression and K-Nearest Neighbours). For each model, we run 5-fold cross validation and extract F1 score to identify the best performing class of model, as plotted below ( {numref}`Figure {number} <model_comparison>`) 

```{figure} ../../results/figures/model_comparison.png
---
width: 800px
name: model_comparison
---
Comparison of model performance (F1) for predicting existance of rainfall in Vancouver. 
```

With the comparison, we identified that Support Vector Machine with Radial Basis Fuction (SVM RBF) has the best performance. The performance for the model almost reach 0.88. The next stage of our pipeline would be to optimize hyperparameter for this model in hope of improving the performance even further. 

But before we go ahead, we actually have a chance to take a look at which feature is the most important! Recall that we actually did Logistic Regression, and the coefficients from that model fitting can be used to imply the feature importance. Although it might not be exactly the same as Support Vector Machine, at least it gives us some intuition. We look at logistic regression's coefficients because it is not as simple to extract out information on feature importance from the chosen SVM model. Now then, let us take a look at the figure ( {numref}`Figure {number} <Feature_importance>`)

```{figure} ../../results/figures/Feature_importance.png
---
width: 800px
name: Feature_importance
---
Comparison of importance of features for predicting existance of rainfall in Vancouver. 
```

As we can see, there are many interpretations we can make. Obviously, evapotranspiration is the most correlated with rain. And this does make sense as the more water evaporates, the more likely it is to rain! Next would be the month. Recall that we transformed our month, which is a circular feature, into two features (month_cos and month_sin.) We can see that month_cos has a very strong correlation to existance of rain, but month_sin does not. I guess cosine value of month coincide well with the seasons in Vancouver!

A point to take note is that our features have respectable coefficient values and there do not seem to be useless features. This is a sign of great generalizable model. 

Now now, let us try to fully optimize the chosen SVM model! We passed it some a hyperparameter optimization magic and obtain the best model.  {numref}`Figure {number} <correlation_heatmap>` shows the final performance metrics of optimized RBF SVM. 

```{figure} ../../results/figures/classification_report.png
---
width: 800px
name: classification_report
---
The performance metrics of our final chosen model 

As you can see, We can see that F1 is consistent across the two classes (rain and no rain.) Meaning that the model do not give a biased prediction. 
The weighted average F1 score turns out to be 0.86. Seems like the hyperparameter optimization could not do much to improve the model further from its default setting.

Nonetheless, we have a tool to predict rain with 86% accuracy, with just only 8 features. 

Going forward, we can take a look at ways to improve this number further. Feature engineering could help if there are other non-linear relationship between the features and rainfall. We can even explore regression models to predict the amount of rainfall, instead of just whether there is rain. 

# References

```{bibliography}
```