# Predicting precipitation in Vancouver

by Dan Zhang, Doris (Yun Yi) Cai, Hayley (Yi) Han & Sivakorn (Oak) Chong 2023/11/30

In [3]:
import pandas as pd
from myst_nb import glue
import pickle
#from sklearn import set_config

# Summary

Our project is to build a classification model to predict if there's precipitation in a day (True or False) and a regression model to predict the amount of precipitation, based on features of temperature, wind speed, direction, shortwave radiation and evapotranspiration. The best classification model in our training and testing process is SVC-RBF with hyperparameter C=10.0. It yields the best test score of 0.8625 and f1-score of 0.87 on the positive class (there's precipitation) when generalizes to the unseen data. This is a pretty high accuracy to predict whether there's rain on a particular day. The best regression model trained with the same features to predict the amount of precipitaiton is SVR with gamma=0.1 and C=1000. It produces the best score on the unseen test data of 0.6993. The accuracy is adequate. More study could be done to improve the regression model.

# 1. Introduction 

Prediction of daily precipitation is a fundamental aspect of meteorological studies {cite:p}`new2001precipitation`. Accurate precipitation prediction is crucial for agriculture, water resources management, as well as daily activities of people. Specifically, in a geographically and climatically diverse region like Vancouver, predicting precipitation is vital for people to prepare for extreme weather events, reducing hazards and minimizing property damage.

In this project, we aim to predict the occurrence and the amount of daily precipitation in Vancouver using machine learning (ML) classification methods {cite:p}`ortiz2014accurate`. Specifically, our analysis utilizes a dataset containing daily precipitation information in Vancouver from 1990 to the present (i.e., 6 Nov, 2023). This dataset, sourced from Open-Meteo’s Historical Weather API {cite:p}`Zippenfenig2023open`, includes a number of parameters relevant to precipitation prediction. 
Key parameters include month, daily temperature measures, wind speeds, wind direction, shortwave radiation, and ET₀ reference evapotranspiration. Specifically, shortwave radiation represents the sum of solar energy received in a day; ET₀ reference evapotranspiration provides an indication of the atmospheric demand for moisture (i.e., higher relative humidity reduces ET₀ ); and month is also included as a variable since it accounts for the seasonal variations in precipitation {cite:p}`pal2000simulation`. This project may contributes insights into accurate forecast of the precipitation in Vancouver.

# 2. Methods & Results

## 2.1 Data

The dataset used in this project was sourced from Open-Meteo’s Historical Weather API [3], which can be found [here](https://open-meteo.com/en/docs/historical-weather-api#latitude=49.2497&longitude=-123.1193&hourly=weather_code&daily=weather_code,temperature_2m_max,temperature_2m_min,temperature_2m_mean,apparent_temperature_max,apparent_temperature_min,apparent_temperature_mean,sunrise,sunset,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,wind_speed_10m_max,wind_gusts_10m_max,wind_direction_10m_dominant,shortwave_radiation_sum,et0_fao_evapotranspiration&timezone=auto).Each row in the data set represents daily precipitation information in Vancouver with  various parameters relevant to precipitation. Parameters included in the following analysis are listed with a short description as follows.  

Pulling in the data and creating a classification target column `is_precipitation` based on sum of daily precipitation `precipitation_sum`. If `precipitation_sum` is greater than 0.01, we assign True to `is_precipitation`, otherwise False. The reason we use 0.01 as the threshold for assigning the class is because 0.01 is insignificant and can be used to avoid rounding issue. `precipittion_sum` is the regression target column.

#### Column description
- `date`: date of the record
- `temperature_2m_max`: Maximum daily air temperature at 2 meters above ground (°C)
- `temperature_2m_min`: Minimum daily air temperature at 2 meters above ground (°C)
- `temperature_2m_mean`: Mean daily air temperature at 2 meters above ground (°C)
- `apparent_temperature_max`: Maximum daily apparent temperature (°C)
- `apparent_temperature_min`: Minimum daily apparent temperature (°C)
- `apparent_temperature_mean`: Mean daily apparent temperature (°C)
- `precipitation_sum`: Sum of daily precipitation (including rain, showers and snowfall) (mm)
- `wind_speed_10m_max`: Maximum wind speed on a day (km/h)
- `wind_gusts_10m_max`: Maximum wind gusts on a day (km/h)
- `wind_direction_10m_dominant`: Dominant wind direction (°)
- `shortwave_radiation_sum`: The sum of solar radiaion on a given day in Megajoules (MJ/m²)
- `et0_fao_evapotranspiration`: Daily sum of ET₀ Reference Evapotranspiration of a well watered grass field (mm)

### EDA

Plotting the distribution of each numeric columns. The temperature features are more like a normal distribution. Wind speed, radiation and evapotranspiration are slightly right-skewed. The wind direction feature seems to be bimodal.

Figure 1. Distribution of all numeric features in the dataset.

In below correlation matrix, we notice that temperatue features are highly correlated with each other. We can just use one temperature parameter in our analysis. Here we choose to use `temperature_2m_mean`. Similarly, winds features are also highly correlated. Hence we decide to keep `wind_speed_10m_max` and drop the other one to avoid collinearity issue. 

Table 1. Correlation matrix between all numeric features in the dataset.

### 2.2.4 Select features for classification and further explore the relationship between the features of interest

We are dropping the temperature features that are highly correlated with `temperature_2m_mean` and the wind feature that are highly correlated to `wind_speed_10m_max`. After cleaning up the data, we are going to predict `precipitation_sum` and `is_precipitation` using the feature `temperature_2m_mean`, `wind_speed_10m_max`, `wind_direction_10m_dominant`, `shortwave_radiation_sum`, `et0_fao_evapotranspiration` and `month`.

Plotting the scatter charts for each feature vs the regression target precipitation_sum to investigate if there is any pattern present. In preliminary investigation, we notice there is no pattern standing out for temperature, wind speed and wind direction. However, shortwave radiation and evapotranspiration shows strong negative correlation with the precipitation amount precipitation_sum. We also notice for month January, February, October, November and December. This is expected because there are more rain in winter in Vancouver.

Plotting a box plot for each feature vs the classification target `is_precipitation`. We notice `shortwave_radiation_sum` and `et0_fao_evapotranspiration` has different means for the False and True class. Radiation and evapotranspiration mean tends to be higher when there's no precipitation than when there's precipitation.

## 2.3 Classification analysis

### 2.3.1 Splitting dateset
We are splitting up the cleaned data into 80% training set and 20% test set.

### 2.3.3 Model selection

Based on accuracy, test_recall, and test_precision, **RBF SVM** is the best performing model. We will take a look at it for future hyperparameter optimization.

### 2.3.4 Feature importance

Meanwhile, let us take a sidetrack and find out which features are important by looking at the feature importance via logistic regression. It is difficult to interpret feature importance of SVC fitted via RBF.

Figure 4. Feature importance obtained from the logistic regression model.

Month and `et0_fao_evapotranspiration` are the most important features.

### 2.3.5 Hyperparameter optimization for the best model
As shown below, our best Model is the one with C=10.0, since it gives highest test_score.

# 3. Discussion

The best classification model in our training and testing process is SVC-RBF with hyperparameter C=10.0. It yields the best test score of 0.8625 and f1-score of 0.87 on the positive class (precipitation occurs on the day) when generalizes to the unseen data. This is a pretty high accuracy to predict whether there's rain on a particular day. 

The best regression model trained with the same features to predict the amount of precipitaiton is SVR with gamma=0.1 and C=1000. The score on the unseen test data is 0.6993, which is adequate. This result suggest more study (e.g., adding new features) could be done to improve the regression model.

# References

```{bibliography}
```