# Predicting precipitation in Vancouver

by Dan Zhang, Doris (Yun Yi) Cai, Hayley (Yi) Han & Sivakorn (Oak) Chong 2023/11/30

In [1]:
import pandas as pd
from myst_nb import glue
import pickle
#from sklearn import set_config

<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)


# Summary

Our project is to build a classification model to predict if there's precipitation in a day (True or False) and a regression model to predict the amount of precipitation, based on features of temperature, wind speed, direction, shortwave radiation and evapotranspiration. The best classification model in our training and testing process is SVC-RBF with hyperparameter C=10.0. It yields the best test score of 0.8625 and f1-score of 0.87 on the positive class (there's precipitation) when generalizes to the unseen data. This is a pretty high accuracy to predict whether there's rain on a particular day. The best regression model trained with the same features to predict the amount of precipitaiton is SVR with gamma=0.1 and C=1000. It produces the best score on the unseen test data of 0.6993. The accuracy is adequate. More study could be done to improve the regression model.

# 1. Introduction 

Prediction of daily precipitation is a fundamental aspect of meteorological studies {cite:p}`new2001precipitation`. Accurate precipitation prediction is crucial for agriculture, water resources management, as well as daily activities of people. Specifically, in a geographically and climatically diverse region like Vancouver, predicting precipitation is vital for people to prepare for extreme weather events, reducing hazards and minimizing property damage.

In this project, we aim to predict the occurrence and the amount of daily precipitation in Vancouver using machine learning (ML) classification methods {cite:p}`ortiz2014accurate`. Specifically, our analysis utilizes a dataset containing daily precipitation information in Vancouver from 1990 to the present (i.e., 6 Nov, 2023). This dataset, sourced from Open-Meteo’s Historical Weather API {cite:p}`Zippenfenig2023open`, includes a number of parameters relevant to precipitation prediction. 
Key parameters include month, daily temperature measures, wind speeds, wind direction, shortwave radiation, and ET₀ reference evapotranspiration. Specifically, shortwave radiation represents the sum of solar energy received in a day; ET₀ reference evapotranspiration provides an indication of the atmospheric demand for moisture (i.e., higher relative humidity reduces ET₀ ); and month is also included as a variable since it accounts for the seasonal variations in precipitation {cite:p}`pal2000simulation`. This project may contributes insights into accurate forecast of the precipitation in Vancouver.

# 2. Methods & Results

## 2.1 Data

The dataset used in this project was sourced from Open-Meteo’s Historical Weather API {cite:p}`Zippenfenig2023open`, which can be found [here](https://open-meteo.com/en/docs/historical-weather-api#latitude=49.2497&longitude=-123.1193&hourly=weather_code&daily=weather_code,temperature_2m_max,temperature_2m_min,temperature_2m_mean,apparent_temperature_max,apparent_temperature_min,apparent_temperature_mean,sunrise,sunset,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,wind_speed_10m_max,wind_gusts_10m_max,wind_direction_10m_dominant,shortwave_radiation_sum,et0_fao_evapotranspiration&timezone=auto).Each row in the data set represents daily precipitation information in Vancouver with  various parameters relevant to precipitation. Parameters included in the following analysis are listed with a short description as follows.  

#### Column description  
- `temperature_2m_max`: Maximum daily air temperature at 2 meters above ground (°C)
- `temperature_2m_min`: Minimum daily air temperature at 2 meters above ground (°C)
- `temperature_2m_mean`: Mean daily air temperature at 2 meters above ground (°C)
- `apparent_temperature_max`: Maximum daily apparent temperature (°C)
- `apparent_temperature_min`: Minimum daily apparent temperature (°C)
- `apparent_temperature_mean`: Mean daily apparent temperature (°C)
- `precipitation_sum`: Sum of daily precipitation (including rain, showers and snowfall) (mm)
- `wind_speed_10m_max`: Maximum wind speed on a day (km/h)
- `wind_gusts_10m_max`: Maximum wind gusts on a day (km/h)
- `wind_direction_10m_dominant`: Dominant wind direction (°)
- `shortwave_radiation_sum`: The sum of solar radiaion on a given day in Megajoules (MJ/m²)
- `et0_fao_evapotranspiration`: Daily sum of ET₀ Reference Evapotranspiration of a well watered grass field (mm)

## 2.2 Exploratory Data Analysis

All the features in our data are numeric. To look at how these features are distributed in order to decided appropiate data transformation method, we plotted the distribution of each numeric features ( {numref}`Figure {number} <histogram_numeric_features>`). 

The histograms for temperature variables (maximum, minimum, and mean temperatures at 2 meters above ground, and maximum, minimum, and mean apparent temperatures) generally show a bell-shaped distribution. The histogram for precipitation sums is highly skewed to the right, indicating that there are many days with low precipitation and fewer days with high precipitation. The wind speed and wind gusts also show right-skewed distributions, which is typical for wind speed data, where calm days are more common than extremely windy ones. The dominant wind direction histogram appears to be multimodal. Lastly, the shortwave radiation and the ET₀ Reference Evapotranspiration distribution also appear right-skewed. In summary, the distributions for these climate and environmental features are reasonable and there are no obvious anomalies or outliers that need for further data cleaning.

```{figure} ../../results/figures/histogram_numeric_features.png
---
width: 800px
name: histogram_numeric_features
---
Distributions for the climate and environmental features.
```

We used a correlation heatmap to further examine the potential correlations between the features ( {numref}`Figure {number} <correlation_heatmap>`). The correlation heatmap indicates strong correlations among the temperature variables (maximum, minimum, and mean temperatures at 2 meters above ground and apparent temperatures), with coefficients close to 1. Thus, we used only `temperature_2m_mean` for analysis to avoid multicollinearity. Similarly, wind-related features (wind speed and wind gusts) also show a high degree of correlation. Therefore, we choosing to keep the wind speed feature to reduce redundancy in the model.


```{figure} ../../results/figures/correlation_heatmap.png
---
width: 600px
name: correlation_heatmap
---
Correlation heatmap between the climate and environmental features.
```

# References

```{bibliography}
```