# Dependent Sample Generation
In this notebook, we discuss the approach for the calibration of copula models at the L3 level of river basins in Thailand.

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import t, chi2

from prisk.analysis_functions import (
    combine_glofas, 
    extract_discharge_timeseries, 
    fit_gumbel_distribution, 
    calculate_uniform_marginals)

import warnings
warnings.filterwarnings("ignore")

In [2]:
glofas_dir = "/Users/rubenkerkhofs/Desktop/glofas/" 
basin_outlet_file = "https://kuleuven-prisk.s3.eu-central-1.amazonaws.com/lev06_outlets_final_clipped_Thailand_no_duplicates.csv"
basin_match_file = "https://kuleuven-prisk.s3.eu-central-1.amazonaws.com/basin_outlets_match.csv"

In a first step, we obtain discharge data for the basins:

In [3]:
# Step 1: Load GloFAS river discharge data and upstream accumulating area data
# Discharge data for producing GIRI maps is from 1979-2016
start_year = 1979
end_year = 2016
area_filter = 500 # not considering rivers with upstream areas below 500 km^2
glofas_data = combine_glofas(start_year, end_year, glofas_dir, area_filter)

# Step 2: Load the basin outlet file, perform some data checks (to ensure we have valid discharge timeseries at each basin outlet point), and then extract discharge timeseries for each basin
basin_outlets = pd.read_csv(basin_outlet_file)
# Note to align the two datasets we need to make the following adjustment to lat lons (based on previous trial and error)
basin_outlets['Latitude'] = basin_outlets['Latitude'] + 0.05/2
basin_outlets['Longitude'] = basin_outlets['Longitude'] - 0.05/2
# Extract discharge timeseries
basin_timeseries = extract_discharge_timeseries(basin_outlets, glofas_data)

Once, the timeseries are obtained, we fit the gumbel distribution to each individual basin:

In [4]:
gumbel_params, fit_quality = fit_gumbel_distribution(basin_timeseries)

Once the Gumbel distributions are fitted, we compute the uniform marginals:

In [5]:
uniform_marginals = calculate_uniform_marginals(basin_timeseries, gumbel_params)

These uniform marginals are used to estimate the dependency structure between basins. 

Next, we group these basins using their L3 basin:

In [6]:
marginals = pd.DataFrame(uniform_marginals)
basin_match = pd.read_csv(basin_match_file)
l3_basins = basin_match.HYBAS_ID_L3.unique()
l3_data = {}

for basin in l3_basins:
    associated_l6_basins = list(basin_match[basin_match.HYBAS_ID_L3 == basin].HYBAS_ID_L6.unique())
    data = marginals[associated_l6_basins]
    l3_data[basin] = data


### Gaussian Copula
The Gaussian copula only requires the correlation matrix as an input parameter. We use the GaussianMultivariate object of the copulas package to estimate and sample from this copula. Note that the GaussianMultivariate object also estimates the univariate distributions; however, we have already transformed the univariate distributions to the uniform distribution. For that reason, we fix the uniform distribution.

Aseparate gaussian copula is fitted for each of the L3 basins:

In [7]:
from copulas.multivariate import GaussianMultivariate
from copulas.univariate import UniformUnivariate

class UniformUnivariateFixed(UniformUnivariate):
    def _fit_constant(self, X):
        self._params = {
            'loc': 0,
            'scale': 1
        }

    def _fit(self, X):
        self._params = {
            'loc': 0,
            'scale': 1
        }

copulas = {}
for basin, data in l3_data.items():
    copula = GaussianMultivariate()
    copula.fit(data)
    copulas[basin] = copula

In [8]:
sum(data.corr().shape[0]*data.corr().shape[1] for data in l3_data.values())
    

4448

Then, the sample function can be used to obtain samples:

In [8]:
samples = {
    basin: copula.sample(100) for basin, copula in copulas.items()
}

generated_samples = pd.DataFrame()
for basin, sample in samples.items():
    generated_samples = pd.concat([generated_samples, sample], axis=1)



In [9]:
#generated_samples.to_parquet("gaussian_random_numbers_L3.parquet.gzip", compression='gzip', index=False)

### T-Copula
The T-copula requires two model inputs: (1) the correlation matrix, and (2) the degrees of freedom. In this case, we set the degrees of freedom equal to 3. Samples of the T-Copula are obtained as follows:


In [14]:
n_samples = 10000

t_samples = {}

for basin, data in l3_data.items():
    corr_matrix = data.corr().values
    mu = np.zeros(len(corr_matrix))
    s = chi2.rvs(df=3, size=n_samples)[:, np.newaxis]
    Z = np.random.multivariate_normal(mu, corr_matrix, n_samples)
    X = np.sqrt(3/s)*Z
    U = t.cdf(X, df=3)
    t_samples[basin] = pd.DataFrame(U, columns=data.columns)

generated_samples = pd.DataFrame()
for basin, sample in samples.items():
    generated_samples = pd.concat([generated_samples, sample], axis=1)

generated_samples.to_parquet("t_random_numbers_L3.parquet.gzip", compression='gzip', index=False)

### Vine Copula
The vine copula is estimated using the vinecopulas package. The estimated parameters are pickled.

In [None]:
from vinecopulas.vinecopula import fit_vinecop

copula_params = {}

for basin, data in l3_data.items():
    copula_params[basin] = fit_vinecop(data.values, copsi=list(range(1, 15)));

In [12]:
from vinecopulas.vinecopula import sample_vinecop

samples = {
    basin: pd.DataFrame(sample_vinecop(*params, 10000), columns=l3_data[basin].columns) for basin, params in copula_params.items()
}

generated_samples = pd.DataFrame()
for basin, sample in samples.items():
    generated_samples = pd.concat([generated_samples, sample], axis=1)

#generated_samples.to_parquet("vine_random_numbers_L3.parquet.gzip", compression='gzip', index=False)
