# Generating Stochastic Volatility Surface Datasets

Author: Sebastien Gurrieri, sebgur@gmail.com

This notebook generates datasets for the future training of models to fit implied volatility surfaces, see [training workbook](https://colab.research.google.com/drive/1y-Tb4JxiBmcJAw943GWtZn_GiGhcUUzS#scrollTo=RGQNx52-lrZ5) for more details about the models and references to the literature. For all supported models, we generate samples defined by:

* expiry, strike, forward rate
* lognormal vol level: order of magnitude in lognormal terms for easy intuition. This will be transformed into a model parameter in a model-dependent way. For instance it will be transformed into the parameter $\alpha$ for SABR.
* other model parameters (say $\beta, \nu, rho$ for SABR)
* option price (put): calculated with the closed-form for Hagan SABR, or by Monte-Carlo simulation for the other models i.e. No-Arbitrage SABR, Free-Boundary SABR, ZABR and Heston.

We start by randomly generating expiry, strike, forward rate and other parameters over a user-specified range. For each sample we then calculate the option price. Finally we go through a process of data cleansing which, at the moment, simply consists in calculating the normal volatilities and rejecting numbers judged too high and too low. We then output the cleansed data to a tsv file.

As the quality of the data is crucial for the subsequent training, it would be interesting to consider more sophisticated approaches for filtering out bad data, together with of course more efficient valuation methods such as PDEs. The current version of this code, running on a 2018 i5 CPU, takes about 10 hours to generate 100k samples for MC based models.

In [1]:
# Install SDevPy
!pip install sdevpy --upgrade

Collecting sdevpy
  Downloading sdevpy-0.1.5-py3-none-any.whl (69 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/69.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.0/69.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting pyperclip (from sdevpy)
  Downloading pyperclip-1.8.2.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py-vollib (from sdevpy)
  Downloading py_vollib-1.0.1.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py_lets_be_rational (from py-vollib->sdevpy)
  Downloading py_lets_be_rational-1.0.1.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting simplejson (from py-vollib->sdevpy)
  Downloading simplejson-3.19.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (137 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.9/137.9 kB[0m

In [6]:
# Import relevant modules
import os
from datetime import datetime
import numpy as np
from platform import python_version

import sdevpy as sd
from sdevpy.tools.timer import Stopwatch
from sdevpy.tools import filemanager
from sdevpy.volsurfacegen import stovolfactory

print("Python version: " + python_version())
print("NumPy version: " + np.__version__)
print("SDevPy version: " + sd.__version__)

Python version: 3.10.12
NumPy version: 1.22.4
SDevPy version: 0.1.5


## 1) Set runtime configuration


In [18]:
# Global settings
# MODEL_TYPE = "SABR"
MODEL_TYPE = "McSABR"
# MODEL_TYPE = "FbSABR"
# MODEL_TYPE = "McZABR"
# MODEL_TYPE = "McHeston"
SHIFT = 0.03
NUM_SAMPLES = 1 * 1000
# The 4 parameters below are only relevant for models whose reference is calculated by MC
NUM_EXPIRIES = 10
NUM_STRIKES = 5
NUM_MC = 10 * 1000 # 100 * 1000
POINTS_PER_YEAR = 25 # 25
# Change seed to generate different sets
SEED = 42# [2468, 8642, 2112, 4444, 88, 6666, 1122, 12]

print(">> Set up runtime configuration")
project_folder = "/content/sdevpy/stovol"
filemanager.check_directory(project_folder)
print("> Project folder: " + project_folder)
dataset_folder = os.path.join(project_folder, "datasets")
print("> Data folder: " + dataset_folder)
filemanager.check_directory(dataset_folder)
print("> Chosen model: " + MODEL_TYPE)
data_file = os.path.join(dataset_folder, MODEL_TYPE + "_data.tsv")

>> Set up runtime configuration
> Project folder: /content/sdevpy/stovol
> Data folder: /content/sdevpy/stovol/datasets
> Chosen model: McSABR


## 2) Generate samples

Here we generate the samples using the SDevPy framework. First prices are calculated with the chosen models. Then these prices are transformed into normal volatilities and the data is cleansed. Finally a tsv file is output containing the dataset.

In [19]:
# Select the model
generator = stovolfactory.set_generator(MODEL_TYPE, SHIFT, NUM_EXPIRIES, NUM_STRIKES, NUM_MC,
                                        POINTS_PER_YEAR, SEED)

In [20]:
# Select training range
# SABR
RANGES = {'Ttm': [1.0 / 12.0, 35.0], 'K': [0.01, 0.99], 'F': [-0.009, 0.041], 'LnVol': [0.05, 0.5],
          'Beta': [0.1, 0.9], 'Nu': [0.1, 1.0], 'Rho': [-0.6, 0.6]}
# # FBSABR
# RANGES = {'Ttm': [1.0 / 12.0, 10.0], 'K': [0.01, 0.99], 'F': [-0.01, 0.05], 'LnVol': [0.05, 0.5],
#           'Beta': [0.25, 0.75], 'Nu': [0.1, 1.0], 'Rho': [-0.6, 0.6]}
# # ZABR
# RANGES = {'Ttm': [1.0 / 12.0, 35.0], 'K': [0.01, 0.99], 'F': [-0.009, 0.041], 'LnVol': [0.05, 0.5],
#           'Beta': [0.1, 0.9], 'Nu': [0.10, 1.0], 'Rho': [-0.6, 0.6],
#           'Gamma': [0.1, 0.9]}
# Heston
# RANGES = {'Ttm': [1.0 / 12.0, 35.0], 'K': [0.01, 0.99], 'F': [-0.009, 0.041], 'LnVol': [0.05, 0.25],
#           'Kappa': [0.25, 4.00], 'Theta': [0.05**2, 0.25**2], 'Xi': [0.10, 0.50],
#           'Rho': [-0.40, 0.40]}

In [21]:
# Generate samples (prices)
print(">> Generate dataset")
print(f"> Generate {NUM_SAMPLES:,} price samples")
timer_gen = Stopwatch("Generating Samples")
timer_gen.trigger()
data_df = generator.generate_samples(NUM_SAMPLES, RANGES)
timer_gen.stop()

# Convert to normal vols and cleanse
print("> Convert to normal vol and cleanse data")
timer_conv = Stopwatch("Converting Prices")
timer_conv.trigger()
data_df = generator.to_nvol(data_df, cleanse=True)
num_clean = len(data_df.index)
print(f"> Dataset size after cleansing: {num_clean:,}")
timer_conv.stop()

# Output to file
timer_out = Stopwatch("File Output")
timer_out.trigger()
now = datetime.now()
dt_string = now.strftime("%Y%m%d-%H_%M_%S")
data_file = os.path.join(dataset_folder, MODEL_TYPE + "_data_" + dt_string + ".tsv")
print("> Output to file: " + data_file)
generator.to_file(data_df, data_file)
timer_out.stop()

# View timers
timer_gen.print()
timer_conv.print()
timer_out.print()

>> Generate dataset
> Generate 1,000 price samples
Number of strikes: 5
Number of expiries: 10
Surface size: 50
Number of samples: 1,000
Number of surfaces/parameter samples: 20
Surface generation number 1/20
Surface generation number 2/20
Surface generation number 3/20
Surface generation number 4/20
Surface generation number 5/20
Surface generation number 6/20
Surface generation number 7/20
Surface generation number 8/20
Surface generation number 9/20
Surface generation number 10/20
Surface generation number 11/20
Surface generation number 12/20
Surface generation number 13/20
Surface generation number 14/20
Surface generation number 15/20
Surface generation number 16/20
Surface generation number 17/20
Surface generation number 18/20
Surface generation number 19/20
Surface generation number 20/20
> Convert to normal vol and cleanse data
Converting to normal vol, batch 1 out of 1
> Dataset size after cleansing: 999
> Output to file: /content/sdevpy/stovol/datasets/McSABR_data_20230701-