README

README

Contact and citation

Authors and affiliations:

Johannes Ruf j.ruf@lse.ac.uk, http://www.maths.lse.ac.uk/Personal/jruf/, London School of Economics and Political Science

Weiguan Wang weiguanwang@shu.edu.cn, https://weiguanwang.github.io/, Shanghai University

24 June 2022

Suggested citation:

W. Wang and J. Ruf (2022), Information Leakage in Backtesting. SSRN 3836631. Download at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3836631

A shorter version of this paper is available under the title ''A Note on Spurious Model Selection'', forthcoming in Quantitative Finance.

@article{
  RW.2022.spurious,
  title={A Note on Spurious Model Selection},
  author={Wang, Weiguan and Ruf, Johannes},
  journal={Quantitative Finance},
  year={2022},
  note={Forthcoming}
}

Supplementary reading:

J. Ruf and W. Wang (2021), Hedging with linear regressions and neural networks, SSRN 3580132, 2021. Forthcoming in the Journal of Business & Economic Statistics. Available at https://www.tandfonline.com/doi/pdf/10.1080/07350015.2021.1931241

@article{RW.2022.hedgenet,
  title={Hedging with Linear Regressions and Neural Networks},
	author={Ruf, Johannes and Wang, Weiguan},
	journal={Journal of Business \& Economic Statistics},
	volume={SSRN 3580132},
	note={Forthcoming},
	year={2022}}

J. Ruf and W. Wang (2020), Neural networks for option pricing and hedging: A literature review, Journal of Computational Finance, volume 24, number 1, pages 1-45. Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3486363

@article{
  RW.2020.LR,
  Author = {Ruf, Johannes and Wang, Weiguan},
  Journal = {Journal of Computational Finance},
  volume = {24},
  number = {1},
  pages = {1--46},
  Title = {Neural networks for option pricing and hedging: {a} literature review},
  Year = {2020}
}

Introduction

This code reproduces the results in Ruf and Wang (2022) "Information leakage in backtesting". It covers two datasets, simulated data under the Black-Scholes model, and real-world S&P 500 data obtained from OptionMetrics。

This code repository comes with two folders: code and data. The data folder contains the trained artificial neural networks (ANNs). They can be used to reproduce the paper's results. The code also allows to train the ANNs again. In order to do so, change the TRAIN_BY_YOURSELF parameter in the setup.py to True.

To use the code for the simulation data, one can run the run.sh script in a Linux-like shell from the directory code. The resulting notebooks in HTML format, Figure 2 of the paper, and all intermediate CSV files will appear in the results/BlackScholes folder. Users can also execute notebooks one-by-one without using the shell script.

The real-world data experiment requires raw data from OptionMetrics. We are not able to provide this raw data due to its commercial nature. They can be obtained from Wharton Research Data Services. Within the OptionMetrics subscription, one can then download the relevant data. To query option prices, one needs to access "Ivy DB US - Option Prices", choose the Date Range "2010 to 2019", Company Code "SECID 108105", Option Type "Both", Exercise Style "European", Security Type "Index", Query Variables "All", Output Fromat "comma-delimited text (*.csv)", Date Format "YYMMDDs10. (e.g. 1984/07/25)", and then click Submit Query. To obtain the files for the S&P 500 price, one instead goes to "Ivy DB US - Security - Security Prices" and chooses the same Date Range, Company Code, Query Variables, and Query Output. To obtain interest rates, one goes to “"Ivy DB US - Market - Zero Coupon Yield Curve" and choose the same Date Range, Query Variables, and Query Output. These data then need to be put into the data/OptionMetrics/RawData folder before running the shell script; see Data folder structure for managing file names.

After obtaining the OptionMetrics data, the filerun_OptionMetrics.sh is used for the S&P 500 data. The resulting notebooks in HTML format, Figure 3 in the paper, and all intermediate CSV files will appear in the results/OptionMetrics folder.

In the following, we explain in more detail the organisation of the code and data folders for the Black-Scholes simulated data and OptionMetrics data only.

Code structure for simulated and OptionMetrics data

The code consists of three subfolders. They are libaray, Simulation, and OptionMetrics. The library folder contains auxiliary tools in the following files:

bs.py: tools to simulate the Black-Scholes dataset.
`cleaner_aux.py: tools to clean raw data.
common.py: tools to calculate and inspect the hedging error.
loader_aux.py: tools to load clean data (before training the ANNs or the linear regressions).
network.py: tools to implement HedgeNet and auxiliary tools.
plot.py: tools to plot diagnostic figures.
regression_aux.py: tools to implement the linear regressions.
simulation.py: tools to implement the CBOE rules for option generation and to organise the simulated data.
vix.py: tools to simulate an Ornstein-Uhlenbeck process, used as the fake VIX feature.

In each of the other two folders, there are two Python files that are used by other notebooks:

setup.py contains flags to configure the experiments.
Load_Clean_aux.py loads the clean data and implements some extra cleaning, before running the linear regressions and ANNs.

The notebooks in both folders have a similar structure:

In the Simulation folder, the first notebook implements the data simulation. In the ``OptionMetrics` folder, the first notebook implements the cleaning of the raw dataset.
2_Regression_Generate.ipynb implements the linear regressions on sensitivities and stores the PNL (MSHE) files.
3_Network.ipynb implements the training of the ANNs and stores the PNL files (MSHE of ANNs).
4_Permute_VIX_Analysis.ipynb implements the analysis of permutation and fake VIX experiments. The implementation of the experiment is done in notebooks 2 and 4, by giving the corresponding setup flags.

Data folder structure for simulated and OptionMetrics data

Before running the code, one needs to specify the directory that stores the simulation data (or historical data) and the results. This is done by overwriting the DATA_DIR variable in each of the setup.py file.

The data folders have two common subfolders:

CleanData either stores the simulated data or the cleaned data generated by 1_Clean.ipynb in case of using real-world data.
Result stores the PNL (MSHE) files and other auxiliary files, either from the linear regressions or from the ANNs. They also include tables created by 5_Diagnostic.ipynb. For the ANNs, it additionally contains loss plots, checkpoints, etc. For the linear regressions, it additionally contains regression coefficients, standard errors, etc.

For the historical dataset, there is an extra folder RawData to store the historical real-world data. We provide a set of sample files in this folder, but the numbers in these files are modified and only for illustration purpose. Users need to arrange and rename files in the following way for the code to run.

option_price.csv contains option quotes, downloaded from OptionMetrics.
spx500.csv contains the close-of-day price of S&P 500, downloaded from OptionMetrics.
onr.csv contains the overnight LIBOR rates downloaded from Bloomberg.
interest_rate.csv contains the interest rate derived from zero-coupon bond for maturity larger than 7 days, downloaded from OptionMetrics.

Comment: The results are not sensitive with respect to the interest rates used. If Bloomberg is not available, other data sources can also be used. The onr.csv file used by us has three columns named date, rate, and days. For each of the days between 04/01/2010 and 31/12/2019 we have one row. Rates are given in percentage terms (e.g., the second column has the entry 1.54263 on 31/12/2019). The third column has always entry 1.

Known issues

We use business day convention when counting and offsetting days, where business days consist of all weekdays. However, the stock/option trading days are a subset of business days due to the existence of certain public holidays. For instance, Martin Luther King Day is not a trading day on the Chicago Board Option Exchange, where the S&P 500 options are traded. The current code does not take this difference into account, and hence unnecessarily removes samples when it cannot obtain the stock/option price at the end of a hedging period. This problem has no significant impact for the results and conclusions presented here as it only reduces the sample size, and only by a miniscule amount.
We use continuous compounding in this code for computing the single-period return on the risk-free asset. However, Equation (1) in our paper uses simple compounding. We admits this inconsistency, but it will only change the results by a miniscule amount.

We thank Yiren Wu and Max Yang for reporting these two issues to us.

Package information

Package	Version
Python	3.7
Numpy	1.19
Pandas	1.2
Scikit-learn	0.24
Scipy	1.6
Seaborn	0.11
Tensorflow	2.4

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
code		code
data		data
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

.gitignore

.gitignore

README.md

README.md

Repository files navigation

README

Contact and citation

Introduction

Code structure for simulated and OptionMetrics data

Data folder structure for simulated and OptionMetrics data

Known issues

Package information

About

Releases

Packages

Languages

weiguanwang/Information_Leakage_in_Backtesting

Folders and files

Latest commit

History

Repository files navigation

README

Contact and citation

Introduction

Code structure for simulated and OptionMetrics data

Data folder structure for simulated and OptionMetrics data

Known issues

Package information

About

Resources

Stars

Watchers

Forks

Languages