Seeing financial transactions in lower dimensions with neural networks
- File Descriptions
- Supporting Packages
- How To Use This Repository
- Project Motivation
- About The Dataset
- Results
- Acknowledgements
- Licence & copyright
File | Description |
---|---|
data/city_payments_fy2017.csv | features: dept, trans amount, date, purchased item, etc. |
ASWA.ipynb | jupyter notebook used to develop analysis |
preprocessing.py | module for ETL, prepares data for neural network |
models.py | module with autoencoder classes and methods for analysis |
In addition to the standard python library, this analysis utilizes the following packages:
Please see requirements.txt
for a complete list of packages and dependencies used in the making of this project
-
Download and unzip this repository to your local machine.
-
Navigate to this directory and open the command line. For the purposes of running the scripts, this will be the root directory.
-
Create a virtual environment to store the supporting packages
python -m venv ./venv
-
Activate the virtual environment
venv\scripts\activate
-
Install the supporting packages from the requirements.txt file
pip install -r requirements.txt
-
To run the ETL pipeline that cleans data and pickles it, type the following in the command line:
python preprocessing.py data/city_payments_fy2017.csv
-
To train a traditional autoencoder and save the model locally, type the following in the command line:
python models.py data/philly_payments_clean ae 5
Note: This is provided as an example, you can also chooose "vae" and a number of epochs other than 5. When training is complete an html file is generated providing a visualization of the embedded transaction data.
Auditing standards require the assessment of the underlying transactions that comprise the financial statements to detect errors or fraud that would result in material misstatement. The accounting profession has developed a framework for addressing this requirement, known as the Audit Risk Model.
The Audit Risk Model defines audit risk as the combination of inherent risk, control risk and detection risk:
Detection risk is composed of sampling risk and non-sampling risk:
Sampling risk is defined as the risk that the auditor's conclusion based on the sample would be different had the entire population been tested. In other words, it is the risk that the sample is not representative of the population and does not provide sufficient appropriate audit evidence to detect material misstatements.
There are a variety of sampling methods used by auditors. Random sampling is based on each member of the population having an equal chance of being selected. Stratified sampling subdivides the population into homogenous groups from which to make selections. Monetary unit sampling treats each dollar amount from the population as the sampling unit and selects items when a cumulative total meets or exceeds a predefined sampling interval when cycling through the population.
Autoencoders offer an alternative method for addressing sampling risk. An autoencoder is a neural network that learns to encode data into lower dimensions and decode it back into higher dimensions. The resulting model provides a low-dimensional representation of the data, disentangling it in a way that reveals something about its fundamental structure. Auditors can model transactions in this way and select from low-dimensional clusters. They can also identify anomalous transactions based on how much they deviate from other transactions in this latent space.
In this demonstration, we consider the traditional autoencoder:
as well as a variational autoencoder:
There is an important difference in the architecture of the variational autoencoder when compared to the traditional autoencoder. The output layer of the encoder is typically bifurcated into two sets of nodes: one representing the means, the other representing the variances, of each dimension in the latent space (e.g., two dimensions in Figure 2). Further, the latent matrix Z is determined by sampling from some distribution (e.g., a normal distribution) parameterized by these means and variances.
Now, it is not actually feasible to compute backpropagation of errors with the configuration in Figure 2 because the sampling operation is not differentiable. In practice, we instead sample a random matrix epsilon from a normal distribution and scale the derived standard deviations by epsilon by applying the element-wise product. We then element-wise add the result to the means to obtain the latent matrix Z. This is an example of a scale-location transformation of some distribution parameterized by the means and variances. The distribution being scale-location transformed is Gaussian of mean 0 and variance 1, but a different distribution could be used.
Finally, because the output layer for the variances spans a range that includes negative values, we interpret these values as the natural logarithm of the variances and exponentiate them to get the variances.
Structuring the network in this way allows for both stochastic sampling and differentiation with respect to the means and variances of the latent space.
To demonstrate how autoencoders work, we analyze the City of Philadelphia payments data. It is one of two datasets used in Schreyer et al (2020) and consists of nearly a quarter-million payments from 58 city offices, departments, boards, and commissions. It covers the City's fiscal year 2017 (July 2016 through June 2017) and represents nearly $4.2 billion in payments during that period.
To read more about this project, check out this Medium post.
This project is largely inspired by a paper published in 2020 by Marco Schreyer, Timur Sattarov, Anita Gierbl, Bernd Reimer, and Damian Borth, entitled Learning Sampling in Financial Statement Audits using Vector Quantised Autoencoder Neural Networks. Additional thanks go to Leena Shekhar for a clear and concise explanation on Deriving KL Divergence for Gaussians, and to Alfredo Canziani, for creating intuitive videos on autoencoders (among many other very interesting topics) for the deep learning course at NYU.
© Zachary Wolinsky 2022 - 2023
Licensed under the MIT License