Audit Sampling With Autoencoders

Seeing financial transactions in lower dimensions with neural networks

File Descriptions

File	Description
data/city_payments_fy2017.csv	features: dept, trans amount, date, purchased item, etc.
ASWA.ipynb	jupyter notebook used to develop analysis
preprocessing.py	module for ETL, prepares data for neural network
models.py	module with autoencoder classes and methods for analysis

Supporting Packages

In addition to the standard python library, this analysis utilizes the following packages:

Please see requirements.txt for a complete list of packages and dependencies used in the making of this project

How To Use This Repository

Download and unzip this repository to your local machine.
Navigate to this directory and open the command line. For the purposes of running the scripts, this will be the root directory.
Create a virtual environment to store the supporting packages
```
 python -m venv ./venv
```
Activate the virtual environment
```
 venv\scripts\activate
```
Install the supporting packages from the requirements.txt file
```
 pip install -r requirements.txt
```
To run the ETL pipeline that cleans data and pickles it, type the following in the command line:
```
 python preprocessing.py data/city_payments_fy2017.csv
```
To train a traditional autoencoder and save the model locally, type the following in the command line:
```
 python models.py data/philly_payments_clean ae 5
```

Note: This is provided as an example, you can also chooose "vae" and a number of epochs other than 5. When training is complete an html file is generated providing a visualization of the embedded transaction data.

Project Motivation

Auditing standards require the assessment of the underlying transactions that comprise the financial statements to detect errors or fraud that would result in material misstatement. The accounting profession has developed a framework for addressing this requirement, known as the Audit Risk Model.

The Audit Risk Model defines audit risk as the combination of inherent risk, control risk and detection risk:

Detection risk is composed of sampling risk and non-sampling risk:

Sampling risk is defined as the risk that the auditor's conclusion based on the sample would be different had the entire population been tested. In other words, it is the risk that the sample is not representative of the population and does not provide sufficient appropriate audit evidence to detect material misstatements.

There are a variety of sampling methods used by auditors. Random sampling is based on each member of the population having an equal chance of being selected. Stratified sampling subdivides the population into homogenous groups from which to make selections. Monetary unit sampling treats each dollar amount from the population as the sampling unit and selects items when a cumulative total meets or exceeds a predefined sampling interval when cycling through the population.

Autoencoders offer an alternative method for addressing sampling risk. An autoencoder is a neural network that learns to encode data into lower dimensions and decode it back into higher dimensions. The resulting model provides a low-dimensional representation of the data, disentangling it in a way that reveals something about its fundamental structure. Auditors can model transactions in this way and select from low-dimensional clusters. They can also identify anomalous transactions based on how much they deviate from other transactions in this latent space.

In this demonstration, we consider the traditional autoencoder:

as well as a variational autoencoder:

There is an important difference in the architecture of the variational autoencoder when compared to the traditional autoencoder. The output layer of the encoder is typically bifurcated into two sets of nodes: one representing the means, the other representing the variances, of each dimension in the latent space (e.g., two dimensions in Figure 2). Further, the latent matrix Z is determined by sampling from some distribution (e.g., a normal distribution) parameterized by these means and variances.

Now, it is not actually feasible to compute backpropagation of errors with the configuration in Figure 2 because the sampling operation is not differentiable. In practice, we instead sample a random matrix epsilon from a normal distribution and scale the derived standard deviations by epsilon by applying the element-wise product. We then element-wise add the result to the means to obtain the latent matrix Z. This is an example of a scale-location transformation of some distribution parameterized by the means and variances. The distribution being scale-location transformed is Gaussian of mean 0 and variance 1, but a different distribution could be used.

Finally, because the output layer for the variances spans a range that includes negative values, we interpret these values as the natural logarithm of the variances and exponentiate them to get the variances.

Structuring the network in this way allows for both stochastic sampling and differentiation with respect to the means and variances of the latent space.

About The Dataset

To demonstrate how autoencoders work, we analyze the City of Philadelphia payments data. It is one of two datasets used in Schreyer et al (2020) and consists of nearly a quarter-million payments from 58 city offices, departments, boards, and commissions. It covers the City's fiscal year 2017 (July 2016 through June 2017) and represents nearly $4.2 billion in payments during that period.

Results

To read more about this project, check out this Medium post.

Acknowledgements

This project is largely inspired by a paper published in 2020 by Marco Schreyer, Timur Sattarov, Anita Gierbl, Bernd Reimer, and Damian Borth, entitled Learning Sampling in Financial Statement Audits using Vector Quantised Autoencoder Neural Networks. Additional thanks go to Leena Shekhar for a clear and concise explanation on Deriving KL Divergence for Gaussians, and to Alfredo Canziani, for creating intuitive videos on autoencoders (among many other very interesting topics) for the deep learning course at NYU.

License & copyright

Licensed under the MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
data		data
images		images
resources		resources
.gitignore		.gitignore
ASWA.ipynb		ASWA.ipynb
LICENSE.txt		LICENSE.txt
README.md		README.md
models.py		models.py
preprocessing.py		preprocessing.py
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

images

images

resources

resources

.gitignore

.gitignore

ASWA.ipynb

ASWA.ipynb

LICENSE.txt

LICENSE.txt

README.md

README.md

models.py

models.py

preprocessing.py

preprocessing.py

requirements.in

requirements.in

requirements.txt

requirements.txt

Repository files navigation

Audit Sampling With Autoencoders

Table of Contents

File Descriptions

Supporting Packages

How To Use This Repository

Project Motivation

About The Dataset

Results

Acknowledgements

License & copyright

About

Releases

Packages

Languages

License

yksnilowyrahcaz/Audit_Sampling_With_Autoencoders

Folders and files

Latest commit

History

Repository files navigation

Audit Sampling With Autoencoders

Table of Contents

File Descriptions

Supporting Packages

How To Use This Repository

Project Motivation

About The Dataset

Results

Acknowledgements

License & copyright

About

Topics

Resources

License

Stars

Watchers

Forks

Languages