Identify, review, and remove sensitive files.
Bulk Reviewer is intended to help librarians, archivists, and others to identify, review, and remove sensitive files in directories and disk images. It is built using Django, Django Rest Framework, Celery, Django Channels, and Vue.js. Bulk Reviewer scans directories and disk images for personally identifying information (PII) and other sensitive information using bulk_extractor, a best-in-class digital forensics tool, and can optionally extract named entities (personal names as well as nationalities, religions, and political affiliations) using spaCy and Apache Tika. Results are presented in a review dashboard, enabling easier detection and dismissal of false positives, along with reporting functionality and the ability to export files from directories and disk images, separating problematic files from those that are free of sensitive information.
Initial development occurred while the author, Tim Walsh, was a 2018 Summer Fellow at the Library Innovation Lab at Harvard University. The application is currently under active development, and is still in the exploratory/prototype phase.
Interested in getting involved? Get in touch!
Minimum resources required to be allocated to Docker for development (higher specs are recommended; bulk_extractor performance will increase linearly with additional CPUs, spaCy NLP performance will increase with additional RAM):
- 2 CPUs
- 8 GB RAM
Clone repository to local machine
git clone https://github.com/timothyryanwalsh/bulk-reviewer.git cd bulk-reviewer
Set environment variables
This script copies
br.env, which is used to set environment variables for Django and Postgres. To modify the values for the Postgres database name, username, and password, change the appropriate values in the resulting
br.env file prior to starting the Docker containers.
Configure docker-compose volumes
The current configuration for sharing data between the host development machine and the
worker Docker containers assumes a macOS development environment. Development in Linux or Windows may require changing the volume configuration to be appropriate for the local machine:
- /Users:/Users - /Volumes:/Volumes
These two volumes are used only for giving the
server containers access to data to scan with bulk_extractor. For development on Linux or Windows, change these two volumes for the
server containers in
docker-compose.yml to appropriate values such as
- /home:/home (Linux) or
- C:\Users:/Users (Windows) - this will determine which directories/files are accessible to Bulk Reviewer from the host machine.
docker-compose up -d
The first time you do this, it will take a while (on my laptop, around 10 minutes). Much of this time is spent installing dependencies for and then building bulk_extractor.
Start frontend development server
Install node_modules (required first time only):
Start webpack dev server:
npm run serve
Open application in browser
Open 127.0.0.1:8080 in your browser.
To run Tox/flake8 locally, create a
/usr/share/bulk_extractor (Linux) or
/usr/local/share/bulk_extractor (macOS) directory and move the helper scripts in the
server/bulk_extractor directory into it. This will place required Python DFXML libraries into their expected paths; otherwise you may encounter module import errors.