Identify, review, and remove sensitive information from directories and disk images.
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
client
docs
nginx
server
.gitignore
.travis.yml
LICENSE
README.md
docker-compose.test.yml
docker-compose.yml
init_default.sh
tox.ini

README.md

Bulk Reviewer logo

Identify, review, and remove sensitive files.

Build Status

Bulk Reviewer is intended to help librarians, archivists, and others to identify, review, and remove sensitive files in directories and disk images. It is built using Django, Django Rest Framework, Celery, Django Channels, and Vue.js. Bulk Reviewer scans directories and disk images for personally identifying information (PII) and other sensitive information using bulk_extractor, a best-in-class digital forensics tool, and can optionally extract named entities (personal names as well as nationalities, religions, and political affiliations) using spaCy and Apache Tika. Results are presented in a review dashboard, enabling easier detection and dismissal of false positives, along with reporting functionality and the ability to export files from directories and disk images, separating problematic files from those that are free of sensitive information.

Initial development occurred while the author, Tim Walsh, was a 2018 Summer Fellow at the Library Innovation Lab at Harvard University. The application is currently under active development, and is still in the exploratory/prototype phase.

Interested in getting involved? Get in touch!

Development installation

Requires Docker CE, Docker-Compose, and Vue CLI 3.

Minimum resources required to be allocated to Docker for development (higher specs are recommended; bulk_extractor performance will increase linearly with additional CPUs, spaCy NLP performance will increase with additional RAM):

  • 2 CPUs
  • 8 GB RAM

Clone repository to local machine

git clone https://github.com/timothyryanwalsh/bulk-reviewer.git
cd bulk-reviewer

Set environment variables

bash init-default.sh

This script copies server/br_sample.env to br.env, which is used to set environment variables for Django and Postgres. To modify the values for the Postgres database name, username, and password, change the appropriate values in the resulting br.env file prior to starting the Docker containers.

Configure docker-compose volumes

The current configuration for sharing data between the host development machine and the server and worker Docker containers assumes a macOS development environment. Development in Linux or Windows may require changing the volume configuration to be appropriate for the local machine:

- /Users:/Users
- /Volumes:/Volumes

These two volumes are used only for giving the worker and server containers access to data to scan with bulk_extractor. For development on Linux or Windows, change these two volumes for the worker and server containers in docker-compose.yml to appropriate values such as - /home:/home (Linux) or - C:\Users:/Users (Windows) - this will determine which directories/files are accessible to Bulk Reviewer from the host machine.

Start containers

docker-compose up -d

The first time you do this, it will take a while (on my laptop, around 10 minutes). Much of this time is spent installing dependencies for and then building bulk_extractor.

Start frontend development server

cd client

Install node_modules (required first time only):

npm install

Start webpack dev server:

npm run serve

Open application in browser

Open 127.0.0.1:8080 in your browser.

Tox

To run Tox/flake8 locally, create a /usr/share/bulk_extractor (Linux) or /usr/local/share/bulk_extractor (macOS) directory and move the helper scripts in the server/bulk_extractor directory into it. This will place required Python DFXML libraries into their expected paths; otherwise you may encounter module import errors.

Logo design

Bailey McGinn