A project to collect reports from the offices of Inspectors General across the US federal government.
For more information about the project, read:
- Opening up government reports through teamwork and open data
- Why we've collected a hojillion inspector general reports
What's an inspector general?
From one of the above pieces:
Just about every agency in the federal government has an independent unit, usually called the Office of the Inspector General, dedicated to independent oversight. This includes regular audits of the agency's spending, monitoring of active government contractors and investigations into wasteful or corrupt agency practices. They ask tough questions, carry guns, and sue people.
How you can help
The initial round of writing scrapers for all 65 federal IGs has come to a close. However, there are two important areas we need help in:
- Keeping the scrapers working. They're scrapers: they break. Check the Oversight.garden dashboard for scrapers in need of attention, or check the issues list for other tasks.
- Just as importantly, sending in reports we can't scrape.
There are 9 IGs who do not publish reports online, many from the US government's intelligence community.
- Architect of the Capitol
- Capitol Police
- Central Intelligence Agency
- Defense Intelligence Agency
- National Geospatial-Intelligence Agency
- National Reconnaissance Office
- National Security Agency
- Intelligence Community (ODNI)
Generally, getting their reports means filing Freedom of Information Act requests, or finding the results of FOIA requests others have already made.
We also need unpublished reports from the other 65 IGs! We're scraping what they publish online, but most IGs do not proactively publish all of their reports.
Submitting IG reports
Scraping IG reports
Python 3: This project uses Python 3, and is tested on Python 3.4.0. If you don't have Python 3 installed, check out pyenv and pyenv-virtualenvwrapper for easily installing and switching between multiple versions of Python.
- To extract PDFs (the most common type of report), you'll need
qpdf. On Ubuntu,
apt-get install poppler-utils qpdf. On OS X,
brew install poppler qpdf.
- To extract DOCs, you'll need
abiword, which you can install via
- Install all the PIP dependencies by running
pip install -r requirements.txt
To run an individual IG scraper, just execute its file directly. For example:
This will fetch the current year's reports from the Inspector General for the US Postal Service and write them to disk, along with JSON metadata.
If you want to go back further, use
--year to specify a year or range:
If you want to run multiple IG scrapers in a row, use the
By default, the
igs script runs all scrapers. It takes the following arguments:
--safe: Limit scrapers to those declared in
safe.yml. The idea is for "safe" scrapers to be appropriate for clients who wish to fully automate their report pipeline, without human intervention when new IGs are added, in a stable way.
--only: Limit scrapers to a comma-separated list of names. For example,
--data-directory: The directory path to store the output files. Defaults to
datain the current working directory.
Using the data
Reports are broken up by IG and by year. So a USPS IG report from 2013 with a scraper-determined ID of
no-ar-13-010 will create the following files:
/data/usps/2013/no-ar-13-010/report.json /data/usps/2013/no-ar-13-010/report.pdf /data/usps/2013/no-ar-13-010/report.txt
Metadata for a report is at
report.json. The original report will be saved at
report.pdf (the extension will match the original, it may not be
Every scraper will accept the following options:
YYYYyear, only fetch reports from this year.
YYYYyear, only fetch reports from this year onwards.
--debug: Print extra output to STDOUT. (Can be quite verbose when downloading.)
--dry_run: Will scrape sites and write JSON metadata to disk, but won't download full reports or extract text.
report has an accompanying JSON file with metadata. That JSON file is an object with the following required fields:
inspector- The handle you chose for the IG. e.g. "usps"
inspector_url- The IG's primary website URL.
agency- The handle of the agency the report relates to. This can be the same value as
inspector, but it may differ -- some IGs monitor multiple agencies.
agency_name- The full text name of an agency, e.g. "United States Postal Service"
report_id- A string usable as an ID for the report.
title- Title of report.
published_on- Date of publication, in
Additionally, some information about report URLs is required. However, not all report contents are released: some are sensitive or classified, or require a FOIA request to obtain. Use these fields to handle report URLs:
url- URL to the report itself. Required unless
landing_url- URL to some kind of landing page for the report.
unreleased- Set to
Trueif the report's contents are not fully released.
url is optional and
landing_url is required.
The JSON file may have arbitrary additional fields the scraper author thought worth keeping.
report_id must be unique within that IG, and should be stable and idempotent.
Bulk data and backup
This project's chief maintainer, Eric Mill, runs a copy of this project on a server that automatically backs up the downloaded bulk data.
Data is backed up to the Internet Archive.
To back up individual reports as items in the collection, run the
This goes through all reports in
data/ for which a report has been released (in other words, where
unreleased is not
true), and uploads their metadata and report data to the Internet Archive.
For example, the
treasury IG's 2014 report
OIG-14-023 report can be found at:
To generate bulk data, the following command is run from the project's output
zip -r ../us-inspectors-general.bulk.zip * -x "*.done" cd .. ./backup --bulk=us-inspectors-general.bulk.zip
Both zipping and uploading take a long time -- this is a several-hour process at minimum.
The process zips up the contents of the
data/ directory, while excluding any
.done files that track the status of individual file backups. The zip file is placed up one directory, so that it doesn't interfere with the automatic directory examination of
data/ that many scripts employ.
Then the file is uploaded to the Internet Archive as part of the collection, to be a convenient bulk mirror of the entire thing.
[TBD: Proper collection landing page, and bulk data link.]
- Matt Rumsey kindly compiled a spreadsheet of IG offices. We used this to track activity during the initial scraping phase.
The project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.
All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.