Skip to content
Switch branches/tags
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Build Status

Welcome to the Digital Forensics XML (DFXML) git repository.

DFXML is a file format designed to capture metadata and provenance information about the operation of software tools in a systematic fashion. The original motivation was to represent the output of digital forensics tools, and specifically the SleuthKit tools. DFXML was expanded to operate with the bulk_extractor digital forensics tool. DFXML was then expanded to cover the output of the tcpflow tool. With the lessons we learned form handling all of those programs, we were able to separate out use of DFXML for documenting runtime provenance of any program, and the use of DFXML to represent specific digital forensics artifacts like files and hash sets.

This repository contains original DFXML implements in C and Python for writing DFXML files, as well as an assortment of tools (mostly in Python) for reading and processing DFXML files. The folder layout is as follows:

python/            - Python source files
python/dfxml/      - The Python DFXML module
python/dfxml/tests - Unit tests for the DFXML modules.
python/tools       - Tools written in Python for processing DFXML files.
python/tools/tests - Unit tests for the DFXML tools.
schema/            - The DFXML schema.  Not directly tracked; run `make schema-init` to retrieve.
src/               - The C language DFXML implementation for both writing and reading DFXML files. Includes a few tools, mostly demos.

Using this as a git submodule

Typically this DFXML module will be a submodule inside another git module.

We've noticed that people will typically start development in these modules, and then want to push the chages back to the master. This causes a problem with git, because when you've done the development, you weren't at the head. If this happens to you, you will need to create a new branch for your current location, then checkout the master branch, and then merge your branch into the master. You can do that this this sequence of git commands:


Typically, this repository will be a submodule in another project. C++ projects will include the files in src/ in their program and manually write a DFXML file using the primitive XML writing tools that are included. These tools are not guarenteed to create clean XML, but they can handle XML of any size.

Sometimes when working with DFXML as a submodule, you may get off the master and end up with a disconnected head. If so, use this to get back on the master:

$ git checkout -b newbranch
$ git checkout master
$ git merge newbranch
$ git branch -d newbranch

or, more succinctly:

$ git checkout -b tmp  ; git checkout master ; git merge tmp ; git branch -d tmp

Usage with the DFXML Schema

The DFXML schema is tracked here similarly to a Git submodule, but without using the Git submodule mechanism to avoid some operational deployment issues. If you would like to check out the tracked schema version, run make schema-init. It is only necessary to check this out if you are testing validation of DFXML content against the schema.

Release Notes

  • 2018-07-22 @simsong Significant redesign of the Python library.
    • Configure Python module with a module directory and moved most of to
    • Renamed to be since Python3 naming conventions use only lower case filenames.
    • Moved tests to a test/ subdirectory and redesigned most of them to work with py.test. The tests that require arguments on the python command line were not updated.
    • Removed calls to logging withing files and modules that are not tests, so that using DFXML doesn't inherently start emitting logging messages.
    • Removed calls to logging in Objects tests where the only thing that the test program was logging was the fact that it had run. py.test will provide similar logging now.


I continue to port bulk_extractor, tcpflow, be13_api and dfxml to modern C++. After surveying the standards I’ve decided to go with C++17 and not C++14, as support for 17 is now widespread. (I probably don’t need 20). I am sticking with autotools, although there seems a strong reason to move to CMake. I am keeping be13_api and dfxml as a modules that are included, python-style, rather than making them stand-alone libraries that are linked against. I’m not 100% sure that’s the correct decision, though.

The project is taking longer than anticipated because I am also doing a general code refactoring. The main thing that is taking time is figuring out how to detangle all of the C++ objects having to do with parser options and configuration.

Given that tcpflow and bulk_extractor both use be13_api, my attention has shifted to using tcpflow to get be13_api operational, as it is a simpler program. I’m about three quarters of the way through now. I anticipate having something finished before the end of 2020.

--- Simson Garfinkel, October 18, 2020