Skip to content
A distributed data processing framework in Haskell.
Haskell Nix Shell
Branch: master
Clone or download


CI Status

A distributed data processing framework in pure Haskell. Inspired by Apache Spark.



This package provides a Dataset type which lets you express and execute transformations on a distributed multiset. Its API is highly inspired by Apache Spark.

It uses pluggable Backends for spawning executors and ShuffleStores for exchanging information. See 'distributed-dataset-aws' for an implementation using AWS Lambda and S3.

It also exposes a more primitive Control.Distributed.Fork module which lets you run IO actions remotely. It is especially useful when your task is embarrassingly parallel.


This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.


Provides Dataset's reading from public open datasets. Currently it can fetch GitHub event data from GH Archive.

Running the example

  • Clone the repository.

    $ git clone
    $ cd distributed-dataset
  • Make sure that you have AWS credentials set up. The easiest way is to install AWS command line interface and to run:

    $ aws configure
  • Create an S3 bucket to put the deployment artifact in. You can use the console or the CLI:

    $ aws s3api create-bucket --bucket my-s3-bucket
  • Build an run the example:

    • If you use Nix on Linux:

      • (Recommended) Use my binary cache on Cachix to reduce compilation times:
      $(nix-build -A cachix use utdemir
      • Then:

        $ $(nix-build -A example-gh)/bin/example-gh my-s3-bucket
    • If you use stack (requires Docker, works on Linux and MacOS):

      $ stack run --docker-mount $HOME/.aws/ --docker-env HOME=$HOME example-gh my-s3-bucket


Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See issues.


I am open to contributions; any issue, PR or opinion is more than welcome.

  • In order to develop distributed-dataset, you can use;
    • On Linux: Nix, cabal-install or stack.
    • On MacOS: stack with docker.
  • Use ormolu to format source code.


  • You can use my binary cache on cachix so that you don't recompile half of the Hackage.
  • nix-shell will drop you into a shell with ormolu, cabal-install, .ghcid alongside with all required haskell and system dependencies. You can use cabal new-* commands there.
  • There is a ./ at the root folder with some utilities like formatting the source code or running ghcid, run ./ --help to see the usage.


  • Make sure that you have Docker installed.
  • Use stack as usual, it will automatically use a Docker image
  • Run ./ stack-build before you send a PR to test different resolvers.

Related Work



You can’t perform that action at this time.