Skip to content

Commit

Permalink
Build system revamp
Browse files Browse the repository at this point in the history
  • Loading branch information
utdemir committed Jul 13, 2019
1 parent a0c05a8 commit 30d1843
Show file tree
Hide file tree
Showing 37 changed files with 2,070 additions and 1,633 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
.envrc
*.lock

# nix
result
Expand Down
69 changes: 40 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,24 @@ A distributed data processing framework in pure Haskell. Inspired by [Apache Spa

### distributed-dataset

This package provides a `Dataset` type which lets you express and execute transformations on a distributed multiset. Its API is highly inspired by Apache Spark.
This package provides a `Dataset` type which lets you express and execute
transformations on a distributed multiset. Its API is highly inspired
by Apache Spark.

It uses pluggable `Backend`s for spawning executors and `ShuffleStore`s for exchanging information. See 'distributed-dataset-aws' for an implementation using AWS Lambda and S3.
It uses pluggable `Backend`s for spawning executors and `ShuffleStore`s
for exchanging information. See 'distributed-dataset-aws' for an
implementation using AWS Lambda and S3.

It also exposes a more primitive `Control.Distributed.Fork` module which lets you run `IO` actions remotely. It is especially useful when your task is [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel).
It also exposes a more primitive `Control.Distributed.Fork`
module which lets you run `IO` actions remotely. It
is especially useful when your task is [embarrassingly
parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel).

### distributed-dataset-aws

This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.
This package provides a backend for 'distributed-dataset' using AWS
services. Currently it supports running functions on AWS Lambda and
using an S3 bucket as a shuffle store.

### distributed-dataset-opendatasets

Expand All @@ -34,13 +43,16 @@ Provides `Dataset`'s reading from public open datasets. Currently it can fetch G
$ cd distributed-dataset
```

* Make sure that you have AWS credentials set up. The easiest way is to install [AWS command line interface](https://aws.amazon.com/cli/) and to run:
* Make sure that you have AWS credentials set up. The easiest way is
to install [AWS command line interface](https://aws.amazon.com/cli/)
and to run:

```sh
$ aws configure
```

* Create an S3 bucket to put the deployment artifact in. You can use the console or the CLI:
* Create an S3 bucket to put the deployment artifact in. You can use
the console or the CLI:

```sh
$ aws s3api create-bucket --bucket my-s3-bucket
Expand Down Expand Up @@ -70,37 +82,36 @@ Provides `Dataset`'s reading from public open datasets. Currently it can fetch G

## Stability

Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See [issues](https://github.com/utdemir/distributed-dataset/issues).
Experimental. Expect lots of missing features, bugs,
instability and API changes. You will probably need to
modify the source if you want to do anything serious. See
[issues](https://github.com/utdemir/distributed-dataset/issues).

## Contributing

I am open to contributions; any issue, PR or opinion is more than welcome.

## Hacking
* In order to develop `distributed-dataset`, you can use;
* On Linux: `Nix`, `cabal-install` or `stack`.
* On MacOS: `stack` with `docker`.
* Use [ormolu](https://github.com/tweag/ormolu) to format source code.

* You can use `Nix`, `cabal-install` or `stack`.
### Nix

If you use Nix:
* You can use [my binary cache on cachix](https://utdemir.cachix.org/)
so that you don't recompile half of the Hackage.
* `nix-shell` will drop you into a shell with `ormolu`, `cabal-install`,
`.ghcid` alongside with all required haskell and system dependencies.
You can use `cabal new-*` commands there.
* There is a `./make.sh` at the root folder with some utilities like
formatting the source code or running `ghcid`, run `./make.sh --help`
to see the usage.
* You can use [my binary cache on cachix](https://utdemir.cachix.org/) so that you don't recompile half of the Hackage.
* 'nix-shell' gives you a development shell with required Haskell dependencies alongside with `cabal-install`, `ghcid` and `stylish-haskell`. Example:
### Stack
```
$ nix-shell --pure --run 'ghcid -c "cabal new-repl distributed-dataset-opendatasets"'
```
* Use stylish-haskell and hlint:
```
$ nix-shell --run 'find -name "*.hs" -exec stylish-haskell -i {} \;'
$ nix-shell --run 'hlint .'
```
* You can generate the Haddocks using
```
$ nix-build -A docs
```
* Make sure that you have `Docker` installed.
* Use `stack` as usual, it will automatically use a Docker image
* Run `./make.sh stack-build` before you send a PR to test different resolvers.
## Related Work
Expand All @@ -109,7 +120,7 @@ $ nix-build -A docs
* [Towards Haskell in Cloud](https://www.microsoft.com/en-us/research/publication/towards-haskell-cloud/) by Jeff Epstein, Andrew P. Black, Simon L. Peyton Jones
* [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf) by Matei Zaharia, et al.
## Projects
### Projects
* [Apache Spark](https://spark.apache.org/).
* [Sparkle](https://github.com/tweag/sparkle): Run Haskell on top of Apache Spark.
Expand Down
37 changes: 0 additions & 37 deletions ci.sh

This file was deleted.

27 changes: 5 additions & 22 deletions default.nix
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
{ compiler ? "ghc865"
, pkgs ? import ./pkgs.nix
{ pkgs ? import ./pkgs.nix
}:

let
Expand Down Expand Up @@ -45,22 +44,7 @@ overlays = se: su: {
});

# Use newer version
# Haddocks does not work with ghc 8.4
stratosphere = pkgs.haskell.lib.dontHaddock se.stratosphere_0_40_0;

# Pulls in a broken dependency on 1.8.1, fixed in master but no new release yet.
# https://github.com/yesodweb/Shelly.hs/commit/8288d27b93b57574135014d0888cf33f325f7c80
shelly =
se.callCabal2nix
"shelly"
(builtins.fetchGit {
url = "https://github.com/yesodweb/Shelly.hs";
rev = "8288d27b93b57574135014d0888cf33f325f7c80";
})
{};

# Always use the new Cabal
Cabal = se.Cabal_2_4_1_0;
stratosphere = se.stratosphere_0_40_0;

# not on Hackage yet
ormolu =
Expand All @@ -73,7 +57,7 @@ overlays = se: su: {
{};
};

haskellPackages = pkgs.haskell.packages.${compiler}.override {
haskellPackages = pkgs.haskell.packages.ghc865.override {
overrides = overlays;
};

Expand All @@ -99,7 +83,7 @@ in rec
${distributed-dataset-aws.src} \
${distributed-dataset-opendatasets.src}
'';
} // (if compiler > "ghc86" then {

shell = haskellPackages.shellFor {
packages = p: with p; [
distributed-dataset
Expand All @@ -110,9 +94,8 @@ in rec
buildInputs = with haskellPackages; [
cabal-install
ghcid
stylish-haskell
ormolu
];
withHoogle = true;
};
} else {})
}
90 changes: 49 additions & 41 deletions distributed-dataset-aws/src/Control/Distributed/Dataset/AWS.hs
Original file line number Diff line number Diff line change
@@ -1,27 +1,29 @@
{-# LANGUAGE StaticPointers #-}
{-# LANGUAGE StaticPointers #-}
{-# LANGUAGE TypeApplications #-}

module Control.Distributed.Dataset.AWS
( s3ShuffleStore
-- Re-exports
, module Control.Distributed.Fork.AWS
) where
, -- Re-exports
module Control.Distributed.Fork.AWS
)
where

--------------------------------------------------------------------------------
import Conduit
import Control.Distributed.Closure
import Control.Lens
import Control.Monad
import Control.Monad.Trans.AWS (AWST)
import qualified Data.Text as T
import Network.AWS
import Network.AWS.Data.Body (RsBody (_streamBody))
import qualified Network.AWS.S3 as S3
import qualified Network.AWS.S3.StreamingUpload as S3
import System.IO.Unsafe
import Conduit
import Control.Distributed.Closure
--------------------------------------------------------------------------------
import Control.Distributed.Dataset.ShuffleStore
import Control.Distributed.Fork.AWS
import Control.Distributed.Dataset.ShuffleStore
import Control.Distributed.Fork.AWS
import Control.Lens
import Control.Monad
import Control.Monad.Trans.AWS (AWST)
import qualified Data.Text as T
import Network.AWS
import Network.AWS.Data.Body (RsBody (_streamBody))
import qualified Network.AWS.S3 as S3
import qualified Network.AWS.S3.StreamingUpload as S3
import System.IO.Unsafe

--------------------------------------------------------------------------------

-- |
Expand All @@ -30,29 +32,36 @@ import Control.Distributed.Fork.AWS
-- TODO: Cleanup
-- TODO: Use a temporary bucket created by CloudFormation
s3ShuffleStore :: T.Text -> T.Text -> ShuffleStore
s3ShuffleStore bucket' prefix'
= ShuffleStore
{ ssGet = static (\bucket prefix num range -> do
ret <- runAWS globalAWSEnv $
send $ S3.getObject
(S3.BucketName bucket)
(S3.ObjectKey $ prefix <> T.pack (show num))
& S3.goRange
.~ (case range of
RangeAll -> Nothing
RangeOnly lo hi ->
Just . T.pack $ "bytes=" <> show lo <> "-" <> show hi
)
_streamBody $ ret ^. S3.gorsBody
) `cap` cpure (static Dict) bucket' `cap` cpure (static Dict) prefix'
, ssPut = static (\bucket prefix num ->
void . transPipe @(AWST (ResourceT IO)) (runAWS globalAWSEnv)
$ S3.streamUpload Nothing $ S3.createMultipartUpload
(S3.BucketName bucket)
(S3.ObjectKey $ prefix <> T.pack (show num))
) `cap` cpure (static Dict) bucket' `cap` cpure (static Dict) prefix'


s3ShuffleStore bucket' prefix' =
ShuffleStore
{ ssGet = static
( \bucket prefix num range -> do
ret <-
runAWS globalAWSEnv $
send $
S3.getObject
(S3.BucketName bucket)
(S3.ObjectKey $ prefix <> T.pack (show num)) &
S3.goRange .~
( case range of
RangeAll -> Nothing
RangeOnly lo hi ->
Just . T.pack $ "bytes=" <> show lo <> "-" <> show hi
)
_streamBody $ ret ^. S3.gorsBody
) `cap`
cpure (static Dict) bucket' `cap`
cpure (static Dict) prefix'
, ssPut = static
( \bucket prefix num ->
void . transPipe @(AWST (ResourceT IO)) (runAWS globalAWSEnv) $
S3.streamUpload Nothing $
S3.createMultipartUpload
(S3.BucketName bucket)
(S3.ObjectKey $ prefix <> T.pack (show num))
) `cap`
cpure (static Dict) bucket' `cap`
cpure (static Dict) prefix'
}

-- FIXME
Expand All @@ -64,4 +73,3 @@ s3ShuffleStore bucket' prefix'
globalAWSEnv :: Env
globalAWSEnv = unsafePerformIO $ newEnv Discover
{-# NOINLINE globalAWSEnv #-}

8 changes: 4 additions & 4 deletions distributed-dataset-aws/src/Control/Distributed/Fork/AWS.hs
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
module Control.Distributed.Fork.AWS
( module Control.Distributed.Fork.AWS.Lambda
) where
)
where

--------------------------------------------------------------------------------
import Control.Distributed.Fork.AWS.Lambda
--------------------------------------------------------------------------------

import Control.Distributed.Fork.AWS.Lambda

--------------------------------------------------------------------------------
Loading

0 comments on commit 30d1843

Please sign in to comment.