Pachyderm is built on Kubernetes. As such, technically Pachyderm can run on any platform that Kubernetes supports. This guide covers the following commonly used platforms:
Each section starts with deploying Kubernetes on the said platform, and then moves on to deploying Pachyderm on Kubernetes. If you have already set up Kubernetes on your platform, you may directly skip to the second part.
- Go >= 1.6
- FUSE (optional) >= 2.8.2
- Kubectl (kubernetes CLI) >= 1.2.2
- Pachyderm Repository
- pachctl and pach-deploy
Find Go 1.6 here.
Having FUSE installed allows you to mount PFS locally, which can be nice if you want to play around with PFS.
FUSE comes pre-installed on most Linux distributions. For OS X, install OS X FUSE
Make sure you have version 1.2.2 or higher.
### Darwin (OS X)
$ wget https://storage.googleapis.com/kubernetes-release/release/v1.2.2/bin/darwin/amd64/kubectl
### Linux
$ wget https://storage.googleapis.com/kubernetes-release/release/v1.2.2/bin/linux/amd64/kubectl
### Copy kubectl to your path
chmod +x kubectl
mv kubectl /usr/local/bin/
Clone this repo under your GOPATH
:
# this will put the repo under $GOPATH/src/github.com/pachyderm/pachyderm
$ go get github.com/pachyderm/pachyderm
pachctl
and pach-deploy
and command-line utilities that Pachyderm provides. You can install them directly from source:
$ cd $GOPATH/src/github.com/pachyderm/pachyderm
$ make install
- Docker >= 1.10
Both kubectl and pachctl need a port forwarded so they can talk with their servers. If your Docker daemon is running locally you can skip this step. Otherwise (e.g. you are running Docker through Docker Machine), do the following:
$ ssh <HOST> -fTNL 8080:localhost:8080 -L 30650:localhost:30650
From the root of this repo you can deploy Kubernetes with:
$ make launch-kube
This step can take a while the first time you run it, since some Docker images need to be pulled.
From the root of this repo you can deploy Pachyderm on Kubernetes with:
$ make launch
This step can take a while the first time you run it, since a lot of Docker images need to be pulled.
Google Cloud Platform has excellent support for Kubernetes through the Google Container Engine.
- Google Cloud SDK >= 106.0.0
If this is the first time you use the SDK, make sure to follow through the quick start guide.
After the SDK is installed, run:
$ gcloud components install kubectl
Pachyderm needs a container cluster, a GCS bucket, and a persistent disk to function correctly. We've made this very easy for you by creating the make google-cluster
helper, which will create all of these resources for you.
First of all, set the required environment variables. Choose a name for both the bucket and disk, as well as a capacity for the disk (in GB):
$ export BUCKET_NAME=some-unique-bucket-name
$ export STORAGE_NAME=pach-disk
$ export STORAGE_SIZE=200
You may need to visit the [Console] to fully initialize Container Engine in a new project. Then, simply run the following command:
$ make google-cluster
This creates a Kubernetes cluster named "pachyderm", a bucket, and a persistent disk. To check that everything has been set up correctly, try:
$ gcloud compute instances list
# should see a number of instances
$ gsutil ls
# should see a bucket
$ gcloud compute disks list
# should see a number of disks, including the one you specified
Unfortunately, your persistent disk is not immediately available for use upon creation. You will need to manually format it. Follow these instructions, attaching the disk to an instance and formatting the disk, then clear all files on the disk by running:
rm -rf [path-to-disk]/*
First of all, record the external IP address of one of the nodes in your Kubernetes cluster:
$ gcloud compute instances list
Then export it with port 30650:
$ export ADDRESS=[the external address]:30650
# for example:
# export ADDRESS=104.197.179.185:30650
This is so we can use pachctl
to talk to our cluster later.
Now you can deploy Pachyderm with:
$ make google-cluster-manifest > manifest
$ make MANIFEST=manifest launch
It may take a while to complete for the first time, as a lot of Docker images need to be pulled.
Deploying Kubernetes on AWS is still a relatively lengthy and manual process comparing to doing it on GCE. However, here are a few good tutorials that walk through the process:
- http://kubernetes.io/docs/getting-started-guides/aws/
- https://coreos.com/kubernetes/docs/latest/kubernetes-on-aws.html
First of all, set these environment variables:
$ export KUBECTLFLAGS="-s [the IP address of the node where Kubernetes runs]"
$ export BUCKET_NAME=[the name of the bucket where your data will be stored; this name needs to be unique across the entire AWS region]
$ export STORAGE_SIZE=[the size of the EBS volume that you are going to create, in GBs]
$ export AWS_REGION=[the AWS region where you want the bucket and EBS volume to reside]
$ export AWS_AVAILABILITY_ZONE=[the AWS availability zone where you want your EBS volume to reside]
Then, simply run:
$ make amazon-cluster
Record the "volume-id" in the output, then export it:
$ export STORAGE_NAME=[volume id]
Now you should be able to see the bucket and the EBS volume that are just created:
aws s3api list-buckets --query 'Buckets[].Name'
aws ec2 describe-volumes --query 'Volumes[].VolumeId'
Unfortunately, your EBS volume is not immediately available for use upon creation. You will need to manually format it. Follow these instructions, then clear all files on the volume by:
rm -rf [path-to-disk]/*
First of all, get a set of temporary AWS credentials:
$ aws sts get-session-token
Then run the following commands with the credentials you get:
$ AWS_ID=[access key ID] AWS_KEY=[secret access key] AWS_TOKEN=[session token] make amazon-cluster-manifest > manifest
$ make MANIFEST=manifest launch
It may take a while to complete for the first time, as a lot of Docker images need to be pulled.
pachctl
is a command-line utility used for interacting with a Pachyderm cluster.
$brew tap pachyderm/tap && brew install pachctl
To install pachctl from source, we assume you'll be compiling from within $GOPATH. So to install pachctl do:
$ go get github.com/pachyderm/pachyderm
$ cd $GOPATH/src/github.com/pachyderm/pachyderm
$ make install
Make sure you add GOPATH/bin
to your PATH
env variable:
$ export PATH=$PATH:$GOPATH/bin
If Pachyderm is running locally, you are good to go. Otherwise, you need to make sure that pachctl
can find the node on which you deployed Pachyderm:
$ export ADDRESS=[the IP address of the node where Pachyderm runs]:30650
# for example:
# export ADDRESS=104.197.179.185:30650
Now, create an empty repo to make sure that everything has been set up correctly:
pachctl create-repo test
pachctl list-repo
# should see "test"
Ready to jump into data analytics with Pachyderm? Head to our quick start guide.
This error normally occurs due to Kubernetes services not function because the kernel does not support iptables. Generally you can solve this with:
modprobe netfilter_xt_match_statistic netfilter_xt_match_recent
However in other cases it may require recompiling the kernel. Please head to this issue if you're having trouble with this so we can collect solutions to the problem in one place.
We'll update this section of the guid as we learn more about this issue.