Skip to content

Latest commit

 

History

History
317 lines (204 loc) · 9.19 KB

SETUP.md

File metadata and controls

317 lines (204 loc) · 9.19 KB

Setup

Intro

Pachyderm is built on Kubernetes. As such, technically Pachyderm can run on any platform that Kubernetes supports. This guide covers the following commonly used platforms:

Each section starts with deploying Kubernetes on the said platform, and then moves on to deploying Pachyderm on Kubernetes. If you have already set up Kubernetes on your platform, you may directly skip to the second part.

Common Prerequisites

Go

Find Go 1.6 here.

FUSE (optional)

Having FUSE installed allows you to mount PFS locally, which can be nice if you want to play around with PFS.

FUSE comes pre-installed on most Linux distributions. For OS X, install OS X FUSE

Kubectl

Make sure you have version 1.2.2 or higher.

### Darwin (OS X)
$ wget https://storage.googleapis.com/kubernetes-release/release/v1.2.2/bin/darwin/amd64/kubectl

### Linux
$ wget https://storage.googleapis.com/kubernetes-release/release/v1.2.2/bin/linux/amd64/kubectl

### Copy kubectl to your path
chmod +x kubectl
mv kubectl /usr/local/bin/

Pachyderm

Clone this repo under your GOPATH:

# this will put the repo under $GOPATH/src/github.com/pachyderm/pachyderm
$ go get github.com/pachyderm/pachyderm

pachctl and pach-deploy

pachctl and pach-deploy and command-line utilities that Pachyderm provides. You can install them directly from source:

$ cd $GOPATH/src/github.com/pachyderm/pachyderm
$ make install

Local Deployment

Prerequisites

Port Forwarding

Both kubectl and pachctl need a port forwarded so they can talk with their servers. If your Docker daemon is running locally you can skip this step. Otherwise (e.g. you are running Docker through Docker Machine), do the following:

$ ssh <HOST> -fTNL 8080:localhost:8080 -L 30650:localhost:30650

Deploy Kubernetes

From the root of this repo you can deploy Kubernetes with:

$ make launch-kube

This step can take a while the first time you run it, since some Docker images need to be pulled.

Deploy Pachyderm

From the root of this repo you can deploy Pachyderm on Kubernetes with:

$ make launch

This step can take a while the first time you run it, since a lot of Docker images need to be pulled.

Google Cloud Platform

Google Cloud Platform has excellent support for Kubernetes through the Google Container Engine.

Prerequisites

If this is the first time you use the SDK, make sure to follow through the quick start guide.

After the SDK is installed, run:

$ gcloud components install kubectl

Set up the infrastructure

Pachyderm needs a container cluster, a GCS bucket, and a persistent disk to function correctly. We've made this very easy for you by creating the make google-cluster helper, which will create all of these resources for you.

First of all, set the required environment variables. Choose a name for both the bucket and disk, as well as a capacity for the disk (in GB):

$ export BUCKET_NAME=some-unique-bucket-name
$ export STORAGE_NAME=pach-disk
$ export STORAGE_SIZE=200

You may need to visit the [Console] to fully initialize Container Engine in a new project. Then, simply run the following command:

$ make google-cluster

This creates a Kubernetes cluster named "pachyderm", a bucket, and a persistent disk. To check that everything has been set up correctly, try:

$ gcloud compute instances list
# should see a number of instances

$ gsutil ls
# should see a bucket

$ gcloud compute disks list
# should see a number of disks, including the one you specified

Format Volume

Unfortunately, your persistent disk is not immediately available for use upon creation. You will need to manually format it. Follow these instructions, attaching the disk to an instance and formatting the disk, then clear all files on the disk by running:

rm -rf [path-to-disk]/*

Deploy Pachyderm

First of all, record the external IP address of one of the nodes in your Kubernetes cluster:

$ gcloud compute instances list

Then export it with port 30650:

$ export ADDRESS=[the external address]:30650
# for example:
# export ADDRESS=104.197.179.185:30650

This is so we can use pachctl to talk to our cluster later.

Now you can deploy Pachyderm with:

$ make google-cluster-manifest > manifest
$ make MANIFEST=manifest launch

It may take a while to complete for the first time, as a lot of Docker images need to be pulled.

Amazon Web Services (AWS)

Prerequisites

Deploy Kubernetes

Deploying Kubernetes on AWS is still a relatively lengthy and manual process comparing to doing it on GCE. However, here are a few good tutorials that walk through the process:

Set up the infrastructure

First of all, set these environment variables:

$ export KUBECTLFLAGS="-s [the IP address of the node where Kubernetes runs]"
$ export BUCKET_NAME=[the name of the bucket where your data will be stored; this name needs to be unique across the entire AWS region]
$ export STORAGE_SIZE=[the size of the EBS volume that you are going to create, in GBs]
$ export AWS_REGION=[the AWS region where you want the bucket and EBS volume to reside]
$ export AWS_AVAILABILITY_ZONE=[the AWS availability zone where you want your EBS volume to reside]

Then, simply run:

$ make amazon-cluster

Record the "volume-id" in the output, then export it:

$ export STORAGE_NAME=[volume id]

Now you should be able to see the bucket and the EBS volume that are just created:

aws s3api list-buckets --query 'Buckets[].Name'
aws ec2 describe-volumes --query 'Volumes[].VolumeId'

Format Volume

Unfortunately, your EBS volume is not immediately available for use upon creation. You will need to manually format it. Follow these instructions, then clear all files on the volume by:

rm -rf [path-to-disk]/*

Deploy Pachyderm

First of all, get a set of temporary AWS credentials:

$ aws sts get-session-token

Then run the following commands with the credentials you get:

$ AWS_ID=[access key ID] AWS_KEY=[secret access key] AWS_TOKEN=[session token] make amazon-cluster-manifest > manifest
$ make MANIFEST=manifest launch

It may take a while to complete for the first time, as a lot of Docker images need to be pulled.

pachctl

pachctl is a command-line utility used for interacting with a Pachyderm cluster.

Installation

Homebrew

$brew tap pachyderm/tap && brew install pachctl

From Source

To install pachctl from source, we assume you'll be compiling from within $GOPATH. So to install pachctl do:

$ go get github.com/pachyderm/pachyderm
$ cd $GOPATH/src/github.com/pachyderm/pachyderm
$ make install

Make sure you add GOPATH/bin to your PATH env variable:

$ export PATH=$PATH:$GOPATH/bin

Usage

If Pachyderm is running locally, you are good to go. Otherwise, you need to make sure that pachctl can find the node on which you deployed Pachyderm:

$ export ADDRESS=[the IP address of the node where Pachyderm runs]:30650
# for example:
# export ADDRESS=104.197.179.185:30650

Now, create an empty repo to make sure that everything has been set up correctly:

pachctl create-repo test
pachctl list-repo
# should see "test"

Next Step

Ready to jump into data analytics with Pachyderm? Head to our quick start guide.

Trouble Shooting

pachd or pachd-init crash loop with "error connecting to etcd"

This error normally occurs due to Kubernetes services not function because the kernel does not support iptables. Generally you can solve this with:

modprobe netfilter_xt_match_statistic netfilter_xt_match_recent

However in other cases it may require recompiling the kernel. Please head to this issue if you're having trouble with this so we can collect solutions to the problem in one place.

We'll update this section of the guid as we learn more about this issue.