Skip to content
Briton Barker edited this page Sep 1, 2016 · 8 revisions

Welcome to the spark-tk wiki!

Getting Started

Run "from source code"

To get spark-tk off the ground, we have to establish its dependencies.

1. Spark - you need it.

Get it from github https://github.com/apache/spark or use a CDH install or something. spark-tk likes version 1.6 (Actually, right now, it likes Spark 1.5, but it will like 1.6 instead very soon)

2. The .jars - build them.

From root spark-tk folder, try to build (without running the tests)

mvn clean install -DskipTests

You should see a sparktk-core/target/core-1.0-SNAPSHOT.jar as well as a bunch of jars in sparktk-core/target/dependencies

Now establish the location to the jars

export SPARKTK_HOME=$PWD/sparktk-core/target

3. The python stuff

(If you're only interested in the Scala API, you can skip this one)

The python sparktk library is spark-tk/python/sparktk It has a few dependencies that you may not have. Look in the spark-tk/python/requirements.txt to see what it needs.

Do pyspark first. Usually pyspark is sitting in your spark installation. There are a couple options: Add the path to pyspark to $PYTHONPATH or create a symlink to pyspark and put it in your site-packages folder. Something like

sudo ln -s /opt/cloudera/parcels/CDH/lib/spark/python/pyspark /usr/lib/python2.7/site-packages/pyspark

For your other dependencies, use pip2.7 to install.

pip2.7 install decorator

or pip2.7 install -r /path/to/spark-tk/python/requirements.txt

(Note: ideally you should use the same py4j that pyspark is using)

If you start up your python interpreter from the spark-tk/python folder, you'll be fine. Otherwise, sparktk needs to be in the $PYTHONPATH or symlinked as shown above. Here it is from the spark-tk root folder:

sudo ln -s $PWD/python/sparktk /usr/lib/python2.7/site-packages/sparktk

Run Tests

A quick way to see if things are happy is to build the code and run the tests

mvn install

To manually kick off the regression tests, cd to integration-tests and run runtests.sh (See the spark-tk/integration-tests/README.md for more info)

To manually run the python unit tests, cd to python/sparktk/tests and run runtests.sh

Generate API docs

  1. Scala docs are built with mvn scala:doc (output found in spark-tk/sparktk-core/target/site/scaladocs)

  2. For the Python docs, see the spark-tk/python/sparktk/doc/README.md

Python Usage

The sparktk library requires a SparkContext at runtime to interact with Spark. To that need, there is a class called TkContext which provides the basic entry point to the sparktk library and holds the SparkContext. So we need to create a TkContext and either give it a SparkContext or tell it how to create one.

>>> import sparktk

>>> tc = sparktk.TkContext()  # passing no parameters, this creates a SparkContext based on default config

Note: Only one SparkContext can exist per session ("Cannot run multiple SparkContexts at once" --that's Spark's rule, enforced by pyspark)

The tc object exposes the library functionality for frames and models, etc. See the Example in the spark-tk README.md for a basic look at using the TkContext.