A simple wrapper over Pyspark that lets you use capability of spark (+ cassandra) with straightforward & simple code :)
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
examples
scripts
simple_spark_lib
tests
.gitignore
LICENSE
README.md
TODO.md
requirements.txt
setup.py

README.md

Simple Spark Lib

  1. Enables you to use the capability of Spark without actually writing the spark codes.
  2. Includes many workflows; which helps in writing codes and get your results in just few lines.
  3. For power user, it allows you to tweak every step in the flow.

Prerequisite:

This assumes that you have access to Apache Spark. (and Cassandra clusters if working with cassandra workflow)

Installation:

Clone the repo and build with the command:

python setup.py install

Uninstallation:

sudo pip uninstall simple_spark_lib

Usage:

Cassandra Workflow example:

# First, import your libraries
from simple_spark_lib import SimpleSparkCassandraWorkflow

# Define connection configuration for cassandra
cassandra_connection_config = {
  'host':     '192.168.56.101',
  'username': 'cassandra',
  'password': 'cassandra'
}

# Define Cassandra Schema information
cassandra_config = {
  'cluster': 'rootCSSCluster',
  'tables': {
    'api_events': 'simpl_events_production.api_events',
    # <alias of table> : <keyspace>.<table_name>
    # (Spark's temporary table name) : Cassandra's config
  }
}
# Initiate your workflow
workflow = SimpleSparkCassandraWorkflow(appName="Simple Example Worker")

# Setup the workflow with configurations
workflow.setup(cassandra_connection_config, cassandra_config)

# Run your favourite query
df = workflow.process(query="SELECT * FROM api_events")

print df.take(10)

Run this example with the command:

simple-runner filename.py -d cassandra