# Reading input from Ceph Object Store with Apache Spark on OpenShift

In this demonstration we will load textual data from [Ceph](http://ceph.com/) using [S3 API](http://docs.ceph.com/docs/master/radosgw/s3/). There are two key pieces of information to get from this demonstration,

0. First, loading of the S3 client libraries (hadoop-aws)
1. Second, configuring the client with Ceph/S3 credentials.


### Important - load the S3 client libraries

This uses some Jupyter line magic to put **--packages** on the pyspark command line for the kernel.

In [2]:
%set_env PYSPARK_SUBMIT_ARGS=--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell

env: PYSPARK_SUBMIT_ARGS=--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell


## Setup a bucket

Before create a job in our Spark cluster, let's create a bucket using S3 API:

In [3]:
import boto
import boto.s3.connection

access_key = '4LO0VJO8MEJ3OU98OR7L'
secret_key = 'ulB5kdTTfGXG7AjDaLxcL5VHpSOXFBBLfNRSqduA'
bucket_name = "radanalytics"

ceph_host = '192.168.121.211' # Add your rgw0 Ceph host here
ceph_port = 8080

conn = boto.connect_s3(
        aws_access_key_id = access_key,
        aws_secret_access_key = secret_key,
        host = ceph_host,
        port = ceph_port,
        is_secure=False,
        calling_format = boto.s3.connection.OrdinaryCallingFormat()
        )

bucket = conn.create_bucket(bucket_name)

print "Bucket {} created!".format(bucket_name)


Bucket radanalytics created!


## Put a file in the Bucket

In [7]:
from boto.s3.key import Key

object_key = "spark-test"
object_value = "/usr/local/spark/README.md"

bucket = conn.get_bucket(bucket_name)

k = Key(bucket)
k.key = object_key
k.set_contents_from_filename(object_value)

print "Object {} added in bucket {}!".format(object_key, bucket_name)


Object spark-test added in bucket radanalytics!


## List contents in the bucket

In [9]:
for key in bucket.list():
    print "{name}\t{size}\t{modified}".format(
        name = key.name,
        size = key.size,
        modified = key.last_modified,
    )

spark-test	3818	2017-07-03T18:49:15.466Z


Setup your SparkSession as you normally would.

In [None]:
import pyspark

from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext

spark = SparkSession.builder.master("spark://sparky:7077").getOrCreate()

### Important - configure the S3 client with your credentials
Don't store your credentials in code, use [Secrets](https://kubernetes.io/docs/user-guide/secrets/). Do use [AWS IAM](https://aws.amazon.com/iam/) and credentials with only the capabilities needed for your application.

In [21]:
hadoopConf=spark.sparkContext._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.access.key", "4LO0VJO8MEJ3OU98OR7L")
hadoopConf.set("fs.s3a.secret.key", "ulB5kdTTfGXG7AjDaLxcL5VHpSOXFBBLfNRSqduA")
hadoopConf.set("fs.s3a.endpoint", "192.168.121.211:8080")
hadoopConf.set("fs.s3a.connection.ssl.enabled", "false")

This is a simple test to see what workers are available in your cluster.

In [22]:
import socket
spark.range(100, numPartitions=100).rdd.map(lambda x: socket.gethostname()).distinct().collect()

['sparky-w-1-79n6p']

## Read a simple text file from S3

In [23]:
df0 = spark.read.text("s3a://penasio/spark-test")

In [24]:
df0.schema.jsonValue()

{'fields': [{'metadata': {},
   'name': 'value',
   'nullable': True,
   'type': 'string'}],
 'type': 'struct'}

In [26]:
df0.show(10)

+--------------------+
|               value|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
|MLlib for machine...|
|and Spark Streami...|
|                    |
|<http://spark.apa...|
+--------------------+
only showing top 10 rows



In [34]:
from operator import add
df0.rdd.flatMap(lambda x: list(x[0])).map(lambda x: (x, 1)).reduceByKey(add).sortBy(lambda x: x[1], ascending=False).take(5)

[(u' ', 464), (u'e', 270), (u'a', 253), (u't', 219), (u'o', 212)]

## Word count example

In [55]:
text_file = spark.read.text("s3a://penasio/spark-test")
counts = text_file.rdd.flatMap(lambda line: line[0].split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(add) \
             .take(10)
counts

[(u'', 72),
 (u'project.', 1),
 (u'help', 1),
 (u'when', 1),
 (u'Hadoop', 3),
 (u'not', 1),
 (u'./dev/run-tests', 1),
 (u'including', 4),
 (u'graph', 1),
 (u'computation', 1)]