# Spark Test
*Danny Luo*

The following tutorial tests the basic capabilities of spark and S3 I/O. This notebook is written for Spark 2.0.2, it will not work for Spark 1.x since it uses integrated spark-csv in the S3 I/O steps. Parts of this notebook are modified from this [tutorial](http://blog.insightdatalabs.com/jupyter-on-apache-spark-step-by-step/)

In [1]:
#Checking if Spark Context is running
sc

<pyspark.context.SparkContext at 0x7f66d665ab10>

In [2]:
#Checking if SQL Context is running
sqlCtx

<pyspark.sql.context.SQLContext at 0x7f66b44ac550>

In [3]:
#Parallelizing a simple array with 20 partitions  over your workers
rdd = sc.parallelize(range(1000), 20)  
rdd.getNumPartitions()

20

In [6]:
#Caching an RDD will let it persist in the workers'  memory, only do this with data you expect to use often
#You should now be able to see the rdd "my_rdd" under the storage tab on the 4040 Spark Admin UI
rdd.setName("my_rdd").cache()
#Performing a test Action, cache is lazily evaluated so it will not actually cache until you perform an action
rdd.count()

1000

## S3

Now we will try to import our practice dataset `iris_data.csv` on our S3 Bucket into Spark as an RDD. Modify the S3 path to your file as necessary. The syntax is `s3n://yourbucketname//path/to/file`

In [7]:
#First we will load it in as a text file
iris_raw_RDD = sc.textFile('s3n://BucketName/iris_data.csv')
iris_raw_RDD.take(5)

[u'sepal_length,sepal_width,petal_length,petal_width,species',
 u'5.1,3.5,1.4,0.2,setosa',
 u'4.9,3,1.4,0.2,setosa',
 u'4.7,3.2,1.3,0.2,setosa',
 u'4.6,3.1,1.5,0.2,setosa']

That was pretty cool, but let's see if we can read it in as an csv. You can try, as an exercise in Spark transformations and actions, to turn the above raw textfile in a dataset, but we will simply use a handy exisiting library.

In [1]:
#Note this is Spark 2.0+ command with spark-csv built in. 
iris_df = spark.read.csv("s3n://BucketName/iris_data.csv", header=True)

iris_df.show()

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|          3|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|           5|        3.6|         1.4|        0.2| setosa|
|         5.4|        3.9|         1.7|        0.4| setosa|
|         4.6|        3.4|         1.4|        0.3| setosa|
|           5|        3.4|         1.5|        0.2| setosa|
|         4.4|        2.9|         1.4|        0.2| setosa|
|         4.9|        3.1|         1.5|        0.1| setosa|
|         5.4|        3.7|         1.5|        0.2| setosa|
|         4.8|        3.4|         1.6|        0.2| setosa|
|         4.8|          3|         1.4|        0.1| setosa|
|         4.3|          3|         1.1| 

In [31]:
#Generating some statistics for sepal length.
iris_stats_df = iris_df.describe('sepal_length').rdd.collectAsMap()
iris_stats_df

{u'count': u'150',
 u'max': u'7.9',
 u'mean': u'5.843333333333335',
 u'min': u'4.3',
 u'stddev': u'0.8280661279778637'}

Now we will try uploading the `iris_data` back on S3.

In [4]:
#Saving locally does not seem to work in the Jupyter environment but it does work in PySpark shell. 
#In Jupyter it creates an empty repository (with _SUCCESS indicator) if you try to save locally.

#However, this will work if you save it on s3
iris_df.write.csv('s3n://BucketName/iris_data_2')

Go on your S3 console, or in your AWS CLI, to check to see if the file has been uploaded properly. It will be saved in partitions. In this case, only one partition is created.