# Learning PySpark, Jupyter & Plotly 

## tl;dr

From the menu, click `Cell -> Run All`. Scroll to the bottom. Wait patiently for the plot to appear. 

### Things to Know

1. If you are new to notebooks and want to learn about them more, see [this article](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)
2. The `Help` section in the menu links to handy references
3. This notebook is themed with a dark style so the programmer's eyes don't hurt. The `jupyterthemes` styling removes the top navigation bar. Don't be alarmed.
4. If you want to learn more about Spark, see [A Gentle Introduction to Spark](http://go.databricks.com/gentle-intro-spark)

### If You Have An Issue Running A Cell

1. It's possible the package is not installed. You can navigate to the root path on the host (i.e. the url in your browser with out any `/`'s behind it) and then open the menu `New->Terminal` and use `pip install --user <package>` to install the proper package.

In [None]:
%%bash

cat /etc/*-release

### Fit the Notebook to Your Screen.

The below cell will update the `css` to fit the notebook to the width of your browser window. This is recommended.

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))



### Download the Bitcoin & Ethereum Pricing Data

Try to not execute this often, as the API is rate limited. Just once is enough, as notebooks are stateful and the files will already be written to the file system after the first execution.

In [None]:
import requests
import json
from io import StringIO

btc_url = "https://www.quandl.com/api/v3/datasets/BCHARTS/BITSTAMPUSD.csv?api_key=T_g99ExMU_4X_Bgss6Zx"
eth_url = "https://etherscan.io/chart/etherprice?output=csv"

response = requests.get(btc_url)
f_btc = open('bitcoin.csv', 'w')
f_btc.write(response.text)
f_btc.close()

response = requests.get(eth_url)
f_eth = open('ethereum.csv', 'w')
f_eth.write(response.text)
f_eth.close() 

### Create the Spark Context with PySpark

In case you are new to Spark (like me), you should know the core engine is written in Scala. The language binding to Python is known as PySpark. The Spark Context is shared session state for a program's interaction with the Spark cluster. The cluster resources can be used by 1 or more programs, and thus Spark supports more than one session. This spark deployed is local, but in the future this will updated with radanalytics.io drivers to spin up a cluster in OpenShift.

In [None]:
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext

spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext

### Create the Spark Dataframe for Bitcoin Prices

You can think of a Dataframe like a table in a SQL database, composed of rows and columns. Here, we'll load the `csv` created earlier as a Dataframe, and then create a view of the data so we can query it with Spark SQL.

In [None]:
btc_df = spark.read.format("csv").option("header", "true").load("bitcoin.csv")
btc_df.createOrReplaceTempView("bitcoin_prices")

dates = spark.sql("SELECT Date as d FROM bitcoin_prices")
opening = spark.sql("SELECT Close as o FROM bitcoin_prices")


### Create the Spark Dataframe for Ethereum Prices

Much like the Bitcoin prices, but here we need to massage the data to take the same shape as the bitcoin data in order to plot it correctly. To this, we need to do a few transformations.

In [None]:
from pyspark.sql import Row

df_eth = spark.read.format("csv").option("header", "true").load("ethereum.csv")

def reformat_date(string): 
    tokens = string.split("/")
    date = "{}-{}-{}".format(tokens[2],tokens[0],tokens[1])
    return date
    

df_eth2 = (df_eth.withColumnRenamed("Date(UTC)", "date").
    withColumnRenamed("UnixTimeStamp", "timestap").
    withColumnRenamed("Value", "value"))

df_eth3 = df_eth2.rdd.map( lambda r : 
                       Row( date = reformat_date(r[0]), 
                           timestamp = r[1],
                           value = r[2])
                      ).toDF()

df_eth3.createOrReplaceTempView("ethereum_prices")


dates_eth = spark.sql("SELECT date as d FROM ethereum_prices")
values = spark.sql("SELECT value as v FROM ethereum_prices")


### Create The Plot with Plotly

Spark does not yet have native ploting tools, so we need to take the transforms we need in Spark and export them back to Python Pandas, which despite the cute name, are just Python native representations of the `Dataframe` concept that integrate with Plotlty, our graphing library.

Once the graph is created, be sure to zoom in to interesting areas by click and dragging a region. Then, double click to zoom back out.

In [None]:
import plotly.offline as plotly
import plotly.graph_objs as go

plotly.offline.init_notebook_mode(connected=True)

trace_btc = go.Scatter(
    x = dates.toPandas()['d'],
    y = opening.toPandas()['o'],
    name = "bitcoin"
)
trace_eth = go.Scatter(
    x = dates_eth.toPandas()['d'],
    y = values.toPandas()['v'],
    name = "etherium"
)

data = [trace_eth, trace_btc]
plotly.iplot(data)