# Connecting to Spark

This is an iPython notebook.  You can execute a cell by clicking on it and pressing shift-enter.

We can execute spark commands in here directly and get immediate results.

We're going to be using Python with DataFrames, which is only available in Spark 1.3 or later.  We're going to be using a recent version of open source spark.  To use it, you'll have to import the `SQLContext`.

In [None]:
from pyspark.sql import SQLContext
sql = SQLContext(sc)

# Reading a Cassandra Table

In [None]:
user = sql.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace="training", table="user")

# Displaying results

If we never perform an operation, our dataframe is never read in.  We can force our dataframe into memory and see it by calling `collect()` or `show()` on it.

In [None]:
user.collect()

In [None]:
user.show()

# Basic Filtering

If we're going to do anything with our data, we need to be able to do a simple task: Filtering.

Here's the syntax for filtering:

In [None]:
user.filter(user.age > 2)

There's an alternative syntax for filtering:

In [None]:
user[user.age > 2]

And of course, a third syntax for filters that have a degree of complexity. 

In [None]:
user.filter("age > 100 or name = 'Larry'").collect()

Try filtering for users name "Jon"

# Selecting specific columns

When you only want to see specific fields in a DataFrame, you will use the `select()` method.  For example:

In [None]:
user.select(user.age)

Sometimes you'll want to use a different name for a field than is in the original DataFrame.  For that, you'll want to know about `.alias()`.  For instance:

In [None]:
user.select(user.name, user.age.alias("years"))

When you have a pipeline of DataFrame queries, and need to do a filter, you'll need to either temporarily assign the intermediate DataFrames to a variable or you'll need to use the SQL syntax.  For instance:

In [None]:
user.select(user.name, user.age.alias("years")).filter("years > 10").collect()

In [None]:
tmp = user.select(user.name, user.age.alias("years"))
tmp[tmp.years > 10].collect()

# Select Expressions

Select expressions allow you to perform various SQL-like operations on your data, still in the JVM.

In [None]:
user.selectExpr("age * 10 as old_age").collect()

# Convenience functions
When working with DataFrames you'll frequently need access to some convenience functions.  For instance, `explode()` is use when you're working with sets and lists.  It creates 1 row per item in the set. 

In [None]:
from pyspark.sql.functions import explode
user.select(explode(user.favorite_foods)).collect()

For queries like the above, it's useful to use our alias command:

In [None]:
user.select(explode(user.favorite_foods).alias("food")).collect()

Tip: When you refer to `user.age`, you're looking at a `Column`.  The api for `Column is here: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

**Advanced Query:** Try selecting the users who have the favorite food "Bacon".  You'll need to use `explode()`, `alias()` and a filter.

# A nicer reader

Personally I find needing to code `org.apache.spark.sql.cassandra` everywhere a little annoying.  Here's a couple convenience functions that returns a function (slightly tricky) that can be used to reference tables in a keyspace.  Execute the below block.  You can then refer to tables like such:

`user = reader("user")`

In [None]:
def create_reader(sql, keyspace):
    def reader(table):
        df = sql.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace=keyspace, table=table)
        return df
    return reader

def create_writer(sql, keyspace, mode="append"):
    def writer(df, table):
        df.write.format("org.apache.spark.sql.cassandra").\
                 options(table=table, keyspace=keyspace).save(mode="append")
    return writer

writer = create_writer(sql, "training")
reader = create_reader(sql, "training")

# Data Migrations

One thing Spark is useful for is performing data migrations.  We frequently need to take a table and write out a new structure.  Here's an example where we take the user table and construct a new table that maps food to users.  The `writer()` function takes a dataframe and a table.  Currently the fields need to be in the correct order in the dataframe.  Let's build an index of age -> user, adults only.  After you execute the below cell, verify the correct data is in the table `adults`

In [None]:
adults = user[user.age > 18].select('age', 'user_id', 'name')
writer(adults, "adults")

Now it's your turn.  This migration may be a little tricky.  What we want to do is map foods people like to users.  We're going to want to save this in the table `favorite_foods_index`.  Look at it's structure using cqlsh or dev center.  Hint: take a look at the documentation for `explode()`:

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

# Loading External Data

In the Spark world, the traditional means of working with data was the RDD.  It's more flexible than DataFrames, but slower to work with in Python.  Unfortunately we don't have time to dig into RDDs today - I've provided the code to load movies and rating, and convert them to DataFrames.

In [None]:
from load_data import load_movies, load_ratings
movies = load_movies(sc, writer)
ratings = load_ratings(sc, writer)

movies.show()
ratings.show()

You may notice below that the ratings DataFrame we're creating is called ratings-subset.  This is to minimize the memory used by the virtual machine.  If you set up Spark and Cassandra on your local machine, or provide more memory to the VM, you could load the work with the entire dataset.

# SparkSQL

The programatic interface above is pretty convenient, and in my opinion, fun.  There's another interface that's very convenient if you come from a SQL background: SparkSQL.  SparkSQL supports quite a bit of Hive's SQL dialect.

You can register a table to query with SQL like so:

In [None]:
user.registerTempTable("user")

Try registering your ratings DataFrame as `ratings`, and your movies DataFrame as `movies`.

How's your SQL?  You can execute queries against the temp tables you've registered.  You can perform JOINs, aggregations, sorting, etc.  For instance:

In [None]:
sql.sql("SELECT * from movies where movie_id=1")

Try your hand at a few queries.  Find the IDs of 3 movies you love.  For a more advanced challenge, get a list of all the movies made in the year you were born.  (Hint: LIKE)

# JOINS and Aggregations

Since we've put our movies and our ratings in tables, we can join them.  Pretty convenient.  We can do various JOINs.  By default, like a RDBMS, the inner join is used, but we also can do LEFT, RIGHT, FULL.  We also have unions and subqueries.  We can perform aggregations on our results as well.  We can take the results of any query (a DataFrame) and use it as a table for future queries.  This is incredibly powerful. 

Full docs: https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

Try writing the following queries:

- Calculate the average rating per movie.  Save this result set into the `average_rating` table
- For the 5 lowest rated movies on average, who rated it?  
- For the users who rated the bottom 5 movies, what were their average ratings?

We're going to want to be able to take a given tag and find all the movies for it.  This is going to be a frequently run query, so we want a dedicated table for it.  Create and save a new DataFrame to the table `movies_by_tag` that will let me query a table for a given tag and get a list of movies back.  Register the DataFrame with the SQL context as a table with the same name.

# The challenge

We're going to need to be able to show movies in a given tag ordered by their rank in our system.  For example, I should be able to ask Cassandra for the top 20 Adventure movies.

Create a new table in Cassandra and the spark code to fill it.  You may use any DataFrame already created as well as any table.

# Pandas and Plotting

One of the benefits of working with Python is that you have access to another excellent data manipulation library, Pandas, and a plotting library, matplotlib.  To tell our notebook we want to be able to display plots inline, we do the following:

In [None]:
%matplotlib inline

I want to see which tags are used the most, globally, as a bar graph.  Fortunately, Pandas and Matplotlib make this straightforward.  Any DataFrame can be exported as a Pandas DataFrame using `toPandas()`.  Pandas has a convenient call, `plot()` to display charts via matplotlib.

For example:
```
pd = my_dataframe.toPandas()
pd.set_index('x_axis_name').plot(kind='bar')
```
http://pandas.pydata.org/pandas-docs/stable/visualization.html

Note the `set_index()` call - you'll want it to make sure you have the correct label.  See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

Use what you've learned today to count the number of instances of each tag, then visualize it.