# Connecting to Spark

This is an iPython notebook.  You can execute a cell by clicking on it and pressing shift-enter.

We can execute spark commands in here directly and get immediate results.

When you open up a notebook with `pys`, you automatically have a variable, `sc`, available.  This is a Spark Context.  It's our starting point for all Spark operations.

In [None]:
print sc

We're going to be using Python with DataFrames, which is only available in Spark 1.3 or later.  We're going to be using a recent version of open source spark.  To use it, you'll have to import the `SQLContext`.

In [None]:
from pyspark.sql import SQLContext
sql = SQLContext(sc)
print sql

# Reading a Cassandra Table

In [None]:
users = sql.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace="movielens_small", table="users")
print users

# Displaying results

If we never perform an operation, our dataframe is never read in.  We can force our dataframe into memory and see it by calling `collect()` or `show()` on it.

In [None]:
users.limit(1).collect()

In [None]:
users.limit(10).show()

# Basic Filtering

If we're going to do anything with our data, we need to be able to do a simple task: Filtering.

Here's the syntax for filtering:

In [None]:
users.filter(users.age > 20).limit(1).show()

There's an alternative syntax for filtering:

In [None]:
users[users.age > 20]

And of course, a third syntax for filters that have a degree of complexity. 

In [None]:
users.filter("name LIKE 'Dani%'").show()

Try filtering for users named "Jon"

# Selecting specific columns

When you only want to see specific fields in a DataFrame, you will use the `select()` method.  For example:

In [None]:
users.select(users.age)

Sometimes you'll want to use a different name for a field than is in the original DataFrame.  For that, you'll want to know about `.alias()`.  For instance:

In [None]:
users.select(users.name, users.age.alias("years"))

When you have a pipeline of DataFrame queries, and need to do a filter, you'll need to either temporarily assign the intermediate DataFrames to a variable or you'll need to use the SQL syntax.  For instance:

In [None]:
users.select(users.name, users.age.alias("years")).filter("years > 10").show()

In [None]:
tmp = users.select(users.name, users.age.alias("years"))
tmp[tmp.years > 10].show()

# Select Expressions

Select expressions allow you to perform various SQL-like operations on your data, still in the JVM.

In [None]:
users.selectExpr("age * 10 as old_age").show()

# Convenience functions
When working with DataFrames you'll frequently need access to some convenience functions.  For instance, `explode()` is use when you're working with sets and lists.  It creates 1 row per item in the set. 

In [None]:
movies = sql.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace="movielens_small", table="movies")

In [None]:
from pyspark.sql.functions import explode

movies.select(explode(movies.genres), movies.name).show()

For queries like the above, it's useful to use our alias command:

In [None]:
movies.select(explode(movies.genres).alias("food"), movies.name).show()

Tip: When you refer to `movies.genres`, you're looking at a `Column`.  The api for `Column is here: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

**Advanced Query:** Try selecting the movies who have the genre "Drama".  You'll need to use `explode()`, `alias()` and a filter.

# A nicer reader

Personally I find needing to code `org.apache.spark.sql.cassandra` everywhere a little annoying.  Here's a couple convenience functions that returns a function (slightly tricky) that can be used to reference tables in a keyspace.  Execute the below block.  You can then refer to tables like such:

`user = reader("user")`

In [None]:
def create_reader(sql):
    def reader(table):
        df = sql.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace="movielens_small", table=table)
        return df
    return reader

def create_writer(sql, mode="append"):
    def writer(df, table):
        df.write.format("org.apache.spark.sql.cassandra").\
                 options(table=table, keyspace="movielens_small").save(mode="append")
    return writer

writer = create_writer(sql)
reader = create_reader(sql)

# Data Migrations

One thing Spark is useful for is performing data migrations.  We frequently need to take a table and write out a new structure.  Here's an example where we take the movie table and construct a new table that maps genres to movies.  The `writer()` function takes a dataframe and a table. 

Create this table in CQLSH:

```
CREATE TABLE movies_by_genre (
  genre text,
  id uuid,
  name text,
  avg_rating float,
  primary key(genre, id)
);
```

In [None]:
movies_by_genre = movies.select("id", "name", "avg_rating", explode(movies.genres).alias("genre"))

In [None]:
writer(movies_by_genre, "movies_by_genre")

Now it's your turn.  This migration may be a little tricky.  What we want is a leaderboard where we can quickly view the top movies in a given genre.  

```
CREATE TABLE movie_leaderboard (
  genre text,
  avg_rating float,
  id uuid,
  name text,
  primary key (genre, avg_rating, id)
) with clustering order by (avg_rating desc);
```

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

# SparkSQL

The programatic interface above is pretty convenient, and in my opinion, fun.  There's another interface that's very convenient if you come from a SQL background: SparkSQL.  SparkSQL supports quite a bit of Hive's SQL dialect.

You can register a table to query with SQL like so:

In [None]:
users.registerTempTable("users")

Try registering your movies DataFrame as `movies`.

How's your SQL?  You can execute queries against the temp tables you've registered.  You can perform JOINs, aggregations, sorting, etc.  For instance:

In [None]:
sql.sql("SELECT * from movies where name LIKE 'Rumble in the Bronx%'")

Try your hand at a few queries.  Find the IDs of 3 movies you love.  For a more advanced challenge, get a list of all the movies made in the year you were born.  (Hint: LIKE)

In [None]:
# lets load our ratings up 
ratings = reader("ratings_by_user")
ratings.registerTempTable("ratings")

# JOINS and Aggregations

Since we've put our movies and our ratings in tables, we can join them.  Pretty convenient.  We can do various JOINs.  By default, like a RDBMS, the inner join is used, but we also can do LEFT, RIGHT, FULL.  We also have unions and subqueries.  We can perform aggregations on our results as well.  We can take the results of any query (a DataFrame) and use it as a table for future queries.  This is incredibly powerful. 

Full docs: https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

In [None]:
ratings = " ".join(["SELECT movies.name, ratings.rating from users",
                    "JOIN ratings on users.id = ratings.user_id",
                    "JOIN movies on ratings.movie_id = movies.id ",
                    "WHERE users.name = 'Dani Traphagen'",
                    "ORDER BY rating DESC LIMIT 10"])
sql.sql(ratings).show()