![DataStax Academy](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/datastax-academy.svg "DataStax Academy")

# Exercise 07.03 - Spark-Cassandra Connector Optimizations: Joining Tables

## Background

With this exercise, you will be optimizing working Spark code by using the `joinWithCassandraTable` method.

You'll be working with the `videos_by_actor` table and the `actor` table:

***

## Directions

Begin by creating a locally initialized list of two actors and parallelize it, or make it an RDD.

In [14]:
case class ActorYear(actor_name: String, release_year: Int)

val actors2014 = sc.parallelize(List(ActorYear("Johnny Depp",2014), 
                                    ActorYear("Bruce Willis",2014)))

Next join to the `videos by actor` table using the new method `joinWithCassandraTable`. Using this method will automatically join on the partition key by default.

In [15]:
actors2014.joinWithCassandraTable("killr_video","videos_by_actor").takeSample(false,10).foreach(println)

(ActorYear(Bruce Willis,2014),CassandraRow{actor_name: Bruce Willis, character_name: Russell Duritz, video_id: ece7d611-a5e2-11e5-8504-a45e60eb67c5, release_year: 2000, title: The Kid})
(ActorYear(Bruce Willis,2014),CassandraRow{actor_name: Bruce Willis, character_name: Harrison Hill, video_id: ece864e3-a5e2-11e5-9f55-a45e60eb67c5, release_year: 2007, title: Perfect Stranger})
(ActorYear(Bruce Willis,2014),CassandraRow{actor_name: Bruce Willis, character_name: Frank Moses, video_id: ecfe78f3-a5e2-11e5-89af-a45e60eb67c5, release_year: 2013, title: RED 2})
(ActorYear(Johnny Depp,2014),CassandraRow{actor_name: Johnny Depp, character_name: Axel Blackmar, video_id: eceaddee-a5e2-11e5-ab2f-a45e60eb67c5, release_year: 1992, title: Arizona Dream})
(ActorYear(Johnny Depp,2014),CassandraRow{actor_name: Johnny Depp, character_name: Rochester, video_id: ece87a0f-a5e2-11e5-b1be-a45e60eb67c5, release_year: 2004, title: The Libertine})
(ActorYear(Johnny Depp,2014),CassandraRow{actor_name: Johnny Depp

Now lets change the where condition. We can use the `on` condition, provided we are limiting the result set by a column that is part of the clustering column.

In [18]:
actors2014.joinWithCassandraTable("killr_video","videos_by_actor").on(SomeColumns("actor_name","release_year")).takeSample(false,10).foreach(println)

(ActorYear(Bruce Willis,2014),CassandraRow{actor_name: Bruce Willis, release_year: 2014, character_name: Omar, title: The Prince, video_id: ed01818c-a5e2-11e5-8efd-a45e60eb67c5})
(ActorYear(Johnny Depp,2014),CassandraRow{actor_name: Johnny Depp, release_year: 2014, character_name: Guy Lapointe, title: Tusk, video_id: ed01abe6-a5e2-11e5-89d1-a45e60eb67c5})


Join two cassandra tables using the `joinWithCassandra` method. You will want to make a point to start with the table with a higher cardinality. In this case there are more videos than there are actors, so we will want to start with the actors table.

In [2]:
sc.cassandraTable("killr_video", "actor").joinWithCassandraTable("killr_video","videos_by_actor").takeSample(false, 10).foreach(println)

(CassandraRow{actor_name: Jaime King},CassandraRow{actor_name: Jaime King, release_year: 2005, character_name: Kathy Joyce, title: Pretty Persuasion, video_id: ececc4cc-a5e2-11e5-8f15-a45e60eb67c5})
(CassandraRow{actor_name: Jim Doughan},CassandraRow{actor_name: Jim Doughan, release_year: 2003, character_name: Mr. Coleman, title: The Haunted Mansion, video_id: ecea9a1e-a5e2-11e5-b56c-a45e60eb67c5})
(CassandraRow{actor_name: Chris Ellis},CassandraRow{actor_name: Chris Ellis, release_year: 1997, character_name: Det. Butler, title: Bean, video_id: ece6cad7-a5e2-11e5-a01b-a45e60eb67c5})
(CassandraRow{actor_name: Harry Morgan},CassandraRow{actor_name: Harry Morgan, release_year: 1945, character_name: Barker (as Henry Morgan), title: State Fair, video_id: ecf0f007-a5e2-11e5-aec7-a45e60eb67c5})
(CassandraRow{actor_name: Ian McKellen},CassandraRow{actor_name: Ian McKellen, release_year: 2001, character_name: Gandalf the Grey, title: The Lord of the Rings: The Fellowship of the Ring, video_id: 