# Writing Standalone Spark Streaming Applications
<!-- So far, we've only written small snippets of Spark code.  In this section, we'll delve into building standalone apps. -->

So far, we've only built little snippets of the critical code.  Building standalone Spark binaries will require a reasonable amount of overhead.  We'll demonstrate this using two examples:
1. A [Meetup](meetup) project, which live-streams meetup RSVP events globally.
1. A [Twitter](twitter) project, which live-streams tweets as they happen.



# Two Environments for Running Spark
<!-- We explain the two environments in which we can run Spark -- shell and standalone JAR --, the relative strengths of both, and the API differences between the two contexts. -->

There are two contexts for running Spark:

1. We will primarily be demonstrating the "REPL" method.  You can access the repl method by running any of the code cells in these notebooks or by running `spark-shell` from the bash command line.  Notebooks are great for didactic, exploratory, and presentation purposes.  The shell is also great for exploration.

2. You can also write Spark jobs as a program to be packaged up and run as a standalone jar application.  This requires using build tools (e.g. SBT) and a reasonable amount of overhead.  We'll explore how to do this in the final notebook.  This is great for production code.

For (2), you'll need to create your own `SparkContext` and `SparkSession`.  In (1), these are provided as global variables named `sc` and `spark`.  You'll access Spark functionality through these two objects.

# Spark Streaming Standalone Code: Meetup Events Example
<!-- In this section, we will do a deep dive into building a receiver, using that to parse live json for Meetup registrations, and package that up in a main app. -->

Our source code is spread into three files:

- [MeetupReceiver.scala](/edit/meetup/src/main/scala/com/thedataincubator/MeetupStreaming/MeetupReceiver.scala): This is our custom `MeetupReceiver`.  It is much like the previous custom receiver (it extends the abstract `Receiver` class and implements the `receive` method) but with a few extra tricks:
    1. It opens up a socket to `stream.meetup.com:80` and requests new Meetup events.
    1. We need to manually specify the headers in this request (Meetup is a little picky about this).
    1. We wrap this inside a `InputStreamReader` and then a `BufferedReader`.
    1. We write the result in the receiver.

- [MeetupDStream.scala](/edit/meetup/src/main/scala/com/thedataincubator/MeetupStreaming/MeetupDStream.scala):
    1. The `MeetupDStream` initializes a Spark DStream with our `MeetupReceiver` and parses it into json.
    1. We are parsing the code with Liftweb's json package.
    1. Notice that we can define our json in a typesale way using (nested) case classes where the attribute names are the key names in json.  They are then easily parsed with the line
    ```
    parse(line).extract[RSVP]
    ```
    The schema is given by their [API documentation](https://www.meetup.com/meetup_api/docs/stream/2/rsvps/).
    1. Json can be messier than the type system expects, missing fields being just one example.  Lift will crash if it expects a field and cannot fill it.  There are two ways to handle this.  The first is with optional fields, which will simply be `None` if no value is provided.  The second is by catching errors in a `try` `catch` pair, which simply returns nothing rather than allowing the program to crash.

- [Main.scala](/edit/meetup/src/main/scala/com/thedataincubator/MeetupStreaming/Main.scala):
    1. Finally, our `def main` entry point is defined in `Main.scala`.
    1. We allow either a single argument (the output directory) or no arguments (printing to screen).
    1. When printing to screen, we disable logging to make the results more clear.


# Scala Build Tool (SBT) and Spark
<!-- In this section, we explain how to build code using SBT, how to package it up using the Assembly plugin and `plugins.sbt`, and the conventions for how to lay out project files. -->

The Scala Build Tool (commonly known as SBT) is actually a Scala DSL (domain specific language) that's used for specifying Spark builds.  For Spark purposes, we need to break it into three parts:
- The [build.sbt](/edit/meetup/build.sbt) tells the Scala build tool (SBT) how to compile the program.  This is done by adding the dependencies to the `libraryDependencies` variable.  Notice that we need a `net.liftweb` dependency for json parsing and `org.apache.spark` dependencies for Spark.
- The [assembly.sbt](/edit/meetup/assembly.sbt) gives assembly extra instructions on how to build a "fat jar", a jar that contains both the byte-code for the package you wrote and the byte-code for the packages you deploy.  This is what enables Spark to deploy easily across multiple computers.
- The [project/plugins.sbt](/edit/meetup/project/plugins.sbt) adds plugins to SBT.  This is where we tell SBT to use assembly by calling the function `addSbtPlugin`.



The layout of our actual code is quite involved.
```
meetup
├── assembly.sbt
├── build.sbt
├── project
│   ├── plugins.sbt
│   └── project
└── src
    └── main
        ├── java
        ├── resources
        └── scala
            └── com
                └── thedataincubator
                    └── MeetupStreaming
                        ├── Main.scala
                        ├── MeetupReceiver.scala
                        └── MeetupStream.scala
```


As we can see, Scala code is always under the [deep directory](meetup/src/main/scala/com/thedataincubator/MeetupStreaming/)

```
src/main/scala/com/thedataincubator/MeetupStreaming/
```

This is a legacy Maven convention:
- Source code is in `src`.
- Production code is in `main` (we can also have a `test` directory).
- Scala code is in `scala`.
- Packages are put into a namespace which is governed by your URL.  Since our website is `thedataincubator.com`, our packages go into `com/thedataincubator`.
- Finally, this package is called `MeetupStreaming` and all our Scala files go in there.

It's a good idea to follow this convention, as it's implicitly assumed throughout the JVM world, e.g. by SBT.

# Compiling and Building a Standalone Spark Application
<!-- In this segment, we will use SBT tools to compile, assemble, and run a Scala application. -->

To compile the program, simply run the bash command
```bash
sbt compile
```

For continuous compilation (compile on any source file save) run
```bash
sbt ~compile
```

To build and assemble the far jar, run 
```bash
sbt assembly
```
You'll notice that the jar was placed under `target/scala-2.11/` (another Maven / SBT convention).

Finally, to submit a job locally run
```bash
spark-submit --master local[2] \
        --class com.thedataincubator.MeetupStreaming.Main \
        target/scala-2.11/MeetupStreaming-assembly-1.0.jar
```
Let's break this down further:
- `--master local[2]` tells spark to run the job locally (standalone) on at most two cores.
    - To run on [mesos](https://spark.apache.org/docs/latest/running-on-mesos.html) provide `--master mesos://host:port` 
    - To run on [yarn](http://spark.apache.org/docs/latest/running-on-yarn.html) provide `--master yarn`
- `--class com.thedataincubator.MeetupStreaming.Main` tells spark the class where the `main` function is located
- `target/scala-2.11/MeetupStreaming-assembly-1.0.jar` is the jar file (JVM binary)

To help with these commands, we've provided a simple [Makefile](/edit/meetup/Makefile).

Running this, we should see a stream of case class outputs, demonstrating that we are successfully streaming and parsing data from Meetup.

# Spark Twitter Streaming Example
<!-- In this example, we'll show how to create a Twitter streaming application using a customer receiver and Twitter4j.  We also give best practices for maintaining Twitter account secrets. -->

We have another example that uses [twitter4j](http://twitter4j.org/en/) to live-stream tweets.  A few important differences:
- [build.sbt](/edit/twitter/build.sbt) has an extra `"org.twitter4j"` dependency which is used for setting up a parsed stream to Twitter.
- [TwitterReceiver.scala](/edit/meetup/src/main/scala/com/thedataincubator/TwitterStreaming/TwitterReceiver.scala) calls `TwitterStreamFactory` to listen to the open Twitter stream.
- You'll need a `twitter4j.properties` file which contains your Twitter credentials.  By Maven convention, the file needs to be in `src/main/resources` and that's where `twitter4j` expects to read it.  Two important notes:
    - You can get those credentials by signing into Twitter and [creating a developer app](https://apps.twitter.com/).
    - The file `twitter4j.properties` is explicitly in [.gitignore](/edit/twitter/.gitignore) because it's poor programming to commit secrets to a source control.  However, a stub of the credentials are provide in [twitter4j.properties.sample](/edit/twitter/src/main/resources/twitter4j.properties.sample), which you can use to construct `twitter4j.properties`.

You can use the same command line tools to build and submit your Spark jobs.

<img src="images/logo-text.jpg" width="20%"/>