# Standalone Spark Streaming Applications

So far, we've only built little snippets of the critical code.  Building standalone spark binaries will require a reasonable amount overhead.  We'll do it in the [meetup](meetup) project, which live streams Spark meetup RSVP events globally.


# Spark Standalone Code: Meetup Example

Our source code is spread into three files:

- [MeetupReceiver.scala](/edit/meetup/src/main/scala/com/thedataincubator/MeetupStreaming/MeetupReceiver.scala): This is our custom `MeetupReceiver`.  Much like the previous custom receiver (it extends the abstract `Receiver` class and implments the `receive` method) but with a few extra tricks:
    1. It opens up a socket to `stream.meetup.com:80` and requests new meetup events.
    1. We need to manually specify the headers in this reqeust (Meetup is a little picky about this).
    1. We wrap this inside a `InputStreamReader` and then a `BufferedReader`
    1. We write hte result the receiver.

- [MeetupDStream.scala](/edit/meetup/src/main/scala/com/thedataincubator/MeetupStreaming/MeetupDStream.scala):
    1. The `MeetupDStream` initializers a Spark DStream with our `MeetupReceiver` and parses it into json
    1. We are parsing the code with Liftweb's json pacakge.
    1. Notice that we can define our Json in a typesale way using (nested) case classes where the attribute names are the key names in json.  They are then easily parsed with the line
    ```
    parse(line).extract[RSVP]
    ```
    The schema is given by their [API documentation](https://www.meetup.com/meetup_api/docs/stream/2/rsvps/).
    1. Json can be messier than the type system expects, missing fields being just one example.  Lift will crash if it expects a field and cannot fill it.  There are two ways to handle this.  The first is with optional fields, which will simply be `None` if no value is provided.  The second is by catching errors in a `try` `catch` pair, which simply returns nothing rather than allowing the program to crash.

- [Main.scala](/edit/meetup/src/main/scala/com/thedataincubator/MeetupStreaming/Main.scala):
    1. Finally, our `def main` entry point is defined in `Main.scala`.
    1. We allow either a single arguement (the output directory) or no arguements (printing to screen).
    1. When printing to screen, we disable logging to make the results more clear.


# Scala Build Tool (SBT)

The Scala Build Tool (commonly known as SBT) is actually a Scala DSL (domain specific language) that's used for specifying Spark builds.  For Spark purposes, we need to break it into three parts:
- The [build.sbt](/edit/meetup/build.sbt) tells the scala build tool (SBT) how to compile the program.  This is done by adding the dependencies to the `libraryDependencies` variable.  Notice that we need a `net.liftweb` dependency for json parsing and `org.apache.spark` dependencies for Spark.
- The [assembly.sbt](/edit/meetup/assembly.sbt) gives assembly extra instructions on how to build a "fat jar", a jar that contains both the byte-code for the package you wrote and the byte-code for the packages you deploy.  This is what enables Spark to deploy easily across multiple computers
- The [project/plugins.sbt](/edit/meetup/project/plugins.sbt) adds plugins to SBT.  This is where we tell SBT to use assembly by calling the function `addSbtPlugin`.



The layout of our actual code is quite involved.
```
meetup
├── assembly.sbt
├── build.sbt
├── project
│   ├── plugins.sbt
│   └── project
└── src
    └── main
        ├── java
        ├── resources
        └── scala
            └── com
                └── thedataincubator
                    └── MeetupStreaming
                        ├── Main.scala
                        ├── MeetupReceiver.scala
                        └── MeetupStream.scala
```


As we can see, scala code is Scala code is always under the [deep directory](meetup/src/main/scala/com/thedataincubator/MeetupStreaming/)

```
src/main/scala/com/thedataincubator/MeetupStreaming/
```

This is a legacy Maven convention:
- Source code is in `src`, 
- Poduction code is in `main` (we can also have a `test` directory).
- Scala code is in `scala`.
- Packages are put into a namespace which is governed by your URL.  Since our website is `thedataincubator.com`, our packages go into `com/thedataincubator`.
- Finally this package is called `MeetupStreaming` and all our scala files go into there.

It's a good idea to follow this convention, as it's implicitly assumed throughout the JVM word, e.g. by SBT.

# Compiling and building the Exmaple

To compile the program, simply run the bash command
```bash
sbt compile
```

For continuous compilation (compile on any source file save) run
```bash
sbt ~compile
```

To build and assembl the far jar, run 
```bash
sbt assembly
```
You'll notice that the jar was placed under `target/scala-2.11/` (another Maven / SBT convention).

Finally, to submit a job locally run
```bash
spark-submit --master local[2] \
        --class com.thedataincubator.MeetupStreaming.Main \
        target/scala-2.11/MeetupStreaming-assembly-1.0.jar
```
Let's break down this further:
- `--master local[2]` tells spark to run the job locally (standalone) on at most two cores.
- `--class com.thedataincubator.MeetupStreaming.Main` tells spark the class where the `main` function is located
- `target/scala-2.11/MeetupStreaming-assembly-1.0.jar` is the jar file (JVM binary)

To help with these commands, we've provided a simple [Makefile](/edit/meetup/Makefile)

Running this, we should see a stream of case class outputs, demonstrating that we are successfully streaming and parsing data from Meetup.

# Twitter Example