# Spark Integration

![title](img/connect.png)

#### WHAT IS LIVY?

Livy is an open source REST interface for interacting with Apache Spark from anywhere. It supports executing snippets of code or programs in a Spark context that runs locally or in Apache Hadoop YARN. Livy provides the following features:

- Interactive Scala, Python, and R shells
- Batch submissions in Scala, Java, Python
- Multiple users can share the same server (impersonation support)
- Can be used for submitting jobs from anywhere with REST
- Does not require any code change to your programs
- Support Spark1/ Spark2, Scala 2.10/2.11 within one build.

#### CORE FUNCTIONALITIES

Livy offers three modes to run Spark jobs:

- Using programmatic API
- **Running interactive statements through REST API**
- Submitting batch applications with REST API

When Using Spark API, the entry point, SparkContext, is created by the user who wrote code, while in Livy API, the **SparkContext is offered by the framework**, the user doesn’t need to create it.

![title](img/low_level.png)

### Configure Spark Cluster


Let’s say we’re working in an IPython notebook and we want to use Spark to analyze some data. So, we'll load sparkmagic in order to be able to talk to Spark from our Python notebook.

In [1]:
%load_ext  sparkmagic.magics

##### Managing Spark Session

Sparkmagic allows us to specify the Livy endpoint along with a username and password to authenticate to it. If the Livy endpoint is on your local machine or has no password, simply leave the text fields for username and password blank.

In [2]:
%manage_spark

MagicsControllerWidget(children=(Tab(children=(ManageSessionWidget(children=(HTML(value='<br/>'), HTML(value='…

### Interactive Session

Data scientists can use Spark from their own Jupyter notebook, which is running on their localhost. We can execute Spark jobs in the cluster as if they were running on a local machine.

In [None]:
%%sql

SHOW TABLES

In [None]:
%%sql

SELECT * FROM metars ORDER BY _c1 DESC

In [None]:
%%spark

df = spark.read.csv('hdfs:////data/sample_data.csv')


print(df.count())