# Spark: Introduction to Framework

[Apache Spark™](https://spark.apache.org/) is a unified analytics engine for large-scale data processing. JupyterHub installation offers you a Spark kernel with [PySpark API](https://spark.apache.org/docs/latest/api/python/index.html).

__NOTE:__ You should start your server with a `Spark environment` to get advances of `Spark`.

![Jupyter dashboard showing files tab](images/jupyterlab_spark_env.jpg)

## Import libraries and get access to Spark UI

In [None]:
import os
import json
import sklearn
import socket
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [None]:
print('user:', os.environ['JUPYTERHUB_SERVICE_PREFIX'])

def uiWebUrl(self):
    from urllib.parse import urlparse
    web_url = self._jsc.sc().uiWebUrl().get()
    port = urlparse(web_url).port
    return '{}proxy/{}/jobs/'.format(os.environ['JUPYTERHUB_SERVICE_PREFIX'], port)

SparkContext.uiWebUrl = property(uiWebUrl)

conf = SparkConf().set('spark.master', 'local[*]').set('spark.driver.memory', '4g')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
spark

In [None]:
sc

## Data load from local disk

Let's read local stored csv file to `Spark dataframe`:

In [None]:
sdf = spark.read.csv(
    './data/telecom_churn.csv', 
    sep=',', 
    header=True
)
sdf.printSchema()

Spark does not store dataframe in memory and processes it only if some method is called. Below we select first 5 rows of the `Spark dataframe` and convert them to `Pandas dataframe`:

In [None]:
sdf.limit(5).toPandas().head()

In [None]:
print('total rows in spark dataframe:', sdf.count())

## Data process examples

In [None]:
from pyspark.sql.functions import udf, col, desc, rank, row_number

In [None]:
sdf.limit(5).orderBy('Total day minutes').toPandas()

In [None]:
sdf.select('Churn').distinct().show()

In [None]:
sdf.groupby('Voice mail plan').count().show()

In [None]:
sdf.groupby('State').count().sort(col('count').desc()).show()

In [None]:
sdf.select(
    'State',
    'Voice mail plan',
    'Number vmail messages'
).filter(
    col('State') == 'WY'
).limit(
    10
).toPandas()

## Apply arbitary function to Spark dataframe

Apply function to one column:

In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *

In [None]:
def one_hot(text):
    if text == 'False':
        return 0
    elif text == 'True':
        return 1
    else:
        return 2

In [None]:
udf_one_hot = udf(lambda x: one_hot(x), IntegerType())
sdf = sdf.withColumn('Churn_OH', udf_one_hot('Churn'))

In [None]:
sdf.select('Churn_OH').distinct().show()

Use two columns as an input to the function:

In [None]:
def sum_of_cols(x, y):
    return int(x) + int(y)

sum_cols = udf(sum_of_cols, IntegerType())
sdf = sdf.withColumn('Day and Night', sum_cols('Total day calls', 'Total night calls'))

In [None]:
sdf.select('Day and Night').limit(5).show()