## Testing connection between Jupyter and the Spark Cluster

The main goal of this notebook is to test if the connection between the Jupyter service and the Spark Cluster is working properly.
This could be used as a snippet for any other Jupyter-Spark development.

### Importing libraries

In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pandas

In [1]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
import time
import pandas as pd

### Connecting to Spark cluster and raising a Spark Session

Once the session is raised, we should be able to see the app name in http://localhost:8080/.

In [None]:
master = "spark://spark-master:7077"
app_name = "Testing if jupyter can communicate with spark"
spark = (
    SparkSession.builder
    .appName(app_name)
    .master(master)
    .config("spark.driver.memory", "512m")
    .config("spark.driver.cores", "1")
    .config("spark.executor.memory", "512m")
    .config("spark.executor.cores", "1")
    .config("spark.sql.shuffle.partitions", "2")
    .getOrCreate()
)
sc = spark.sparkContext
sc.setLogLevel("WARN")

print("Spark version: " + str(sc.version))

### Reading a tsv file

We are going to read a tsv file and compare the reading time with pandas

In [6]:
datasets_path = "/opt/workspace/facial_database/datasets/imdb_datasets/"
tsv_file = "name.basics.tsv.gz"

#### Pandas

In [None]:
print("Reading csv with Pandas...")
starttime = time.time()
df = pd.read_csv(datasets_path + tsv_file,header= 1, sep ='\t',engine='python', quotechar='"', on_bad_lines='skip')
endtime = time.time()
exec_time = str(endtime - starttime)

print(f"File {tsv_file} succesfully read and loaded as a Pandas dataframe in {exec_time} seconds.")
print(f"Counting the amount of records in {tsv_file}")

starttime = time.time()
total_records = df.count()
endtime = time.time()
exec_time = str(endtime - starttime)

print(f"Counting finished in {exec_time} seconds. Total amount of records is {total_records}. ")

#### Spark

In [None]:
print("Reading csv with Spark...")
starttime = time.time()
df = spark.read.csv(datasets_path + tsv_file,header= True, sep =r'\t')
endtime = time.time()
exec_time = str(endtime - starttime)

print(f"File {tsv_file} succesfully read and loaded as a Spark dataframe in {exec_time} seconds.")
print(f"Counting the amount of records in {tsv_file}")

starttime = time.time()
total_records = df.count()
endtime = time.time()
exec_time = str(endtime - starttime)

print(f"Counting finished in {exec_time} seconds. Total amount of records is {total_records}. ")

### Terminating Spark session

Otherwise, it will be endlessly running.

In [None]:
spark.stop()