**Introduction to Python for Data Science and Data Engineering**

Spark Components

  ![](https://jmp.sh/s/YzccOZReG7jdljk57jox)

  Spark Core (RDD API)-> Dataframe API -> Spark SQL/ Spark R API/ MLlib/ Structured Streaming API

# Spark Runtime Architecture
Driver / Cluster Manager / Workers / Executors
###  Driver    
- Responsible for planning and co-ordinating execution.
- Creates the SparkSession, the entry point to all spark applications.
- analyzes spark application and constructs DAG
- Schedules and distributes tasks to executors for execution
- monitors the progress of tasks and handles failures
- returns results to the client
###  Cluster Manager/Master
- Manages cluster resources and allocates them to driver
###  Workers
 - Nodes in the cluster that host ececutors.
###  Executors
- Processes on worker nodes that execute tasks assigned by the driver.
- Run on worker nodes in a spark cluster and host Tasks.
- Store intermediate and final resluts in memory or on disk.
- Interact with the driver for task co-ordination and data transfer


# The Spark DAG
 - Spark jobs are broken down in stages i.e group of tasks that can be run in parallel.
 - Computations flow in one direction through the stages
 - Stages never loop back, ensuring the job terminates
 - Stages are organized into a dependency graph for execution flow.

The Spark UI
 Visualising Spark applications

 Spark provides web user interfaces for monitoring and management including

###  Application UI
  - Per application SparkSession
  - Track Application progress and task execution
  - DAG visualization and stage details
  - Resource usage and performance metrics.

### Master UI
  - Per cluster
  - Worker node status and health and cluster-wide resource allocation
  - Shows all running applications and available resources.

# Spark Clusters in databricks

- **All purpose clusters** - interactive clusters that support notebooks , jobs, dashboards with auto termination
- **Job Cluster** - Clusters that stat when a job runs and terminate automatically upon completion, optimized for non interactive workloads.
- **SQL Warehouses** - Optimized clusters for SQL query performance with instant startup and auto-scaling to balance cost and performance.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("MySparkSession") \
    .getOrCreate()

In [2]:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import col

import math

# user defined function
def primeNumbercheck(myNum):
	if myNum==2:
		return True
	elif myNum%2==0 or myNum==1:
		return False
	x=int(math.sqrt(myNum))
	if x%2==0:
		strtno=x+1
	else:
		strtno=x
	for i in range(strtno,1,-2):
		if myNum%i==0:
			return False
	return True


In [3]:
#register the function primecheck as a
#User Defined Function (UDF) for use with Spark DataFrames
from pyspark.sql.functions import udf

@udf(returnType=BooleanType())
def primecheck(myNum):
    return primeNumbercheck(myNum)

In [4]:
#create dataframe with sample data 1000 records
df = spark.range(0,100)

print(type(df))

#Add a new column isPrime with boolean value as output
df = df.withColumn("isPrime", primecheck(df["id"]))

<class 'pyspark.sql.dataframe.DataFrame'>


In [6]:
#capture the result of only prime records
result = df.filter(df.isPrime==True)

In [7]:
# number of primes cache the result
# visible on spark UI /storage/
result.cache()

DataFrame[id: bigint, isPrime: boolean]

In [8]:
result.collect()

[Row(id=2, isPrime=True),
 Row(id=3, isPrime=True),
 Row(id=5, isPrime=True),
 Row(id=7, isPrime=True),
 Row(id=11, isPrime=True),
 Row(id=13, isPrime=True),
 Row(id=17, isPrime=True),
 Row(id=19, isPrime=True),
 Row(id=23, isPrime=True),
 Row(id=29, isPrime=True),
 Row(id=31, isPrime=True),
 Row(id=37, isPrime=True),
 Row(id=41, isPrime=True),
 Row(id=43, isPrime=True),
 Row(id=47, isPrime=True),
 Row(id=53, isPrime=True),
 Row(id=59, isPrime=True),
 Row(id=61, isPrime=True),
 Row(id=67, isPrime=True),
 Row(id=71, isPrime=True),
 Row(id=73, isPrime=True),
 Row(id=79, isPrime=True),
 Row(id=83, isPrime=True),
 Row(id=89, isPrime=True),
 Row(id=97, isPrime=True)]

In [9]:
result.count()

25

In [10]:
# Free up executor memeory by unpersisting cached objects
result.unpersist()

DataFrame[id: bigint, isPrime: boolean]

In [11]:
# install ngrok reverse proxy python wrapper required in google colab to check the Spark UI
!pip install pyngrok
!pip install findspark

Collecting pyngrok
  Downloading pyngrok-7.2.11-py3-none-any.whl.metadata (9.4 kB)
Downloading pyngrok-7.2.11-py3-none-any.whl (25 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.11
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl.metadata (352 bytes)
Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [12]:
#required in google colab to check the Spark UI
import findspark
findspark.init()
from pyngrok import ngrok
import os

In [13]:
#required in google colab to check the Spark UI
ngrok.set_auth_token("Get your token from https://dashboard.ngrok.com/get-started/your-authtoken")



In [15]:
#required in google colab to check the Spark UI
spark_ui_url = spark.sparkContext.uiWebUrl
print(spark_ui_url)
if spark_ui_url:
  # Extract the port from the URL
  spark_ui_port = int(spark_ui_url.split(':')[-1])
  ngrok_tunnel = ngrok.connect(spark_ui_port)
  print(f"Spark UI URL: {ngrok_tunnel.public_url}")
else:
  print("Spark UI is not available.")



http://e615004a3e81:4040
Spark UI URL: https://d799-34-125-255-91.ngrok-free.app


# DataFrames

* Dataframes are distributed collection of records aall with the same pre-defined structure   
* Built on Sparks core concepts but with structure , optimization and familar SQL like operations for data manipulation.
* DataFrames tack their schema and provide native support for many common SQL functions and relational operators like JOINs.
* DataFrames are evaluated as DAGs using lazy evaluation . Prepare the DAG as execute when data is requested.

* Can be created from JSON,CSV, Parquet, ORC,Text or Binary Files
* Delta Lake or other Table storage format directories.

## DataFrame API Optimization
- Adaptive Query Execution
- In-memory Columnar Storage
- Built in Statistics
- Catalyst Optimizer adn Photon (DataBricks)  

## DataFrame/ Query Planning
- When a DF is evaluated, the driver creates an optimzed execution plan throught a series of transformation
 Unresolved logical plan -> Analyzed Logical Plan -> Optimized logical Plan -> Physical Plan



#Columnar Storage


*   Organizes data by column enabling efficient scanning and analysis
*   Efficient for analytical workloads
*   Implemented in dataframe internal storage and in physical file encoding formats such as Parquet and ORC.



# DataFrameReader and DataFrameWriter

df = spark.read.format("format").option().load()

df = spark.read.csv("filelocation")
df = spark.read.parquet("filelocation")

-------------

df.write.format("format").mode("mode").save()

df.write.csv("filelocation")


In [23]:
#Infer the schema of the dataframe Using a DDL string

housing_ddl_schema = '''
longitude DOUBLE,
latitude DOUBLE,
housing_median_age DOUBLE,
total_rooms DOUBLE,
total_bedrooms DOUBLE,
population DOUBLE,
households DOUBLE,
median_income DOUBLE,
median_house_value DOUBLE
'''

housing_ddl_df = spark.read.format("csv") \
.option("header","true")\
.option("inferSchema","false")\
.schema(housing_ddl_schema)\
.load("/content/sample_data/california_housing_test.csv")

In [24]:
#Record count

housing_ddl_df.count()

3000

In [21]:
#explicitly define the schema
from pyspark.sql.types import *

housing_schema = StructType([
    StructField("longitude",DoubleType()),
    StructField("latitude",DoubleType()),
    StructField("housing_median_age",DoubleType()),
    StructField("total_rooms",DoubleType()),
    StructField("total_bedrooms",DoubleType()),
    StructField("population",DoubleType()),
    StructField("households",DoubleType()),
    StructField("median_income",DoubleType()),
    StructField("median_house_value",DoubleType())
    ])

housing_data_df = spark.read.format("csv") \
.option("header","true")\
.option("inferSchema","false")\
.schema(housing_schema)\
.load("/content/sample_data/california_housing_test.csv")

In [22]:
housing_data_df.count()

3000

In [18]:
#Reading data from a csv file
#This will create a spark job to load data as it infers the Schema

housing_df = spark.read.format("csv") \
.option("header","true")\
.option("inferSchema","true")\
.load("/content/sample_data/california_housing_test.csv")



In [20]:
#display the fields available
housing_df.printSchema()

#display sample data
housing_df.show()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -122.05|   37.37|              27.0|     3885.0|         661.0|    1537.0|     606.0|       6.6085|          344700.0|
|   -118.3|   34.26|              43.0|     1510.0|         310.0|     809.0|     277.0|        3.599|     

In [17]:
#writing data onto a file
result.write.format("csv").mode("overwrite").save("primes")


# Dataframe Schema

- Every DF has a define schema i.e structure and data types of all columns
- Can be inferred from data or explicitly specified
- self describing format like parquet include schema information
- df.printSchema() --> to print out the dataFrame schema
- DDL schema
    - ddl_schema = "name STRING NOT NULL, age INT, city STRING"
    - df = spark.read.csv("Filelocation",schema = ddl_schema)
    - df.printSchema()
- DataFrame Data Types (Primitive and Complex datatypes)
    - TINYINT/SMALLINT/INT/BIGINT
    - FLOAT/DOUBLE
    - STRING
    - BINARY
    - TIMESTAMP/DATE
    - ARRAY
    - MAP
    - STRUCT


In [None]:
df.printSchema()

#Transformations and Actions

DF are immutable - once created their data cannot be modified


*   **Transformations** create new DF from existing ones
    - select/filter/withColumn/groupBy/agg
*   **Actions** like showing or saving output trigger actual computation and produce final results.
    - count/show/take/first/write
    - Multiple transformations can be called, the job is only created when an action is requested - Lazy evaluation



#SparkSQL
SQL interface for Spark DataFrames

**DataFrame Registration**
    - Temporary views: createOrReplaceTempView()
    - Global Temp views: createGlobalTempView()

**SQL Query Execution**
    - spark.sql() for SQL statements
