# **Databricks spark Broadcast variable**

It is a programming mechanism in pyspark through which we can keep only read only copy of data into each node of the cluster instaed of sending it to node every time a task is needed.

In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending this data along with every task, spark distributes broadcast variables to the machine using efficient broadcast algorithms to reduce communication costs.

(for example CA to California, NY to New York e.t.c) by doing a lookup to reference mapping. In some instances, this data could be large and you may have many such lookups (like zip code).

**lookup table** - A lookup table in PySpark is a way to **search for a value or a range of values** in a table or a database using PySpark.

Instead of distributing this information along with each task over the network (overhead and time consuming), we can use the broadcast variable to cache this lookup info on each machine and tasks use this cached info while executing the transformations. info like CA = California, NY = New York

**mainly used in joining operation**
*  If one table or dataframe have many record and other has less record. Than it is not idead to send the fact table or big table to all the worker node because it will take hugh memory and also execution time increase. So the tiny table are sent to the worker node as whole and the fact table is distributed among the worker nodes.

Summary -
* Broadcase variable is materialised and cached in each worker node.
* It avoid data shuffle thus improve performace. As copy is present one each node thus if worker node need that data then there is no need of fetching data from other node or shufflying of data as entire dataset is already present on the worker node.
* it reduces network I/O operation thus improves performance.
* Ideal for joinning sistuation where one side of join is tiny table.
* Tiny table is sent to all worker node through driver thus the size of tiny table must be less than driver.
* only suitable tables as it gets cached in each node. It created for large tables, it would consume memory of each worker node thus hitting the performance.
* Broadcast variable is sent to worker nodes through driver. So the size of broadcast variable should fit into driver memory otherwise leading to OOM issues. 

In [None]:
# for local system
import findspark
findspark.init()



In [None]:
from pyspark.sql import SparkSession
spark= SparkSession.builder.config("spark.driver.host","localhost").getOrCreate()

### **Create Sample Dataframe**

In [None]:

# Sample data
data = [(1, 'Alice', 25),
        (2, 'Bob', 30),
        (3, 'Charlie', 22),
        (4, 'David', 28),
        (5, 'Eva', 35)]

# Define the schema for the DataFrame
schema = ["store_id", "name", "amount"]

In [None]:
df= spark.createDataFrame(data=data, schema=schema)
df.show()

### **Create sample dimension table**

In [None]:
store = [(1, "store_london"),
        (2, 'store_paris'), 
        (3, 'store_frankfurt'),
        (4, 'store_stockholm'),
        (5,'store_oslo') 
]

storeDf=spark.createDataFrame(data=store,schema=['store_id','store_name'])
storeDf.show()

In [None]:
from pyspark.sql.functions import broadcast

joinDF=df.join(broadcast(storeDf),storeDf.store_id==df.store_id)
joinDF.show()

  *  using PySpark to perform a join operation on two DataFrames, where storeDf is broadcasted, and you are joining on the store_id column. This is a common pattern to optimize join operations, especially when one of the DataFrames is relatively small and can be efficiently broadcasted to all the nodes in the cluster.
  *  The use of broadcast hints to Spark that storeDf is small enough to be efficiently broadcasted to all nodes, reducing the need for shuffling data across the network during the join operation.
  *   it's beneficial to use broadcasting when one DataFrame is significantly smaller than the other. If both DataFrames are large, broadcasting may not provide significant performance improvements, and Spark's default behavior for joins will be applied.

In [None]:
joinDF.explain(True)

### **Broadcast Variable Example**

1. creating the dataframe and storing the array of rown in the variable.
    DF.collect() returns Array of Row type of the dataframe
    * first we create the dataframe
    * use the collectin function to get the dataframe value in form of rows
    * than it is stored in the broadcast variable.
    * broadcast_df
      * it contain the reference of the broadcast variable.
        * e.g. <pyspark.broadcast.Broadcast object at 0x000000>
      * using value, we can get the value of the broadcast variable.
      * for row in broadcast_value:
        * to retrive all the value from broadcast_value variable.



  
When you use broadcast(), you typically use it in the context of the driver program to broadcast data to all the worker nodes. one common approach is to collect the data you want to broadcast on the driver node using df.collect() (or other means), and then use sparkContext.broadcast() to broadcast it to the worker nodes.

* Here's a summary of the typical workflow:

* Collect Data on the Driver: You collect the data you want to broadcast on the driver node. This typically involves collecting the data from a DataFrame or any other source into a suitable data structure.

* Broadcast Data: Once you have the data on the driver node, you use sparkContext.broadcast() to broadcast it. This broadcasts the data to all the worker nodes in the Spark cluster.

* Accessing Broadcasted Data: On the worker nodes, you can access the broadcasted data using the .value attribute of the broadcast variable. This allows the tasks running on the worker nodes to access the broadcasted data efficiently.

* Performing Operations: With the broadcasted data available on all worker nodes, you can perform distributed operations using Spark's transformations and actions.

This approach allows you to efficiently distribute read-only data to all the worker nodes in the Spark cluster without incurring significant overhead from network transfer.
  
  
Error -
  * **RuntimeError:** It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063
  * if we directly use df (spark.sparkContext.broadcast(df.collect())) in place of df.collect().

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType

# Create a SparkSession
spark = SparkSession.builder.appName('first').getOrCreate()

# Create a schema for the DataFrame
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("value", IntegerType(), True)
])

# Create a 2D DataFrame
data = [(1, 10), (2, 20), (3, 30)]
df = spark.createDataFrame(data, schema)

# Show the DataFrame
df.show()
"""
+---+-----+
| id|value|
+---+-----+
|  1|   10|
|  2|   20|
|  3|   30|
+---+-----+
"""

# Add the DataFrame to a broadcast variable
broadcast_df = spark.sparkContext.broadcast(df.collect())

# Retrieve the value from the broadcast variable
broadcast_value = broadcast_df.value
print(broadcast_value)
# Print the values from the broadcast variable
for row in broadcast_value:
    print(row)
"""
[Row(id=1, value=10), Row(id=2, value=20), Row(id=3, value=30)]
Row(id=1, value=10)
Row(id=2, value=20)
Row(id=3, value=30)
"""

# Stop the SparkSession
spark.stop()

2. Storing a constant value in the broadcast variable.
    * broadcast_age_threshold = spark.sparkContext.broadcast(30)
    * now the broadcast_age_threshold contain the reference of broadcast where value 30 is stored.
    * using broadcast_age_threshold.value will get the value of the varibale stored.

In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.config('spark.driver.host','localhost').appName("Broadcast").getOrCreate()

# Create a DataFrame
data = [("John", 30), ("Jane", 25), ("Alice", 35)]
df = spark.createDataFrame(data, ["name", "age"])

# Broadcast a Python object
broadcast_age_threshold = spark.sparkContext.broadcast(30)

# Use the broadcast variable in a transformation
filtered_df = df.filter(df.age > broadcast_age_threshold.value)

# Show the filtered DataFrame
filtered_df.show()

# Stop the SparkSession
spark.stop()

3.  In PySpark, there's a built-in broadcast() function available in the pyspark.sql.functions module. This function allows you to explicitly mark a DataFrame to be broadcasted during join operations. When you use broadcast(df) with a DataFrame df, it indicates to Spark that you want to broadcast df when performing join operations with other DataFrames.

In [None]:

from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

# Create a SparkSession
spark = SparkSession.builder \
    .appName("shop") \
    .getOrCreate()

# Sample DataFrames
df1 = spark.createDataFrame([(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')], ["id", "name"])
df2 = spark.createDataFrame([(1, 'Engineer'), (2, 'Manager'), (3, 'Analyst')], ["id", "role"])

# Broadcast df1 during join operation
joined_df = df1.join(broadcast(df2), on='id', how='left')

# Show the result
joined_df.show()

# Don't forget to stop the SparkSession
spark.stop()


* We have two DataFrames df1 and df2.
* We use broadcast(df2) to explicitly mark df2 for broadcasting during the join operation with df1.
* The join operation is performed between df1 and the broadcasted df2.
* By broadcasting df2, Spark optimizes the join operation by distributing df2 to all worker nodes, reducing the data shuffle during the join.


Using broadcast() from pyspark.sql.functions is convenient when you want to explicitly control the broadcasting behavior for join operations without manually managing broadcast variables. However, it's important to note that broadcast() is specifically designed for broadcast joins and may not be suitable for general broadcasting of large read-only data across all tasks in the cluster. For that purpose, you would still use sparkContext.broadcast()




### **To check the differene in performance with and without use of broadcast variable.**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast
import time

# Create a SparkSession
spark = SparkSession.builder \
    .appName("performace") \
    .getOrCreate()

# Sample large DataFrame
large_df = spark.range(1000000)

# Sample small DataFrame
small_df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')], ["id", "name"])

# Without broadcasting small DataFrame
start_time = time.time()
result1 = large_df.join(small_df, on='id', how='left')
elapsed_time_without_broadcast = time.time() - start_time

# With broadcasting small DataFrame
start_time = time.time()
result2 = large_df.join(broadcast(small_df), on='id', how='left')
elapsed_time_with_broadcast = time.time() - start_time

# Print the elapsed time for both scenarios
print("Elapsed time without broadcast:", elapsed_time_without_broadcast)
print("Elapsed time with broadcast:", elapsed_time_with_broadcast)

# Don't forget to stop the SparkSession
spark.stop()

Elapsed time without broadcast: 0.12572813034057617
Elapsed time with broadcast: 0.06737208366394043
