# **Databricks spark Broadcast variable**

It is a programming mechanism in psark through which we can keep only read only copy of data into each node of the cluster instaed of sending it to node every time a task is needed.

In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending this data along with every task, spark distributes broadcast variables to the machine using efficient broadcast algorithms to reduce communication costs.

(for example CA to California, NY to New York e.t.c) by doing a lookup to reference mapping. In some instances, this data could be large and you may have many such lookups (like zip code).

**lookup table** - A lookup table in PySpark is a way to **search for a value or a range of values** in a table or a database using PySpark.

Instead of distributing this information along with each task over the network (overhead and time consuming), we can use the broadcast variable to cache this lookup info on each machine and tasks use this cached info while executing the transformations. info like CA = California, NY = New York

**mainly used in joining operation**
*  If one table or dataframe have many record and other has less record. Than it is not idead to send the fact table or big table to all the worker node because it will take hugh memory and also execution time increase. So the tiny table are sent to the worker node as whole and the fact table is distributed among the worker nodes.

Summary -
* Broadcase variable is materialised and cached in each worker node.
* It avoid data shuffle thus improve performace. As copy is present one each node thus if worker node need that data then there is no need of fetching data from other node or shufflying of data as entire dataset is already present on the worker node.
* it reduces network I/O operation thus improves performance.
* Ideal for joinning sistuation where one side of join is tiny table.
* Tiny table is sent to all worker node through driver thus the size of tiny table must be less than driver.
* only suitable tables as it gets cached in each node. It created for large tables, it would consume memory of each worker node thus hitting the performance.
* Broadcast variable is sent to worker nodes through driver. So the size of broadcast variable should fir into driver memory otherwise leading to OOM issues. 

In [None]:
# for local system
import findspark
findspark.init()



In [None]:
from pyspark.sql import SparkSession
spark= SparkSession.builder.config("spark.driver.host","localhost").getOrCreate()

### **Create Sample Dataframe**

In [None]:

# Sample data
data = [(1, 'Alice', 25),
        (2, 'Bob', 30),
        (3, 'Charlie', 22),
        (4, 'David', 28),
        (5, 'Eva', 35)]

# Define the schema for the DataFrame
schema = ["store_id", "name", "amount"]

In [None]:
df= spark.createDataFrame(data=data, schema=schema)
df.show()

### **Create sample dimension table**

In [None]:
store = [(1, "store_london"),
        (2, 'store_paris'), 
        (3, 'store_frankfurt'),
        (4, 'store_stockholm'),
        (5,'store_oslo') 
]

storeDf=spark.createDataFrame(data=store,schema=['store_id','store_name'])
storeDf.show()

In [None]:
from pyspark.sql.functions import broadcast

joinDF=df.join(broadcast(storeDf),storeDf.store_id==df.store_id)
joinDF.show()

  *  using PySpark to perform a join operation on two DataFrames, where storeDf is broadcasted, and you are joining on the store_id column. This is a common pattern to optimize join operations, especially when one of the DataFrames is relatively small and can be efficiently broadcasted to all the nodes in the cluster.
  *  The use of broadcast hints to Spark that storeDf is small enough to be efficiently broadcasted to all nodes, reducing the need for shuffling data across the network during the join operation.
  *   it's beneficial to use broadcasting when one DataFrame is significantly smaller than the other. If both DataFrames are large, broadcasting may not provide significant performance improvements, and Spark's default behavior for joins will be applied.

In [None]:
joinDF.explain(True)