# `Broadcast variable` allows the programmer to keep a `read-only variable cached on each machine` rather than shipping a copy of it with tasks.

Usage: Broadcast a copy of large input dataset to every node in an efficient manner.
<br>All broadcast variables are kept in all worker nodes for use in one or more Spark operations.

__Problems without broadcast variables usage:__
<br>Say, a larget lookup table needs to be referenced by the RDD.
<br>Without boradcasting, Spark Driver send a copy of this loopup table to each and every task. That's memory & performance overhead. Instead, a worker node needs only one copy of this lookup table and the multiple tasks it handles can reference this read-only broadcast variable.

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark import SparkContext
sc = SparkContext("local[1]", "BROADCAST")
sc.setLogLevel("WARN")

In [3]:
# MASTER DATA

# Create Broadcast variable in the Driver program
# .broadcast() broadcasts it to all nodes

category = sc.broadcast(["Mobile", "Tablet", "PC"])
states = sc.broadcast(["CA", "WA", "NY", "WI", "WY", "TX", "DC"])

type(category), type(states)

(pyspark.broadcast.Broadcast, pyspark.broadcast.Broadcast)

In [4]:
# Access broadcast value
category.value

['Mobile', 'Tablet', 'PC']

___

In [None]:
"""
On a cluster, this global variable (x) won't be avaiable as this is initialized in the Driver only.
How do you access across a cluster? Broadcast [SEE NEXT CODE CELL]
"""

x = 10

rdd = sc.parallelize(range(1, 10))
rdd.map(lambda n: n*x).collect()

In [10]:
"""
x is now broadcast to all worker nodes.
"""
x = sc.broadcast(10)

rdd = sc.parallelize(range(1, 10))
rdd.map(lambda n: n*x.value).collect()

[10, 20, 30, 40, 50, 60, 70, 80, 90]