# Broadcast Variables

## What are Broadcast Variables?
Broadcast Variables allow us to broadcast a read-only copy of non-rdd data to all the executors.  The executors can then access the value of this data locally.  This is much more efficent than relying on the driver to trasmit this data teach time a task is run.

## Using a Broadcast Variable

In [4]:
# Create a broadcast variable, transmitting it's value to all the executors.
broadcastVar=sc.broadcast([1,2,3])

# I can read it's value
print broadcastVar.value

# When I no longer want the variable to remain on the executors, I should free up the memory.
broadcastVar.unpersist()

In [5]:
# The value is available on the driver
print "Driver:", broadcastVar.value

# And on the executors
mapFunction=lambda n: "Task " + str(n) + ": " + str(broadcastVar.value)
results=sc.parallelize(range(10), numSlices=10).map(mapFunction).collect()
print "\n".join(results)


## How Broadcast Variables can improve performance (demo)
Here we have a medium sized dataSet, small enough to fit in RAM, but still involves quite a bit of network communication when sending the dataSet to the executors.

In [7]:
# Create a medium sized dataSet of several million values.
size=2*1000*1000
dataSet=list(xrange(size))

# Check out the size of the dataSet in RAM.
import sys
print sys.getsizeof(dataSet) / 1000 / 1000, "Megabytes"

Now let's demonstrate the overhead of network communication when not using broadcast variables.

In [9]:
# Ceate an RDD with 5 partitions so that we can do an operation in 5 seperate tasks running in parallel on up to 5 different executors.
rdd=sc.parallelize([1,2,3,4,5], numSlices=5)
print rdd.getNumPartitions(), "partitions"

In [10]:
# In a loop, do a job 5 times without using broadcast variables...
for i in range(5):
  rdd.map(lambda x: len(dataSet) * x).collect()

# Look how slow it is...
# This is because our local "data" variable is being used by the lambda and thus must be sent to each executor every time a task is run.

Let's do that again, but this time we'll first send a copy of the dataset to the executors once, so that the data is available locally every time a task is run.

In [12]:
# Create a broadcast variable.  This will transmit the dataset to the executors.
broadcastVar=sc.broadcast(dataSet)

Now we'll run the job 5 times, and notice how much faster it is since we don't have to retransmit the data set each time.

In [14]:
for i in range(5):
  rdd.map(lambda x: len(broadcastVar.value)).collect()

Finally, let's delete the the broadcast variable out of the Executor JVMs

In [16]:
# Free up the memory on the executors.
broadcastVar.unpersist()

## Frequently Asked Questions about Broadcast Variables
**Q:** How is this different than using an RDD to keep data on an executor?  
**A:** With an RDD, the data is divided up into partitions and executors hold only a few partitions.  A broadcast variable is sent out to all the executors.

**Q:** When should I use an RDD and when should I use a broadcast variable?  
**A:** BroadCast variables must fit into RAM (and they're generally under 20 MB).  And they are on all executors.  They're good for small datasets that you can afford to leave in memory on the executors.  RDDs are better for very large datasets that you want to partition and divide up between executors.