# PySpark Tutorial part 3

## Broadcast & Accumulator

### Terminology

For parallel processing, Apache Spark uses shared variables.
There are two types of shared variables supported by Apache Spark −
* Broadcast: Broadcast variables are used to save the copy of data across all nodes.
* Accumulator: Accumulator variables are used for aggregating the information through associative and commutative operations. 

### Broadcast 
Broadcast variable is cached on all the machines and not sent on machines with tasks. The following code block has the details of a Broadcast class for PySpark.

```python
class pyspark.Broadcast (
   sc = None, 
   value = None, # this stores data, used to return broadcasted value
   pickle_registry = None, 
   path = None
)
'''


#### Example

In [1]:
# importing needed modules
from pyspark import SparkContext

sc = SparkContext('local', 'broadcast app')
words = ["scala", "java", "hadoop", "spark", "akka"]
bc = sc.broadcast(words)
data = bc.value
print(f'stored data: \n{data}')
print(f'type of .data: {type(data)}')
# you can also reference each element
print(f'an element: {bc.value[2]}')
sc.stop()

stored data: 
['scala', 'java', 'hadoop', 'spark', 'akka']
type of .data: <class 'list'>
an element: hadoop


### Accumulator

You can use an accumulator for a sum operation or counters (in MapReduce). The following code block has the details of an Accumulator class for PySpark.

```python
class pyspark.Accumulator(aid, value, accum_param)
```
#### Example

In [3]:
sc.stop()

sc = SparkContext('local', 'accumulator app')
num = sc.accumulator(10) # initialize as 10

def f(x):
    global num # why??
    num += x

data = [20,30,40,50]
rdd = sc.parallelize(data)
result = rdd.foreach(f)
fin = num.value
print(f'accumulated value: {fin}')
print(f'type: {type(num)}')
sc.stop()

accumulated value: 150
type: <class 'pyspark.accumulators.Accumulator'>
