## Changing the number of workers

#### The effect of changing the number of workers
* When you initialize SparkContext, you can specify the number of workers. 
* Usually the recommendation is for one worker per core.
* But the number of workers can be smaller or larger than the number of cores

In [1]:
from time import time
from pyspark import SparkContext
for j in range(1,10):
    sc = SparkContext(master="local[%d]"%(j))
    t0=time()
    for i in range(10):
        sc.parallelize([1,2]*1000000).reduce(lambda x,y:x+y)
    print("%2d executors, time=%4.3f"%(j,time()-t0))
    sc.stop()

 1 executors, time=23.134
 2 executors, time=11.996
 3 executors, time=14.768
 4 executors, time=16.016
 5 executors, time=16.815
 6 executors, time=14.804
 7 executors, time=15.229
 8 executors, time=18.107
 9 executors, time=16.711


In [1]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.getOrCreate()

In [5]:
sc = spark.sparkContext

In [23]:
a = sc.parallelize([[1,2,2,1],[3,4,4,3],[4,5,5,4],[6,7,7,6]])

In [27]:
a = sc.parallelize([1,3,4,6])

In [28]:
a.collect()

[1, 3, 4, 6]

In [29]:
a.reduce(lambda x,y: 1+1)

2

In [31]:
a.reduceByKey?

[0;31mSignature:[0m [0ma[0m[0;34m.[0m[0mreduceByKey[0m[0;34m([0m[0mfunc[0m[0;34m,[0m [0mnumPartitions[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mpartitionFunc[0m[0;34m=[0m[0;34m<[0m[0mfunction[0m [0mportable_hash[0m [0mat[0m [0;36m0x104176a60[0m[0;34m>[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Merge the values for each key using an associative and commutative reduce function.

This will also perform the merging locally on each mapper before
sending results to a reducer, similarly to a "combiner" in MapReduce.

Output will be partitioned with C{numPartitions} partitions, or
the default parallelism level if C{numPartitions} is not specified.
Default partitioner is hash-partition.

>>> from operator import add
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.reduceByKey(add).collect())
[('a', 2), ('b', 1)]
[0;31mFile:[0m      ~/anaconda/lib/python3.6/site-packages/pyspark/rdd.py
[0;31mType:[0m      method


In [8]:
sc.parallelize?

[0;31mSignature:[0m [0msc[0m[0;34m.[0m[0mparallelize[0m[0;34m([0m[0mc[0m[0;34m,[0m [0mnumSlices[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Distribute a local Python collection to form an RDD. Using xrange
is recommended if the input represents a range for performance.

>>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect()
[[0], [2], [3], [4], [6]]
>>> sc.parallelize(xrange(0, 6, 2), 5).glom().collect()
[[], [0], [], [2], [4]]
[0;31mFile:[0m      ~/anaconda/lib/python3.6/site-packages/pyspark/context.py
[0;31mType:[0m      method


## Summary
* This machine has 4 cores
* Increasing the number of executors from 1 to 3 speeds up the computation
* From 3 and up you have fluctuations in performance.
* More than one worker per core is usually unhelpful