### Q-1: Spark Transformations

CSE and DSE

Which of the following are properties of Spark transformations?
1. They are not computed right away
2. They are computed right away
3. They are vulnerable to machine failures
4. They define an execution plan, rather than a data structure in memory.

Correct: 1,4


### Q-2: Resilient Distributed Datasets

CSE

Which of the following is **not** a property of RDDs?
1. They can be changed after they are constructed 
2. They can be created by transformations applied to existing RDDs
3. They enable parallel operations on collections of distributed data
4. They track lineage information to enable efficient recomputation of lost data.
5. They always reside in memory.

Correct: 1,5


RDDs cannot be changed once they are created - they are immutable. You can create RDDs by applying transformations to existing RDDs and Spark automatically tracks how you create and manipulate RDDs (their lineage) so that it can reconstruct any data that is lost due to slow or failed machine.

### Q-3: Spark Actions

DSE

Which of the following is not a property of Spark Actions?
1. They cause Spark to execute the recipe to transform the source data
2. They are the primary mechanism for getting results out of Spark
3. They are lazily evaluated
4. The results are returned to the driver

Correct: 3


### Q-4: Broadcasting variables

Which of the following is true?
1. In iterative or repeated computations, broadcasted variables avoid the problem of repeatedly sending the same data to workers.
2. Workers can modify broadcasted variables and can communicate the change to all other nodes.
3. Broadcasting caches the data on workers in deserialized form.
4. The value of a broadcasted variable `broadcastVar` can be accessed on workers using `broadcastVar.value`

Correct: 1,3,4


### Q-5: Hadoop Map Reduce and Spark Differences

Spark is often faster than a traditional MapReduce implementation because:
1. It sends less data over the network
2. Results do not need to be written to disk
3. It detects machine failures more quickly
4. It replicates the output of each task to recover from failures quickly
5. Results do not need to be serialized

Correct: 2,5

Spark keeps results in memory so they do not need to be serialized (converted into a format that can be stored on disk) and they do not need to be written to disk.


### Q-6: Gradient-Boosted Trees vs. Random Forests

Which of the following is true?
1. Gradient-Boosted Trees are faster as they train trees in parallel while Random Forests train trees sequentially.
2. Random Forests can be less prone to overfitting. Training more trees in a Random Forest reduces the likelihood of overfitting, but training more trees with GBTs increases the likelihood of overfitting.
3. Random Forests can be easier to tune since performance improves monotonically with the number of trees (whereas performance can start to decrease for GBTs if the number of trees grows too large).
4. If we compare the time required to construct a single tree on a single node, then Random Forests will be faster than Gradient-Boosted Trees.

Correct: 2,3,4


### Q-7: PCA distance metric

CSE and DSE

When working with two dimensional data, if we project data points onto the top principal component 
(which is a line in 2D space), the distance between the projected points and the original points minimizes 
which distance?

1. vertical distance
2. euclidean distance
3. manhattan distance
4. horizontal distance

Correct: 2

PCA minimizes the euclidean distance between points and their projections.

## Short answer questions

For CSE and DSE

Consider the following methods for computing the variance of the elements of an RDD X:

```python
N,S,S2=X.map(lambda x: np.array(1,x,x*x)).reduce(lambda a,b:a+b)
print 'variance=',(S2/N)-(S/N)**2

N,S = X.map(lambda x: np.array(1,x)).reduce(lambda a,b:a+b)
mean=S/N
S2 = X.map(lambda x: (x-mean)*(x-mean)).reduce(lambda a,b:a+b)
print 'variance=',S2/N
```

Which of these methods is faster? Explain why.

### Q-1 What is lazy evaluation and why is it more efficient?

The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

### Q-2 Consider the following methods of estimating pi. Which one is more efficient and why?

sc refers to SparkContext object

#### Method 1


```python
from random import random
def sampleAndSum(p):
    points = [(random(),random()) for i in xrange(p)]
    return sum([1 for (a,b) in points if a*a + b*b < 1])

def calculate_pi(sc, NUM_SAMPLES):
    tasks = sc.defaultParallelism
    count = sc.parallelize([NUM_SAMPLES/tasks]*tasks) \
            .map(sampleAndSum) \
            .reduce(lambda a, b: a + b)
    return count


NUM_SAMPLES=10000000
count = calculate_pi(sc, n)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
```

#### Method 2
```python
def sample(p):
    x, y = random(), random()
    return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample) \
             .reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
```



**Answer:** Method 1 is more efficient because it is generating random numbers in parallel at worker nodes while method 2 is generating random samples at driver and then distributing them to workers.

### Q-3 In assignment 5, you might have got different results by running Gradient Boosted Trees multiple times (with exactly the same training/test sets and same set of hyper-parameters). What might be introducing randomness in the algorithm?

It is because Stochastic Gradient Boosting is implemented in Spark. In Stochastic Gradient Boosting, at each iteration a subsample of training data is drawn at random (without replacement) and the loss gradient of this subsample is used to build the next tree.