# IST 718: Big Data Analytics

- Professor: Daniel Acuna <deacuna@syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers either from your classmates or from the internet__
- You can put the homework files anywhere you want in your http://notebook.acuna.io workspace but _do not change_ the file names. The TAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and TAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before downloading and submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 
- Good luck!

In [1]:
# this code creates the spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
import numpy as np

# Part 1: Map-Reduce: Gradient descent

Throughout this assignment, you should use vanilla Python and not Numpy.

Some statistical models $f(x)$ are learned by optimizing a loss function $L(\Theta)$ that depends on a set of parameters $\Theta$. There are several ways of finding the optimal $\Theta$ for the loss function, one of which is to iteratively update following the gradient:

$$
\nabla L = 
\begin{pmatrix} 
    \frac{\partial L}{\partial \theta_0}\\ 
    \frac{\partial L}{\partial \theta_1} \\ 
    \vdots\\ 
    \frac{\partial L}{\partial \theta_p}
\end{pmatrix}
$$

To then, compute the update
$$\Theta^{t+1} = \Theta^t - \eta \nabla L$$

Because we assume independence between data points, the gradient becomes a summation:

$$\nabla L = \sum_{i=1}^{n} \nabla L_i$$
where $L_i$ is the loss function for the $i$-th data point.

Take as example, the statistical model $f(x) = b_0 + b_1 x$ and loss function $L(\Theta) = (f(x) - y)^2$. If we have a set of three datapoints $D=\{ (x=1,y=2), (x=-2, y=-1), (x=4, y = 3)\}$

Then the loss function for each of them is
$L_1 = \left(b_{0} + b_{1} - 2\right)^{2}$, 
$L_2 = \left(b_{0} - 2 b_{1} + 1\right)^{2}$, and
$L_3 = \left(b_{0} + 4 b_{1} - 3\right)^{2}$

with 
$$\nabla L_i = \left[\begin{matrix}2 b_{0} + 2 b_{1} x_i - 2 y_i\\2 x_i \left(b_{0} + b_{1} x_i - y_i\right)\end{matrix}\right]$$

if we start with a solution $b_0 = 0, b_1 = 1$, then the gradients are:

$$\nabla L_1 = \left[\begin{matrix}-2\\-2\end{matrix}\right]$$
$$\nabla L_2 = \left[\begin{matrix}-2\\4\end{matrix}\right]$$
$$\nabla L_3 = \left[\begin{matrix}2\\8\end{matrix}\right]$$

which after accumulation would yield
$$\nabla L = \left[\begin{matrix}-2\\10\end{matrix}\right]$$

## Question 1 (5 pts)

Create a function `f_linear(b, x)` that receives the parameters `b` and one data point `x` as lists and return the prediction for that data point. Assume that the first element of `b` is the intercept.

In [2]:
# define below the function `f_linear` which performs a linear prediction based on parameters as data point
def f_linear(b, x):
    if len(b)==0:
        print("No paramters, Enter values")
    else:
        return(sum([b[i+1]*x[i] for i in range(0,len(b)-1)])+b[0])

In [3]:
# for the example above, if we assume b = [0, 1], and the first data point x = [1], y = 2
f_linear([0, 1], [1])

1

In [4]:
# test (5 pts)
assert f_linear([1, 1, 2, 3], [2, 1, 3]) == 14
assert f_linear([1], []) == 1
assert f_linear([0, 1, 0, 1, 0, 1], [0, 10, 10, 10 , 10]) == 20

## Question 2 (5 pts)
Define the function `L(y_pred, y)` that receives a prediction `y_pred` and the actual value `y` and returns the squared error between them.

In [5]:
def L(y_pred, y):
    return (pow((y_pred-y),2))

In [6]:
# there should be not error here
L(1, 1)

0

In [7]:
# 2^2 error
L(0, 2)

4

In [8]:
# (5 pts)
assert L(1, 1) == 0
assert L(0, 4) == 16

## Question 3 (10 pts)
Create a function `gf_linear(f, b, x, y)` which returns the gradient of the linear function `f` with parameter `b` with respect to the squared loss function, evaluated at `x` and the actual outcome `y`. This function should return a vector with each element $j$ corresponding to the gradient with respect $b_j$, with $j = \{0, 1, \dots, p\}$.

In [9]:
# define `gf_linear`
def gf_linear(f, b, x, y):
    y_pred=f_linear(b,x)
    j=[]
    j.append((2*y_pred)-(2*y))
    for i in range(0,len(x)):
        j.append(2*x[i]*(y_pred-y))
    return j

In [10]:
# for the example above and first data point
x = [1]
y = 2
b = [0, 1]
gf_linear(f_linear, b, x, y)

[-2, -2]

In [11]:
# for the example above and second data point
x = [-2]
y = -1
b = [0, 1]
gf_linear(f_linear, b, x, y)

[-2, 4]

In [12]:
## (10 pts)
np.testing.assert_array_equal(gf_linear(f_linear, [0, 1], [1], 2), [-2, -2])
np.testing.assert_array_equal(gf_linear(f_linear, [0, 1], [-2], -1), [-2, 4])
np.testing.assert_array_equal(gf_linear(f_linear, [1], [], 0), [2])

## Question 4 (15 pts)

Develop a map-reduce job that produces a value so that the first element of the value is the mean loss function across all the data. You might use other pieces of information as part of the value to create your computation.

You will implement your map function as `map_mse(f, b, L, xy)` where `f` is the function `b` are the parameters of the function `L` is the loss function and `xy` is the data. Assume that the data will come as an RDD where each element is of the format:

`[x, y]` where `x` is a list and `y` is a scalar.

Since the key does not matter for this map reduce job, just put a constant of your choice.

In [13]:
# data
rdd_data = sc.parallelize([
    [[1, 2], 3],
    [[3, 1], 4],
    [[-1, 1.5], 0],
    [[-9, 3], 0]
])

In [14]:
# create function `map_mse` below
def map_mse(f, b, L, xy):
    y_pred=f(b,xy[0])
    mse=L(y_pred,xy[1])
    return[1,[mse,1]]

You should apply the map function in the following way:

```python
# for an example set of `b = [0, 0, 0]`
rdd_data.map(lambda x: map_generator(f_linear, [0, 0, 0], L, x))
```


In [15]:
# try it here
rdd_data.map(lambda x: map_mse(f_linear, [0, 0, 0], L, x)).collect()

[[1, [9, 1]], [1, [16, 1]], [1, [0.0, 1]], [1, [0, 1]]]

In [16]:
# (10 pts)
assert rdd_data.map(lambda x: map_mse(f_linear, [0, 0, 0], L, x)).count() == 4
assert rdd_data.map(lambda x: map_mse(f_linear, [0, 0, 0], L, x)).map(lambda x: len(x)).\
    distinct().\
    first() == 2

assert rdd_data.map(lambda x: map_mse(f_linear, [0, 0, 0], L, x)).count() == 4
# the first element should be a number
assert isinstance((rdd_data.map(lambda x: map_mse(f_linear, [0, 0, 0], L, x)).first()[1][0]), 
                  (int, float, complex))
# try with other initializations
assert isinstance((rdd_data.map(lambda x: map_mse(f_linear, [1, 2, 3], L, x)).first()[1][0]), 
                  (int, float, complex))

You will now create a reduce job that receives two values of a previous reduce (or map) and merge them appropriately. Remember that at the end of the reduce job, the first element of the value should be the mean squared error. Create the function `reduce_mse(v1, v2)` below.

In [17]:
# create function `reduce_mse` below
def reduce_mse(v1, v2):
    # YOUR CODE HERE
    return[(v1[0]*v1[1]+v2[0]*v2[1])/(v1[1]+v2[1]),(v1[1]+v2[1])]
    raise NotImplementedError()

In [18]:
# the following function call should return the mean squared error
rdd_data.\
    map(lambda x: map_mse(f_linear, [0, 0, 0], L, x)).\
    reduceByKey(reduce_mse).first()[1][0]

6.25

In [19]:
# the following function call should return the mean squared error
rdd_data.\
    map(lambda x: map_mse(f_linear, [2, 2, 3], L, x)).\
    reduceByKey(reduce_mse).first()[1][0]

41.8125

In [20]:
# (5 pts)
assert rdd_data.\
    map(lambda x: map_mse(f_linear, [0, 0, 0], L, x)).\
    reduceByKey(reduce_mse).first()[1][0] == 6.25

assert rdd_data.\
    map(lambda x: map_mse(f_linear, [2, 0, 0], L, x)).\
    reduceByKey(reduce_mse).first()[1][0] == 3.25

assert rdd_data.\
    map(lambda x: map_mse(f_linear, [2, 2, 3], L, x)).\
    reduceByKey(reduce_mse).first()[1][0] == 41.8125

## Question 5 (10 pts)

In this question, you will compute the cumulative gradient of a model on the data. You will define a map function `map_gradient(f, gf, b, xy)` that would receive a function `f`, its gradient `gf`, its parameters `b`, and a data point `xy = [x, y]`. Also you will define a function `reduce_gradient(v1, v2)` that combines the two values appropriately. In the map function, you probably do not need to keep extra values beyond the actual gradient.

In [21]:
# define the function `map_gradient` below
def map_gradient(f, gf, b, xy):
    return(1,gf_linear(f,b,xy[0],xy[1]))

In [22]:
# (5 pts)
assert len(rdd_data.map(lambda xy: map_gradient(f_linear, gf_linear, [0, 0, 0], xy)[1]).first()) == 3

In [23]:
# define the function `reduce_gradient` below
def reduce_gradient(v1, v2):
    result=[]
    for i in range(len(v1)):
        result.append(v1[i]+v2[i])
    return result

In [24]:
rdd_data.map(lambda xy: map_gradient(f_linear, gf_linear, [0, 0, 0], xy)).\
    reduceByKey(reduce_gradient).first()[1]

[-14.0, -30.0, -20.0]

In [25]:
# (5 pts)
np.testing.assert_array_equal(
    rdd_data.map(lambda xy: map_gradient(f_linear, gf_linear, [0, 0, 0], xy)).\
    reduceByKey(reduce_gradient).first()[1],
    [-14.0, -30.0, -20.0])

np.testing.assert_array_equal(
    rdd_data.map(lambda xy: map_gradient(f_linear, gf_linear, [0, 0, 0], xy)).\
    reduceByKey(reduce_gradient).first()[1],
    [-14.0, -30.0, -20.0])

if all your answers are correct, then we can run an optimization below, and the MSE should decrease with each iteration

In [26]:
b = [0, 0, 0]
learning_rate = 0.01
print("Initial solution: \t", b)
for _ in range(10):
    print("New iteration")
    print("=============")
    gradient = rdd_data.map(lambda xy: map_gradient(f_linear, gf_linear, b, xy)).\
        reduceByKey(reduce_gradient).first()[1]
    b = [b0 - learning_rate*g0 for b0, g0 in zip(b, gradient)]
    print("Current solution: \t", b)
    mse = rdd_data.\
        map(lambda x: map_mse(f_linear, b, L, x)).\
        reduceByKey(reduce_mse).first()[1][0]
    print("Current MSE: \t\t", mse)
    

Initial solution: 	 [0, 0, 0]
New iteration
Current solution: 	 [0.14, 0.3, 0.2]
Current MSE: 		 4.036099999999999
New iteration
Current solution: 	 [0.27480000000000004, 0.15880000000000008, 0.455]
Current MSE: 		 2.807733502499999
New iteration
Current solution: 	 [0.34362200000000004, 0.41343399999999997, 0.540541]
Current MSE: 		 2.1247451622975624
New iteration
Current solution: 	 [0.42466317000000003, 0.24800435, 0.707635855]
Current MSE: 		 1.7435970621414332
New iteration
Current solution: 	 [0.45430526015, 0.47522477825, 0.730516771125]
Current MSE: 		 1.5295387322303051
New iteration
Current solution: 	 [0.50541029705925, 0.29867069991675, 0.8483086772643751]
Current MSE: 		 1.4080112827808984
New iteration
Current solution: 	 [0.5135716556948637, 0.5084709260312963, 0.8371720415554382]
Current MSE: 		 1.337759564870712
New iteration
Current solution: 	 [0.5479266281297145, 0.3279838803481508, 0.9270367149304005]
Current MSE: 		 1.2959552690942717
New iteration
Current soluti

**(5 pts)** In the code, above, play with the value of `learning_rate` less than 1.0 until the optimizer diverges (the loss function goes down and then goes *up*). What is this learning rate?

In [27]:
b = [0, 0, 0]
learning_rate = 0.001
print("Initial solution: \t", b)
while learning_rate <= 1:
    for _ in range(10): 
        print("New iteration")
        print("=============")
        print("Learning Rate:", learning_rate)
        gradient = rdd_data.map(lambda xy: map_gradient(f_linear, gf_linear, b, xy)).\
            reduceByKey(reduce_gradient).first()[1]
        b = [b0 - learning_rate*g0 for b0, g0 in zip(b, gradient)]
        print("Current solution: \t", b)
        mse = rdd_data.\
            map(lambda x: map_mse(f_linear, b, L, x)).\
            reduceByKey(reduce_mse).first()[1][0]
        print("Current MSE: \t\t", mse)
    learning_rate=learning_rate + 0.001


Initial solution: 	 [0, 0, 0]
New iteration
Learning Rate: 0.001
Current solution: 	 [0.014, 0.03, 0.02]
Current MSE: 		 5.891261
New iteration
Learning Rate: 0.001
Current solution: 	 [0.027948, 0.055588, 0.04055]
Current MSE: 		 5.58415718675025
New iteration
Learning Rate: 0.001
Current solution: 	 [0.041783222, 0.077601034, 0.061425541]
Current MSE: 		 5.314090249091918
New iteration
Learning Rate: 0.001
Current solution: 	 [0.055458785517, 0.096710842835, 0.0824497111855]
Current MSE: 		 5.071490223926492
New iteration
Learning Rate: 0.001
Current solution: 	 [0.0689388996791015, 0.11345668960528252, 0.10348362340246126]
Current MSE: 		 4.850025802529529
New iteration
Learning Rate: 0.001
Current solution: 	 [0.08219661440589515, 0.12827165581397543, 0.12441878655814302]
Current MSE: 		 4.64545465790137
New iteration
Learning Rate: 0.001
Current solution: 	 [0.09521201956204355, 0.1415037134853074, 0.1451709946021718]
Current MSE: 		 4.454884757939677
New iteration
Learning Rate: 

Current solution: 	 [0.5716084613819287, 0.45224668686011077, 0.9895080325774727]
Current MSE: 		 1.005911233754649
New iteration
Learning Rate: 0.006
Current solution: 	 [0.5716772937575515, 0.4531634189728946, 0.9926427703950431]
Current MSE: 		 1.0054926869498257
New iteration
Learning Rate: 0.006
Current solution: 	 [0.5715267004876835, 0.4539570308287649, 0.9954185578801863]
Current MSE: 		 1.0051631750339605
New iteration
Learning Rate: 0.006
Current solution: 	 [0.571190654874729, 0.4546464245511342, 0.9978904187433701]
Current MSE: 		 1.004897503053821
New iteration
Learning Rate: 0.006
Current solution: 	 [0.5706979083215203, 0.4552475970832928, 1.0001049198731071]
Current MSE: 		 1.0046776255038936
New iteration
Learning Rate: 0.006
Current solution: 	 [0.5700727929235048, 0.4557740867067033, 1.002101471126403]
Current MSE: 		 1.0044906175156267
New iteration
Learning Rate: 0.006
Current solution: 	 [0.5693359007046829, 0.4562373509306409, 1.0039134253449293]
Current MSE: 		 

Current MSE: 		 0.9985261903086873
New iteration
Learning Rate: 0.010000000000000002
Current solution: 	 [0.4943924778504847, 0.464822631940678, 1.0555877855433213]
Current MSE: 		 0.9983749939961158
New iteration
Learning Rate: 0.011000000000000003
Current solution: 	 [0.4920705426011635, 0.4650203170949069, 1.0569536940795845]
Current MSE: 		 0.9982093526042282
New iteration
Learning Rate: 0.011000000000000003
Current solution: 	 [0.4897536571856574, 0.4652175667573141, 1.0583166128550079]
Current MSE: 		 0.9980444321490226
New iteration
Learning Rate: 0.011000000000000003
Current solution: 	 [0.48744181304420875, 0.46541438323505613, 1.0596765523372407]
Current MSE: 		 0.9978802294926805
New iteration
Learning Rate: 0.011000000000000003
Current solution: 	 [0.48513500094770107, 0.46561076844749155, 1.0610335218569067]
Current MSE: 		 0.9977167415110496
New iteration
Learning Rate: 0.011000000000000003
Current solution: 	 [0.48283321119298267, 0.46580672403488593, 1.062387529924045]


Current MSE: 		 0.9906144300842954
New iteration
Learning Rate: 0.015000000000000006
Current solution: 	 [0.37670931678933606, 0.4748410220607639, 1.1248132047355923]
Current MSE: 		 0.9904346863539282
New iteration
Learning Rate: 0.015000000000000006
Current solution: 	 [0.37389261168004495, 0.47508078753372857, 1.126470091702229]
Current MSE: 		 0.990256008994519
New iteration
Learning Rate: 0.015000000000000006
Current solution: 	 [0.3710842694015092, 0.4753198986931172, 1.1281220395806608]
Current MSE: 		 0.990078391679585
New iteration
Learning Rate: 0.015000000000000006
Current solution: 	 [0.36828427993244045, 0.47555818469675126, 1.1297691132483967]
Current MSE: 		 0.9899018281201976
New iteration
Learning Rate: 0.015000000000000006
Current solution: 	 [0.3654925891050736, 0.47579599016167656, 1.1314112277662138]
Current MSE: 		 0.9897263120648235
New iteration
Learning Rate: 0.016000000000000007
Current solution: 	 [0.36252367314677475, 0.47604841551406774, 1.1331577525441456]

Current MSE: 		 5.720225871586498e+19
New iteration
Learning Rate: 0.02000000000000001
Current solution: 	 [-370528384.57637715, 4319158523.331398, -1254971609.8425398]
Current MSE: 		 5.058383617006062e+20
New iteration
Learning Rate: 0.02000000000000001
Current solution: 	 [1101845685.7881415, -12843944967.478472, 3731927464.2595415]
Current MSE: 		 4.473117913733488e+21
New iteration
Learning Rate: 0.02000000000000001
Current solution: 	 [-3276574655.130659, 38194227294.43546, -11097687362.275375]
Current MSE: 		 3.9555686925156875e+22
New iteration
Learning Rate: 0.02000000000000001
Current solution: 	 [9743598049.317377, -113578733186.25735, 33001355476.912178]
Current MSE: 		 3.497901012886757e+23
New iteration
Learning Rate: 0.02000000000000001
Current solution: 	 [-28974680246.06884, 337750742619.9037, -98136614192.55792]
Current MSE: 		 3.093186453595048e+24
New iteration
Learning Rate: 0.02000000000000001
Current solution: 	 [86162431080.12653, -1004374330820.8036, 2918302871

Current MSE: 		 4.725010107256311e+65
New iteration
Learning Rate: 0.024000000000000014
Current solution: 	 [-4.26756951212261e+31, 4.974601133290383e+32, -1.445416547124287e+32]
Current MSE: 		 6.710115618067539e+66
New iteration
Learning Rate: 0.024000000000000014
Current solution: 	 [1.6082154667728681e+32, -1.8746573338425292e+33, 5.4469909404206285e+32]
Current MSE: 		 9.52921805155231e+67
New iteration
Learning Rate: 0.024000000000000014
Current solution: 	 [-6.0604917628654395e+32, 7.064566635928591e+33, -2.0526754286889443e+33]
Current MSE: 		 1.353270224279411e+69
New iteration
Learning Rate: 0.024000000000000014
Current solution: 	 [2.2838706110359284e+33, -2.6622519674663743e+34, 7.735420274479045e+33]
Current MSE: 		 1.9218159244691887e+70
New iteration
Learning Rate: 0.024000000000000014
Current solution: 	 [-8.606669511398591e+33, 1.0032583601424213e+35, -2.915060315260827e+34]
Current MSE: 		 2.7292231671690015e+71
New iteration
Learning Rate: 0.025000000000000015
Curren

Current MSE: 		 6.731607661390951e+117
New iteration
Learning Rate: 0.028000000000000018
Current solution: 	 [-6.168002660214599e+57, 7.189889452645452e+58, -2.0890891366750076e+58]
Current MSE: 		 1.4017091242838723e+119
New iteration
Learning Rate: 0.028000000000000018
Current solution: 	 [2.814583287059725e+58, -3.280890720063147e+59, 9.532932609109684e+58]
Current MSE: 		 2.9187507174098685e+120
New iteration
Learning Rate: 0.028000000000000018
Current solution: 	 [-1.2843507884479906e+59, 1.497136219950624e+60, -4.350068292177626e+59]
Current MSE: 		 6.077655915047993e+121
New iteration
Learning Rate: 0.02900000000000002
Current solution: 	 [6.115932346997386e+59, -7.129192365367711e+60, 2.0714530344101688e+60]
Current MSE: 		 1.3781425478785985e+123
New iteration
Learning Rate: 0.02900000000000002
Current solution: 	 [-2.91233740886699e+60, 3.394840302781099e+61, -9.864023701611999e+60]
Current MSE: 		 3.1250154810027886e+124
New iteration
Learning Rate: 0.02900000000000002
Curre

Current solution: 	 [-1.8098593609851191e+87, 2.10970867672515e+88, -6.129954441057321e+87]
Current MSE: 		 1.2068643325804345e+178
New iteration
Learning Rate: 0.03300000000000002
Current solution: 	 [1.0056717318469926e+88, -1.1722868772852974e+89, 3.4061883656671396e+88]
Current MSE: 		 3.726329728623059e+179
New iteration
Learning Rate: 0.03300000000000002
Current solution: 	 [-5.588144880415628e+88, 6.513963457687149e+89, -1.8926925630470103e+89]
Current MSE: 		 1.1505463266721045e+181
New iteration
Learning Rate: 0.03300000000000002
Current solution: 	 [3.1051248847537935e+89, -3.619567935993963e+90, 1.0516990705274263e+90]
Current MSE: 		 3.552441534227361e+182
New iteration
Learning Rate: 0.03300000000000002
Current solution: 	 [-1.7254027510468072e+90, 2.0112596775185684e+91, -5.843901733135192e+90]
Current MSE: 		 1.0968563856620953e+184
New iteration
Learning Rate: 0.03300000000000002
Current solution: 	 [9.587423256105008e+90, -1.117582419212491e+92, 3.247239483573349e+91]


Current solution: 	 [-3.555670742836342e+119, 4.144758194723212e+120, -1.2042979764529012e+120]
Current MSE: 		 4.658132867870888e+242
New iteration
Learning Rate: 0.037000000000000026
Current solution: 	 [2.2583387950927892e+120, -2.6324901556985965e+121, 7.6489445670828e+120]
Current MSE: 		 1.8790874238879834e+244
New iteration
Learning Rate: 0.037000000000000026
Current solution: 	 [-1.4343550014287413e+121, 1.671992452701528e+122, -4.858129311370938e+121]
Current MSE: 		 7.580225053193671e+245
New iteration
Learning Rate: 0.037000000000000026
Current solution: 	 [9.11012233679983e+121, -1.0619446214601326e+123, 3.085578696382516e+122]
Current MSE: 		 3.0578572942697886e+247
New iteration
Learning Rate: 0.037000000000000026
Current solution: 	 [-5.7861776832645825e+122, 6.744805439916767e+123, -1.959765844290157e+123]
Current MSE: 		 1.2335374169635563e+249
New iteration
Learning Rate: 0.037000000000000026
Current solution: 	 [3.675016750002258e+123, -4.283876908739414e+124, 1.2447

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 3332.0 failed 1 times, most recent failure: Lost task 1.0 in stage 3332.0 (TID 3337, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/pyspark/rdd.py", line 2499, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/spark/python/pyspark/rdd.py", line 2499, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/spark/python/pyspark/rdd.py", line 352, in func
    return f(iterator)
  File "/usr/local/spark/python/pyspark/rdd.py", line 1861, in combineLocally
    merger.mergeValues(iterator)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 238, in mergeValues
    for k, v in iterator:
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-27-337b66369781>", line 14, in <lambda>
  File "<ipython-input-14-725e106158f4>", line 4, in map_mse
  File "<ipython-input-5-45cc549bb075>", line 2, in L
OverflowError: (34, 'Numerical result out of range')

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:153)
	at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
	at sun.reflect.GeneratedMethodAccessor68.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/pyspark/rdd.py", line 2499, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/spark/python/pyspark/rdd.py", line 2499, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/spark/python/pyspark/rdd.py", line 352, in func
    return f(iterator)
  File "/usr/local/spark/python/pyspark/rdd.py", line 1861, in combineLocally
    merger.mergeValues(iterator)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 238, in mergeValues
    for k, v in iterator:
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-27-337b66369781>", line 14, in <lambda>
  File "<ipython-input-14-725e106158f4>", line 4, in map_mse
  File "<ipython-input-5-45cc549bb075>", line 2, in L
OverflowError: (34, 'Numerical result out of range')

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


The learning rate for the above problem was found to be 0.017000000000000008 after which MSE started to increase again