### Coding Challenge #2:

**Question 1:** This question is meant to provide you with exposure to Spark MLlib data types (i.e. specifically LabelPoint and Dense Vectors)

**Dataset**: https://www.dropbox.com/s/cv8kpsqsgxzw5ar/Spiders.csv?raw=1

In 2006, Japanese researchers conducted a study to uncover the presence/absense of an endangered burrowing spider based on the size of the grain. The dataset is representative of some of the research they undertook. If you are interested in reviewing the paper, it can be accessed via this link: 
https://www.jstage.jst.go.jp/article/asjaa/55/2/55_2_79/_pdf

**ASK:**

**Step 1:** Import the requisite packages

from pyspark.mllib.regression import LabeledPoint

from pyspark import SparkContext, SparkConf

from pyspark.mllib.linalg import Vectors

**Step 2: ** Read in the "Spiders.csv" file

**Step 3:** Ignore the header row

**Step 4: **Create a RDD of LabeledPoints with the presence or absence of spiders being the label and the value is a dense vector of the grain size

**Step 5: ** Convert the RDD into a list/collection and output the list of LabelPoints





In [1]:
# Step 1
from pyspark import SparkContext, SparkConf, SparkFiles
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.sql import SQLContext

import os

In [2]:
conf = SparkConf().setAppName("CC2").setMaster("local[2]")
sc = SparkContext(conf=conf)

In [3]:
sqlContext = SQLContext(sc)

In [4]:
# Step 2
sc.addFile('https://uc08fd544110d6fbed318ef426e7.dl.dropboxusercontent.com/cd/0/inline/AJR66J7T_NONvcHsz_wksCUeVZMWiOFJy61-hl8ZEZjxBYpDw6RiDLEVBw0UWHDJ2ROYoCSDO5Pjlnd3q45hOUIfOmRP2Vk8iL2fuOcir-oweADIEUhi-x_nVa_Qkdqrk2Qg2NXphK0Qsb-RlqtmxvWUoSuFdAUj6e9Ptfynekt4IxKm5UwS7TU7r0QmCbpTIaU/file')

In [5]:
df = sqlContext.read.format("csv").option("header", "true")\
                                  .option("inferSchema", "true")\
                                  .load(SparkFiles.get('file'))

# (Step 3)

In [6]:
df.show(5)

+---------------+-------+
|Grain Size (mm)| Spider|
+---------------+-------+
|          0.245| Absent|
|          0.247| Absent|
|          0.285|Present|
|          0.299|Present|
|          0.327|Present|
+---------------+-------+
only showing top 5 rows



In [7]:
# Step 4
features = df.rdd.keys()
labels = df.rdd.values().map(lambda x: 0 if x=='Absent' else 1)

In [8]:
data = labels.zip(features).map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1])))

In [9]:
# Step 5
data.collect()[:10]

[LabeledPoint(0.0, [0.245]),
 LabeledPoint(0.0, [0.247]),
 LabeledPoint(1.0, [0.285]),
 LabeledPoint(1.0, [0.299]),
 LabeledPoint(1.0, [0.327]),
 LabeledPoint(1.0, [0.347]),
 LabeledPoint(0.0, [0.356]),
 LabeledPoint(1.0, [0.36]),
 LabeledPoint(0.0, [0.363]),
 LabeledPoint(1.0, [0.364])]

In [10]:
os.remove(SparkFiles.get('file'))

**Question 2**:

In this question, you are given the size of houses and associated prices and the **ask** is to predict the price of a house for a given square footage.

Here is the snapshot of the dataset that contains the size of houses and the associated prices in the city of Los Gatos (where Netflix is headquartered):

![alt text](https://www.dropbox.com/s/2woxl7v5t6i3g5f/HomePrices.JPG?raw=1)

**ASK**:

**Step 1**: Import the requisite packages

from pyspark.mllib.regression import LabeledPoint

from pyspark.mllib.regression import LinearRegressionWithSGD

from pyspark import SparkContext, SparkConf

from pyspark.mllib.linalg import Vectors


**Step 2:** Create a LabeledPoint data type which includes the price of the house as the label and a dense vector of home sizes

***Reference:*** https://spark.apache.org/docs/1.2.1/mllib-data-types.html

**Step 3:** Create a RDD of the LabelPoint constructed in setp 2 (*Hint*: Utilize the parallelize method of the *SparkContext* object since it ensures that the elements of the RDD can be operated in parallel)

**Step 4:** Train a LinearRegressionWithSGD model with  the num of iterations at 100 and a stepSize of 0.0000006

**Reference: ** https://spark.apache.org/docs/2.3.0/mllib-linear-methods.html

**Step 5:** Predict the price for a house with **2,600** sq ft



In [11]:
# Step 1
from pyspark.mllib.regression import LinearRegressionWithSGD
from numpy import array

I cannot find a raw data source, so I will use the housing data from __Question 3__ here as well.

In [12]:
sc.addFile('https://ucf1a7524b617cec1a3482991aba.dl.dropboxusercontent.com/cd/0/inline/AJThx2ikvEC68ehcl0FTjQDaPzIG-PIB7TX2pw42UCQV7oYnhvxEsQKbuTt-tBbd_9XlXOCkuC7U9EX-28phZ7lHtYC2qmFaUEy1y4FSVTh4UQ1i8EHHOqrlBseqfLAIs86sd10nzj5qQoZecxtUbz3rEnUnAs_gtMZimS5Q021HtENLx1KhE9PdDvnNcScr6ZQ/file')

In [13]:
house = sc.textFile(SparkFiles.get('file')).map(lambda line: line.split(','))

In [14]:
house.collect()[:5]

[['12839', '2405'],
 ['10000', '2200'],
 ['8040', '1400'],
 ['13104', '1800'],
 ['10000', '2351']]

In [15]:
# Step 2 (and Step 3 since it produces an RDD)
house_data = house.map(lambda x: LabeledPoint(float(x[1]), Vectors.dense(x[0])))

In [16]:
type(house_data)

pyspark.rdd.PipelinedRDD

In [17]:
# Step 4
model = LinearRegressionWithSGD.train(house_data, iterations=100, step=0.0000006)

In [18]:
# Step 5
model.predict(array([2600]))

-5.8779505413723687e+174

The prediction is a massive number, indicating divergence during training (exploding gradients). I will try dramaticly reducing the step size.

In [19]:
model = LinearRegressionWithSGD.train(house_data, iterations=200, step=0.00000001)
model.predict(array([2600]))

190.80536403423054

In **Question 3**, you are given the lot size of houses and the assocated prices in the city of Saratoga (cloe to the Netflix headquarters) and the ask is to uncover 4 clusters (**k = 4**)  based on the lot size and the price.

Here is the snapshot of a subset of the dataset that contains the size of houses and the associated prices in the city of Saratoga:

![alt text](https://www.dropbox.com/s/h8yyl0creyi11wg/HomePrices_COS.JPG?raw=1)

**Source: ** https://www.neighborhoodscout.com/ca/saratoga/real-estate



**Question 2: Ask**

**Step 1:** Import the relevant packages

from pyspark.mllib.clustering import KMeans

from pyspark import SparkContext, SparkConf

from pyspark.mllib.linalg import Vectors

**Step 2:** Initialize the Spark Context; the starting point/root of every Spark Application 

**Step 3:** Load the data into a RDD

***Dataset***: https://www.dropbox.com/s/njtjw2272kwk0au/Home_Prices1_COS.csv?raw=1

**Step 4:** Train the KMeans clustering model for 4 clusters and 5 iterations

**Step 5: ** Load the RDD of dense vectors into a collection

**Step 6: ** Predict the cluster for a select few data points i.e. elements 0, 18, 35, 6  and 15 of the collection

In [20]:
# Step 1 (Step 2 done in question 1)
from pyspark.mllib.clustering import KMeans

In [21]:
house = sc.textFile(SparkFiles.get('file')).map(lambda line: line.split(','))

In [22]:
house_arrays = house.map(lambda line: array([float(x) for x in line]))

In [23]:
# Step 4
clusters = KMeans.train(house_arrays, 4, maxIterations=5, initializationMode="random")

In [24]:
# Step 5
collection = house_arrays.collect()

In [28]:
# Step 6
for point in collection:
    print('Point:', point)
    print('Cluster:', clusters.predict(point))

Point: [ 12839.   2405.]
Cluster: 0
Point: [ 10000.   2200.]
Cluster: 1
Point: [ 8040.  1400.]
Cluster: 1
Point: [ 13104.   1800.]
Cluster: 0
Point: [ 10000.   2351.]
Cluster: 1
Point: [ 3049.   795.]
Cluster: 2
Point: [ 38768.   2725.]
Cluster: 3
Point: [ 16250.   2150.]
Cluster: 0
Point: [ 43026.   2724.]
Cluster: 3
Point: [ 44431.   2675.]
Cluster: 3
Point: [ 40000.   2930.]
Cluster: 3
Point: [ 1260.   870.]
Cluster: 2
Point: [ 15000.   2210.]
Cluster: 0
Point: [ 10032.   1145.]
Cluster: 1
Point: [ 12420.   2419.]
Cluster: 0
Point: [ 69696.   2750.]
Cluster: 3
Point: [ 12600.   2035.]
Cluster: 0
Point: [ 10240.   1150.]
Cluster: 1
Point: [ 876.  665.]
Cluster: 2
Point: [ 8125.  1430.]
Cluster: 1
Point: [ 11792.   1920.]
Cluster: 0
Point: [ 1512.  1230.]
Cluster: 2
Point: [ 1276.   975.]
Cluster: 2
Point: [ 67518.   2400.]
Cluster: 3
Point: [ 9810.  1725.]
Cluster: 1
Point: [ 6324.  2300.]
Cluster: 1
Point: [ 12510.   1700.]
Cluster: 0
Point: [ 15616.   1915.]
Cluster: 0
Point: [ 154