### Coding Challenge #2:

**Question 1:** This question is meant to provide you with exposure to Spark MLlib data types (i.e. specifically LabelPoint and Dense Vectors)

**Dataset**: https://www.dropbox.com/s/cv8kpsqsgxzw5ar/Spiders.csv?dl=0

In 2006, Japanese researchers conducted a study to uncover the presence/absense of an endangered burrowing spider based on the size of the grain. The dataset is representative of some of the research they undertook. If you are interested in reviewing the paper, it can be accessed via this link: 
https://www.jstage.jst.go.jp/article/asjaa/55/2/55_2_79/_pdf

**ASK:**

**Step 1:** Import the requisite packages

from pyspark.mllib.regression import LabeledPoint

from pyspark import SparkContext, SparkConf

from pyspark.mllib.linalg import Vectors

**Step 2: ** Read in the "Spiders.csv" file

**Step 3:** Ignore the header row

**Step 4: **Create a RDD of LabeledPoints with the presence or absence of spiders being the label and the value is a dense vector of the grain size

**Step 5: ** Convert the RDD into a list/collection and output the list of LabelPoints





In [0]:
# https://mikestaszel.com/2018/03/07/apache-spark-on-google-colaboratory/
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
!tar xf spark-2.3.1-bin-hadoop2.7.tgz
!pip install -q findspark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-hadoop2.7"

In [0]:
import findspark
findspark.init()
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = SparkContext.getOrCreate()

In [0]:
from pyspark.mllib.regression import LabeledPoint

from pyspark import SparkContext, SparkConf

from pyspark.mllib.linalg import Vectors

In [9]:
!wget https://www.dropbox.com/s/cv8kpsqsgxzw5ar/Spiders.csv?dl=0 -O Spiders.csv

--2018-06-19 21:38:40--  https://www.dropbox.com/s/cv8kpsqsgxzw5ar/Spiders.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.7.1, 2620:100:6016:1::a27d:101
Connecting to www.dropbox.com (www.dropbox.com)|162.125.7.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc4c80814bf69a54888aee6b150a.dl.dropboxusercontent.com/cd/0/get/AJRqU5Lhi5wOuBpOUf901y_Wnb6tpyRPoYnTrmVp2z7ytKTNwMuvGbo3E_OjLhvqoBmOkRRllF5MvUjsB3ztL0sSlsHtz1xNITQoWhODI33_JfIJKcCKbHJSgqCWcBNhvF6BhYkKDIx2I1cejEbZYA5B-Zlk42Hoge49LXehTFcdEbN5GciYf9F7_1yiBhCDHMw/file [following]
--2018-06-19 21:38:40--  https://uc4c80814bf69a54888aee6b150a.dl.dropboxusercontent.com/cd/0/get/AJRqU5Lhi5wOuBpOUf901y_Wnb6tpyRPoYnTrmVp2z7ytKTNwMuvGbo3E_OjLhvqoBmOkRRllF5MvUjsB3ztL0sSlsHtz1xNITQoWhODI33_JfIJKcCKbHJSgqCWcBNhvF6BhYkKDIx2I1cejEbZYA5B-Zlk42Hoge49LXehTFcdEbN5GciYf9F7_1yiBhCDHMw/file
Resolving uc4c80814bf69a54888aee6b150a.dl.dropboxusercontent.com (uc4c80814bf69a54888aee6b150a.dl.dropboxu

In [14]:
spiders = sc.textFile('Spiders.csv')
header = spiders.first()
spiders = spiders.filter(lambda row: row != header)
split_data = spiders.map(lambda row:row.split(','))
split_data.take(5)

[['0.245', 'Absent'],
 ['0.247', 'Absent'],
 ['0.285', 'Present'],
 ['0.299', 'Present'],
 ['0.327', 'Present']]

In [16]:
# Helper function to encode to LabeledPoints
def construct_labeled_points(row):
  grain_size = row[0]
  presence = 1 if row[1] == 'Present' else 0
  return LabeledPoint(presence, Vectors.dense(grain_size))

labeled_points_rdd = split_data.map(construct_labeled_points)
print(labeled_points_rdd.take(5))

[LabeledPoint(0.0, [0.245]), LabeledPoint(0.0, [0.247]), LabeledPoint(1.0, [0.285]), LabeledPoint(1.0, [0.299]), LabeledPoint(1.0, [0.327])]


**Question 2**:

In this question, you are given the size of houses and associated prices and the **ask** is to predict the price of a house for a given square footage.

Here is the snapshot of the dataset that contains the size of houses and the associated prices in the city of Los Gatos (where Netflix is headquartered):

![alt text](https://www.dropbox.com/s/2woxl7v5t6i3g5f/HomePrices.JPG?raw=1)

**ASK**:

**Step 1**: Import the requisite packages

from pyspark.mllib.regression import LabeledPoint

from pyspark.mllib.classification import LinearRegressionWithSGD

from pyspark import SparkContext, SparkConf

from pyspark.mllib.linalg import Vectors


**Step 2:** Create a LabeledPoint data type which includes the price of the house as the label and a dense vector of home sizes

***Reference:*** https://spark.apache.org/docs/1.2.1/mllib-data-types.html

**Step 3:** Create a RDD of the LabelPoint constructed in setp 2 (*Hint*: Utilize the parallelize method of the *SparkContext* object since it ensures that the elements of the RDD can be operated in parallel)

**Step 4:** Train a LinearRegressionWithSGD model with  the num of iterations at 100 and a stepSize of 0.0000006

**Reference: ** https://spark.apache.org/docs/2.3.0/mllib-linear-methods.html

**Step 5:** Predict the price for a house with **2,600** sq ft



In [0]:
from pyspark.mllib.regression import LabeledPoint

from pyspark.mllib.regression import LinearRegressionWithSGD

from pyspark import SparkContext, SparkConf

from pyspark.mllib.linalg import Vectors

In [25]:
points = [(2200, 1720000), 
          (2400, 1890000),
          (2146, 1500000),
          (4415, 2200000),
          (1344, 1120000),
          (4608, 3870000),
          (2193, 1270000), 
          (2850, 2460000), 
          (4090, 3480000), 
          (2059, 1530000)]
points_rdd = sc.parallelize(points)
# Helper function to encode to LabeledPoints
def construct_labeled_points(row):
  square_feet = row[0]
  price = row[1]
  return LabeledPoint(price, Vectors.dense(square_feet))

points_rdd_labeled = points_rdd.map(construct_labeled_points)
points_rdd_labeled.cache()

# Train a linear regression model with 100 iterations
# and stepsize 0.0000006
model = LinearRegressionWithSGD.train(points_rdd_labeled, iterations=100, step=0.0000006)

# Predict the price of a house that has a square footage of 2600
prediction = model.predict(Vectors.dense(2600))
print(prediction)
print(model.intercept)
print(model.weights)
print(model.weights[0] * 2600)
                                      

1922871.7203570446
0.0
[739.566046291171]
1922871.7203570446


In **Question 3**, you are given the lot size of houses and the assocated prices in the city of Saratoga (cloe to the Netflix headquarters) and the ask is to uncover 4 clusters (**k = 4**)  based on the lot size and the price.

Here is the snapshot of a subset of the dataset that contains the size of houses and the associated prices in the city of Saratoga:

![alt text](https://www.dropbox.com/s/h8yyl0creyi11wg/HomePrices_COS.JPG?raw=1)

**Source: ** https://www.neighborhoodscout.com/ca/saratoga/real-estate



**Question 2: Ask**

**Step 1:** Import the relevant packages

from pyspark.mllib.clustering import KMeans

from pyspark import SparkContext, SparkConf

from pyspark.mllib.linalg import Vectors

**Step 2:** Initialize the Spark Context; the starting point/root of every Spark Application 

**Step 3:** Load the data into a RDD

***Dataset***: https://www.dropbox.com/s/njtjw2272kwk0au/Home_Prices1_COS.csv?raw=1

**Step 4:** Train the KMeans clustering model for 4 clusters and 5 iterations

**Step 5: ** Load the RDD of dense vectors into a collection

**Step 6: ** Predict the cluster for a select few data points i.e. elements 0, 18, 35, 6  and 15 of the collection

In [0]:
from pyspark.mllib.clustering import KMeans
from pyspark import SparkContext, SparkConf
from pyspark.mllib.linalg import Vectors

In [27]:
!wget https://www.dropbox.com/s/njtjw2272kwk0au/Home_Prices1_COS.csv?raw=1 -O home_prices.csv

--2018-06-19 22:15:12--  https://www.dropbox.com/s/njtjw2272kwk0au/Home_Prices1_COS.csv?raw=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.1, 2620:100:601a:1::a27d:701
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc5cdd10a1d53a56d341fcdee703.dl.dropboxusercontent.com/cd/0/inline/AJRW2XkXwEKACnLZCVA-NoZBVOKpkVjo-uOo90w5GpR7_K80VmDR0kCH6eeY4nK02dbpau56zn0eLkEWdqlB0xOgPSzCb86jw9DL11J_f5nh5A9xvHTefW-fkOB82OYs00CHSgasCN7OU4QcNXu_fDgges6Filv5iAFwMgWsIjLPafss-B0TA6cMzaoo5wAiEUc/file [following]
--2018-06-19 22:15:12--  https://uc5cdd10a1d53a56d341fcdee703.dl.dropboxusercontent.com/cd/0/inline/AJRW2XkXwEKACnLZCVA-NoZBVOKpkVjo-uOo90w5GpR7_K80VmDR0kCH6eeY4nK02dbpau56zn0eLkEWdqlB0xOgPSzCb86jw9DL11J_f5nh5A9xvHTefW-fkOB82OYs00CHSgasCN7OU4QcNXu_fDgges6Filv5iAFwMgWsIjLPafss-B0TA6cMzaoo5wAiEUc/file
Resolving uc5cdd10a1d53a56d341fcdee703.dl.dropboxusercontent.com (uc5cdd10a1d53a56d341fcde

In [28]:
home_prices = sc.textFile('home_prices.csv')
home_prices.take(5)

['12839,2405', '10000,2200', '8040,1400', '13104,1800', '10000,2351']

In [33]:
from pyspark.mllib.clustering import KMeans
from pyspark import SparkContext, SparkConf
from pyspark.mllib.linalg import Vectors

# !wget https://www.dropbox.com/s/njtjw2272kwk0au/Home_Prices1_COS.csv?raw=1 -O home_prices.csv
home_prices = sc.textFile('home_prices.csv')

# Split on comma and change numbers to dense vector
home_prices_split = home_prices.map(lambda row : Vectors.dense(row.split(',')))

# Train the actual KMeans model
kmeans_model = KMeans.train(home_prices_split, 4, maxIterations=5)

homes = home_prices_split.collect()
homes[0]


DenseVector([12839.0, 2405.0])

In [36]:
for point in homes:
    print('Point:', point)
    print('Cluster:', kmeans_model.predict(point))

Point: [12839.0,2405.0]
Cluster: 3
Point: [10000.0,2200.0]
Cluster: 3
Point: [8040.0,1400.0]
Cluster: 3
Point: [13104.0,1800.0]
Cluster: 3
Point: [10000.0,2351.0]
Cluster: 3
Point: [3049.0,795.0]
Cluster: 0
Point: [38768.0,2725.0]
Cluster: 1
Point: [16250.0,2150.0]
Cluster: 3
Point: [43026.0,2724.0]
Cluster: 1
Point: [44431.0,2675.0]
Cluster: 1
Point: [40000.0,2930.0]
Cluster: 1
Point: [1260.0,870.0]
Cluster: 0
Point: [15000.0,2210.0]
Cluster: 3
Point: [10032.0,1145.0]
Cluster: 3
Point: [12420.0,2419.0]
Cluster: 3
Point: [69696.0,2750.0]
Cluster: 2
Point: [12600.0,2035.0]
Cluster: 3
Point: [10240.0,1150.0]
Cluster: 3
Point: [876.0,665.0]
Cluster: 0
Point: [8125.0,1430.0]
Cluster: 3
Point: [11792.0,1920.0]
Cluster: 3
Point: [1512.0,1230.0]
Cluster: 0
Point: [1276.0,975.0]
Cluster: 0
Point: [67518.0,2400.0]
Cluster: 2
Point: [9810.0,1725.0]
Cluster: 3
Point: [6324.0,2300.0]
Cluster: 0
Point: [12510.0,1700.0]
Cluster: 3
Point: [15616.0,1915.0]
Cluster: 3
Point: [15476.0,2278.0]
Cluster: 3