## Data: houses number of bedrooms, neighborhood, price

Find the total dollar amount of properties on sale for each neighborhood
Create a local file

In [1]:
%%file houses.txt
3 Downtown 400000
2 Downtown 240000
3 Hilltop 650000

Overwriting houses.txt


In [2]:
!cat houses.txt

3 Downtown 400000
2 Downtown 240000
3 Hilltop 650000

Copy the file to the distributed file system HDFS

In [3]:
!hdfs dfs -mkdir /data

16/08/03 09:40:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
!hdfs dfs -copyFromLocal houses.txt /data/

16/08/03 09:41:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
!hdfs fsck /data/houses.txt -files -blocks -locations -racks

16/08/03 09:42:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://comet-10-33.ibnet:50070
FSCK started by zonca (auth:SIMPLE) from /10.22.253.158 for path /data/houses.txt at Wed Aug 03 09:42:24 PDT 2016
/data/houses.txt 52 bytes, 1 block(s):  Under replicated BP-316019376-198.202.116.133-1470239884934:blk_1073741825_1001. Target Replicas is 3 but found 2 replica(s).
0. BP-316019376-198.202.116.133-1470239884934:blk_1073741825_1001 len=52 repl=2 [/default-rack/10.22.253.157:50010, /default-rack/10.22.253.158:50010]

Status: HEALTHY
 Total size:	52 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 52 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	1 (100.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	2.0
 Corrupt blo

In [1]:
text_RDD = sc.textFile("/data/houses.txt")

Alternatively read from the shared filesystem

In [None]:
# text_RDD = sc.textFile("file:///home/zonca/workshop/spark/houses.txt")

In [7]:
text_RDD.collect()

[u'3 Downtown 400000', u'2 Downtown 240000', u'3 Hilltop 650000']

## Mapper: parse each line into a (key, value) pair

In [2]:
def mapper_parse_lines(line):
    """Parse line into (neighborhoood, price) pair"""
    words = line.split()
    return (words[1], float(words[2]))

In [3]:
house_prices_RDD = text_RDD.map(mapper_parse_lines)

In [10]:
house_prices_RDD.collect()

[(u'Downtown', 400000.0), (u'Downtown', 240000.0), (u'Hilltop', 650000.0)]

In [13]:
house_prices_RDD.take(2)

[(u'Downtown', 400000.0), (u'Downtown', 240000.0)]

## Reducer: sum values across all pairs with the same key

In [4]:
def reducer_sum(a,b):
    return a+b

In [7]:
house_prices_RDD.reduceByKey(reducer_sum).collect()

[(u'Downtown', 640000.0), (u'Hilltop', 650000.0)]

## Exercise
Copy-pasting and modifying the code above, find the **maximum** dollar amount available for each **number of bedrooms**

In [None]:
max_price_per_bedroom_RDD = None

In [14]:
def mapper_parse_lines(line):
    """Parse line into (neighborhoood, price) pair"""
    words = line.split()
    return (int(words[0]), float(words[2]))

In [15]:
mapper_parse_lines("3 Downtown 400000")

(3, 400000.0)

In [19]:
text_RDD.map(mapper_parse_lines).reduceByKey(max).collect()

[(2, 240000.0), (3, 650000.0)]

In [20]:
max_price_per_bedroom_RDD = text_RDD.map(mapper_parse_lines).reduceByKey(max)

## Save results back to HDFS

In [21]:
house_prices_RDD.saveAsTextFile("/data/house_prices_RDD")

In [22]:
max_price_per_bedroom_RDD.saveAsTextFile("/data/max_price_per_bedroom_RDD")

In [23]:
!hdfs dfs -copyToLocal /data/house_prices_RDD

16/08/03 10:58:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [24]:
!hdfs dfs -copyToLocal /data/max_price_per_bedroom_RDD

16/08/03 10:58:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [25]:
!cat max_price_per_bedroom_RDD/part-00000

(2, 240000.0)


In [26]:
!cat max_price_per_bedroom_RDD/part-00001

(3, 650000.0)


In [27]:
!cat house_prices_RDD/part-00000

(u'Downtown', 400000.0)
(u'Downtown', 240000.0)


In [28]:
!cat house_prices_RDD/part-00001

(u'Hilltop', 650000.0)
