## Data: houses number of bedrooms, neighborhood, price

Find the total dollar amount of properties on sale for each neighborhood
Create a local file just on the Driver, this is not accessible by the Spark Cluster

In [39]:
%%file houses.txt
3 Downtown 400000
2 Downtown 240000
3 Hilltop 650000

Overwriting houses.txt


In [40]:
!cat houses.txt

3 Downtown 400000
2 Downtown 240000
3 Hilltop 650000

Copy the file to the distributed file system HDFS

In [41]:
!hdfs dfs -mkdir /data

15/12/03 14:51:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
mkdir: `/data': File exists


In [42]:
!hdfs dfs -copyFromLocal houses.txt /data/

15/12/03 14:51:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
copyFromLocal: `/data/houses.txt': File exists


In [5]:
!hdfs fsck /data/houses.txt -files -blocks -locations -racks

15/12/03 14:14:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://comet-06-55.ibnet:50070
FSCK started by zonca (auth:SIMPLE) from /10.22.254.96 for path /data/houses.txt at Thu Dec 03 14:14:21 PST 2015
/data/houses.txt 52 bytes, 1 block(s):  Under replicated BP-1823862164-198.202.119.184-1449177989860:blk_1073741825_1001. Target Replicas is 3 but found 1 replica(s).
0. BP-1823862164-198.202.119.184-1449177989860:blk_1073741825_1001 len=52 repl=1 [/default-rack/10.22.254.96:50010]

Status: HEALTHY
 Total size:	52 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 52 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	1 (100.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	1.0
 Corrupt blocks:		0
 Missing replicas:		2 (66.6

In [6]:
text_RDD = sc.textFile("/data/houses.txt")

In [11]:
text_RDD = sc.textFile("file:///home/zonca/workshop/spark/houses.txt")

In [12]:
pwd

u'/home/zonca/workshop/spark'

In [13]:
text_RDD.collect()

[u'3 Downtown 400000', u'2 Downtown 240000', u'3 Hilltop 650000']

## Mapper: parse each line into a (key, value) pair

In [14]:
def mapper_parse_lines(line):
    """Parse line into (neighborhoood, price) pair"""
    words = line.split()
    return (words[1], float(words[2]))

In [15]:
house_prices_RDD = text_RDD.map(mapper_parse_lines)

In [16]:
house_prices_RDD.collect()

[(u'Downtown', 400000.0), (u'Downtown', 240000.0), (u'Hilltop', 650000.0)]

## Reducer: sum values across all pairs with the same key

In [17]:
def reducer_sum(a,b):
    return a+b

In [18]:
house_prices_RDD.reduceByKey(reducer_sum).collect()

[(u'Downtown', 640000.0), (u'Hilltop', 650000.0)]

## Exercise
Copy-pasting and modifying the code above, find the **maximum** dollar amount available for each **number of bedrooms**

In [25]:
def mapper_parse_lines(line):
    """Parse line into (number of bedrooms, price) pair"""
    words = line.split()
    return (int(words[0]), float(words[2]))

In [26]:
house_prices_RDD = text_RDD.map(mapper_parse_lines)

In [27]:
house_prices_RDD.collect()

[(3, 400000.0), (2, 240000.0), (3, 650000.0)]

In [28]:
def reducer_max(a,b):
    return max(a,b)

In [29]:
house_prices_RDD.reduceByKey(reducer_max).collect()

[(2, 240000.0), (3, 650000.0)]

In [34]:
max_price_per_bedroom_RDD = house_prices_RDD.reduceByKey(reducer_max)

## Save results back to HDFS

In [30]:
house_prices_RDD.saveAsTextFile("/data/house_prices_RDD")

In [35]:
max_price_per_bedroom_RDD.saveAsTextFile("/data/max_price_per_bedroom_RDD")

In [31]:
!hdfs dfs -copyToLocal /data/house_prices_RDD

15/12/03 14:25:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [36]:
!hdfs dfs -copyToLocal /data/max_price_per_bedroom_RDD

15/12/03 14:28:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [37]:
!cat max_price_per_bedroom_RDD/part-00000

(2, 240000.0)


In [38]:
!cat max_price_per_bedroom_RDD/part-00001

(3, 650000.0)


In [32]:
!cat house_prices_RDD/part-00000

(3, 400000.0)
(2, 240000.0)


In [33]:
!cat house_prices_RDD/part-00001

(3, 650000.0)
