Create a local file just on the Driver, this is not accessible by the Spark Cluster

In [1]:
%%file inflation.txt
Downtown 2.1
Hilltop 4.5

Writing inflation.txt


In [2]:
!cat inflation.txt

Downtown 2.1
Hilltop 4.5

Copy the file to the distributed file system HDFS

In [3]:
!hdfs dfs -copyFromLocal inflation.txt /data/

15/12/03 14:53:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Verify that the file is replicated 3 times

In [4]:
!hdfs fsck /data/inflation.txt -files -blocks -locations -racks  

15/12/03 14:53:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://comet-06-55.ibnet:50070
FSCK started by zonca (auth:SIMPLE) from /10.22.254.96 for path /data/inflation.txt at Thu Dec 03 14:53:39 PST 2015
/data/inflation.txt 24 bytes, 1 block(s):  Under replicated BP-1823862164-198.202.119.184-1449177989860:blk_1073741830_1006. Target Replicas is 3 but found 1 replica(s).
0. BP-1823862164-198.202.119.184-1449177989860:blk_1073741830_1006 len=24 repl=1 [/default-rack/10.22.254.96:50010]

Status: HEALTHY
 Total size:	24 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 24 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	1 (100.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	1.0
 Corrupt blocks:		0
 Missing replicas:		2

## Load house prices

In [5]:
text_RDD = sc.textFile("/data/houses.txt")

In [6]:
def mapper_parse_lines(line):
    """Parse line into (neighborhoood, price) pair"""
    words = line.split()
    return (words[1], float(words[2]))

In [7]:
house_prices_RDD = text_RDD.map(mapper_parse_lines)

In [8]:
house_prices_RDD.collect()

[(u'Downtown', 400000.0), (u'Downtown', 240000.0), (u'Hilltop', 650000.0)]

## Load inflation

In [9]:
inflation_text_RDD = sc.textFile("/data/inflation.txt")

In [10]:
def mapper_parse__inflation_lines(line):
    """Parse line into (neighborhoood, inflation) pair"""
    words = line.split()
    return (words[0], float(words[1]))

In [11]:
inflation_RDD = inflation_text_RDD.map(mapper_parse__inflation_lines)

In [12]:
inflation_RDD.collect()

[(u'Downtown', 2.1), (u'Hilltop', 4.5)]

## join

In [13]:
house_prices_RDD.join(inflation_RDD).collect()

[(u'Downtown', (400000.0, 2.1)),
 (u'Downtown', (240000.0, 2.1)),
 (u'Hilltop', (650000.0, 4.5))]

In [17]:
def mapper_multiply_price_inflation(pair):
    inflation_ratio = 1 + pair[1][1]/100.
    return (pair[0], pair[1][0]*inflation_ratio)

In [18]:
house_prices_nextyear_RDD = house_prices_RDD.join(inflation_RDD).map(mapper_multiply_price_inflation)

In [19]:
house_prices_nextyear_RDD.collect()

[(u'Downtown', 408399.99999999994),
 (u'Downtown', 245039.99999999997),
 (u'Hilltop', 679250.0)]

## reduce

In [20]:
def reducer_sum(a,b):
    return a+b

In [21]:
total_nextyear = house_prices_nextyear_RDD.reduceByKey(reducer_sum)

In [22]:
total_nextyear.collect()

[(u'Downtown', 653439.9999999999), (u'Hilltop', 679250.0)]

## Excercise

List neighborhood and house price only for the neighborhoods where inflation is low (less than 4%)

(Advanced: for each of those neighborhoods, find the more expensive house)

In [24]:
prices_inflation_RDD = house_prices_RDD.join(inflation_RDD)

In [25]:
prices_inflation_RDD.collect()

[(u'Downtown', (400000.0, 2.1)),
 (u'Downtown', (240000.0, 2.1)),
 (u'Hilltop', (650000.0, 4.5))]

In [35]:
def has_low_inflation(pair):
    return pair[1][1] < 4

In [36]:
has_low_inflation((u'Downtown', (400000.0, 2.1)))

True

In [37]:
has_low_inflation((u'Hilltop', (650000.0, 4.5)))

False

In [39]:
prices_inflation_RDD.filter(has_low_inflation).collect()

[(u'Downtown', (400000.0, 2.1)), (u'Downtown', (240000.0, 2.1))]

In [41]:
# %load solution_house_price_join.py
def is_inflation_low(pair):
    return pair[1][1] < 4
def reducer_max_price(a,b):
    return max(a[0], b[0])
house_prices_RDD.join(inflation_RDD).filter(is_inflation_low). \
       reduceByKey(reducer_max_price).collect()


[(u'Downtown', 400000.0)]

## Print DAG

In [23]:
print(total_nextyear.toDebugString())

(4) PythonRDD[34] at collect at <ipython-input-22-4646dc69b0ef>:1 []
 |  MapPartitionsRDD[33] at mapPartitions at PythonRDD.scala:374 []
 |  ShuffledRDD[32] at partitionBy at NativeMethodAccessorImpl.java:-2 []
 +-(4) PairwiseRDD[31] at reduceByKey at <ipython-input-21-6168dd502f27>:1 []
    |  PythonRDD[30] at reduceByKey at <ipython-input-21-6168dd502f27>:1 []
    |  MapPartitionsRDD[28] at mapPartitions at PythonRDD.scala:374 []
    |  ShuffledRDD[27] at partitionBy at NativeMethodAccessorImpl.java:-2 []
    +-(4) PairwiseRDD[26] at join at <ipython-input-18-983588daccd8>:1 []
       |  PythonRDD[25] at join at <ipython-input-18-983588daccd8>:1 []
       |  UnionRDD[24] at union at NativeMethodAccessorImpl.java:-2 []
       |  PythonRDD[22] at RDD at PythonRDD.scala:43 []
       |  MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 []
       |  /data/houses.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:-2 []
       |  PythonRDD[23] at RDD at PythonRD

## Cache

In [42]:
%time house_prices_nextyear_RDD.reduceByKey(max).collect()

CPU times: user 12 ms, sys: 1 ms, total: 13 ms
Wall time: 165 ms


[(u'Downtown', 408399.99999999994), (u'Hilltop', 679250.0)]

In [43]:
%time house_prices_nextyear_RDD.reduceByKey(min).collect()

CPU times: user 7 ms, sys: 3 ms, total: 10 ms
Wall time: 172 ms


[(u'Downtown', 245039.99999999997), (u'Hilltop', 679250.0)]

In [44]:
house_prices_nextyear_RDD.cache()

PythonRDD[29] at collect at <ipython-input-19-0f19c536a678>:1

In [45]:
%time house_prices_nextyear_RDD.reduceByKey(max).collect()

CPU times: user 8 ms, sys: 2 ms, total: 10 ms
Wall time: 278 ms


[(u'Downtown', 408399.99999999994), (u'Hilltop', 679250.0)]

In [46]:
%time house_prices_nextyear_RDD.reduceByKey(min).collect()

CPU times: user 5 ms, sys: 4 ms, total: 9 ms
Wall time: 178 ms


[(u'Downtown', 245039.99999999997), (u'Hilltop', 679250.0)]