# Python Basics

* Everything is an object.
* Pass by reference.

**`==` and `is`**
* The statement `is` is used for object identity, it checks if objects refer to the same instance (same address in memory).
* And the `==` statement refers to equality (same value).
* A very common use of `is` and `is not` is to check if avariable is `None`, since there is only one instance of None.

In [2]:
a = [1, 2, 3]
b = a
c = list(a)
print "a is b: ", a is b
print "a is c: ", a is c
print "a == c: ", a == c

a is b:  True
a is c:  False
a == c:  True


**Comparing Sequences and Other Types**

* The comparison uses lexicographical ordering.
* Lexicographical ordering for strings uses the ASCII ordering for individual characters.

`'ABC' < 'C' < 'Pascal' < 'Python'`

See [docs](https://docs.python.org/2/tutorial/datastructures.html#comparing-sequences-and-other-types).




## Test programs

Map key to each item in the value list

In [2]:
RDD = sc.parallelize([(1, [1, 2, 3]), (2, [4, 5, 76])])
RDD.flatMapValues(lambda x: iter(x)).collect()

[(1, 1), (1, 2), (1, 3), (2, 4), (2, 5), (2, 76)]

Union

In [3]:
a = sc.parallelize([-1, 0]) # argument is a list
b = sc.parallelize([1, 2, 3])
print "a", a.collect()
print "b", b.collect()
print "union", b.union(a).collect()

a [-1, 0]
b [1, 2, 3]
union [1, 2, 3, -1, 0]


distinct

In [1]:
a = sc.parallelize([-1, 0, 0, 2, -1])
a.distinct().collect()

[0, 2, -1]

key-value map

In [5]:
a = sc.parallelize([(1, 2), (3, 4)])
print a.map(lambda x: (x[0], x[1]+1)).collect()

[(1, 3), (3, 5)]


In [15]:
import ujson

raw_data = sc.parallelize(['["foo", {"bar":["baz", null, 1.0, 2]}]', "{a}"])

def safe_parse(raw_json):
    # your code here
    try:
        json_object = ujson.loads(raw_json)
    except ValueError, e:
        pass # invalid json
    else:
        return json_object

# your code here

# convert to json and filter 
json_data = raw_data.map(lambda x: safe_parse(x))
json_data.collect()

[[u'foo', {u'bar': [u'baz', None, 1.0, 2]}], None]