PySpark for Data Analysis
============



## Basic Operations

In [1]:
from pyspark import SparkContext 
sc = SparkContext( 'local', 'pyspark')

In [2]:
int_RDD = sc.parallelize(range(10), 3)

int_RDD

PythonRDD[1] at RDD at PythonRDD.scala:43

In [3]:
int_RDD.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [4]:
int_RDD.glom().collect()

[[0, 1, 2], [3, 4, 5], [6, 7, 8, 9]]

### Reading data from a text file

To read data from a local file, you need to specify the address by **file://**

```
   textFile("file:///home/vahid/examplefile.txt")
```

But if the file is on HDFS, then we can specify the address by
```
    textFile("/user/wordcount/input/examplefile.txt")
```

In [11]:
text = sc.textFile('file:///home/vahid/Github/DataScience/bigdata-platforms/data/31987-0.txt')

text

MapPartitionsRDD[20] at textFile at NativeMethodAccessorImpl.java:-2

#### Take the first k elements (lines)

```
text.take(k)
```

In [12]:
text.take(3)

['The Project Gutenberg EBook of Territory in Bird Life, by H. Eliot Howard',
 '',
 'This eBook is for the use of anyone anywhere at no cost and with']

## Narrow Transformation


  * **map:** applies a function to each element of RDD.

  * **flatMap:** similar to map, except that here we can have 0 or more outputs for each element
  
  * **filter:** apply a boolean function to each element of RDD, resulting in filtering out based on that function

In [14]:
example = sc.textFile('data/example.txt')

# print the first line to make sure it's working
print(example.take(1))

['Love looks not with the eyes, but with the mind; and therefore is winged Cupid painted blind.']


In [15]:
def lower(line):
    return(line.lower())

# apply lower() to each element:
example.map(lower).take(1)

['love looks not with the eyes, but with the mind; and therefore is winged cupid painted blind.']

In [20]:
def split(line):
    return(line.split())

# apply split to each element, resulting in 0-more outputs --> flatMap
example.flatMap(split).take(5)

['Love', 'looks', 'not', 'with', 'the']

In [23]:
def create_keyval(word):
    return(word, 1)

# Create key-value pairs for each split element --> map
example.flatMap(split).map(create_keyval).take(5)

[('Love', 1), ('looks', 1), ('not', 1), ('with', 1), ('the', 1)]

In [22]:
def filterlen(word):
    return(len(word)>5)

# filter split elements based on their character lengths
example.flatMap(split).filter(filterlen).collect()

['therefore',
 'winged',
 'painted',
 'blind.',
 'dreams',
 'little',
 'rounded',
 'sleep.']

## Wide Transformation
 
  * **groupByKey:**
  * **reduceByKey:**
  * **repartition**

In [29]:
pairs_RDD = example.flatMap(split).map(create_keyval)

for key,vals in pairs_RDD.groupByKey().take(5):
    print(key, list(vals))

on, [1]
stuff [1]
few, [1]
none. [1]
a [1, 1]


In [32]:
def sumvals(a, b):
    return (a + b)

pairs_RDD.reduceByKey(sumvals).take(10)

[('on,', 1),
 ('stuff', 1),
 ('few,', 1),
 ('none.', 1),
 ('a', 2),
 ('all,', 1),
 ('to', 1),
 ('with', 3),
 ('but', 1),
 ('Love', 2)]

# Appendix: Installing Spark on Ubuntu


Dowload and extract the spark package from 


```
tar xvfz spark-1.5.2-bin-hadoop2.6.tgz
sudo mv spark-1.5.2-bin-hadoop2.6 $HOME/apps/spark/
cd $HOME/apps/spark/
```

Now, we need to add the SPARK_HOME location to the PATH environment variable

```
export SPARK_HOME=$HOME/apps/spark 
export PATH=$SPARK_HOME/bin:$PATH 
```

Now, you can launch pyspark by ```pyspark```

<img src='pyspark-launch.png' >

### Reduce the verbosity level

By default, pyspark will generate lots of log messages when you run some command, and we can see how that can be a problem. To reduce the verbosity, copy the template file in the conf folder 

```cp $SPARK_HOME/conf/log4j.properties.template $SPARK_HOME/conf/log4j.properties```

and edit it by replacing the INFO to WARN.



### Using pyspark in iPython

In order to use pyspark in an iPython notebook, you need to configure it by adding a new file in the startup directory of ipython profile.

```
vim $HOME/.ipython/profile_default/startup/00-pyspark-setup.py
```
and add these contents in this file:



In [None]:
import os
import sys

# Configure the environment
if 'SPARK_HOME' not in os.environ:
    home_folder = os.environ['HOME']
    os.environ['SPARK_HOME'] = os.path.join(home_folder, 'apps/spark')

# Create a variable for our root path
SPARK_HOME = os.environ['SPARK_HOME']

# Add the PySpark/py4j to the Python Path
sys.path.insert(0, os.path.join(SPARK_HOME, "python", "build"))
sys.path.insert(0, os.path.join(SPARK_HOME, "python"))

Now, you should be able to run ipython and use pyspark. Try running the following commands:

In [1]:
print(SPARK_HOME)

/home/vahid/apps/spark


In [2]:
from pyspark import SparkContext

sc = SparkContext( 'local', 'pyspark')

sc.parallelize(range(10), 3)

PythonRDD[1] at RDD at PythonRDD.scala:43

In case you received an error for py4j, then you also need to run these two commands in bash shell

```
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
```