In [1]:
import numpy as np
import urllib.request

# PySpark Tuto 1 - First steps, Creating a RDD

The first notebook was just a teaser to get you interested in PySpark.
In this second notebook we will look in a little more details at the spark structure and how to create Resilient Distributed Datasets (RDD).

__Important Note__ :
I didn't want to mention it in the first notebook but if you are familiar with python when running `sc.parallelize()` you probably wondered where does the `sc` come from.
Actually we are running the notebook in a Spark shell in which a `SparkContext` object has already been created and named `sc`, look :

In [2]:
sc

The `SparkContext` object is the first thing to create using a `SparkConf` object as well but you won't have to worry about it here as it is automatically created in our Spark shell.
Keep this in mind if you decide to venture outside of the shell.

## Creating RDD

RDDs are the central part of PySpark.
Once created we can apply different operations to each rdd which are separated into two groups : _transformations_ and _actions_ (we'll se later the differences).  
But first we need to create RDDs...

There are two ways of creating rdds, using `parallelize` or creating it from external files.

### Using `sc.parallelize()`

Exactly like we did in the previous notebook :

In [3]:
n_samples = 1000000
rdd = sc.parallelize(np.arange(0,n_samples))

In [4]:
rdd

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:480

Note that when we call the function `sc.parallelize` spark automatically chose the number of partitions, aka in how many pieces the dataset is distributed. According to the [programming guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html) : __"Typically you want 2-4 partitions for each CPU in your cluster"__  
We can check in how many partitions or dataset has been divided :

In [5]:
rdd.getNumPartitions()

4

It is anyway possible to force the number of partitions when creating the RDD :

In [6]:
rdd = sc.parallelize(np.arange(0,n_samples), 8)

In [7]:
rdd.getNumPartitions()

8

### Using External Files

To create RDD it is also possible to use external files and any storage source supported by Hadoop.
The main function to create a rdd from a file is named `sc.textFile`.

Hereafter we'll simply download a dataset and create a rdd from it.

In [8]:
# downloading the dataset (can take some time)
urllib.request.urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 'adult.data')
urllib.request.urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names', 'adult.names')

('adult.data', <http.client.HTTPMessage at 0x106579320>)

In [9]:
# creating the rdd from the adult dataset
data_file = 'adult.data'
adult_rdd = sc.textFile(data_file)

Let's now look the first 5 entries of the dataset :

In [10]:
adult_rdd.take(5)

['39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 '50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K',
 '38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K',
 '53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K',
 '28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K']

At this stage we have only loaded the data in a rdd and we can see that it is loaded as a list of long strings, one per line in the original file.
I can also already tell you that there are missing data in the dataset.  
All of this will be the opportunity in the next notebook to look at the different _operations_ we can perform on a rdd and the difference between _transformations_ and _actions_.

### Notes on Partitions and Repartitioning

When using `sc.textFile` pyspark choose by itself how many partitions should be created.
If we want a different number of partitions we can proceed in several different ways :

In [11]:
# Force the minimum number of partitions when calling sc.textFile
adult_rdd = sc.textFile(data_file, minPartitions=4)
adult_rdd.getNumPartitions()

4

In [12]:
# Repartition the rdd (include a shuffling before repartition)
adult_rdd = sc.textFile(data_file).repartition(6)
adult_rdd.getNumPartitions()

6

__Note__ : If you want to reduce the number of partition of partitions it is actually recommanded to use a different function named `coalesce` (doesn't work to increase the number of partitions):

In [13]:
adult_rdd = sc.textFile(data_file).coalesce(1)
adult_rdd.getNumPartitions()

1