# First steps with PySpark

## Learning objectives

- Get familiar with PySpark RDDs
- Become imbued with the concept of lazyness

In [0]:
import pyspark

spark = (pyspark.sql.SparkSession.builder \
         .master('local') \
         .appName('Introduction to PySpark') \
         .config("spark.some.config.option", "some-value") \
         .getOrCreate())

sc = spark.sparkContext

In [0]:
# We need a S3 filepath
S3_RESOURCE = 's3'
SCHEME = 's3a'
# TODO: assign a BUCKET_NAME and PREFIX
BUCKET_NAME = ''
PREFIX = ''
FILENAME = 'tears_in_rain.txt'

In [0]:
# This is just a utility function
def get_s3_path(key, bucket_name=BUCKET_NAME, scheme=SCHEME):
    return f"{scheme}://{bucket_name}/{key}"

In [0]:
### BEGIN STRIP ###
# This is required for local work

### END STRIP ###

In [0]:
# TODO: Load the file from `filepath` into a PySpark RDD
### BEGIN STRIP ###
from pathlib import Path

path = Path("FileStore", "shared_uploads", "thibaudchevrier@gmail.com", "tears_in_rain.txt")
tears_in_rain_rdd = sc.textFile(str(path))

### END STRIP ###

In [0]:
# TODO: print out text_file
### BEGIN STRIP ###

tears_in_rain_rdd.collect()

### END STRIP ###

That doesn't tell us much, how would you do to see the first 3 elements of this RDD?

What's the type of `text_file`?

In [0]:
# TODO: check the type of `text_file`
### BEGIN STRIP ###

type(tears_in_rain_rdd)

### END STRIP ###

It's a PySpark `RDD`. It means we can call **actions** on it and it will return a result.

We'll try a first one, `.take(...)`.

In [0]:
# TODO: take the first 3 elements of the RDD `text_file`
### BEGIN STRIP ###

tears_in_rain_rdd.take(3)

### END STRIP ###

And now, another one, we want the results to be all elements of the `rdd`.

In [0]:
# TODO: collect all elements of `text_file`
### BEGIN STRIP ###

tears_in_rain_rdd.collect()

### END STRIP ###

How many lines are there in `text_file`?

In [0]:
# TODO: count the number of lines in `text_file`
### BEGIN STRIP ###

len(tears_in_rain_rdd.collect())

### END STRIP ###

What's the length of each sentence?

In [0]:
# TODO: call `.map(...)` on your rdd and give it a function that computes the lenght of a string: `lineLengths`
#
# NOTE: the end of previous line, ": `LineLenghts`" is how you should name your result variable
#
### BEGIN STRIP ### 

line_length = tears_in_rain_rdd.map(lambda line: len(line))

### END STRIP ###

In [0]:
# TODO: take the first 3 elements of lineLengths
### BEGIN STRIP ###

line_length.take(3)

### END STRIP ###

In [0]:
# TODO: collect all elements of lineLenghts
### BEGIN STRIP ###

line_length.collect()

### END STRIP ###

What's the average length?

In [0]:
# TODO: compute the average value of `lineLengths`: `avgLength`
### BEGIN STRIP ###

avg_length = line_length.mean()

### END STRIP ###

In [0]:
# TODO: what's the type of `avgLength`? Print it out.
### BEGIN STRIP ###

type(avg_length)

### END STRIP ###

In [0]:
# TODO: print out `avgLength`
### BEGIN STRIP ###
print(avg_length)
### END STRIP ###

Now we want to compute the total length of the document

In [0]:
# TODO: compute the sum of all `lineLengths`: `totalLength`
### BEGIN STRIP ###
total_length = line_length.sum()
### END STRIP ###

In [0]:
# TODO: what's the type of `totalLength`
### BEGIN STRIP ###
type(total_length)
### END STRIP ###

In [0]:
# TODO: print out `totalLength`
### BEGIN STRIP ###
print(total_length)
### END STRIP ###

## Bonus: another way to compute the sum would be to use a `reducer`
This is a step exercise to get you prepare for the next (optional) assignment.

Your goal is to compute the sum of lineLenghts, just like we did, but this time using `.reduce(...)`.  
Here is the link to the [documentation](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduce).

In [0]:
# Try to compute the total sum, but this time using `.reduce(...)`
### BEGIN STRIP ###
from operator import add
line_length.reduce(add)
### END STRIP ###