Why Spark? Because ... Pyspark
==========

![Related
image](https://miro.medium.com/proxy/1*sQGVLk43kXJTEw1mtJRoDw.png)

Hadoop was the first open source system that introduced us to the
MapReduce paradigm of programming and Spark is the system that made it
faster, much much faster(100x).

There used to be a lot of data movement in Hadoop as it used to write
intermediate results to the file system.

This affected the speed at which you could do analysis.

Spark provided us with an in-memory model, so Spark doesn’t write too
much to the disk while working.

Simply, Spark is faster than Hadoop and a lot of people use Spark now.

***So without further ado let us get started.***

Load Some Data
==============

The next step is to upload some data we will use to learn Spark. We will end up using multiple datasets by the end of this but let us start with something very simple.

Let us add the file `shakespeare.txt` 

You can see that the file is loaded to `shakespeare/shakespeare.txt` location.

In [2]:
# To download the data you would use the following commands:
#!wget -P shakespeare https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
#!mv shakespeare/t8.shakespeare.txt shakespeare/shakespeare.txt
!ls -l shakespeare


total 5332
-rw-r--r-- 1 root root 5458199 Apr 23  2020 shakespeare.txt



Our First Spark Program
=======================

I like to learn by examples so let’s get done with the “Hello World” of
Distributed computing: ***The WordCount Program.***

First, we need to create a `SparkSession`:

In [5]:
from pyspark.sql import SparkSession

spark = SparkSession\
        .builder\
        .appName("shakespeare")\
        .master("spark://spark-master:7077")\
        .config("hive.metastore.uris", "thrift://hive-metastore:9083")\
        .config("spark.sql.warehouse.dir", "hdfs://namenode:8020/user/hive/warehouse")\
        .config("spark.executor.memory", "1g")\
        .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.2.0")\
        .enableHiveSupport()\
        .getOrCreate()

sc = spark.sparkContext

sc.setLogLevel("ERROR")


In [6]:
from pyspark.sql.functions import to_json,col
from pyspark.sql.types import *
from os.path import abspath

In [7]:
# Distribute the data - Create a RDD 
lines = sc.textFile("shakespeare/shakespeare.txt")

In [12]:
x = 'This is the 100th Etext file presented by Project Gutenberg, and'
x.split(' ')

['This',
 'is',
 'the',
 '100th',
 'Etext',
 'file',
 'presented',
 'by',
 'Project',
 'Gutenberg,',
 'and']

Now we can write our program:

In [13]:
# Distribute the data - Create a RDD 
lines = sc.textFile("shakespeare/shakespeare.txt")

# Create a list with all words, Create tuple (word,1), reduce by key i.e. the word
counts = (lines.flatMap(lambda x: x.split(' '))          
                  .map(lambda x: (x, 1))                 
                  .reduceByKey(lambda x,y : x + y))

# get the output on local
output = counts.take(10)                    

# print output
for (word, count) in output:    
    if word.strip() != "":
        print(f"'{word}' occurs {count} times")



'to' occurs 15623 times
'thine' occurs 315 times
'friend.' occurs 73 times
'Tell' occurs 179 times
'more' occurs 1608 times
'thou' occurs 4247 times
'Exeunt.' occurs 122 times
'III.' occurs 141 times
'French' occurs 127 times


                                                                                

In [None]:
# print(counts.toDebugString().decode('utf-8'))

So that is a small example which counts the number of words in the
document and prints 10 of them.

And most of the work gets done in the second command.

Don’t worry if you are not able to follow this yet as I still need to
tell you about the things that make Spark work.

But before we get into Spark basics, Let us refresh some of our Python
Basics. Understanding Spark becomes a lot easier if you have used
[functional programming with Python](https://amzn.to/2SuAtzL).

For those of you who haven’t used it, below is a brief intro.

A functional approach to programming in Python 
==============================================

![Related
image](https://miro.medium.com/proxy/1*nCX6bsSNUF_v2hFKgnaQIA.png)

Map 
------

`map` is used to map a function to an array or a
list. Say you want to apply some function to every element in a list.

You can do this by simply using a for loop but python lambda functions
let you do this in a single line in Python.

In [14]:
my_list = [1,2,3,4,5,6,7,8,9,10]

# Lets say I want to square each term in my_list.
squared_list = map(lambda x:x**2, my_list)

print(list(squared_list))

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


In the above example, you could think of `map`
as a function which takes two arguments — A function and a list.

It then applies the function to every element of the list.

What lambda allows you to do is write an inline function. In here the
part `lambda x:x**2` defines a function that takes x as input and returns x².

You could have also provided a proper function in place of lambda. For
example:

In [15]:
def squared(x):
    return x**2

my_list = [1,2,3,4,5,6,7,8,9,10]

# Lets say I want to square each term in my_list.
squared_list = map(squared, my_list)

print(list(squared_list))

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


The same result, but the lambda expressions make the code compact and a
lot more readable.

Filter
---------

The other function that is used extensively is the `filter` function. This function takes two arguments — A condition and the list to filter.

If you want to filter your list using some condition you use
`filter`.


In [16]:
my_list = [1,2,3,4,5,6,7,8,9,10]

# Lets say I want only the even numbers in my list.
filtered_list = filter(lambda x:x%2==0,my_list)
print(list(filtered_list))

[2, 4, 6, 8, 10]


Reduce
---------

The next function I want to talk about is the reduce function. This
function will be the workhorse in Spark.

This function takes two arguments — a function to reduce that takes two
arguments, and a list over which the reduce function is to be applied.


In [17]:
import functools
my_list = [1,2,3,4,5]

# Lets say I want to sum all elements in my list.
sum_list = functools.reduce(lambda x,y:x+y,my_list)
print(sum_list)

15


In [18]:
import functools
my_list = [1,2,3,4]

# Lets say I want to sum all elements in my list.
sum_list = functools.reduce(lambda x,y:x*y,my_list)
print(sum_list)

24


In python2 reduce used to be a part of Python, now we have to use
`reduce` as a part of `functools`.

Here the lambda function takes in two values x, y and returns their sum.
Intuitively you can think that the reduce function works as:

```
Reduce function first sends 1,2    ; the lambda function returns 3
Reduce function then sends 3,3     ; the lambda function returns 6
Reduce function then sends 6,4     ; the lambda function returns 10
Reduce function finally sends 10,5 ; the lambda function returns 15
```

A condition on the lambda function we use in reduce is that it must be:

-   commutative that is a + b = b + a and
-   associative that is (a + b) + c == a + (b + c).

In the above case, we used sum which is **commutative as well as
associative**. Other functions that we could have used: `max`**,** `min`, `*` etc.

Moving Again to Spark
=====================

As we have now got the fundamentals of Python Functional Programming out
of the way, lets again head to Spark.

But first, let us delve a little bit into how spark works. Spark
actually consists of two things a driver and workers.

Workers normally do all the work and the driver makes them do that work.

RDD
---

An RDD(Resilient Distributed Dataset) is a parallelized data structure
that gets distributed across the worker nodes. They are the basic units
of Spark programming.

In our wordcount example, in the first line

```py
lines = sc.textFile("/FileStore/tables/shakespeare.txt")
```

we took a text file and distributed it across worker nodes so that they
can work on it in parallel. We could also parallelize lists using the
function `sc.parallelize`

For example:


In [19]:
data = [1,2,3,4,5,6,7,8,9,10]
new_rdd = sc.parallelize(data,4)
new_rdd.collect()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


In Spark, we can do two different types of operations on RDD:
Transformations and Actions.

1.  **Transformations:** Create new datasets from existing RDDs
2.  **Actions:** Mechanism to get results out of Spark

Transformation Basics
=====================

![Image for
post](https://miro.medium.com/max/2560/1*LP9yglc4UeUxDFBoTlfS9w.png)

So let us say you have got your data in the form of an RDD.

To requote your data is now accessible to the worker machines. You want
to do some transformations on the data now.

You may want to filter, apply some function, etc.

In Spark, this is done using Transformation functions.

Spark provides many transformation functions. You can see a
comprehensive list
[**here**](http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations).
Some of the main ones that I use frequently are:

Map:
-------

Applies a given function to an RDD.

Note that the syntax is a little bit different from Python, but it
necessarily does the same thing. Don’t worry about `collect` yet. For now, just think of it as a function that collects the data in squared\_rdd back to a list.


In [20]:
data = [1,2,3,4,5,6,7,8,9,10]
rdd = sc.parallelize(data,4)
squared_rdd = rdd.map(lambda x:x**2)
result_list = squared_rdd.collect()
print(result_list)

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]



Filter:
----------

Again no surprises here. Takes as input a condition and keeps only those
elements that fulfill that condition.


In [22]:
data = [1,2,3,4,5,6,7,8,9,10]
rdd = sc.parallelize(data,4)
filtered_rdd = rdd.filter(lambda x:x%2!=0)
filtered_rdd.collect()

[1, 3, 5, 7, 9]


distinct:
------------

Returns only distinct elements in an RDD.

In [26]:
data = [1,2,2,2,2,3,3,3,3,4,5,6,7,7,7,8,8,8,9,10]
rdd = sc.parallelize(data,4)
distinct_rdd = rdd.distinct()
sorted(distinct_rdd.collect())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

flatmap:
-----------

Similar to `map`, but each input item can be
mapped to 0 or more output items.

In [28]:
data = [1,2,3,4]
rdd = sc.parallelize(data,4)
flat_rdd = rdd.flatMap(lambda x:[x,x**3])
flat_rdd.collect()

[1, 1, 2, 8, 3, 27, 4, 64]

Reduce By Key:
-----------------

The parallel to the reduce in Hadoop MapReduce.

Now Spark cannot provide the value if it just worked with Lists.

In Spark, there is a concept of pair RDDs that makes it a lot more
flexible. Let's assume we have a data in which we have a product, its
category, and its selling price. We can still parallelize the data.

In [30]:
data = [('Apple','Fruit',200),('Banana','Fruit',24),('Tomato','Fruit',56),('Potato','Vegetable',103),('Carrot','Vegetable',34)]
rdd = sc.parallelize(data,4)
rdd.collect()

[('Apple', 'Fruit', 200),
 ('Banana', 'Fruit', 24),
 ('Tomato', 'Fruit', 56),
 ('Potato', 'Vegetable', 103),
 ('Carrot', 'Vegetable', 34)]

Right now our RDD `rdd` holds tuples.

Now we want to find out the total sum of revenue that we got from each
category.

To do that we have to transform our `rdd` to a
pair rdd so that it only contains key-value pairs/tuples.


In [31]:
category_price_rdd = rdd.map(lambda x: (x[1],x[2]))
category_price_rdd.collect()

[('Fruit', 200),
 ('Fruit', 24),
 ('Fruit', 56),
 ('Vegetable', 103),
 ('Vegetable', 34)]

Here we used the map function to get it in the format we wanted. When
working with textfile, the RDD that gets formed has got a lot of
strings. We use `map` to convert it into a format that we want.

So now our `category_price_rdd` contains the
product category and the price at which the product sold.

Now we want to reduce on the key category and sum the prices. We can do
this by:

In [32]:
category_total_price_rdd = category_price_rdd.reduceByKey(lambda x,y:x+y)
category_total_price_rdd.collect()

[('Fruit', 280), ('Vegetable', 137)]

Group By Key:
----------------

Similar to `reduceByKey` but does not reduces
just puts all the elements in an iterator. For example, if we wanted to
keep as key the category and as the value all the products we would use
this function.

Let us again use `map` to get data in the
required form.

In [33]:
data = [('Apple','Fruit',200),('Banana','Fruit',24),('Tomato','Fruit',56),('Potato','Vegetable',103),('Carrot','Vegetable',34)]
rdd = sc.parallelize(data,4)
category_product_rdd = rdd.map(lambda x: (x[1],x[0]))
category_product_rdd.collect()

[('Fruit', 'Apple'),
 ('Fruit', 'Banana'),
 ('Fruit', 'Tomato'),
 ('Vegetable', 'Potato'),
 ('Vegetable', 'Carrot')]

We then use `groupByKey` as:

In [34]:
grouped_products_by_category_rdd = category_product_rdd.groupByKey()
findata = grouped_products_by_category_rdd.collect()
for data in findata:
    print(data[0],list(data[1]))

Fruit ['Apple', 'Tomato', 'Banana']
Vegetable ['Potato', 'Carrot']


Here the `groupByKey` function worked and it
returned the category and the list of products in that category.

Action Basics
=============

![Image for
post](https://miro.medium.com/max/2560/1*-T8LTnsXH2AhzhbmajacXw.png)

You have filtered your data, mapped some functions on it. Done your
computation.

Now you want to get the data on your local machine or save it to a file
or show the results in the form of some graphs in excel or any
visualization tool.

You will need actions for that. A comprehensive list of actions is
provided
[**here**](http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions)**.**

Some of the most common actions that I tend to use are:

collect:
-----------

We have already used this action many times. It takes the whole RDD and
brings it back to the driver program.

reduce:
----------

Aggregate the elements of the dataset using a function func (which takes
two arguments and returns one). The function should be commutative and
associative so that it can be computed correctly in parallel.

In [35]:
rdd = sc.parallelize([1,2,3,4,5])
rdd.reduce(lambda x,y : x+y)

15

take:
--------

Sometimes you will need to see what your RDD contains without getting
all the elements in memory itself. `take`
returns a list with the first n elements of the RDD.


In [36]:
rdd = sc.parallelize([1,2,3,4,5])
rdd.take(3)

[1, 2, 3]


takeOrdered:
---------------

`takeOrdered` returns the first n elements of
the RDD using either their natural order or a custom comparator.


In [37]:
rdd = sc.parallelize([5,3,12,23])

# descending order
rdd.takeOrdered(3,lambda s:-1*s)

[23, 12, 5]

In [38]:
rdd = sc.parallelize([(5,23),(3,34),(12,344),(23,29)])

# descending order
rdd.takeOrdered(3,lambda s:-1*s[1])

[(12, 344), (3, 34), (23, 29)]

We have our basics covered finally. Let us get back to our wordcount
example

Understanding The WordCount Example
===================================

![Image for post](https://miro.medium.com/max/10368/0*wcxaKwNMIiEsGmW2)

Now we sort of understand the transformations and the actions provided
to us by Spark.

It should not be difficult to understand the wordcount program now. Let
us go through the program line by line.

The first line creates an RDD and distributes it to the workers.


In [None]:
lines = sc.textFile("shakespeare/shakespeare.txt")

This RDD `lines` contains a list of sentences in
the file. You can see the rdd content using `take`

In [None]:
lines.take(5)

This RDD is of the form:

```py
['word1 word2 word3','word4 word3 word2']
```

This next line is actually the workhorse function in the whole script.


In [None]:
counts = (lines.flatMap(lambda x: x.split(' '))          
                  .map(lambda x: (x, 1))                 
                  .reduceByKey(lambda x,y : x + y))

It contains a series of transformations that we do to the lines RDD.
First of all, we do a `flatmap` transformation.

The `flatmap` transformation takes as input the
lines and gives words as output. So after the `flatmap` transformation, the RDD is of the form:

```py
['word1','word2','word3','word4','word3','word2']
```

Next, we do a `map` transformation on the
`flatmap` output which converts the RDD to :

```py
[('word1',1),('word2',1),('word3',1),('word4',1),('word3',1),('word2',1)]
```

Finally, we do a `reduceByKey` transformation
which counts the number of time each word appeared.

After which the RDD approaches the final desirable form.

```py
[('word1',1),('word2',2),('word3',2),('word4',1)]
```

This next line is an action that takes the first 10 elements of the
resulting RDD locally.


In [None]:
output = counts.take(10)

This line just prints the output

In [None]:
for (word, count) in output:                 
    print("%s: %i" % (word, count))

And that is it for the wordcount program. Hope you understand it now.

So till now, we talked about the Wordcount example and the basic
transformations and actions that you could use in Spark. But we don’t do
wordcount in real life.

We have to work on bigger problems which are much more complex. Worry
not! Whatever we have learned till now will let us do that and more.

Spark in Action with Example
============================

![Image for post](https://miro.medium.com/max/10796/0*e94sY_GitJyJz02J)

Let us work with a concrete example which takes care of some usual
transformations.

We will work on Movielens
[ml-100k.zip](https://github.com/rudrasingh21/Data-ML-100k-/raw/master/ml-100k.zip)
dataset which is a stable benchmark dataset. 100,000 ratings from 1000
users on 1700 movies. Released 4/1998.

Let us start by downloading the data.


In [None]:
# To download the data you would use the following commands:
!rm -rf ml-100k
!wget -P /tmp https://github.com/rudrasingh21/Data-ML-100k-/raw/master/ml-100k.zip
!unzip /tmp/ml-100k.zip -d .
!ls -l ml-100k

The Movielens dataset contains a lot of files but we are going to be
working with 3 files only:

​1) **Users**: This file name is kept as `u.user`. The columns in this
file are:

```py
['user_id', 'age', 'sex', 'occupation', 'zip_code']
```

​2) **Ratings**: This file name is kept as `u.data`. The columns in this
file are:

```py
['user_id', 'movie_id', 'rating', 'unix_timestamp']
```

​3) **Movies**: This file name is kept as `u.item`. The columns in this
file are:

```py
['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url', and 18 more columns.....]
```

Our business partner now comes to us and asks us to find out the ***25
most rated movie titles*** from this data. How many times a movie has
been rated?

Let us load the data in different RDDs and see what the data contains.



In [39]:
userRDD = sc.textFile("ml-100k/u.user") 
ratingRDD = sc.textFile("ml-100k/u.data") 
movieRDD = sc.textFile("ml-100k/u.item") 
print("userRDD:",userRDD.take(1))
print("ratingRDD:",ratingRDD.take(1))
print("movieRDD:",movieRDD.take(1))

userRDD: ['1|24|M|technician|85711']
ratingRDD: ['196\t242\t3\t881250949']
movieRDD: ['1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0']


We note that to answer this question we will need to use the
`ratingRDD`. But the `ratingRDD` does not have the movie name.

So we would have to merge `movieRDD` and `ratingRDD` using `movie_id`.

**How we would do that in Spark?**

Below is the code. We also use a new transformation `leftOuterJoin`. Do read the docs and comments in the below code.

In [40]:
# Create a RDD from RatingRDD that only contains the two columns of interest i.e. movie_id,rating.
RDD_movid_rating = ratingRDD.map(lambda x : (x.split("\t")[1],x.split("\t")[2]))
print("RDD_movid_rating:",RDD_movid_rating.take(4))

# Create a RDD from MovieRDD that only contains the two columns of interest i.e. movie_id,title.
RDD_movid_title = movieRDD.map(lambda x : (x.split("|")[0],x.split("|")[1]))
print("RDD_movid_title:",RDD_movid_title.take(2))

# merge these two pair RDDs based on movie_id. For this we will use the transformation leftOuterJoin(). See the transformation document.
rdd_movid_title_rating = RDD_movid_rating.leftOuterJoin(RDD_movid_title)
print("rdd_movid_title_rating:",rdd_movid_title_rating.take(1))

# use the RDD in previous step to create (movie,1) tuple pair RDD
rdd_title_rating = rdd_movid_title_rating.map(lambda x: (x[1][1],1 ))
print("rdd_title_rating:",rdd_title_rating.take(2))

# Use the reduceByKey transformation to reduce on the basis of movie_title
rdd_title_ratingcnt = rdd_title_rating.reduceByKey(lambda x,y: x+y)
print("rdd_title_ratingcnt:",rdd_title_ratingcnt.take(2))

# Get the final answer by using takeOrdered Transformation
print("#####################################")
print("25 most rated movies:",rdd_title_ratingcnt.takeOrdered(25,lambda x:-x[1]))
print("#####################################")

RDD_movid_rating: [('242', '3'), ('302', '3'), ('377', '1'), ('51', '2')]
RDD_movid_title: [('1', 'Toy Story (1995)'), ('2', 'GoldenEye (1995)')]


                                                                                

rdd_movid_title_rating: [('741', ('4', 'Last Supper, The (1995)'))]
rdd_title_rating: [('Last Supper, The (1995)', 1), ('Last Supper, The (1995)', 1)]
rdd_title_ratingcnt: [('L.A. Confidential (1997)', 297), ('Broken Arrow (1996)', 254)]
#####################################
25 most rated movies: [('Star Wars (1977)', 583), ('Contact (1997)', 509), ('Fargo (1996)', 508), ('Return of the Jedi (1983)', 507), ('Liar Liar (1997)', 485), ('English Patient, The (1996)', 481), ('Scream (1996)', 478), ('Toy Story (1995)', 452), ('Air Force One (1997)', 431), ('Independence Day (ID4) (1996)', 429), ('Raiders of the Lost Ark (1981)', 420), ('Godfather, The (1972)', 413), ('Pulp Fiction (1994)', 394), ('Twelve Monkeys (1995)', 392), ('Silence of the Lambs, The (1991)', 390), ('Jerry Maguire (1996)', 384), ('Chasing Amy (1997)', 379), ('Rock, The (1996)', 378), ('Empire Strikes Back, The (1980)', 367), ('Star Trek: First Contact (1996)', 365), ('Titanic (1997)', 350), ('Back to the Future (1985)',

Star Wars is the most rated movie in the Movielens Dataset.

Now we could have done all this in a single command using the below
command but the code is a little messy now.

In [None]:
print(((ratingRDD.map(lambda x : (x.split("\t")[1],x.split("\t")[2]))).
     leftOuterJoin(movieRDD.map(lambda x : (x.split("|")[0],x.split("|")[1])))).
     map(lambda x: (x[1][1],1)).
     reduceByKey(lambda x,y: x+y).
     takeOrdered(25,lambda x:-x[1]))

I did this to show that you can use chaining functions with Spark and
you could bypass the process of variable creation.


Let us do one more. For practice:

Now we want to find the most highly rated 25 movies using the same
dataset. We actually want only those movies which have been rated at
least 100 times.

In [None]:
# We already have the RDD rdd_movid_title_rating: [(u'429', (u'5', u'Day the Earth Stood Still, The (1951)'))]
# We create an RDD that contains sum of all the ratings for a particular movie
rdd_title_ratingsum = (rdd_movid_title_rating.
                        map(lambda x: (x[1][1],int(x[1][0]))).
                        reduceByKey(lambda x,y:x+y))
                        
print("rdd_title_ratingsum:",rdd_title_ratingsum.take(2))
# Merge this data with the RDD rdd_title_ratingcnt we created in the last step
# And use Map function to divide ratingsum by rating count.
rdd_title_ratingmean_rating_count = (rdd_title_ratingsum.
                                    leftOuterJoin(rdd_title_ratingcnt).
                                    map(lambda x:(x[0],(float(x[1][0])/x[1][1],x[1][1]))))
                                    
print("rdd_title_ratingmean_rating_count:",rdd_title_ratingmean_rating_count.take(1))
# We could use take ordered here only but we want to only get the movies which have count
# of ratings more than or equal to 100 so lets filter the data RDD.
rdd_title_rating_rating_count_gt_100 = (rdd_title_ratingmean_rating_count.
                                        filter(lambda x: x[1][1]>=100))
                                        
print("rdd_title_rating_rating_count_gt_100:",rdd_title_rating_rating_count_gt_100.take(1))
# Get the final answer by using takeOrdered Transformation
print("#####################################")
print ("25 highly rated movies:")
print(rdd_title_rating_rating_count_gt_100.takeOrdered(25,lambda x:-x[1][0]))
print("#####################################")

We have talked about RDDs till now as they are very powerful.

You can use RDDs to work with non-relational databases too.

They let you do a lot of things that you couldn’t do with SparkSQL?

***Yes, you can use SQL with Spark too which I am going to talk about
now.***

Spark DataFrames
================

![Image for
post](https://miro.medium.com/max/1234/0*_Xne4_sz6lroaINt.png)

Spark has provided DataFrame API to work with relational data. Here is the
[documentation](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html#) for the adventurous folks.

Remember that in the background it still is all RDDs and that is why the
starting part of this post focussed on RDDs.

I will start with some common functionalities you will need to work with
Spark DataFrames. Would look a lot like Pandas with some syntax changes.

Reading the File
-------------------

In [41]:
ratings = spark.read.load("ml-100k/u.data",format="csv", sep="\t", inferSchema="true", header="false")

                                                                                

Show File
------------

Here is how we can show files using Spark Dataframes.

In [42]:
ratings.show(5)

+---+---+---+---------+
|_c0|_c1|_c2|      _c3|
+---+---+---+---------+
|196|242|  3|881250949|
|186|302|  3|891717742|
| 22|377|  1|878887116|
|244| 51|  2|880606923|
|166|346|  1|886397596|
+---+---+---+---------+
only showing top 5 rows



In [45]:
ratings.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- movie_id: integer (nullable = true)
 |-- rating: integer (nullable = true)
 |-- unix_timestamp: integer (nullable = true)



Change Column names
----------------------

Good functionality. Always required. Don’t forget the `*` in front of the list.


In [44]:
ratings = ratings.toDF(*['user_id', 'movie_id', 'rating', 'unix_timestamp'])
ratings.show(5)

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|    196|     242|     3|     881250949|
|    186|     302|     3|     891717742|
|     22|     377|     1|     878887116|
|    244|      51|     2|     880606923|
|    166|     346|     1|     886397596|
+-------+--------+------+--------------+
only showing top 5 rows



Some Basic Stats
-------------------


In [46]:
print(ratings.count()) #Row Count
print(len(ratings.columns)) #Column Count

100000
4


We can also see the dataframe statistics using:

In [47]:
ratings.describe().show()

[Stage 54:>                                                         (0 + 1) / 1]

+-------+------------------+------------------+------------------+-----------------+
|summary|           user_id|          movie_id|            rating|   unix_timestamp|
+-------+------------------+------------------+------------------+-----------------+
|  count|            100000|            100000|            100000|           100000|
|   mean|         462.48475|         425.53013|           3.52986|8.8352885148862E8|
| stddev|266.61442012750905|330.79835632558473|1.1256735991443214|5343856.189502848|
|    min|                 1|                 1|                 1|        874724710|
|    max|               943|              1682|                 5|        893286638|
+-------+------------------+------------------+------------------+-----------------+



                                                                                


Select a few columns
-----------------------

In [48]:
ratings.select('user_id','movie_id').show(5)

+-------+--------+
|user_id|movie_id|
+-------+--------+
|    196|     242|
|    186|     302|
|     22|     377|
|    244|      51|
|    166|     346|
+-------+--------+
only showing top 5 rows



Filter
---------

Filter a dataframe using multiple conditions:


In [49]:
ratings.filter((ratings.rating==5) & (ratings.user_id==253)).show(5)


+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|    253|     465|     5|     891628467|
|    253|     510|     5|     891628416|
|    253|     183|     5|     891628341|
|    253|     483|     5|     891628122|
|    253|     198|     5|     891628392|
+-------+--------+------+--------------+
only showing top 5 rows



Groupby
----------

We can use groupby function with a spark dataframe too. Pretty much same
as a pandas groupby with the exception that you will need to import
`pyspark.sql.functions`

In [50]:
from pyspark.sql import functions as F
ratings.groupBy("user_id").agg(F.count("user_id"),F.mean("rating")).show(5)

[Stage 59:>                                                         (0 + 1) / 1]

+-------+--------------+------------------+
|user_id|count(user_id)|       avg(rating)|
+-------+--------------+------------------+
|    148|            65|               4.0|
|    463|           133|2.8646616541353382|
|    471|            31|3.3870967741935485|
|    496|           129|3.0310077519379846|
|    833|           267| 3.056179775280899|
+-------+--------------+------------------+
only showing top 5 rows



                                                                                

Here we have found the count of ratings and average rating from each
user_id

Sort
=======


In [51]:
ratings.sort("user_id").show(5)


[Stage 62:>                                                         (0 + 1) / 1]

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|      1|      33|     4|     878542699|
|      1|     202|     5|     875072442|
|      1|     160|     4|     875072547|
|      1|      61|     4|     878542420|
|      1|     189|     3|     888732928|
+-------+--------+------+--------------+
only showing top 5 rows



                                                                                

We can also do a descending sort using `F.desc`
function as below.

In [52]:
# descending Sort
from pyspark.sql import functions as F
ratings.sort(F.desc("user_id")).show(5)

                                                                                

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|    943|     570|     1|     888640125|
|    943|     186|     5|     888639478|
|    943|     232|     4|     888639867|
|    943|      58|     4|     888639118|
|    943|    1067|     2|     875501756|
+-------+--------+------+--------------+
only showing top 5 rows



Joins/Merging with Spark Dataframes
===================================

We can use SQL with dataframes and thus we can merge dataframes using SQL.

Let us try to run some SQL on Ratings.

We first register the ratings df to a temporary table ratings\_table on
which we can run sql operations.

As you can see the result of the SQL select statement is again a Spark
Dataframe.

In [54]:
ratings.registerTempTable('ratings_table')
newDF = spark.sql('select * from ratings_table where rating > 4')
newDF.show(5)

+-------+--------+------+--------------+
|user_id|movie_id|rating|unix_timestamp|
+-------+--------+------+--------------+
|    196|     242|     3|     881250949|
|    186|     302|     3|     891717742|
|     22|     377|     1|     878887116|
|    244|      51|     2|     880606923|
|    166|     346|     1|     886397596|
+-------+--------+------+--------------+
only showing top 5 rows



Let us now add one more Spark Dataframe to the mix to see if we can use
join using the SQL queries:

In [55]:
# get one more dataframe to join
movies = spark.read.load("ml-100k/u.item",format="csv", sep="|", inferSchema="true", header="false")

# change column names
movies = movies.toDF(*["movie_id","movie_title","release_date","video_release_date","IMDb_URL","unknown","Action","Adventure","Animation ","Children","Comedy","Crime","Documentary","Drama","Fantasy","Film_Noir","Horror","Musical","Mystery","Romance","Sci_Fi","Thriller","War","Western"])

# display
movies.show(5)

                                                                                

+--------+-----------------+------------+------------------+--------------------+-------+------+---------+----------+--------+------+-----+-----------+-----+-------+---------+------+-------+-------+-------+------+--------+---+-------+
|movie_id|      movie_title|release_date|video_release_date|            IMDb_URL|unknown|Action|Adventure|Animation |Children|Comedy|Crime|Documentary|Drama|Fantasy|Film_Noir|Horror|Musical|Mystery|Romance|Sci_Fi|Thriller|War|Western|
+--------+-----------------+------------+------------------+--------------------+-------+------+---------+----------+--------+------+-----+-----------+-----+-------+---------+------+-------+-------+-------+------+--------+---+-------+
|       1| Toy Story (1995)| 01-Jan-1995|              null|http://us.imdb.co...|      0|     0|        0|         1|       1|     1|    0|          0|    0|      0|        0|     0|      0|      0|      0|     0|       0|  0|      0|
|       2| GoldenEye (1995)| 01-Jan-1995|              null|

Now let us try joining the tables on movie\_id to get the name of the
movie in the ratings table.


In [56]:
movies.registerTempTable('movies_table')

spark.sql("""
select 
    ratings_table.*,
    movies_table.movie_title 
from ratings_table 
left join movies_table 
    on movies_table.movie_id = ratings_table.movie_id
""").show(5)


+-------+--------+------+--------------+--------------------+
|user_id|movie_id|rating|unix_timestamp|         movie_title|
+-------+--------+------+--------------+--------------------+
|    196|     242|     3|     881250949|        Kolya (1996)|
|    186|     302|     3|     891717742|L.A. Confidential...|
|     22|     377|     1|     878887116| Heavyweights (1994)|
|    244|      51|     2|     880606923|Legends of the Fa...|
|    166|     346|     1|     886397596| Jackie Brown (1997)|
+-------+--------+------+--------------+--------------------+
only showing top 5 rows



Let us try to do what we were doing earlier with the RDDs. Finding the
top 25 most rated movies:


In [57]:
mostrateddf = spark.sql("""
select 
    movie_id,
    movie_title, 
    count(user_id) as num_ratings 
from (
    select 
        ratings_table.*,
        movies_table.movie_title 
    from ratings_table 
    left join movies_table 
        on movies_table.movie_id = ratings_table.movie_id
    )A 
group by movie_id, movie_title 
order by num_ratings desc 
""")

mostrateddf.show(5)


[Stage 72:>                                                         (0 + 1) / 1]

+--------+--------------------+-----------+
|movie_id|         movie_title|num_ratings|
+--------+--------------------+-----------+
|      50|    Star Wars (1977)|        583|
|     258|      Contact (1997)|        509|
|     100|        Fargo (1996)|        508|
|     181|Return of the Jed...|        507|
|     294|    Liar Liar (1997)|        485|
+--------+--------------------+-----------+
only showing top 5 rows



                                                                                

And finding the top 25 highest rated movies having more than 100 votes:

In [58]:
highrateddf = spark.sql("""
select 
    movie_id,
    movie_title, 
    avg(rating) as avg_rating,
    count(movie_id) as num_ratings 
from (
    select 
        ratings_table.*,
        movies_table.movie_title 
    from ratings_table 
    left join movies_table 
        on movies_table.movie_id = ratings_table.movie_id
    )A 
group by movie_id, movie_title 
having num_ratings>100 
order by avg_rating desc 
""")

highrateddf.show(5, False)


[Stage 76:>                                                         (0 + 1) / 1]

+--------+--------------------------------+-----------------+-----------+
|movie_id|movie_title                     |avg_rating       |num_ratings|
+--------+--------------------------------+-----------------+-----------+
|408     |Close Shave, A (1995)           |4.491071428571429|112        |
|318     |Schindler's List (1993)         |4.466442953020135|298        |
|169     |Wrong Trousers, The (1993)      |4.466101694915254|118        |
|483     |Casablanca (1942)               |4.45679012345679 |243        |
|64      |Shawshank Redemption, The (1994)|4.445229681978798|283        |
+--------+--------------------------------+-----------------+-----------+
only showing top 5 rows



                                                                                

Converting from Spark Dataframe to RDD and vice versa:
======================================================

Sometimes you may want to convert to RDD from a spark Dataframe or vice
versa so that you can have the best of both worlds.

To convert from DF to RDD, you can simply do :

In [59]:
highratedrdd = highrateddf.rdd
highratedrdd.take(2)

                                                                                

[Row(movie_id=408, movie_title='Close Shave, A (1995)', avg_rating=4.491071428571429, num_ratings=112),
 Row(movie_id=318, movie_title="Schindler's List (1993)", avg_rating=4.466442953020135, num_ratings=298)]

To go from an RDD to a dataframe:


In [63]:
from pyspark.sql import Row
# creating a RDD first
data = [('A',1),('B',2),('C',3),('D',4)]
rdd = sc.parallelize(data)

# map the schema using Row.
rdd_rows = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

# Convert the rdd to Dataframe
df = spark.createDataFrame(rdd_rows)

df.show(5)

df.registerTempTable('people')
spark.sql('select * from people where age > 3').show()


+----+---+
|name|age|
+----+---+
|   A|  1|
|   B|  2|
|   C|  3|
|   D|  4|
+----+---+

+----+---+
|name|age|
+----+---+
|   D|  4|
+----+---+



RDD provides you with ***more control*** at the cost of time and coding
effort. While Dataframes provide you with ***familiar coding***
platform. And now you can move back and forth between these two.

Conclusion
==========

![Image for post](https://miro.medium.com/max/10336/0*TK-uI698Vdxjh5kL)

This was long and congratulations if you reached the end.

[Spark](https://spark.apache.org/) has provided us with an interface where
we could use transformations and actions on our data. Spark also has the
Dataframe API to ease the transition to Big Data.

Hopefully, I’ve covered the basics well enough to pique your interest
and help you get started with Spark.