# KEY NOTE

This notebook is guide to core pyspark datatype RDD. Since most technologies and industries now a days are working on cloud, it is good to have knowledge on how to apply ML and Datascience when it comes to "Big Data". I will be modifying the code and content already available so that we can use it with simple dataset like "Titanic" which is the hello world of ML.

These are just my personal notes. I am sharing these so that it helps others too, who are trying to learn the similar concepts. Any Feedback is appreciated.

If you havent checked the Spark DataFrame notebook, please feel free to do so [here](https://www.kaggle.com/amritvirsinghx/scalable-data-science-pyspark-nb1)

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home">Notebook Navigation</h3>

[1. What is Spark?](#1)     
[2. Using Spark in Python](#2)    
[3. Spark Core RDD](#3)    
[4. Creating RDDs](#4)    
&nbsp;&nbsp;&nbsp;&nbsp;[a. Creating an Empty RDD](#4a)   
&nbsp;&nbsp;&nbsp;&nbsp;[b. Creating RDD with Partitions](#4b)       
&nbsp;&nbsp;&nbsp;&nbsp;[c. Checking Number of Partitions](#4c)   
&nbsp;&nbsp;&nbsp;&nbsp;[d. Collecting the Data with Partition](#4d)   
&nbsp;&nbsp;&nbsp;&nbsp;[e. Setting Name of RDD](#4e)       
&nbsp;&nbsp;&nbsp;&nbsp;[f. Using Range to Create RDD](#4f)  
[5. Ordering and Repartitioning RDD](#5)  
&nbsp;&nbsp;&nbsp;&nbsp;[a. Fetch Ordered Elements](#5a)   
&nbsp;&nbsp;&nbsp;&nbsp;[b. Repartitions and Coalesce](#5b)       
[6. Saving and Debugging](#6)     
&nbsp;&nbsp;&nbsp;&nbsp;[a. Saving RDD to a Text File](#6a)   
&nbsp;&nbsp;&nbsp;&nbsp;[b. Checking Lineage of RDD](#6b)  
[7. Performing Operations on RDD](#7)         
&nbsp;&nbsp;&nbsp;&nbsp;[a. Performing Reduce](#7a)   
&nbsp;&nbsp;&nbsp;&nbsp;[b. Creating User Defined Functions (UDF)](#7b)    
&nbsp;&nbsp;&nbsp;&nbsp;[c. Applying Function as Filter](#7c)      
&nbsp;&nbsp;&nbsp;&nbsp;[d. Creating Flat Maps](#7d)     
&nbsp;&nbsp;&nbsp;&nbsp;[e. Performing Joins on RDD](#7e)       
&nbsp;&nbsp;&nbsp;&nbsp;[f. Listing and Replication](#7f)      
&nbsp;&nbsp;&nbsp;&nbsp;[g. Grouping on RDD](#7g)     
&nbsp;&nbsp;&nbsp;&nbsp;[h. Reducing by Key and Functions](#7h)       
[8. Epilogue](#8)   

<a id="1"></a>
## 1. What is Spark?

Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.

As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster. It is a fact that parallel computation can make certain types of programming tasks much faster.

However, with greater computing power comes greater complexity.

If you are deciding whether or not Spark is the best solution for your problem you can consider questions like:
* Is my data too big to work with on a single machine?
* Can my calculations be easily parallelized?

![Spark Logo](https://www.lwindia.com/images/Cloudera-Landing-Page-Banner.jpg)

<a id="2"></a>
## 2. Using Spark in Python

The first step in using Spark is connecting to a cluster.

In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called the master that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called worker. The master sends the workers data and calculations to run, and they send their results back to the master.

When just getting started with Spark it's simpler to just run a cluster locally. Thus for now, instead of connecting to another computer, all computations will be run on Kaggle's servers in a simulated cluster.

Creating the connection is as simple as creating an instance of the SparkContext class. The class constructor takes a few optional arguments that allow us to specify the attributes of the cluster we're connecting to.

In [None]:
!pip install pyspark

<a id="3"></a>
# 3. Spark Core RDD

RDD stands for Resilient Distributed Dataset, these are the elements that run and operate on multiple nodes to do parallel processing on a cluster. RDDs are immutable elements, which means once you create an RDD you cannot change it. RDDs are fault tolerant as well, hence in case of any failure, they recover automatically. You can apply multiple operations on these RDDs to achieve a certain task.

![RDD](https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F1991843%2Fece9429ac833006eda20f0aabede3860%2FCapture.JPG?generation=1601101924589580&alt=media)


To apply operations on these RDD's, there are two ways −

- Transformation
- Action

**Transformation** − These are the operations, which are applied on a RDD to create a new RDD. Filter, groupBy and map are the examples of transformations.

**Action** − These are the operations that are applied on RDD, which instructs Spark to perform computation and send the result back to the driver.

We can create initialize spark core RDD context as follows:

In [None]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
print(sc)

<a id="4"></a>
## 4. Creating RDDs
We can RDDs in many different ways lets go through few of them

<a id="4a"></a>
### a. Creating an Empty RDD

In [None]:
rdd=sc.emptyRDD() 
rdd.isEmpty()

In [None]:
a = sc.parallelize([])
a.isEmpty()

<a id="4b"></a>
### b. Creating RDD with Partitions

In [None]:
mixRDD =sc.parallelize([True, [11,22,33,44,55], (10,20,30,40,50)], 3)
mixRDD.collect()

<a id="4c"></a>
### c. Checking Number of Partitions

In [None]:
mixRDD.getNumPartitions()

<a id="4d"></a>
### d. Collecting the Data with Partitions

In [None]:
mixRDD.glom().collect()

<a id="4e"></a>
### e. Setting Name of RDD

In [None]:
mixRDD.setName('MyRDD')

<a id="4f"></a>
### f. Using Range to Create RDD

In [None]:
a= sc.parallelize(range(1,1000), 4)
print(a.getNumPartitions())
print("\n")
print(a.glom().take(1))
print("\n")
print(a.glom().max()) #to get max partitions data means last one partition(3)
print("\n")
print(a.glom().min()) #to get min partitions data means first one partition(0)
print("\n")

Expand the output to see the result

<a id="5"></a>
## 5. Ordering and Repartitioning RDD

<a id="5a"></a>
### a. Fetch Ordered Elements

In [None]:
b =sc.parallelize([12,21,23,43,1,22,11,45,56])
b.takeOrdered(4, key=lambda x: -x)

<a id="5b"></a>
### b. Repartitions and Coalesce

In [None]:
c=sc.parallelize(range(1,1000), 5)
c.getNumPartitions()

We can increase or decrease number of partitions with repartition method

In [None]:
d= c.repartition(7)
d.getNumPartitions()

In [None]:
e= d.repartition(4)
e.getNumPartitions()

With coalesce we cannot increase the partitions

In [None]:
e.coalesce(6)
e.getNumPartitions()

In [None]:
e.coalesce(2)
e.getNumPartitions()

<a id="6"></a>
## 6. Saving and Debugging

<a id="6a"></a>
### a. Saving RDD to a Text File

In [None]:
e.saveAsTextFile("./sampletext")

The output will be saved in kaggle's output directory, notice how the file is saved in partitions as per the number of partitions of RDD

<a id="6b"></a>
### b. Checking Lineage of RDD

In [None]:
e.toDebugString()

Ouput DAG can be viewed on Spark console on the port which is configured. A sample DAG is as follows:

![DAG](https://1.bp.blogspot.com/-OYuEUWP8UZo/XZQ4ZOF8ApI/AAAAAAAADGc/_nwzjUP8BHIEFr0FNHy3Vt55xeJYBsfdwCLcBGAsYHQ/s640/sp_1.png)

To know more about DAGs follow this [link](https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html)

<a id="7"></a>
## 7. Performing Operations on RDDs
We can perform two kind of operations on RDD:

### i. Transformation
Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable). Operations like map, filter, flatMap are transformations.

Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the **transformation in Spark are lazy**.
Spark has certain operations which can be performed on RDD. An operation is a method, which can be applied on a RDD to accomplish certain task. RDD supports two types of operations, which are Action and Transformation. An operation can be something as simple as sorting, filtering and summarizing data

Further transformations are of two types:

- Narrow transformation

In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().

- Wide transformation

In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey and reducebyKey.

### ii. Actions
Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Thus, actions are RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion.

Spark drivers and external storage system store the value of action. It brings laziness of RDD into motion.
An action is one of the ways of sending data from Executer to the driver. Executors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task

<a id="7a"></a>
### a. Performing Reduce

In [None]:
i =sc.parallelize([1,9,5,6,7,8])
i.reduce(lambda a,b:a+b)

In [None]:
i.reduce(lambda a,b:a*b)

<a id="7b"></a>
### b. Creating User Defined Functions (UDF)

In [None]:
word =['how', 'are', 'you', 'hey', 'hi']
wordRDD=sc.parallelize(word)
wordRDD.collect()

In [None]:
# defining functions
def start_h(word):
    return word[0].lower().startswith('h')

<a id="7c"></a>
### c. Applying Function as Filter

In [None]:
wordRDD.filter(start_h).collect() # it will return all words starts with 'h'

In [None]:
def toUpper(s):
    return s.upper()

In [None]:
x=['navin', 'kumar', 'pal']
x1=sc.parallelize(x)
k3=x1.map(toUpper)
k3.collect()

In [None]:
k3=x1.map(lambda k:k.upper())
k3.collect()

Creating RDD from File

textFile method reads froom csv as well as text

In [None]:
demo = sc.textFile("../input/titanic/test.csv")
demo.collect()

<a id="7d"></a>
### d. Creating Flat Maps

In [None]:
demo.flatMap(lambda x:x.split(",")).map(lambda x: (x,1)).reduceByKey(lambda a,b:a+ b).collect()

Expand the output to see the results

In [None]:
t1=(['navin', 121223], ['kumar',3000])
t1rdd=sc.parallelize(t1,2)
print(t1rdd.collect())
print(t1rdd.getNumPartitions())
t1rdd.collectAsMap()

<a id="7e"></a>
### e. Perfoming Joins on RDD

In [None]:
C=((201, "pune"), (301, "kol"),(201, "Mum"), (402, "Jaipur"),(505,"RTM"))
D=sc.parallelize(C)
A=((201, "navin"), (301, "kumar" ), (402, "pal"),(603,'kk'))
B=sc.parallelize(A)
tupleJoin=B.join(D)
tupleJoin.collect()

In [None]:
tupleLeftJoin=B.leftOuterJoin(D)
tupleLeftJoin.collect()

<a id="7f"></a>
### f. Listing and Replication

In [None]:
sc.parallelize([5,6,7]).map(lambda x:[x,x,x]).collect()

In [None]:
sc.parallelize([5,6,7]).map(lambda x:[[x,x,x],[x*x*x],[x+5]]).collect()

In [None]:
input1 = sc.parallelize(["apple", "banana", "pineapple"])
print(input1.collect())
input1.count()

In [None]:
a=sc.parallelize([5,5,6,7,8,8,8,9]).countByValue()
a

<a id="7g"></a>
### g. Grouping on RDD

In [None]:
i =sc.textFile("../input/titanic/train.csv",4)
i.collect()

In [None]:
j=i.map(lambda x:x.split(","))
k=j.map(lambda field:(field[5], field[1]))
k.collect()

In [None]:
L=k.groupByKey()
for j in L.collect():
    print([j[0],list(j[1])])

Finally you can check the lineage

In [None]:
print(L.toDebugString())
L.getNumPartitions()

<a id="7h"></a>
### h. Reducing by Key and Functions

In [None]:
# creating datframe
x=[('Designation', 'Salary'), ('AM', '50000'), ('AM', '50000'), ('SSE', '30000'), ('SSE', '30000'), ('Lead', '40000'), ('Lead', '35000'), ('ASE', '15000'), ('ASE', '15000'), ('SE', '22000'), ('SE', '25000'), ('SE', '25000'), ('ASE', '20000'), ('ASE', '18000'), ('ASE', '15000'), ('ASE', '18000')]
k=sc.parallelize(x)
k.collect()

In [None]:
k.reduceByKey(lambda x,y:int(int(x)+int(y))).collect()

In [None]:
 k.reduceByKey(lambda x,y:int(int(x)+int(y))).sortByKey().collect()

In [None]:
for i in k.collect():
    print (i)

In [None]:
k.countByKey()

In [None]:
 k.countByValue()

In [None]:
 k.groupByKey().distinct().count()

In [None]:
k.sortByKey().collect()

<a id="8"></a>
## 8. Epilogue

These are some of the core operations that we perform on RDDs for further operations like querying and manipulating data we have RDD abstractions called dataframes, these are kind of similar to pd dataframes and can can also be convertedd back and forth to pandas dataframes, most data science opertations are performed with the use of Spark Dynamic frames. 

To know more bout them checkout this [notebook](https://www.kaggle.com/amritvirsinghx/scalable-data-science-pyspark-nb1)

One last thing, if we want to ingest real time data and perform actions on streaming data we have something called spark streaming context. We ll cover it up going further, if the implementation is possible through kaggle notebooks.

Danke!