## Table Of Contents
* [Shared Variables](#Shared-Variables)
* [google ad-campaign analysis](#google-ad-campaign-analysis)
* [Removing boring words](#Removing-boring-words)
* [Spark Accumulator Practical](#Spark-Accumulator-Practical)
* [YARN](#YARN(Yet-Another-Resource-Negotiator))
* [Spark On YARN Architecture](#Spark-On-YARN-Architecture)
* [Find the number of Warinings and errors in a log **practical**](#Find-the-number-of-Warinings-and-errors-in-a-log)
* [Narrow and Wide tranformation](#Narrow-and-Wide-tranformation)
* [Stages in Spark](#Stages-in-Spark)
* [reduceByKey vs reduce](#reduceByKey-vs-reduce)
* [groupByKey vs reduceByKey](#groupByKey-vs-reduceByKey)
* [Pair RDD](#Pair-RDD)
* [What is the difference between repartition and coalesce?](#What-is-the-difference-between-repartition-and-coalesce?)

### Shared Variables

There are two kinds of shared variable:
* Broadcast Variable
* Accumulator

We'll see what are they and how they work with the help of practicals.

### google ad-campaign analysis

We're interested the search words that is there in "search tearm" column and "total cost" column to find the total cost spend on that word.

```bash
[itv002768@g02 ~]$ hadoop fs -mkdir week10_practical_search_words
[itv002768@g02 ~]$ hadoop fs -put bigdatacampaigndata-201014-183159.csv  week10_practical_search_words
[itv002768@g02 ~]$ hadoop fs -head week10_practical_search_words/bigdatacampaigndata-201014-183159.csv
big data contents,Broad match,None,TrendyTech Search India,Broad Match #3,1,1,100%,INR,24.06,24.06,0,0,0%,Search
spark training with lab access,Broad match,None,TrendyTech Search India,Broad Match #3,1,2,200%,INR,29.97,59.94,0,0,0%,Search
online hadoop training institutes in hyderabad,Broad match,None,TrendyTech Search India,Broad Match #3,1,1,100%,INR,28.45,28.45,0,0,0%,Search
coursera data analytics,Broad match,None,TrendyTech Search India,Broad Match #3,1,1,100%,INR,24.64,24.64,0,0,0%,Search
ameerpet big data training cost,Broad match,None,TrendyTech Search India,Broad Match #3,2,1,50%,INR,34.86,34.86,0,0,0%,Search
good comment on big data trainer,Broad match,None,TrendyTech Search India,Broad Match #3,1,2,200%,INR,30.47,60.94,0,0,0%,Search
spark classes,Broad match,None,TrendyTech Search India,Broad Match #3,3,1,33.33%,INR,29.21,29.21,0,0,0%,Search
data analytics course near me,Broad match,None,TrendyTech Search India,Broad Match #3,0,1,--,INR,25.42,25.42,0,0,0%,Search
```

In [13]:
val inputFile = sc.textFile("/user/itv002768/week10_practical_search_words/bigdatacampaigndata-201014-183159.csv")
val requiredInfo = inputFile.map(x=> (x.split(",")(10).toFloat, x.split(",")(0)))
val allWords = requiredInfo.flatMapValues(x => x.split(" ")).map(x => (x._2.toLowerCase(), x._1))
val finalInfo = allWords.reduceByKey((x, y) => x + y).sortBy(x => x._2, false)
val finalAnswer = finalInfo.collect()
finalAnswer.take(10).foreach(println)

(data,16394.64)
(big,12889.278)
(in,5774.84)
(hadoop,4818.34)
(course,4191.5903)
(training,4099.3696)
(online,3484.4202)
(courses,2565.78)
(intellipaat,2081.22)
(analytics,1458.51)


inputFile = /user/itv002768/week10_practical_search_words/bigdatacampaigndata-201014-183159.csv MapPartitionsRDD[58] at textFile at <console>:34
requiredInfo = MapPartitionsRDD[59] at map at <console>:35
allWords = MapPartitionsRDD[61] at map at <console>:36
finalInfo = MapPartitionsRDD[67] at sortBy at <console>:37
finalAnswer = Array((data,16394.64), (big,12889.278), (in,5774.84), (hadoop,4818.34), (course,4191.5903), (training,4099.3696), (online,3484.4202), (courses,2565.78), (intellipaat,2081.22), (analytics,1458.51), (tutorial,1383.3701), (hyderabad,1118.16), (spark,1078.72), (best,1047.7), (banga...


Array((data,16394.64), (big,12889.278), (in,5774.84), (hadoop,4818.34), (course,4191.5903), (training,4099.3696), (online,3484.4202), (courses,2565.78), (intellipaat,2081.22), (analytics,1458.51), (tutorial,1383.3701), (hyderabad,1118.16), (spark,1078.72), (best,1047.7), (banga...

### Removing boring words

In above practical we've to remove the boring words having no significance like 'of', 'in' etc.

For this we've created a new file and put all the boring words in it.

```bash
[itv002768@g02 ~]$ head boringwords-201014-183159.txt
shouldnt
worrying
simplify
tidy
shouldnt
yep
the
lively
borrow
whichever
```

For this we'll use **broadcast join** in spark. This is same as that of **map side join** in hive. We can achive this using broadcast variable.

In spark, we have data node and worker node(master). From driver we'll broadcast a variable which will be broadcasted on all the nodes. Complete copy will be broadcasted.

Now, In our practical we'll broadcast all the boring words as broadcast variable. In this way boring data words will be there on all the machines and campaign data will be distributed across all the machines.

Since we do not want duplicates so we'll put all the words in a set.





In [16]:
import scala.io.Source

def loadBoringWords():Set[String] = {
    // Read the file and load words in a variable
    var boringWords: Set[String] = Set()
    val lines = Source.fromFile("/home/itv002768/boringwords-201014-183159.txt").getLines()
    for (line <- lines){
        boringWords += line 
    }
    boringWords
}

var nameSet = sc.broadcast(loadBoringWords)
val inputFile = sc.textFile("/user/itv002768/week10_practical_search_words/bigdatacampaigndata-201014-183159.csv")
val requiredInfo = inputFile.map(x=> (x.split(",")(10).toFloat, x.split(",")(0)))
val allWords = requiredInfo.flatMapValues(x => x.split(" ")).map(x => (x._2.toLowerCase(), x._1))
/*
 Check if the word is present in the nameSet
 If it is there then it'll return true else false
 Since, we've to ignore the words that are there in the nameSet
 We've to use this function with a negation(!)
 */
val filteredWords = allWords.filter(x => !nameSet.value(x._1))
val finalInfo = filteredWords.reduceByKey((x, y) => x + y).sortBy(x => x._2, false)
val finalAnswer = finalInfo.collect()
finalAnswer.take(20).foreach(println)

Waiting for a Spark session to start...

(hadoop,4818.34)
(intellipaat,2081.22)
(analytics,1458.51)
(hyderabad,1118.16)
(spark,1078.72)
(bangalore,1039.27)
(cloudxlab,707.52)
(bigdata,694.48)
(dataflair,643.9)
(chennai,604.04)
(edureka,351.44)
(iit,308.73)
(coursera,293.25)
(pune,284.71)
(curso,277.53)
(cloudera,258.06)
(simplilearn,252.45001)
(scala,250.73)
(ameerpet,184.94)
(flair,154.13)


nameSet = Broadcast(0)
inputFile = /user/itv002768/week10_practical_search_words/bigdatacampaigndata-201014-183159.csv MapPartitionsRDD[1] at textFile at <console>:48
requiredInfo = MapPartitionsRDD[2] at map at <console>:49
allWords = MapPartitionsRDD[4] at map at <console>:50
filteredWords = MapPartitionsRDD[5] at filter at <console>:57
finalInfo = MapPartitionsRDD[11] at sortBy at <console>:58
finalAnswer = Array((hadoop,4818.34), (intellipaat,2081.22), (...


loadBoringWords: ()Set[String]


Array((hadoop,4818.34), (intellipaat,2081.22), (...

### Spark Accumulator Practical
There will be a driver and multiple executer. Let's say you've a file of 500MB and based on the default size it is divided in four parts based on the default block size and there are four partitons residing on four machines. You want to find the number of blank lines in a file.

In that case you can create an accumulator variable and keep on incrementing this variable whenever you find a blank line. This is very similar to counter in mapReduce. Here executor won't have the copy of this variable it can only update the value. Executor cannot read the value they can only update.

In the below practical we've to calculate the number of blank lines in a file.

```bash
[itv002768@g02 ~]$ vim somefile.txt
[itv002768@g02 ~]$ hadoop fs -mkdir accumulator_practical
[itv002768@g02 ~]$ hadoop fs -put somefile.txt accumulator_practical
[itv002768@g02 ~]$ hadoop fs -cat  accumulator_practical/somefile.txt
This is a line.

yes it is.

this one too.


yes.

No.

why?

why not?
```

In [20]:
val inputFile = sc.textFile("/user/itv002768/accumulator_practical/somefile.txt")
// Initaialized a long type accumulator which is named as "blank lines Accumulator" and assigned to a variable myAccumulator
val myAccumulator = sc.longAccumulator("blank lines Accumulator")
inputFile.foreach(x => if (x=="") myAccumulator.add(1))
// Below line will tell you the number of empty lines
myAccumulator.value

inputFile = /user/itv002768/accumulator_practical/somefile.txt MapPartitionsRDD[7] at textFile at <console>:33
myAccumulator = LongAccumulator(id: 677, name: Some(blank lines Accumulator), value: 7)


7

## YARN(Yet Another Resource Negotiator)

Before talking about YARN, Let us recap a bit.
* Storage Perspective
 - HDFS
   - Name Node(Master) which holds the metadata in the form of tables.
   - Data Node(Slave)  which holds the actual data in terms of blocks.
* Processing Perspective(MapReduce - Hadoop 1.0) - Job execution was controlled by two processes.
  - Master(Job Tracker) - It runs on a master node. Used to do a lot of work
    - Scheduling - what algorithm to use and what job to prirotize accordingly.
    - Monitoring - Tracking the progress of a job, If a task fails rerun the task, If the task is slow then based on speculation, execution starts on another machine.
  - Slave(Task Tracker) - Runs on many slave nodes or data nodes.
    - Tracks the tasks on each data node and informs the job tracker about it.

If there are so many data nodes then it'll be very difficult manage all the things because Task tracker only has to send the information but all the processing and then decision making will be done by Job tracker.

**Drawbackes of MapReduce 1.0**
* **Scalability** - It was obeserved, when the cluster size goes beyond 4k data nodes then the Job tracker becomes a bottleneck.
* **Resource Utilization** - There used to be a fixed number of map and reduce slots. e.g 150 slots, 100map slots and 50reduce slots If you want to execute a MapReduce jobs which required 150, you cannot do that, you can only run 100 mappers at a time and 50 mappers will run later. Here, 50 slots are unused as there is no reduce jobs initially. In this way cluster utilization is not good.
* Only MapReduce jobs are supported.

To overcome these Drawbackes **YARN** was introduced.

Three Components of YARN:
* Resource Manager(Master)
  - Monitoring Aspect was taken away from Job Tracker. After doing that they gave it a new name as Resource Manager that is only responsible for scheduling.


* Node Manager(Slave)
  - This is almost the same as that of Task Tracker.
  - Manages the resources of the containers in respective nodes.
  
* Application master
  - When a request comes from client, Resource manager creates a container on one of the Node Managers. Inside this container the resource manager launches an application master. This application master takes care of end-to-end monitoring for this application. In this way resource manager has deligated its work to Application master.
  - Application master negotiate the resources from resource manager or you can say request for the resources(in the form of containers) required from resource manager. Resource manager allocates the resources and will send the container_id and hostname(Node Manager) to the application master. Finally Application master uses these containers to run the tasks and also coordinates.
  
  
How the limitations were handled in MapReduce 2.0 with the introduction of YARN
* Scheduling is done by resource manager and monitoring is done by Application master. This solves scalability problem.
* It is no longer limited to MapReduce can be used for other jobs like spark, giraphtez etc.
* There are no fixed amount of Map and Reduce jobs Instead, a concept of comtainers comes in which is very effective in allocating the resources dynamically.

**Uberization**  - When the job is very small and Application master thinks that It can do this job in the same set of resources then it won't ask for more resources.

### Spark On YARN Architecture

***How to execute the spark program on spark cluster?***
* Interactive Mode - spark-shell/pyspark/notebook
* Submitting a job - spark-submit utility

***How does spark executes our programs on the cluster?</BR>***
Master/Slave Architecture where each application  has a driver which is the master process and a bunch of executors which are the slaves.

Driver is responsible for analysing the work and divide the work in many tasks, distributes the tasks, schedule the tasks and finally monitoring.

Executor is responsible for executing the code on JVM locally.

Two different set of applications or two different spark-submit jobs will have different set of drivers and executors.

***Who executes where?</BR>***
Executors always resides on the clusters but Driver has the flexibility to launch it on client machine or the cluster machine. Whenever driver runs on client machine it is known as **Client Mode** on the other hand if it runs on cluster the executor it is known as **Cluster Mode**. Spark offers these two deployment modes.

Client mode is only preferred for exploratory purpose whereas Cluster mode is preferred for production. In case of client mode if client stops then the driver is also gone.

***Who controls the cluster and how spark gets the driver and executor?</BR>***
Cluster manager manages the cluster there are many supported cluster managers such as YARN, Kubernates, Mesos, Spark Standalone.

***What is a spark session?</BR>***
It is a data structure where driver contains all the information including executor location and status. This is the entry point for any spark application

**Client Mode**
* Spark session is created automatically whenever you open your spark-shell.
* As soon as the request is created, Request goes to YARN's resource manager.
* Resource manager creates a container on one of the machines and runs an application master there.
* This Application master requests for resources in the form of containers.
* Application master creates executors inside the containers.
* Now executors can contact directly to the driver that is on client machine.

**Cluster Mode**
* Submits the code using spark-submit utility
* Only difference is that the spark driver runs on the application master.

### Find the number of Warinings and errors in a log

We're give the following lines from logger, we've to find the number of WARN and ERROR

```
"WARN: Tuesday 8 AUG XXXX"
"ERROR: Tuesday 8 AUG XXXXE"
"ERROR: Tuesday 8 AUG XXXXF"
"ERROR: Tuesday 8 AUG XXXXE"
"ERROR: Tuesday 8 AUG XXXXT"
```

In [6]:
val logLines = List(
"WARN: Tuesday 8 AUG XXXX",
"ERROR: Tuesday 8 AUG XXXXE",
"ERROR: Tuesday 8 AUG XXXXF",
"ERROR: Tuesday 8 AUG XXXXE",
"ERROR: Tuesday 8 AUG XXXXT"
)

val logLines_ = sc.parallelize(logLines)
val info = logLines_.map(
x => {
    val splitted = x.split(":")
    (splitted(0),1)
  }
)
val final_ = info.reduceByKey((x,y) => x+y)
final_.collect

Waiting for a Spark session to start...

logLines = List(WARN: Tuesday 8 AUG XXXX, ERROR: Tuesday 8 AUG XXXXE, ERROR: Tuesday 8 AUG XXXXF, ERROR: Tuesday 8 AUG XXXXE, ERROR: Tuesday 8 AUG XXXXT)
logLines_ = ParallelCollectionRDD[2] at parallelize at <console>:39
info = MapPartitionsRDD[3] at map at <console>:40
final_ = ShuffledRDD[4] at reduceByKey at <console>:46


Array((ERROR,4), (WARN,1))

### Narrow and Wide tranformation

There are two kind for transformations:
1. Narrow

These are the transformations where shuffling is not involved. It works on the concept of data locality, No movement of data is required.
`map`, `flatMap` and `filter` are narrow tranformations. There are many narrow transformations present.

2. wide

These are the transformations where shuffling is involved for e.g `reduceByKey`. Here data movement is involved so these are expensive operations. We should avoid them as much as possible

```
500MB file in hdfs

val rdd1 = sc.textFile("path_to_file")
It will have 4 partitions because of 4 blocks in hdfs.

There is a 1:1 mapping between your file block and rdd partitions.

rdd1.map(x => x.length) //It will give length of each line so shuffiling is not required because there is no movement of data

partitions p1 p2 p3 p4
output     o1 o2 o3 o4


When we talk about reduceByKey, shuffling is involved

  p1         p2        p3         p4
(hello, 1)  (Hi, 1)   (now, 1)   (hello, 1)
(how, 1)    (yes, 1)  (world, 1) (how, 1)

Shuffling is required because grouped data is required
```

### Stages in Spark
Stages are marked by shuffle boundaries. It means whenever we encounter a shuffle a new stage is created. If we'll use n wide transformations then n+1 stages will be created.

Output of stage1 is sent to Disk and stage2 reads it back from the disk.

In the above practical `Find the number of Warinings and errors in a log` we two stages will be created.


### reduceByKey vs reduce

`reduceByKey` is a tranformation and `reduce` is an action.
```
logLines = List(WARN: Tuesday 8 AUG XXXX, ERROR: Tuesday 8 AUG XXXXE, ERROR: Tuesday 8 AUG XXXXF, ERROR: Tuesday 8 AUG XXXXE, ERROR: Tuesday 8 AUG XXXXT)
logLines_ = ParallelCollectionRDD[2] at parallelize at <console>:39
info = MapPartitionsRDD[3] at map at <console>:40
final_ = ShuffledRDD[4] at reduceByKey at <console>:46
Array((ERROR,4), (WARN,1))
```

whenever we call a transformation on a rdd you'll always get a resultant rdd.

whenever you call an action on a rdd you'll get a local variable.

`reduceByKey` always works on pair rdds, like touple of two elements.

`reduce` is an action, you can see the below code.

Why developers gave `reduceByKey` is a transformation and `reduce` as an action?</BR>
`reduce` gives you a single output that is very small. It won't distribute the data across the cluster.

We can still have huge amount of data from `reduceByKey` and we can do further operations on it that's why we want this output as a rdd. We still want to do the things in parallel manner.

In [8]:
val input = 1 to 100
val inputRdd = sc.parallelize(input)
inputRdd.reduce((x, y) => x+y)

Waiting for a Spark session to start...

input = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
inputRdd = ParallelCollectionRDD[0] at parallelize at <console>:28


5050

### groupByKey vs reduceByKey

<font color='red'>**IMPORTANT**</font>

**Similarities**
* Both of them are wide transformations.

Let's say you want to find the frequency of each word and your file is there on two nodes.

`reduceByKey` before sending it to the reducer, it will first do a local aggregation. So here it is doing more aggregation before shuffling. Shuffling required is less. There are two advantages - **Less Shuffling** and **More parallelism**. This is same as that of combiner acting at the mapper.

In `groupByKey` no aggregation will happen locally and more shuffling is required. All the key/value pairs are sent to another machine for shuffling.

NOTE: Always prefer reduceByKey and always use groupByKey.

Consider you have 1TB data in HDFS and 1000 node cluster.

Number of RDD partititions = 1TB/128MB = 8000

On each node we might end up getting 8 partitions.

example data:
```
WARN: xyz
ERROR: YYY
ERROR: ttt
INFO: bbb
.
.
INFO: bbb
```

you've to calculate the frequency of different log levels. If we'll apply `groupByKey`, same log_levels will go on respective machines. If we have only three log levels mentioned above, Max three machines will hold all the data.

Before applying `groupByKey` we'd data well distributed in 1000 but after applying it we have data only on three machines.If three machines hold 1TB data in memory then there is a huge possibility of `Out of memory exception`. Even if we don't get Out Of memory exception it is not suggested to use `groupByKey` because parallelism is restricted.

```bash
[itv002768@g02 ~]$ hadoop fs -mkdir week10_practical_bigLog
[itv002768@g02 ~]$ hadoop fs -put bigLog.txt week10_practical_bigLog
[itv002768@g02 ~]$ hadoop fs -ls week10_practical_bigLog
Found 1 items
-rw-r--r--   3 itv002768 supergroup  365001114 2022-08-17 09:51 week10_practical_bigLog/bigLog.txt
[itv002768@g02 ~]$ hadoop fs -head week10_practical_bigLog/bigLog.txt
ERROR: Thu Jun 04 10:37:51 BST 2015
WARN: Sun Nov 06 10:37:51 GMT 2016
WARN: Mon Aug 29 10:37:51 BST 2016
ERROR: Thu Dec 10 10:37:51 GMT 2015
ERROR: Fri Dec 26 10:37:51 GMT 2014
ERROR: Thu Feb 02 10:37:51 GMT 2017
WARN: Fri Oct 17 10:37:51 BST 2014
ERROR: Wed Jul 01 10:37:51 BST 2015
WARN: Thu Jul 27 10:37:51 BST 2017
WARN: Thu Oct 19 10:37:51 BST 2017
WARN: Wed Jul 30 10:37:51 BST 2014
ERROR: Fri Jan 12 10:37:51 GMT 2018
WARN: Fri May 15 10:37:51 BST 2015
ERROR: Tue Jan 16 10:37:51 GMT 2018
WARN: Wed Nov 12 10:37:51 GMT 2014
ERROR: Fri Jul 25 10:37:51 BST 2014
```
**Please go through the spark UI for more info for the below example**

In [8]:
val logLines_ = sc.textFile("/user/itv002768/week10_practical_bigLog/bigLog.txt")
val info = logLines_.map(
x => {
    val splitted = x.split(":")
    (splitted(0), 1)
  }
)
info.groupByKey.collect().foreach(x => println(x._1, x._2.size))



(WARN,4998886)
(ERROR,5001114)


logLines_ = /user/itv002768/week10_practical_bigLog/bigLog.txt MapPartitionsRDD[25] at textFile at <console>:30
info = MapPartitionsRDD[26] at map at <console>:31


MapPartitionsRDD[26] at map at <console>:31

In [6]:
val logLines_ = sc.textFile("/user/itv002768/week10_practical_bigLog/bigLog.txt")
val info = logLines_.map(
x => {
    val splitted = x.split(":")
    (splitted(0), 1)
  }
)
info.reduceByKey(_ + _).collect().foreach(println)

(WARN,4998886)
(ERROR,5001114)


logLines_ = /user/itv002768/week10_practical_bigLog/bigLog.txt MapPartitionsRDD[17] at textFile at <console>:27
info = MapPartitionsRDD[18] at map at <console>:28


lastException: Throwable = null


MapPartitionsRDD[18] at map at <console>:28

### Pair RDD
Rdd which holds touple of two elements. Transformation like groupByKey, reduceByKey etc. can only work on pair RDDs.

*is touple of two elements are same as that of a map?*</BR>
No, Because In a map we can only have unique keys. But in case of pair RDDs keys can repeat.

This is the same example from week9 where we've to find the top customers

```bash
[itv002768@g02 ~]$ hadoop fs -head customerorders_practical/customerorders-201008-180523.csv
44,8602,37.19
35,5368,65.89
2,3391,40.64
47,6694,14.98
29,680,13.08
91,8900,24.59
70,3959,68.68
85,1733,28.53
53,9900,83.55
```
In the below example we are storing the output in a file rather then in a variable

In [5]:
val rawCustomersInfo = sc.textFile("/user/itv002768/customerorders_practical/customerorders-201008-180523.csv")
// split on "," and take only 1st and third element of an array, and convert third element to float
// Put these in a touple
val splitCust = rawCustomersInfo.map(x => (x.split(",")(0), x.split(",")(2).toFloat ) )

// Calculate the sum of amount and sort by amount in descending order
val totalPurchase = splitCust.reduceByKey((x, y) => x+y).sortBy(x => x._2, false)
//val finalInfo = totalPurchase.collect()
totalPurchase.saveAsTextFile("/user/itv002768/top_customers_output")

rawCustomersInfo = /user/itv002768/customerorders_practical/customerorders-201008-180523.csv MapPartitionsRDD[21] at textFile at <console>:31
splitCust = MapPartitionsRDD[22] at map at <console>:34
totalPurchase = MapPartitionsRDD[28] at sortBy at <console>:37


MapPartitionsRDD[28] at sortBy at <console>:37

```bash
[itv002768@g02 ~]$ hadoop fs -ls /user/itv002768/top_customers_output
Found 3 items
-rw-r--r--   3 itv002768 supergroup          0 2022-08-18 06:17 /user/itv002768/top_customers_output/_SUCCESS
-rw-r--r--   3 itv002768 supergroup        704 2022-08-18 06:17 /user/itv002768/top_customers_output/part-00000
-rw-r--r--   3 itv002768 supergroup        691 2022-08-18 06:17 /user/itv002768/top_customers_output/part-00001
[itv002768@g02 ~]$ hadoop fs -cat /user/itv002768/top_customers_output/*
(68,6375.45)
(73,6206.199)
(39,6193.1104)
(54,6065.39)
(71,5995.66)
(2,5994.591)
(97,5977.1895)
(46,5963.111)
(42,5696.8403)
(59,5642.8906)
(41,5637.619)
(0,5524.9497)
```

same example with other new things

In [8]:
val rawCustomersInfo = sc.textFile("/user/itv002768/customerorders_practical/customerorders-201008-180523.csv")
// split on "," and take only 1st and third element of an array, and convert third element to float
// Put these in a touple
val splitCust = rawCustomersInfo.map(x => (x.split(",")(0), x.split(",")(2).toFloat ) )

// Calculate the sum of amount and sort by amount in descending order
val totalPurchase = splitCust.reduceByKey((x, y) => x+y)
val finalInfo = totalPurchase.filter(x => x._2>5000) //customers spending more than 5K
val doubledAmount = finalInfo.map(x => (x._1 , x._2*2))
val final_ = doubledAmount.collect()
for(info <- final_){
    println(info)
}

(19,10118.861)
(42,11393.681)
(62,10506.643)
(6,10795.759)
(46,11926.222)
(2,11989.182)
(93,10531.5)
(28,10001.421)
(59,11285.781)
(24,10519.84)
(39,12386.221)
(11,10304.58)
(64,10577.38)
(8,11034.48)
(60,10081.419)
(15,10827.0205)
(35,10310.84)
(97,11954.379)
(0,11049.899)
(55,10596.18)
(40,10372.859)
(71,11991.32)
(22,10038.898)
(26,10500.801)
(68,12750.9)
(33,10509.318)
(17,10065.359)
(73,12412.398)
(69,10246.02)
(41,11275.238)
(92,10758.562)
(9,10645.299)
(34,10661.599)
(61,10994.96)
(81,10225.42)
(25,10115.221)
(63,10830.3)
(65,10280.699)
(29,10065.061)
(90,10580.82)
(32,10992.101)
(85,11006.861)
(54,12130.78)
(72,10674.879)
(52,10490.121)
(58,10875.461)
(87,10412.799)
(70,10736.501)
(43,10737.66)


rawCustomersInfo = /user/itv002768/customerorders_practical/customerorders-201008-180523.csv MapPartitionsRDD[41] at textFile at <console>:33
splitCust = MapPartitionsRDD[42] at map at <console>:36
totalPurchase = ShuffledRDD[43] at reduceByKey at <console>:39
finalInfo = MapPartitionsRDD[44] at filter at <console>:40
doubledAmount = MapPartitionsRDD[45] at map at <console>:41
final_ = Array((19,10118.861), (42,11393.681), (62,10506.643), (6,10795.759), (46,11926.222), (2,11989.182), (93,10531.5), (28,10001.421), (59,11285.781), (24,10519.84), (39,12386.221), (...


Array((19,10118.861), (42,11393.681), (62,10506.643), (6,10795.759), (46,11926.222), (2,11989.182), (93,10531.5), (28,10001.421), (59,11285.781), (24,10519.84), (39,12386.221), (...

Whenver we call the actions, all the transformations from the very beginning starts executing. What if you call another action like count. Again all the transformations will be executed again from the very beginning.

But in case of second action spark can do some optimization and it will only execute the last stage and takes the input from disk for previous stage.

In the below code example how many partitions will be there since we're not loading any file from local filesystem or from HDFS?

We can check with a property using `sc.defaultParallelism` and it can give different result for different systems.

In [9]:
sc.defaultParallelism

2

In [12]:
val logLines = List(
"WARN: Tuesday 8 AUG XXXX",
"ERROR: Tuesday 8 AUG XXXXE",
"ERROR: Tuesday 8 AUG XXXXF",
"ERROR: Tuesday 8 AUG XXXXE",
"ERROR: Tuesday 8 AUG XXXXT"
)

val logLines_ = sc.parallelize(logLines)
// to check the number of  partitions
logLines_.getNumPartitions

/*
val info = logLines_.map(
x => {
    val splitted = x.split(":")
    (splitted(0),1)
  }
)
val final_ = info.reduceByKey((x,y) => x+y)
final_.collect
*/

logLines = List(WARN: Tuesday 8 AUG XXXX, ERROR: Tuesday 8 AUG XXXXE, ERROR: Tuesday 8 AUG XXXXF, ERROR: Tuesday 8 AUG XXXXE, ERROR: Tuesday 8 AUG XXXXT)
logLines_ = ParallelCollectionRDD[49] at parallelize at <console>:38


2

We're expecting only one partition because the file is only of few KBs.

But we'll end up getting 2 partitions because there is a propery that will allocate minimum partitions.

In [14]:
sc.defaultMinPartitions

2

In [13]:
val rawCustomersInfo = sc.textFile("/user/itv002768/customerorders_practical/customerorders-201008-180523.csv")
rawCustomersInfo.getNumPartitions

rawCustomersInfo = /user/itv002768/customerorders_practical/customerorders-201008-180523.csv MapPartitionsRDD[51] at textFile at <console>:29


2

### What is the difference between repartition and coalesce?
<font color='red'>**IMPORTANT**</font>

### Repartition
Consider you have 500MB file in HDFS and a spark cluster of 20 worker nodes by default your RDD will have 4 partitions, at max 4 machines will be used out of 20 and remaining 16 will remain idle.

In this case you can modify the number of partitions using `rdd.repartition(10)`. You can increase or decrease the number of partitions.

*When to decrease the partitions?*</BR>
Consider you've 1TB file in hdfs then the number of blocks will be 8000. You've 1000 node clusters then each node will hold 8 partitions.</BR>
Bunch of transformations - map -> filter -> filter -> reduce

when we started, each partition was having 128MB of data but after applying transformations like filter data will start decreasing since most of the data is getting eliminated. Do you still want to maintain a partition having few MBs of data? No, In this case we'll reduce the number of partitions.

`Repartition` is a wide transformation because shuffling is involved.

In [18]:
val rawCustomersInfo = sc.textFile("/user/itv002768/customerorders_practical/customerorders-201008-180523.csv")
val rawCustomersInfoNewPartition = rawCustomersInfo.repartition(10)
rawCustomersInfoNewPartition.getNumPartitions

Waiting for a Spark session to start...

rawCustomersInfo = /user/itv002768/customerorders_practical/customerorders-201008-180523.csv MapPartitionsRDD[7] at textFile at <console>:29
rawCustomersInfoNewPartition = MapPartitionsRDD[11] at repartition at <console>:30


10

### Coalesce
It can only decrease the number of partitions. In the below example you can see the working of coalesce. It won't give you any error when you try to increase the number of partitions but won't increase.

If you want to decrease the partitions you can use any of these `coalesce` or `repartition`. In case you want to increase the partitions you've to use `repartition`.

To decrease the number of partitions coalesce is preferred because it will try to minimize the shuffling and give you more performance.

Consider you have 

```
node1  - p1 p2 p3 p4
node2  - p5 p6 p7 p8
node3  - p9 p10 p11 p12
node4  - p13 p14 p15 p16
```
rdd1 has 16 partitions

if you'll do rdd1.repartition(8)

Repartition has an intention to have final partitions of exectly equal size for this it'll go through complete shuffling

if you'll use rdd1.coalesce(8)

whenever feasible, It'll try to combine the existing partitions on the same machine to achive the goal. It'll minimize the shuffling.

In [21]:
val rawCustomersInfo = sc.textFile("/user/itv002768/customerorders_practical/customerorders-201008-180523.csv")
val rawCustomersInfoNewPartition = rawCustomersInfo.repartition(10)
val decreasePartition = rawCustomersInfoNewPartition.coalesce(3)
val increasePartition = decreasePartition.coalesce(4)
increasePartition.getNumPartitions


rawCustomersInfo = /user/itv002768/customerorders_practical/customerorders-201008-180523.csv MapPartitionsRDD[20] at textFile at <console>:31
rawCustomersInfoNewPartition = MapPartitionsRDD[24] at repartition at <console>:32
decreasePartition = CoalescedRDD[25] at coalesce at <console>:33
increasePartition = CoalescedRDD[26] at coalesce at <console>:34


3

### Practicals(pyspark)

```
[itv002768@g02 ~]$ hadoop fs -head /user/itv002768/week10_practical_search_words/bigdatacampaigndata-201014-183159.csv
big data contents,Broad match,None,TrendyTech Search India,Broad Match #3,1,1,100%,INR,24.06,24.06,0,0,0%,Search
spark training with lab access,Broad match,None,TrendyTech Search India,Broad Match #3,1,2,200%,INR,29.97,59.94,0,0,0%,Search
online hadoop training institutes in hyderabad,Broad match,None,TrendyTech Search India,Broad Match #3,1,1,100%,INR,28.45,28.45,0,0,0%,Search
coursera data analytics,Broad match,None,TrendyTech Search India,Broad Match #3,1,1,100%,INR,24.64,24.64,0,0,0%,Search
ameerpet big data training cost,Broad match,None,TrendyTech Search India,Broad Match #3,2,1,50%,INR,34.86,34.86,0,0,0%,Search
good comment on big data trainer,Broad match,None,TrendyTech Search India,Broad Match #3,1,2,200%,INR,30.47,60.94,0,0,0%,Search
spark classes,Broad match,None,TrendyTech Search India,Broad Match #3,3,1,33.33%,INR,29.21,29.21,0,0,0%,Search
data analytics course near me,Broad match,None,TrendyTech Search India,Broad Match #3,0,1,--,INR,25.42,25.42,0,0,0%,Search
```

In [12]:
"""
val inputFile = sc.textFile("/user/itv002768/week10_practical_search_words/bigdatacampaigndata-201014-183159.csv")
val requiredInfo = inputFile.map(x=> (x.split(",")(10).toFloat, x.split(",")(0)))
val allWords = requiredInfo.flatMapValues(x => x.split(" ")).map(x => (x._2.toLowerCase(), x._1))
val finalInfo = allWords.reduceByKey((x, y) => x + y).sortBy(x => x._2, false)
val finalAnswer = finalInfo.collect()
finalAnswer.take(10).foreach(println)
"""
from pyspark import SparkConf
from pyspark.sql import SparkSession

my_conf = SparkConf()
my_conf.set("spark.app.name", "My pysoark Application")
my_conf.set("spark.master", "local[*]")

spark = SparkSession.builder.config(conf=my_conf).getOrCreate()

input_file = spark.sparkContext.textFile("/user/itv002768/week10_practical_search_words/bigdatacampaigndata-201014-183159.csv")
required_info = input_file.map(lambda x: (float(x.split(",")[10]), x.split(",")[0]))
all_words = required_info.flatMapValues(lambda x: x.split(" ")).map(lambda x: (x[1].lower(), x[0]))
reduced_info = all_words.reduceByKey(lambda x, y: x + y).sortBy(lambda x: x[1], False)
final = reduced_info.collect()
final[:20]




[('data', 16394.64),
 ('big', 12889.279999999999),
 ('in', 5774.84),
 ('hadoop', 4818.34),
 ('course', 4191.59),
 ('training', 4099.37),
 ('online', 3484.42),
 ('courses', 2565.7800000000007),
 ('intellipaat', 2081.22),
 ('analytics', 1458.5099999999998),
 ('tutorial', 1383.37),
 ('hyderabad', 1118.1600000000003),
 ('spark', 1078.72),
 ('best', 1047.7),
 ('bangalore', 1039.2699999999998),
 ('and', 985.8),
 ('certification', 967.44),
 ('for', 967.05),
 ('of', 871.4199999999998),
 ('to', 848.3299999999999)]

In [20]:
"""
def loadBoringWords():Set[String] = {
    // Read the file and load words in a variable
    var boringWords: Set[String] = Set()
    val lines = Source.fromFile("/home/itv002768/boringwords-201014-183159.txt").getLines()
    for (line <- lines){
        boringWords += line 
    }
    boringWords
}

var nameSet = sc.broadcast(loadBoringWords)
val inputFile = sc.textFile("/user/itv002768/week10_practical_search_words/bigdatacampaigndata-201014-183159.csv")
val requiredInfo = inputFile.map(x=> (x.split(",")(10).toFloat, x.split(",")(0)))
val allWords = requiredInfo.flatMapValues(x => x.split(" ")).map(x => (x._2.toLowerCase(), x._1))
/*
 Check if the word is present in the nameSet
 If it is there then it'll return true else false
 Since, we've to ignore the words that are there in the nameSet
 We've to use this function with a negation(!)
 */
val filteredWords = allWords.filter(x => !nameSet.value(x._1))
val finalInfo = filteredWords.reduceByKey((x, y) => x + y).sortBy(x => x._2, false)
val finalAnswer = finalInfo.collect()
finalAnswer.take(20).foreach(println)
"""


def load_boring_words():
    boring_words = set((line.strip() for line in \
                        open("/home/itv002768/boringwords-201014-183159.txt")))
    return boring_words

named_set = spark.sparkContext.broadcast(load_boring_words())
input_file = spark.sparkContext.textFile("/user/itv002768/week10_practical_search_words/bigdatacampaigndata-201014-183159.csv")
required_info = input_file.map(lambda x: (float(x.split(",")[10]), x.split(",")[0]))
all_words = required_info.flatMapValues(lambda x: x.split(" ")).map(lambda x: (x[1].lower(), x[0]))
filtered_words = all_words.filter(lambda x: x[0] not in named_set.value)
reduced_info = filtered_words.reduceByKey(lambda x, y: x + y).sortBy(lambda x: x[1], False)
final = reduced_info.collect()
final[:20]

[('hadoop', 4818.34),
 ('intellipaat', 2081.22),
 ('analytics', 1458.5099999999998),
 ('hyderabad', 1118.1600000000003),
 ('spark', 1078.72),
 ('bangalore', 1039.2699999999998),
 ('cloudxlab', 707.52),
 ('bigdata', 694.48),
 ('dataflair', 643.9000000000001),
 ('chennai', 604.0400000000001),
 ('edureka', 351.44),
 ('iit', 308.73),
 ('coursera', 293.25),
 ('pune', 284.71),
 ('curso', 277.53000000000003),
 ('cloudera', 258.06),
 ('simplilearn', 252.45),
 ('scala', 250.73000000000002),
 ('ameerpet', 184.94),
 ('flair', 154.13)]

In [1]:
"""
val rawCustomersInfo = sc.textFile("/user/itv002768/customerorders_practical/customerorders-201008-180523.csv")
// split on "," and take only 1st and third element of an array, and convert third element to float
// Put these in a touple
val splitCust = rawCustomersInfo.map(x => (x.split(",")(0), x.split(",")(2).toFloat ) )

// Calculate the sum of amount and sort by amount in descending order
val totalPurchase = splitCust.reduceByKey((x, y) => x+y).sortBy(x => x._2, false)
//val finalInfo = totalPurchase.collect()
totalPurchase.saveAsTextFile("/user/itv002768/top_customers_output")
"""
from pyspark import SparkConf
from pyspark.sql import SparkSession

my_conf = SparkConf()
my_conf.set("spark.app.name", "My pysoark Application")
my_conf.set("spark.master", "local[*]")

spark = SparkSession.builder.config(conf=my_conf).getOrCreate()

cust_info = spark.sparkContext.textFile("/user/itv002768/customerorders_practical/customerorders-201008-180523.csv")
split_cust = cust_info.map(lambda x: (x.split(",")[0], float(x.split(",")[2])))
total_purchase = split_cust.reduceByKey(lambda x, y: x+y).sortBy(lambda x: x[1], False)
total_purchase.collect()


[('68', 6375.450000000001),
 ('73', 6206.200000000001),
 ('39', 6193.110000000001),
 ('54', 6065.390000000001),
 ('71', 5995.660000000002),
 ('2', 5994.59),
 ('97', 5977.1900000000005),
 ('46', 5963.109999999999),
 ('42', 5696.840000000002),
 ('59', 5642.889999999999),
 ('41', 5637.620000000001),
 ('0', 5524.949999999999),
 ('8', 5517.24),
 ('85', 5503.43),
 ('61', 5497.4800000000005),
 ('32', 5496.05),
 ('58', 5437.73),
 ('63', 5415.1500000000015),
 ('15', 5413.51),
 ('6', 5397.879999999999),
 ('92', 5379.279999999999),
 ('43', 5368.83),
 ('70', 5368.249999999999),
 ('72', 5337.439999999999),
 ('34', 5330.8),
 ('9', 5322.65),
 ('55', 5298.089999999999),
 ('90', 5290.41),
 ('64', 5288.689999999999),
 ('93', 5265.75),
 ('24', 5259.92),
 ('33', 5254.660000000002),
 ('62', 5253.3200000000015),
 ('26', 5250.4),
 ('52', 5245.0599999999995),
 ('87', 5206.4),
 ('40', 5186.429999999999),
 ('35', 5155.420000000001),
 ('11', 5152.289999999999),
 ('65', 5140.35),
 ('69', 5123.01),
 ('81', 5112.70

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 39588)
Traceback (most recent call last):
  File "/opt/anaconda3/envs/beakerx/lib/python3.6/socketserver.py", line 320, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/anaconda3/envs/beakerx/lib/python3.6/socketserver.py", line 351, in process_request
    self.finish_request(request, client_address)
  File "/opt/anaconda3/envs/beakerx/lib/python3.6/socketserver.py", line 364, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/anaconda3/envs/beakerx/lib/python3.6/socketserver.py", line 724, in __init__
    self.handle()
  File "/opt/spark-2.4.7-bin-hadoop2.7/python/pyspark/accumulators.py", line 269, in handle
    poll(accum_updates)
  File "/opt/spark-2.4.7-bin-hadoop2.7/python/pyspark/accumulators.py", line 241, in poll
    if func():
  File "/opt/spark-2.4.7-bin-hadoop2.7/python/pyspark/accumulato

In [11]:
x = spark.sparkContext.parallelize([("a", "aaa bbb ccc"), ("b", "ddd eee fff")])
y = x.flatMapValues(lambda x: x.split(" "))
y.reduceByKey(lambda x, y: x+y).collect()

[('b', 'dddeeefff'), ('a', 'aaabbbccc')]