Table Of Contents
=================
* What is spark?
  - [Compute Engine](#Compute-Engine)
  - [In-memory](#In-memory)
  - [General Purpose](#General-Purpose)  
* [RDD](#RDD)(Resilient Distributed Dataset) </BR>
  Properies of RDD
  - [Resilient](#Resilient)
  - [Immutable](#Immutable)
  - [Lazy Transformations](#Lazy-Transformations)
* [Practicals - Spark](#Practicals)
  - [Find Word Count in a file](#Find-Word-Count-in-a-file)
  - [Find Top Customers](#Find-Top-Customers)
  - [Find Movie Ratings](#Find-Movie-Ratings)
  - [Find Average Linkedin Connections](#Find-Average-Linkedin-Connections)
* [Practicals - pyspark](#Practicals---pyspark)
  - [Find Word Count in a file](#Find-Word-Count-in-a-file-pyspark)
  - [Find Top Customers](#Find-Top-Customers-pyspark)
  - [Find Movie Ratings](#Find-Movie-Ratings-pyspark)
  - [Find Average Linkedin Connections](#Find-Average-Linkedin-Connections-pyspark)


What is Apache Spark?
Apache spark is a general puropose in-mempory compute engine.

### Compute Engine

Hadoop provides three things:
* hdfs - storage
* mapReduce - computation
* YARN - resource manager

In the above scenerio spark takes place of mapReduce and is an alternative of compute engine. It's not like it will be used in place of hadoop ecosystem.

Spark alone cannot do anything, It requires tow things to work:
1. Storage
2. Resource Manager

In this way it is a plug and play compute engine. But it is not bundled by default with hadoop ecosystem.
Spark is flexible enought so that it can use local storage, s3, google cloud storage, hdfs etc.

For resource manager it can use YARN, mesos, Kubernates.

### In-memory
Normally in production there won't be a single mapReduce job there will be a chain of mapreduce jobs. e.g mr1, mr2 ..... mr5.

And these mr jobs will take the inputs form hdfs that is again reading the data from discs and writing the data to discs. So for each map reduce job two disc I/O is required. one for read and one for write. This is a kind of bottleneck with mapreduce where lot of disc I/Os are required.

In case of spark, we'll load the data in some variable V1 that is in memory, do the processing and output will be assigned to variable V2 that is again in a memory. Same goes with the chain of variables. So the data is read from the disc in the very starting and finally in the end for writing the output data. So only two disc I/Os are required. That's why it is said to be 10 to 100 times faster than mapReduce.

### General Purpose
In hadoop, if you want to clean the data we'll use pig, for querying we'll use hive, for ML its mahout for each of the thing we've to learn different tools and all of them are bound to Map and Reduce. Whether if the problem fits in Map Reduce or not all the above tools will use mapReduce.

In spark, we have to learn just only style of writing the code. Things like cleaning, data ingestion, querying etc can happen with a single thing.




## RDD

The basic unit that holds the data in spark is called as **RDD(Resilient Distributed Dataset)**.

Let's say we have created a list with some elements in it in scala, it will be kept in memory on a system. But if there are too many elements or you can say the data is too much we can put it in a List and distribute it among the n number of systems. Roughly RDD is in-memory distributed collection.

Basic Program flow in Spark(pseudoc-code)

```
rdd1 = load file1 from hdfs
rdd2 = rdd1.map // not literaly map, just transformation of data
rdd3 = rdd2.filter // Everytime we do some operation we create a new rdd
rdd3.collect() // if rdd2 is the final answer collect it

No calcullation will ot happen on first three lines, everything will happen on the last line
All the steps will be added in a stack or a flow and when we call collect() things will start
executing.

We can call this flow as DAG(Directed acylic Graph), DAG is nothing but an execution plan.
```

There are two kinds of operations in Spark:
1. Transformations
In the above pseudo code, first three lines are transformations and all the transforamtions are lazy. Every transformation is added to a DAG and once you call collect it will start executing.

2. Actions
In the above pseudo code, last line(collect()) is an action.


Let's say we have a four node hadoop cluster and we are having a dataset 500 MB. Based on default block size four blocks will be made with 1 block on each node. 

In spark when we say load the data in one variable rdd1 from files then the data will be loaded in memory(RAM) of each node. We call this as partition, so we have four partitions for four blocks. These all the partitions together are known as RDD, in-memory distributed across nodes. If we loose an RDD we can again recover it that's why it is **Resilient**.

### Resilient
**How RDDs are Resilient?**</BR>
Resilient is the ability to quickly recover.

Consider we have rdd1 and we're doing some operation and generatiog rdd2 in same way rddd3. IF rdd3 is lost then how can we recover it.

In this case we already know that how rdd3 is created. It will search for it's parent rdd using lineage graph and it will quickly apply the transformation to it to regenerate the rdd3. So, RDD provides fault tolerance throught Lineage Graph that keeps a track of transformations to be executed after an action has been called.

In HDFS we get this using replication, but in case of RDD we cannot do this because memory is not cheap.

### Immutable
**RDDs are Immutable</BR>**
Once we load the data in RDD, it cannot be changed. Why? Because if we want to recover a child RDD we have to use the parent RDD, if the data is getting overwritten in same RDD we cannot do that. Immutability and Lineage gives us the power to regenerate RDD during failure.

### Lazy Transformations
**Why transformations are Lazy?</BR>**
Assume that transformations are not lazy. Now, consider you have 1GB file in HDFS.

```
rdd1 = load file1 in hdfs
rdd1.printl(line1)
```

In the above case, to print the first line we loaded the whole 1GB file in memory.

Now consider that spark is using lazy transformations(fact). Nothing will happen, An entry to the DAG is loaded. In this case spark knows that user is trying to print the first line so only first line will be loaded in the memory.

Another example:

File1 is in HDFS wil 1M lines

```
rdd1 = load file1 from hdfs
rdd2 = rdd1.map
rdd3 = rdd2.filter
rdd2.collect
```

In the above example rdd1 will be loaded with 1M lines or RDD is filled with data(RDD is materialized). In second line 1M lines will be processed by MAP and produces an ouput of 1M lines.In line 3 we are applying the filter and there is a possibility that we'll get only 5 records on filter. It means we are interested in those five records. So for those five records we are loading all the 1M lines in the memory.

## Practicals

### Find Word Count in a file
We've to find the frequency of words which resides in hdfs.

Create a file file1 and put some text in it

Move file in hdfs

```bash
[itv002768@g02 ~]$ hadoop fs -mkdir spark_wordcount_test
[itv002768@g02 ~]$ hadoop fs -put file1 spark_wordcount_test
[itv002768@g02 ~]$ hadoop fs -ls spark_wordcount_test
Found 1 items
-rw-r--r--   3 itv002768 supergroup       1673 2022-08-07 06:03 spark_wordcount_test/file1

[itv002768@g02 ~]$ hadoop fs -head spark_wordcount_test/file1
this is very interesting
this is very interesting
this is very interesting
this is very interesting
this is very interesting
this is very interesting
this is very interesting
this is very interesting
this is very interesting

[itv002768@g02 ~]$ spark-shell
Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set
Spark2 will be picked by default
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/07 06:12:57 WARN Utils: spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.
22/08/07 06:13:02 WARN Utils: spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.
Spark context Web UI available at http://g02.itversity.com:4040
Spark context available as 'sc' (master = yarn, app id = application_1658918988971_4431).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
```
In the above commands we can see the following line</BR>
Spark context available as 'sc' (master = yarn, app id = application_1658918988971_4431).

sc is the entry point to the cluster, if you do not use sc then it won't use the cluster. It gives you the capability to run the code on cluster.

Basic unit which holds the data in spark is know as RDD. Our file is in hdfs so first step would be to load the data in rdd

Here, I'm using spark kernel to run the commands you can also use spark-shell

In [3]:
val rdd1 = sc.textFile("/user/itv002768/spark_wordcount_test/file1")


Waiting for a Spark session to start...

rdd1 = /user/itv002768/spark_wordcount_test/file1 MapPartitionsRDD[1] at textFile at <console>:27


/user/itv002768/spark_wordcount_test/file1 MapPartitionsRDD[1] at textFile at <console>:27

FlatMap takes each line as input and whatever we'll mention it'll do that. Like below we are splitting the words on space. It will create an array of words.

Array(Array(line1), Array(line2), Array(line3))

But flatMap will flatten the 2D array and we'll have a single Array containing all the words in the input file.

In [4]:
val rdd2 = rdd1.flatMap(x => x.split(" "))

rdd2 = MapPartitionsRDD[2] at flatMap at <console>:27


MapPartitionsRDD[2] at flatMap at <console>:27

In [5]:
val rdd3 = rdd2.map(x => (x,1)) // for each word it will create a mapping with one e.g (this, 1), (is, 1) .....

rdd3 = MapPartitionsRDD[3] at map at <console>:27


MapPartitionsRDD[3] at map at <console>:27

How reduceByKey works:

First it will put together all the key/value pairs sorted by keys. It always works on two rows at a time.

Here, x is first row and y is second row. On the right side it will add the values(x + y)

In [7]:
val rdd4 = rdd3.reduceByKey((x, y) => x+y)

rdd4 = ShuffledRDD[4] at reduceByKey at <console>:27


ShuffledRDD[4] at reduceByKey at <console>:27

In [8]:
rdd4.collect()

Array((his,1), (this,25), (is,26), (am,34), (interestingt,1), ("",3), (there,34), (going,34), (very,26), (interesting,25), (not,34), (it,34), (i,34), (isort,34))

Waiting for a Spark session to start...

**finding the wordcount using pyspark**

```python
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/

Using Python version 2.7.5 (default, Nov 16 2020 22:23:17)
SparkSession available as 'spark'.
>>> rdd1 = spark.textFile("/user/itv002768/spark_wordcount_test/file1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'SparkSession' object has no attribute 'textFile'
>>> rdd1 = sc.textFile("/user/itv002768/spark_wordcount_test/file1")
>>> rdd2 = rdd1.flatMap(lambda x: x.split(" "))
>>> rdd3 = rdd2.map(lambda x: (x,1))
>>> rdd4 = rdd3.reduceByKey(lambda x, y:  x+y)
>>> rdd4.collect()
[(u'', 3), (u'very', 26), (u'i', 34), (u'is', 26), (u'am', 34), (u'this', 25), (u'isort', 34), (u'not', 34), (u'interestingt', 1), (u'going', 34), (u'his', 1), (u'there', 34), (u'it', 34), (u'interesting', 25)]

"""
Instead of writing collect, we use saveAsTextFile to save the output in a file.

rdd4.saveAsTextFile('/user/itv002768/spark_wordcount_test_result')
"""

>>> rdd4.saveAsTextFile('/user/itv002768/spark_wordcount_test_result')
>>>
[itv002768@g02 ~]$ hadoop fs -ls spark_wordcount_test_result
Found 3 items
-rw-r--r--   3 itv002768 supergroup          0 2022-08-07 07:05 spark_wordcount_test_result/_SUCCESS
-rw-r--r--   3 itv002768 supergroup        121 2022-08-07 07:05 spark_wordcount_test_result/part-00000
-rw-r--r--   3 itv002768 supergroup         75 2022-08-07 07:05 spark_wordcount_test_result/part-00001
[itv002768@g02 ~]$ hadoop fs -cat spark_wordcount_test_result/*
(u'', 3)
(u'i', 34)
(u'is', 26)
(u'am', 34)
(u'this', 25)
(u'not', 34)
(u'isort', 34)
(u'very', 26)
(u'interestingt', 1)
(u'going', 34)
(u'his', 1)
(u'there', 34)
(u'it', 34)
(u'interesting', 25)
```


Below code is used to read the file from local path and process it.

In [46]:
val rdd1 = sc.textFile("/user/itv002768/spark_wordcount_test/file1")
val rdd2 = rdd1.flatMap(x => x.split(" "))
val rdd3 = rdd2.map(x => (x,1))
val rdd4 = rdd3.reduceByKey((x, y) => x+y)
rdd4.collect()

Waiting for a Spark session to start...

rdd1 = /user/itv002768/spark_wordcount_test/file1 MapPartitionsRDD[1] at textFile at <console>:27
rdd2 = MapPartitionsRDD[2] at flatMap at <console>:28
rdd3 = MapPartitionsRDD[3] at map at <console>:29
rdd4 = ShuffledRDD[4] at reduceByKey at <console>:30


Array((his,1), (this,25), (is,26), (am,34), (interestingt,1), ("",3), (there,34), (going,34), (very,26), (interesting,25), (not,34), (it,34), (i,34), (isort,34))

We'll again run the same code example but this time using scala ide.

Please refer the Video **Spark Fundamental Practical - 2** for detailed information since ide settings related to versions and jars are also there.



```scala
import org.apache.spark.SparkContext
import org.apache.log4j.Level
import org.apache.log4j.Logger

object wordcount {
  def main(args: Array[String]){
    System.setProperty("hadoop.home.dir", "C:/Hadoop")
    Logger.getLogger("org").setLevel(Level.ERROR)
    // local[*] - cluster is on local and use all the cpu cores
    // wordcount - name of application, we can give any name
    val sc = new SparkContext("local[*]", "wordcount")
    val input = sc.textFile("C:/Users/tushar.sharma/bigdata/spark_practice/search_data.txt")
    val allWords = input.flatMap(x => x.split(" "))
    val wordsValue = allWords.map(x => (x,1))
    val wordsFrequency = wordsValue.reduceByKey((x, y) => x+y)
    wordsFrequency.collect.foreach(println)
  }
}
```

**Below program to convert all the words to lowercase and then count**

```scala
import org.apache.spark.SparkContext
import org.apache.log4j.Level
import org.apache.log4j.Logger

object wordcount {
  def main(args: Array[String]){
    System.setProperty("hadoop.home.dir", "C:/Hadoop")
    Logger.getLogger("org").setLevel(Level.ERROR)
    // local[*] - cluster is on local and use all the cpu cores
    // wordcount - name of application, we can give any name
    val sc = new SparkContext("local[*]", "wordcount")
    val input = sc.textFile("C:\\Users\\tushar.sharma\\bigdata\\spark_practice\\search_data.txt.txt")
    val allWords = input.flatMap(_.split(" "))
    val lowerCaseWords = allWords.map(_.toLowerCase())
    val wordsValue = lowerCaseWords.map(x => (x,1))
    val wordsFrequency = wordsValue.reduceByKey((x, y) => x+y)
    
    wordsFrequency.collect.foreach(println)
  }
}
```

**In case if we want to get the top ten words, we've to sort on values of output</BR>
But there is No such way where we can sort on keys but not sort on values
In this case we'll replace key with value, sort by key and then show the results**

```scala
import org.apache.spark.SparkContext
import org.apache.log4j.Level
import org.apache.log4j.Logger

object wordcount {
  def main(args: Array[String]){
    System.setProperty("hadoop.home.dir", "C:/Hadoop")
    Logger.getLogger("org").setLevel(Level.ERROR)
    // local[*] - cluster is on local and use all the cpu cores
    // wordcount - name of application, we can give any name
    val sc = new SparkContext("local[*]", "wordcount")
    val input = sc.textFile("C:\\Users\\tushar.sharma\\bigdata\\spark_practice\\search_data.txt.txt")
    val allWords = input.flatMap(_.split(" "))
    val lowerCaseWords = allWords.map(_.toLowerCase())
    val wordsValue = lowerCaseWords.map(x => (x,1))
    val wordsFrequency = wordsValue.reduceByKey((x, y) => x+y)
    val replaceKeyWithVal = wordsFrequency.map(x => (x._2, x._1))
    val finalCount_ = replaceKeyWithVal.sortByKey(false) // We want it in decending order that's why we gave false
    val finalCount = finalCount_.map(x => (x._2, x._1))
    finalCount.collect.foreach(println)
  }
}
```

**If you want to sort by second column of the touple use soryBy function**
```scala
import org.apache.spark.SparkContext
import org.apache.log4j.Level
import org.apache.log4j.Logger

object wordcount {
  def main(args: Array[String]){
    System.setProperty("hadoop.home.dir", "C:/Hadoop")
    Logger.getLogger("org").setLevel(Level.ERROR)
    // local[*] - cluster is on local and use all the cpu cores
    // wordcount - name of application, we can give any name
    val sc = new SparkContext("local[*]", "wordcount")
    val input = sc.textFile("C:\\Users\\tushar.sharma\\bigdata\\spark_practice\\search_data.txt.txt")
    val allWords = input.flatMap(_.split(" "))
    val lowerCaseWords = allWords.map(_.toLowerCase())
    val wordsValue = lowerCaseWords.map(x => (x,1))
    val wordsFrequency = wordsValue.reduceByKey((x, y) => x+y).sortBy(x => x._2)
    wordsFrequency.collect.foreach(println)
  }
}
```


### Find Top Customers

Below are the few lines from customerorders file, where columns are customer_id, product_id, amount_spend

There can be repeated rows for same customer who baught some product with product_id x.

Problem statement - Find the top ten customers who spend the most amount from shopping.

**Note:** I've written and executed the code here, you can run the same code in IDE and make sure that the input files are in place.

```bash
[itv002768@g02 ~]$ hadoop fs -mkdir customerorders_practical
[itv002768@g02 ~]$ hadoop fs -put customer
customerorders-201008-180523.csv  customers.java
[itv002768@g02 ~]$ hadoop fs -put customerorders-201008-180523.csv customerorders_practical
[itv002768@g02 ~]$ hadoop fs -ls customerorders_practical
Found 1 items
-rw-r--r--   3 itv002768 supergroup     146855 2022-08-08 13:25 customerorders_practical/customerorders-201008-180523.csv
[itv002768@g02 ~]$ hadoop fs -head customerorders_practical/customerorders-201008-180523.csv
44,8602,37.19
35,5368,65.89
2,3391,40.64
47,6694,14.98
29,680,13.08
91,8900,24.59
70,3959,68.68
85,1733,28.53
53,9900,83.55
14,1505,4.32
51,3378,19.80
42,6926,57.77
2,4424,55.77
79,9291,33.17
50,3901,23.57
20,6633,6.49
15,6148,65.53
44,8331,99.19
5,3505,64.18
48,5539,32.42
```

In [37]:
val rawCustomersInfo = sc.textFile("/user/itv002768/customerorders_practical/customerorders-201008-180523.csv")
// split on "," and take only 1st and third element of an array, and convert third element to float
// Put these in a touple
val splitCust = rawCustomersInfo.map(x => (x.split(",")(0), x.split(",")(2).toFloat ) )

// Calculate the sum of amount and sort by amount in descending order
val totalPurchase = splitCust.reduceByKey((x, y) => x+y).sortBy(x => x._2, false)
val finalInfo = totalPurchase.collect
val topTen = finalInfo.take(10)
println("Top 10 Customers::")
for(info <- topTen){
    println(info)
}

Top 10 Customers::
(68,6375.45)
(73,6206.199)
(39,6193.1104)
(54,6065.39)
(71,5995.66)
(2,5994.591)
(97,5977.1895)
(46,5963.111)
(42,5696.8403)
(59,5642.8906)


rawCustomersInfo = /user/itv002768/customerorders_practical/customerorders-201008-180523.csv MapPartitionsRDD[112] at textFile at <console>:33
splitCust = MapPartitionsRDD[113] at map at <console>:36
totalPurchase = MapPartitionsRDD[119] at sortBy at <console>:39
finalInfo = Array((68,6375.45), (73,6206.199), (39,6193.1104), (54,6065.39), (71,5995.66), (2,5994.591), (97,5977.1895), (46,5963.111), (42,5696.8403), (59,5642.8906), (41,5637.619), (0,5524.9497), (8,5517.24), (85,5503.4307), (61,5497.48), (32,5496.0503), (58,5437.7305), (63,5415.15), (15,5413.5103), (6,5397.8794), (92,5379.281), (43,5368.83), (70,5368.2505), (72,5337.4395), (34,5330.7...


Array((68,6375.45), (73,6206.199), (39,6193.1104), (54,6065.39), (71,5995.66), (2,5994.591), (97,5977.1895), (46,5963.111), (42,5696.8403), (59,5642.8906), (41,5637.619), (0,5524.9497), (8,5517.24), (85,5503.4307), (61,5497.48), (32,5496.0503), (58,5437.7305), (63,5415.15), (15,5413.5103), (6,5397.8794), (92,5379.281), (43,5368.83), (70,5368.2505), (72,5337.4395), (34,5330.7...

### Find Movie Ratings
Below are the few lines from moviedata-201008-180523.data file where columns are separated by tab.

column details : user_id    movie_id    rating_given    timestamp

problem statement - How many times movies were rated 1 star, 2 stars ..... 5 starts

```bash
[itv002768@g02 ~]$ hadoop fs -mkdir movierating_practical
[itv002768@g02 ~]$ hadoop fs -put moviedata-201008-180523.data movierating_practical
[itv002768@g02 ~]$ hadoop fs -head movierating_practical/moviedata-201008-180523.data
196     242     3       881250949
186     302     3       891717742
22      377     1       878887116
244     51      2       880606923
166     346     1       886397596
298     474     4       884182806
115     265     2       881171488
253     465     5       891628467
```

In [41]:
val rawMovieRatingInfo = sc.textFile("/user/itv002768/movierating_practical/moviedata-201008-180523.data")
// split on "\t" and take only third column
// Put this column value in a touple along with 1 as second element
val requiredCol = rawMovieRatingInfo.map(x => (x.split("\t")(2), 1) )
// Sum it up
val final_ = requiredCol.reduceByKey((x, y) => x+y)

final_.collect

rawMovieRatingInfo = /user/itv002768/movierating_practical/moviedata-201008-180523.data MapPartitionsRDD[133] at textFile at <console>:31
requiredCol = MapPartitionsRDD[134] at map at <console>:34
final_ = ShuffledRDD[135] at reduceByKey at <console>:36


Array((4,34174), (2,11370), (5,21201), (3,27145), (1,6110))

In [44]:
val rawMovieRatingInfo = sc.textFile("/user/itv002768/movierating_practical/moviedata-201008-180523.data")
// split on "\t" and take only third column
val requiredCol = rawMovieRatingInfo.map(x => x.split("\t")(2))
// It will result a map with key as rating and count as value
val final_ = requiredCol.countByValue

rawMovieRatingInfo = /user/itv002768/movierating_practical/moviedata-201008-180523.data MapPartitionsRDD[143] at textFile at <console>:30
requiredCol = MapPartitionsRDD[144] at map at <console>:33
final_ = Map(4 -> 34174, 5 -> 21201, 1 -> 6110, 2 -> 11370, 3 -> 27145)


Map(4 -> 34174, 5 -> 21201, 1 -> 6110, 2 -> 11370, 3 -> 27145)

**In the above example we used countByValue instead of map + reduceByKey</BR>
The only difference is map + reduceByKey is a transformation and countByValue is and action</BR>
map + reduceByKey -> rdd</BR>
countByValue -> variable</BR>
we don't have to use any collect when using countByValue</BR>
Only use CountByValue if it is the final result because if you perform anything after using countByValue then operations will be performed on the local machine.**

### Find Average Linkedin Connections

Below file contains the information related to linkedin users

row_id, username, age, connections

Problem statement - We've to find the average connection for various age groups

```bash
[itv002768@g02 ~]$ hadoop fs -mkdir week9_friendsavg_practical
[itv002768@g02 ~]$ hadoop fs -put friendsdata-201008-180523.csv week9_friendsavg_practical

[itv002768@g02 ~]$ hadoop fs -head week9_friendsavg_practical/friendsdata-201008-180523.csv
0::Will::33::385
1::Jean-Luc::26::2
2::Hugh::55::221
3::Deanna::40::465
4::Quark::68::21
5::Weyoun::59::318
6::Gowron::37::220
7::Will::54::307
8::Jadzia::38::380
```


In [21]:
def parseLines(info: String) = {
    val splitLine = info.split("::")
    val age = splitLine(2).toInt
    val connections = splitLine(3).toInt
    (age, connections)
}

val rawInfo = sc.textFile("/user/itv002768/week9_friendsavg_practical/friendsdata-201008-180523.csv")
val requiredCol = rawInfo.map(parseLines)
//val requiredInfo = requiredCol.map(x => (x._1, (x._2, 1)))

// We can also use mapValues to take values only instead of using map in the above commented line
val requiredInfo = requiredCol.mapValues(x => (x, 1))
val final_ = requiredInfo.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
val final__ = final_.map(x => (x._1, x._2._1/x._2._2))

final__.collect()


rawInfo = /user/itv002768/week9_friendsavg_practical/friendsdata-201008-180523.csv MapPartitionsRDD[47] at textFile at <console>:41
requiredCol = MapPartitionsRDD[48] at map at <console>:42
requiredInfo = MapPartitionsRDD[49] at mapValues at <console>:46
final_ = ShuffledRDD[50] at reduceByKey at <console>:47
final__ = MapPartitionsRDD[51] at map at <console>:48


parseLines: (info: String)(Int, Int)


Array((34,245), (52,340), (56,306), (66,276), (22,206), (28,209), (54,278), (46,223), (48,281), (30,235), (50,254), (32,207), (36,246), (24,233), (62,220), (64,281), (...

## Practicals - pyspark
For the below practicals, I you want to run the code blocks switch the kernel to **Pyspark 3**

### Find Word Count in a file pyspark

In [40]:
sc.stop()

In [41]:
from pyspark import SparkContext

sc = SparkContext()

input_ = sc.textFile("/user/itv002768/week9_practical_search_words/search_data.txt")
all_words = input_.flatMap(lambda x: x.split(" "))
small_letters = all_words.map(lambda x: x.lower())
words_value = small_letters.map(lambda x: (x,1))
words_frequency = words_value.reduceByKey(lambda x, y: x+y)
final = words_frequency.collect()
# top ten words
sorted(final, key=lambda final: final[1], reverse = True)[0:10]

[('data', 361),
 ('big', 285),
 ('in', 171),
 ('training', 114),
 ('course', 105),
 ('hadoop', 100),
 ('online', 58),
 ('courses', 53),
 ('spark', 42),
 ('bangalore', 40)]

### Find Top Customers pyspark

In [45]:
raw_customers_info = sc.textFile("/user/itv002768/customerorders_practical/customerorders-201008-180523.csv")
split_cust = raw_customers_info.map(lambda x: (x.split(",")[0], float(x.split(",")[2]) ) )
total_purchase = split_cust.reduceByKey(lambda x, y: x+y)
final_info = total_purchase.collect()
sorted(final_info, key=lambda final_info: final_info[1], reverse=True)[0:10]

[('68', 6375.450000000001),
 ('73', 6206.200000000001),
 ('39', 6193.110000000001),
 ('54', 6065.390000000001),
 ('71', 5995.660000000002),
 ('2', 5994.59),
 ('97', 5977.1900000000005),
 ('46', 5963.109999999999),
 ('42', 5696.840000000002),
 ('59', 5642.889999999999)]

### Find Movie Ratings pyspark

In [47]:
raw_movie_rating_info = sc.textFile("/user/itv002768/movierating_practical/moviedata-201008-180523.data")
required_col = raw_movie_rating_info.map(lambda x: (x.split("\t")[2], 1) )
final_ = required_col.reduceByKey(lambda x, y: x+y)

final_.collect()

[('1', 6110), ('4', 34174), ('5', 21201), ('3', 27145), ('2', 11370)]

### Find Average Linkedin Connections pyspark

In [51]:
def parse_lines(info):
    split_line = info.split("::")
    age = int(split_line[2])
    connections = int(split_line[3])
    return (age, connections)

raw_info = sc.textFile("/user/itv002768/week9_friendsavg_practical/friendsdata-201008-180523.csv")
required_col = raw_info.map(parse_lines)
required_info = required_col.mapValues(lambda x: (x, 1))
final_ = required_info.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
final__ = final_.map(lambda x: (x[0], x[1][0]/x[1][1]))

final__.collect()

[(60, 202.71428571428572),
 (32, 207.9090909090909),
 (68, 269.6),
 (38, 193.53333333333333),
 (40, 250.8235294117647),
 (66, 276.44444444444446),
 (26, 242.05882352941177),
 (44, 282.1666666666667),
 (64, 281.3333333333333),
 (54, 278.0769230769231),
 (46, 223.69230769230768),
 (30, 235.8181818181818),
 (56, 306.6666666666667),
 (62, 220.76923076923077),
 (28, 209.1),
 (36, 246.6),
 (58, 116.54545454545455),
 (20, 165.0),
 (18, 343.375),
 (52, 340.6363636363636),
 (48, 281.4),
 (22, 206.42857142857142),
 (50, 254.6),
 (24, 233.8),
 (42, 303.5),
 (34, 245.5),
 (69, 235.2),
 (67, 214.625),
 (51, 302.14285714285717),
 (29, 215.91666666666666),
 (27, 228.125),
 (49, 184.66666666666666),
 (55, 295.53846153846155),
 (25, 197.45454545454547),
 (47, 233.22222222222223),
 (23, 246.3),
 (21, 350.875),
 (65, 298.2),
 (43, 230.57142857142858),
 (63, 384.0),
 (45, 309.53846153846155),
 (57, 258.8333333333333),
 (37, 249.33333333333334),
 (19, 213.27272727272728),
 (41, 268.55555555555554),
 (61, 2