In [None]:
# 102 Spark basics

The goal of this lab is to get familiar with Spark programming.

- [Spark programming guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
- [RDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html)
- [PairRDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/PairRDDFunctions.html)

In [1]:
import org.apache.spark

Intitializing Scala interpreter ...

Spark Web UI available at http://LAPTOP-T2P39KLE:4040
SparkContext available as 'sc' (version = 3.5.1, master = local[*], app id = local-1729587000353)
SparkSession available as 'spark'


import org.apache.spark


In [None]:
// DO NOT EXECUTE - this is needed just to avoid showing errors in the following cells
val sc = spark.SparkContext.getOrCreate()

In [None]:
## 102-1 Spark warm-up

Load the ```capra``` and ```divinacommedia``` datasets and try the following actions:
- Show their content (```collect```)
- Count their rows (```count```)
- Split phrases into words (```map``` or ```flatMap```; what’s the difference?)
- Check the results (remember: evaluation is lazy)
- Try the ```toDebugString``` function to check the execution plan

In [3]:
val rddCapra = sc.textFile("../../../../datasets/capra.txt")
val rddDC = sc.textFile("../../../../datasets/divinacommedia.txt")
//COLLECT
val capraData = rddCapra.collect()
val DCData = rddDC.collect()
//COUNT
val capraCount = rddCapra.count()
val DCCount = rddDC.count()
print(capraCount + "\n")
print(DCCount + "\n")

2
14753


rddCapra: org.apache.spark.rdd.RDD[String] = ../../../../datasets/capra.txt MapPartitionsRDD[5] at textFile at <console>:29
rddDC: org.apache.spark.rdd.RDD[String] = ../../../../datasets/divinacommedia.txt MapPartitionsRDD[7] at textFile at <console>:30
capraData: Array[String] = Array(sopra la panca la capra campa, sotto la panca la capra crepa)
DCData: Array[String] = Array(LA DIVINA COMMEDIA, di Dante Alighieri, INFERNO, "", "", "", Inferno: Canto I, "", "  Nel mezzo del cammin di nostra vita", mi ritrovai per una selva oscura, ch� la diritta via era smarrita., "  Ahi quanto a dir qual era � cosa dura", esta selva selvaggia e aspra e forte, che nel pensier rinova la paura!, "  Tant'� amara che poco � pi� morte;", ma per trattar del ben ch'i' vi trovai,, dir� de l'altre cose ch'i' ...


In [27]:
//MAP e FLATMAP
val splitCapra = rddCapra.map(x => x.split(" ")).collect()
val splitDC = rddDC.map(x => x.split(" ")).collect()

splitCapra: Array[Array[String]] = Array(Array(sopra, la, panca, la, capra, campa), Array(sotto, la, panca, la, capra, crepa))
splitDC: Array[Array[String]] = Array(Array(LA, DIVINA, COMMEDIA), Array(di, Dante, Alighieri), Array(INFERNO), Array(""), Array(""), Array(""), Array(Inferno:, Canto, I), Array(""), Array("", "", Nel, mezzo, del, cammin, di, nostra, vita), Array(mi, ritrovai, per, una, selva, oscura), Array(ch�, la, diritta, via, era, smarrita.), Array("", "", Ahi, quanto, a, dir, qual, era, �, cosa, dura), Array(esta, selva, selvaggia, e, aspra, e, forte), Array(che, nel, pensier, rinova, la, paura!), Array("", "", Tant'�, amara, che, poco, �, pi�, morte;), Array(ma, per, trattar, del, ben, ch'i', vi, trovai,), Array(dir�, de, l'altre, cose, ch'i', v'ho, scorte.), Array("", "...


In [28]:
val splitFlatCapra = rddCapra.flatMap(x => x.split(" ")).collect()
val splitFlatDC = rddDC.flatMap(x => x.split(" ")).collect()

splitFlatCapra: Array[String] = Array(sopra, la, panca, la, capra, campa, sotto, la, panca, la, capra, crepa)
splitFlatDC: Array[String] = Array(LA, DIVINA, COMMEDIA, di, Dante, Alighieri, INFERNO, "", "", "", Inferno:, Canto, I, "", "", "", Nel, mezzo, del, cammin, di, nostra, vita, mi, ritrovai, per, una, selva, oscura, ch�, la, diritta, via, era, smarrita., "", "", Ahi, quanto, a, dir, qual, era, �, cosa, dura, esta, selva, selvaggia, e, aspra, e, forte, che, nel, pensier, rinova, la, paura!, "", "", Tant'�, amara, che, poco, �, pi�, morte;, ma, per, trattar, del, ben, ch'i', vi, trovai,, dir�, de, l'altre, cose, ch'i', v'ho, scorte., "", "", Io, non, so, ben, ridir, com'i', v'intrai,, tant'era, pien, di, sonno, a, quel, punto, che, la, verace, via, abbandonai., "", "", Ma, poi, ch'...


In [29]:
val debugFlatDC = rddDC.flatMap(x => x.split(" ")).toDebugString

debugFlatDC: String =
(2) MapPartitionsRDD[94] at flatMap at <console>:25 []
 |  ../../../../datasets/divinacommedia.txt MapPartitionsRDD[83] at textFile at <console>:30 []
 |  ../../../../datasets/divinacommedia.txt HadoopRDD[82] at textFile at <console>:30 []


In [None]:
## 102-2 Basic Spark jobs

Implement on Spark the following jobs and test them on both capra and divinacommedia datasets.

- **Word count**: count the number of occurrences of each word
  - Result: (sopra, 1), (la, 4), …
- **Word length count**: count the number of occurrences of words of given lengths
  - Result: (2, 4), (5, 8)
- Count the average length of words given their first letter (i.e., words that begin with "s" have an average length of 5)
  - Result: (s, 5), (l, 2), …
- Return the inverted index of words (i.e., for each word, list the numbers of lines in which they appear)
  - Result: (sopra, (0)), (la, (0, 1)), ...

Also, check how sorting works and try to sort key-value RDDs by descending values.

In [34]:
//WORD COUNT OF EACH WORD
val getCapraOccurrences = rddCapra.flatMap(_.split(" ")).
          map(word => (word, 1)).
          reduceByKey((x, y) => x+y).sortByKey().collect
val getDCOccurrences = rddDC.flatMap(_.split(" ")).
        map(word => (word, 1)).
        reduceByKey((x, y) => x+y).collect

getCapraOccurrences: Array[(String, Int)] = Array((campa,1), (capra,2), (crepa,1), (la,4), (panca,2), (sopra,1), (sotto,1))
getDCOccurrences: Array[(String, Int)] = Array((grand'avello,,1), (diseta,1), (vane.,1), (tonda,3), (blandimenti;,1), (sapore,1), (dando,3), (Verrucchio,,1), (Mantua,1), (m'apparvero,1), (disiderate,1), (dole,1), (moventi,1), (rincalzi,1), (freni,,1), (Voglia,1), (focina,1), (tormento,5), (s�:,2), (marino,,1), (scalz�,1), (pensassi,1), (esser,,2), (rade,2), (prava".,1), (Forese,,1), (forti,4), (rossi:,1), (richiuso".,1), ("Segnor,1), (rota.,1), ("ver',1), (pronti,1), (tr'ambo,2), (ch'ode,1), (chiari,,1), (lontana?".,1), (rinovelle.,1), (perdonasse,1), (Pluto,,1), (falsai,2), (nova,,3), (sparito,,1), (stampa,,1), (doglia,7), (regina,3), (pianto;,2), (Alto,2), (giov...


In [43]:
//WORD LENGTH OF EACH WORD
val getSameLengthCapraOccurrences = rddCapra.flatMap(_.split(" ")).
          map(word => (word.length, 1)).
          reduceByKey(_+_).sortByKey().collect
val getSameLengthDCOccurrences = rddDC.flatMap(_.split(" ")).
        map(word => (word.length, 1)).
        reduceByKey(_+_).sortByKey().collect

getCapraOccurrences: Array[(Int, Int)] = Array((2,4), (5,8))
getDCOccurrences: Array[(Int, Int)] = Array((0,10684), (1,6992), (2,19258), (3,16887), (4,9111), (5,13504), (6,11775), (7,7379), (8,5363), (9,3231), (10,1741), (11,933), (12,370), (13,154), (14,50), (15,18), (16,3), (17,1))


In [5]:
// COUNT AVERAGE LENGTH OF WORDS
val getAverageLengthInCapra = rddCapra.flatMap(_.split(" ")).
          filter( _.length > 0 ).
          map( x => (x.substring(0,1).toLowerCase, (1.0,x.length.toDouble))).
          reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).
          mapValues(v => v._2/v._1).
          collect()

getAverageLengthInCapra: Array[(String, Double)] = Array((p,5.0), (l,2.0), (s,5.0), (c,5.0))


In [9]:
//INVERTED INDEX OF WORDS
val invertedIndexInCapra = rddCapra.flatMap(_.split(" ")).zipWithIndex().groupByKey().collect
val invertedIndexInCapra2 = rddCapra.zipWithIndex().
        map({case (k,v)=>(v,k)}).
        flatMapValues(_.split(" ")).
        map({case (k,v)=>(v,k)}).
        distinct().
        groupByKey().
        collect()

invertedIndexInCapra: Array[(String, Iterable[Long])] = Array((campa,CompactBuffer(5)), (la,CompactBuffer(1, 3, 7, 9)), (panca,CompactBuffer(2, 8)), (sotto,CompactBuffer(6)), (crepa,CompactBuffer(11)), (sopra,CompactBuffer(0)), (capra,CompactBuffer(4, 10)))
invertedIndexInCapra2: Array[(String, Iterable[Long])] = Array((campa,CompactBuffer(0)), (la,CompactBuffer(0, 1)), (panca,CompactBuffer(1, 0)), (sotto,CompactBuffer(1)), (crepa,CompactBuffer(1)), (sopra,CompactBuffer(0)), (capra,CompactBuffer(0, 1)))
