# Prerrequisites

Installing Spark and Apache Kafka Library in VM


---



In [2]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark3.0.1
!wget -q https://apache.osuosl.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz

# unzip it
!tar xf spark-3.0.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install py4j

# For maps
!pip install folium
!pip install plotly



Define the environment (Java & Spark homes)

---

In [3]:
!ls /content

sample_data		   spark-3.0.1-bin-hadoop3.2.tgz
spark-3.0.1-bin-hadoop3.2  spark-3.0.1-bin-hadoop3.2.tgz.1


In [5]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop3.2"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Starting Spark Session and print the version


---


In [6]:
import findspark
findspark.init("spark-3.0.1-bin-hadoop3.2")# SPARK_HOME

from pyspark.sql import SparkSession

# create the session
spark = SparkSession \
        .builder \
        .master("local[*]") \
        .config("spark.ui.port", "4500") \
        .getOrCreate()

spark.version

'3.0.1'

In [7]:
spark

In [8]:
# For Pandas conversion optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

Creating ngrok tunnel to allow Spark UI (Optional)


In [10]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 4500 &')
!curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

--2020-11-21 10:57:44--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 34.233.2.239, 52.73.16.193, 52.206.15.164, ...
Connecting to bin.equinox.io (bin.equinox.io)|34.233.2.239|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13773305 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip.1’


2020-11-21 10:57:44 (73.5 MB/s) - ‘ngrok-stable-linux-amd64.zip.1’ saved [13773305/13773305]

Archive:  ngrok-stable-linux-amd64.zip
replace ngrok? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ngrok                   
http://bb002e36da73.ngrok.io


# Descargar Datasets

In [11]:
!mkdir -p /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/frankenstein.txt -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/el_quijote.txt -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/characters.csv -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/planets.csv -P /dataset
!ls /dataset

characters.csv	el_quijote.txt	frankenstein.txt  planets.csv


# RDD

---



## Ejemplo 1

In [12]:
textFile = spark.sparkContext.textFile("/dataset/frankenstein.txt")
textFile.first()

'FRANKENSTEIN'

Abre la Spark UI e investiga qué ha sucedido


---




## Creación de colecciones paralelizadas
Una manera muy rápida de crear RDD desde la shell, cuando estamos aprendiendo, es crear una colección paralelizada. Para ello:

## Ejemplo 2

In [13]:
distData = spark.sparkContext.parallelize([25, 20, 15, 10, 5])
distData.reduce(lambda x ,y: x + y)

75

¿De qué tipo es la variable distData?


*Spark RDD (Resilient Distributed Dataset)*

## Ejercicio 1
Cuenta el número de líneas del fichero "el_quijote.txt"

---



In [34]:
# Es buena praxis aplicar cada función en una línea
# \ se utiliza para decirle que siga en la siguiente línea
quijote = spark \
.sparkContext \
.textFile("/dataset/el_quijote.txt")

quijote

/dataset/el_quijote.txt MapPartitionsRDD[40] at textFile at NativeMethodAccessorImpl.java:0

In [27]:
quijote.count()

2186

## Ejercicio 2
Imprime la primera línea del fichero "el_quijote.txt"

---



In [28]:
quijote.first()

'DON QUIJOTE DE LA MANCHA'

## Transformaciones y Acciones sobre RDDs 

### Acciones

### Ejemplo 3

In [35]:
print(textFile.count()) # Número de elementos en el RDD
print(textFile.first()) # Primer elemento del RDD

7237
FRANKENSTEIN


### Transformaciones

### Ejemplo 4

In [38]:
# ReduceByKey
lines = spark.sparkContext.textFile("/dataset/frankenstein.txt")
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b).cache()
counts.count()
counts.collect() # show the RDD

[('FRANKENSTEIN', 1),
 ('', 811),
 ('Letter 1', 1),
 ('commencement of an enterprise which you have regarded with such evil', 1),
 ('forebodings.  I arrived here yesterday, and my first task is to assure', 1),
 ('my dear sister of my welfare and increasing confidence in the success', 1),
 ('of my undertaking.', 1),
 ('feeling?  This breeze, which has travelled from the regions towards', 1),
 ('frost and desolation; it ever presents itself to my imagination as the', 1),
 ('region of beauty and delight.  There, Margaret, the sun is forever', 1),
 ('perpetual splendour.  There—for with your leave, my sister, I will put', 1),
 ('some trust in preceding navigators—there snow and frost are banished;', 1),
 ('solitudes.  What may not be expected in a country of eternal light?  I', 1),
 ('shall satiate my ardent curiosity with the sight of a part of the world',
  1),
 ('never before visited, and may tread a land never before imprinted by', 1),
 ('the foot of man. These are my enticements, and 

In [40]:
# SortByKey
sorted = counts.sortByKey()
sorted.collect()

[('', 811),
 ('                                                "Elizabeth Lavenza', 1),
 ('                                                August 26th, 17—', 1),
 ('                              "Alphonse Frankenstein.', 1),
 ('                    [Wordsworth\'s "Tintern Abbey".]', 1),
 ('               "Your affectionate and afflicted father,', 1),
 ('               Walton, in continuation.', 1),
 ('     An appetite; a feeling, and a love,', 1),
 ('     By thought supplied, or any interest', 1),
 ('     Haunted him like a passion:  the tall rock,', 1),
 ('     That had no need of a remoter charm,', 1),
 ('     The mountain, and the deep and gloomy wood,', 1),
 ('     Their colours and their forms, were then to him', 1),
 ("     Unborrow'd from the eye.", 1),
 ('     ——The sounding cataract', 1),
 ('    "Geneva, May 18th, 17—"', 1),
 ('   Embrace fond woe, or cast our cares away;', 1),
 ('   Nought may endure but mutability!', 1),
 ('   The path of its departure still is free.', 1),
 (

Abre la Spark UI e investiga que ha ido pasando.

### Ejemplo 5

In [41]:
# Filter

linesWithSpark = textFile.filter(lambda line: "the" in line)
linesWithSpark.count()

3712

### Ejercicio 3
Obtén el número de ocurrencias de cada palabra del fichero "frankenstein.txt"

In [46]:
# ReduceByKey
lines = quijote
words = lines.flatMap(lambda line: line.split(' '))
pairs = words.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b).cache()
counts.collect() # show the RDD

[('DE', 17),
 ('LA', 13),
 ('Miguel', 3),
 ('', 1),
 ('PRIMERA', 1),
 ('CAPÍTULO', 1),
 ('1:', 1),
 ('condición', 23),
 ('y', 8042),
 ('del', 1113),
 ('hidalgo', 16),
 ('Mancha', 11),
 ('En', 143),
 ('cuyo', 26),
 ('nombre', 67),
 ('no', 2786),
 ('acordarme,', 1),
 ('mucho', 141),
 ('que', 10351),
 ('vivía', 10),
 ('los', 2122),
 ('en', 3883),
 ('rocín', 5),
 ('corredor.', 1),
 ('Una', 2),
 ('vaca', 1),
 ('carnero,', 1),
 ('noches,', 3),
 ('lentejas', 1),
 ('viernes,', 3),
 ('palomino', 1),
 ('añadidura', 4),
 ('domingos,', 1),
 ('consumían', 1),
 ('tres', 99),
 ('su', 1856),
 ('hacienda.', 3),
 ('resto', 3),
 ('concluían', 1),
 ('sayo', 3),
 ('velarte,', 1),
 ('calzas', 2),
 ('velludo', 1),
 ('para', 693),
 ('fiestas', 4),
 ('con', 2024),
 ('sus', 487),
 ('lo', 1778),
 ('mismo,', 15),
 ('días', 90),
 ('honraba', 3),
 ('vellori', 1),
 ('fino.', 1),
 ('Tenía', 5),
 ('pasaba', 23),
 ('sobrina', 9),
 ('mozo', 36),
 ('campo', 20),
 ('plaza,', 4),
 ('ensillaba', 1),
 ('el', 3726),
 ('como',

### Ejercicio 4
Obtén el top-10 de palabras con más de 4 caracteres

---



In [52]:
# Change Key-Value
changed = counts.map(lambda s: (s[1], s[0]))
# SortByValue
sorted = changed.sortByKey(ascending=False)
sorted.collect()

[(10351, 'que'),
 (8947, 'de'),
 (8042, 'y'),
 (4941, 'la'),
 (4725, 'a'),
 (3883, 'en'),
 (3726, 'el'),
 (2786, 'no'),
 (2382, 'se'),
 (2122, 'los'),
 (2024, 'con'),
 (1881, 'por'),
 (1856, 'su'),
 (1801, 'le'),
 (1778, 'lo'),
 (1473, 'las'),
 (1154, 'me'),
 (1130, 'como'),
 (1113, 'del'),
 (960, 'es'),
 (899, 'si'),
 (894, 'un'),
 (882, 'más'),
 (851, 'mi'),
 (814, 'yo'),
 (800, 'al'),
 (750, 'tan'),
 (713, 'don'),
 (693, 'para'),
 (689, 'porque'),
 (653, 'había'),
 (627, 'él'),
 (617, 'ni'),
 (616, 'sin'),
 (594, 'una'),
 (512, 'o'),
 (509, 'todo'),
 (487, 'sus'),
 (466, 'ser'),
 (460, 'ha'),
 (452, 'era'),
 (451, 'bien'),
 (445, 'vuestra'),
 (407, 'Y'),
 (376, 'ya'),
 (372, 'todos'),
 (354, 'cuando'),
 (348, 'dijo'),
 (345, 'Don'),
 (343, 'fue'),
 (342, 'donde'),
 (340, 'te'),
 (326, 'este'),
 (326, 'cual'),
 (321, 'así'),
 (313, 'sino'),
 (312, 'esto'),
 (312, 'Sancho'),
 (310, 'Quijote'),
 (310, 'que,'),
 (305, 'quien'),
 (300, 'muy'),
 (294, 'pero'),
 (293, 'aquel'),
 (292, 'est

## Key/Value Pair RDD

---



### Ejemplo 6


---



In [None]:
charac_sw = spark.sparkContext.textFile("/dataset/characters.csv")
planets_sw = spark.sparkContext.textFile("/dataset/planets.csv")
charac_sw.take(10)

In [None]:
planets_sw.take(10)

In [None]:
from itertools import islice

charac_sw_noheader = charac_sw.mapPartitionsWithIndex(
    lambda idx, it: islice(it, 1, None) if idx == 0 else it)

planets_sw_noheader = planets_sw.mapPartitionsWithIndex(
    lambda idx, it: islice(it, 1, None) if idx == 0 else it)

### Ejercicio 5
Obtén un listado con la población del planeta al que pertenece cada personaje de Star Wars


---
