<a href="https://colab.research.google.com/github/tomasborrella/TheValley/blob/main/notebooks/spark01/Ejemplo_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ejemplo de PySpark

Notebook por [Tomás Borrella Martín](https://www.linkedin.com/in/tomasborrella/).

### Enlaces de interés
*   [Slides de presentación](https://docs.google.com/presentation/d/10HZGQnFNzRO63I9XRt-uQa6K9K2yAM71Wu-SYB0TL7c/edit?usp=sharing)

# 1. Datos
Descargamos un archivo que contiene una canción en cada fila (simplificado para el ejemplo).

NOTA: En un notebook, "!" ejecuta comandos del sistema desde dentro del notebook.

In [None]:
!wget -P /content/data 'https://raw.githubusercontent.com/tomasborrella/TheValley/main/data/spark01/simple_songs_log.txt' 

--2021-06-05 10:09:26--  https://raw.githubusercontent.com/tomasborrella/TheValley/main/data/spark01/simple_songs_log.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32939 (32K) [text/plain]
Saving to: ‘/content/data/simple_songs_log.txt’


2021-06-05 10:09:27 (4.75 MB/s) - ‘/content/data/simple_songs_log.txt’ saved [32939/32939]



# 2. Instalación Spark

In [None]:
# Install JAVA
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [None]:
# Install Spark
!wget -q https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
!tar xf spark-3.1.1-bin-hadoop2.7.tgz

In [None]:
# Install findspark
!pip install -q findspark

In [None]:
# Environment variables
import os 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"

In [None]:
# Find spark
import findspark
findspark.init()

In [None]:
# PySpark 
!pip install pyspark==3.1.1

Collecting pyspark==3.1.1
[?25l  Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
[K     |████████████████████████████████| 212.3MB 71kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 19.3MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.1-py2.py3-none-any.whl size=212767604 sha256=4c08c6f1bd76febba30e772bcf000d16e9b429707ffc5cacbbf1b49b12c33768
  Stored in directory: /root/.cache/pip/wheels/0b/90/c0/01de724414ef122bd05f056541fb6a0ecf47c7ca655f8b3c0f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1

# 3. Spark Session

In [None]:
# Imports
from pyspark.sql import SparkSession

In [None]:
# Create Spark Session
spark = (SparkSession
         .builder
         .master("local[*]")
         .appName("Primer ejemplo con Spark")
         .getOrCreate()
)

In [None]:
# Show config
spark.sparkContext.getConf().getAll()

[('spark.driver.port', '33979'),
 ('spark.app.startTime', '1622887977083'),
 ('spark.app.id', 'local-1622887978550'),
 ('spark.rdd.compress', 'True'),
 ('spark.app.name', 'Primer ejemplo con Spark'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.submit.pyFiles', ''),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.sql.warehouse.dir', 'file:/content/spark-warehouse'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.driver.host', 'c2657a14be39')]

## Primer RDD de Spark

In [None]:
first_rdd = spark.sparkContext.parallelize([1,2,3,4,5,6,7,8,9,10])

In [None]:
type(first_rdd)

pyspark.rdd.RDD

In [None]:
first_rdd.collect()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [None]:
first_rdd.getNumPartitions()

2

In [None]:
first_rdd.glom().collect()

[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]

## RDD con los datos del log de canciones
(versión simplificada)

In [None]:
my_rdd = spark.sparkContext.textFile('/content/data/simple_songs_log.txt')

In [None]:
my_rdd.getNumPartitions()

2

In [None]:
type(my_rdd)

pyspark.rdd.RDD

In [None]:
pairs = my_rdd.map(lambda x: (x, 1))

In [None]:
type(pairs)

pyspark.rdd.PipelinedRDD

In [None]:
pairs.collect()

In [None]:
result = pairs.reduceByKey(lambda x, y: x + y)
print(result.collect())

[('Blinding Lights', 1131), ('The Box', 510), ('Dance Monkey', 828)]


In [None]:
result.glom().collect()

[[('Blinding Lights', 1131), ('The Box', 510)], [('Dance Monkey', 828)]]

# Ejercicio propuesto

Crear un RDD con el log completo (no la versión simplificada) y contar las canciones usando los métodos "map" y "reduceByKey" del RDD

Pasos:
1. Descargar el archivo completo de la siguiente ruta:
https://raw.githubusercontent.com/tomasborrella/TheValley/main/data/spark01/complete_songs_log.txt
2. Crear un RDD de Spark con el contenido del archivo. (importante tener una sesión de Spark activa)
3. Usar los métodos del RDD para contar las reproducciones de las canciones.

Pista: Solo es necesario añadir un pequeño preprocesado en la fase de Map antes de devolver la tupla.

# Spark Stop

In [None]:
spark.stop()