<a href="https://colab.research.google.com/github/ralsouza/apache_spark_real_time_analytics/blob/master/notebooks/01_pyspark_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Install

In [38]:
# instalar as dependências
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [39]:
# configurar as variáveis de ambiente
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
 
# tornar o pyspark "importável"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

In [None]:
# iniciar uma sessão local e importar dados do Airbnb
# from pyspark.sql import SparkSession
# sc = SparkSession.builder.master('local[*]').getOrCreate()
 
# download do http para arquivo local
# !wget --quiet --show-progress http://data.insideairbnb.com/brazil/rj/rio-de-janeiro/2019-07-15/visualisations/listings.csv
 
# carregar dados do Airbnb
# df_spark = sc.read.csv("./listings.csv", inferSchema=True, header=True)
 
# ver algumas informações sobre os tipos de dados de cada coluna
# df_spark.printSchema()

# 1. Pyspark Introduction

In [46]:
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()
# sc = SparkContext("local[*]", "My First App")

In [45]:
# Stoping Context
# sc.stop()

In [47]:
import sys
print(sys.version)

3.6.9 (default, Apr 18 2020, 01:56:04) 
[GCC 8.4.0]


In [48]:
# Print session context (Spark Context)
print(sc)

<SparkContext master=local[*] appName=pyspark-shell>


In [49]:
# Version context
print(sc.version)

2.4.4


In [50]:
# Testing Spark and creating a RDD
# We can't put a Python list in a Spark cluster, it's needed to convert it to 
# a RDD
lst = [25,90,81,37,776,3320]
test_data = sc.parallelize(lst,10)

In [None]:
# What does sc.parallelize?
?sc.parallelize

# Signature: sc.parallelize(c, numSlices=None)
# Docstring:
# Distribute a local Python collection to form an RDD (Resilient Distribuited 
# Dataset). 
# Using xrange
# is recommended if the input represents a range for performance.

# >>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect()
# [[0], [2], [3], [4], [6]]
# >>> sc.parallelize(xrange(0, 6, 2), 5).glom().collect()
# [[], [0], [], [2], [4]]
# File:      /content/spark-2.4.4-bin-hadoop2.7/python/pyspark/context.py
# Type:      method

In [None]:
# Check data type
type(test_data)

pyspark.rdd.RDD

In [51]:
# Counting data
test_data.count()

6

In [52]:
# List values
test_data.collect()

[25, 90, 81, 37, 776, 3320]

# 2. Executing an Spark Application
RDDs are distribuited collections of items, RDDs can created from Hadoop (HDFS files), through transformations from others RDDs, from non-relational or relational databases or local files. RDDs are immutable.

In [53]:
# Making a RDD from a CSV file
sentiment_rdd = sc.textFile('/content/drive/My Drive/Colab Notebooks/08-apache-spark/data/sentimentos.csv')

In [54]:
# Check type
type(sentiment_rdd)

pyspark.rdd.RDD

In [34]:
# Action: Counting the number of rows
sentiment_rdd.count()

100

In [9]:
# Listing the 5 firsts rows
sentiment_rdd.take(5)

['positivo,Esse livro é incrível.',
 'positivo,Um dos melhores livros que eu já li.',
 'positivo,um dos melhores livros que eu já li',
 'positivo,Acho que ele tem um conteúdo que vai além do que está em sua descrição.',
 'positivo,O Sol é para todos é profundo e emocionante']

In [10]:
# Transforming data: Transform lower case to upper case
transf_rdd = sentiment_rdd.map(lambda x: x.upper())

In [11]:
transf_rdd.take(5)

['POSITIVO,ESSE LIVRO É INCRÍVEL.',
 'POSITIVO,UM DOS MELHORES LIVROS QUE EU JÁ LI.',
 'POSITIVO,UM DOS MELHORES LIVROS QUE EU JÁ LI',
 'POSITIVO,ACHO QUE ELE TEM UM CONTEÚDO QUE VAI ALÉM DO QUE ESTÁ EM SUA DESCRIÇÃO.',
 'POSITIVO,O SOL É PARA TODOS É PROFUNDO E EMOCIONANTE']

In [None]:
# Return only the first row
transf_rdd.first()

'POSITIVO,ESSE LIVRO É INCRÍVEL.'

In [None]:
# Apply a filter 
rows_with_sol = sentiment_rdd.filter(lambda line: 'Sol' in line)

In [None]:
type(rows_with_sol)

pyspark.rdd.PipelinedRDD

In [None]:
rows_with_sol.count()

3

First the `map()` function determines the lenght of each row, making a RDD. The `reduce()` function is called to search the row with the greatest number of characters. The argument to the functions `map()` and `reduce()` are anonymous functions maked with lambda (from the Python language).

In [55]:
sentiment_rdd.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)

27

In [None]:
# This command can be rewrite like this
def max(a,b):
  if a > b:
    return a
  else:
    return b

sentiment_rdd.map(lambda line: len(line.split())).reduce(max)

27

# 3. MapReduce Operation

In [56]:
# Count words in the dataset
count_words = sentiment_rdd.flatMap(lambda line: line.split()).map(lambda word: (word,1)).reduceByKey(lambda a, b: a + b)

count_words.collect()

[('livro', 5),
 ('que', 13),
 ('li.', 4),
 ('positivo,um', 3),
 ('li', 1),
 ('positivo,Acho', 1),
 ('tem', 1),
 ('um', 3),
 ('vai', 1),
 ('do', 2),
 ('em', 1),
 ('descrição.', 1),
 ('positivo,O', 2),
 ('para', 5),
 ('todos', 4),
 ('positivo,Me', 1),
 ('este', 1),
 ('livro,', 1),
 ('antigo', 1),
 ('uma', 4),
 ('história', 1),
 ('antiga', 1),
 ('positivo,The', 6),
 ('Da', 38),
 ('Vinci', 45),
 ('Code', 24),
 ('is', 17),
 ('good', 3),
 ('movie...', 1),
 ('thought', 2),
 ('was', 4),
 ('pretty', 1),
 ('book.', 4),
 ('realmente', 1),
 ('deveria', 1),
 ('todas', 1),
 ('as', 1),
 ('pessoas.', 1),
 ('an', 6),
 ('*', 2),
 ('book', 2),
 ('turn', 1),
 ('positivo,Harper', 1),
 ('aborda', 1),
 ('muito', 3),
 ('polêmicos,', 1),
 ('como', 1),
 ('Bullying,', 1),
 ('olhos', 1),
 ('inocentes', 1),
 ('positivo,i', 4),
 ('love', 6),
 ('da', 13),
 ('code....', 1),
 ('loved', 5),
 ('code..', 2),
 ('VINCI', 4),
 ('BEAUTIFUL', 1),
 ('positivo,THE', 1),
 ('slash', 1),
 ('race.', 1),
 ('positivo,Hey', 1),
 ('The

# 4. Monitor Jobs - Spark UI
In the Spark UI, you can monitor the progress of your job and debug the performance bottlenecks (if your Colab is running with a local runtime).

In [57]:
spark

In [58]:
# If you are running this Colab on the Google hosted runtime, the cell below 
# will create a ngrok tunnel which will allow you to still check the Spark UI.
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 4050 &')
!curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

--2020-07-25 22:48:08--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 3.233.171.45, 54.84.116.182, 52.201.131.65, ...
Connecting to bin.equinox.io (bin.equinox.io)|3.233.171.45|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13773305 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip.3’


2020-07-25 22:48:09 (55.6 MB/s) - ‘ngrok-stable-linux-amd64.zip.3’ saved [13773305/13773305]

Archive:  ngrok-stable-linux-amd64.zip
replace ngrok? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ngrok                   
https://14a306d6fd15.ngrok.io
