<a href="https://colab.research.google.com/github/ralsouza/apache_spark_real_time_analytics/blob/master/notebooks/01_pyspark_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Install

In [8]:
# instalar as dependências
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [11]:
# configurar as variáveis de ambiente
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
 
# tornar o pyspark "importável"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

In [None]:
# iniciar uma sessão local e importar dados do Airbnb
# from pyspark.sql import SparkSession
# sc = SparkSession.builder.master('local[*]').getOrCreate()
 
# download do http para arquivo local
# !wget --quiet --show-progress http://data.insideairbnb.com/brazil/rj/rio-de-janeiro/2019-07-15/visualisations/listings.csv
 
# carregar dados do Airbnb
# df_spark = sc.read.csv("./listings.csv", inferSchema=True, header=True)
 
# ver algumas informações sobre os tipos de dados de cada coluna
# df_spark.printSchema()

# 2. Pyspark Introduction

In [14]:
from pyspark import SparkContext
sc = SparkContext("local[*]", "My First App")

In [13]:
# Stoping Context
# sc.stop()

In [None]:
import sys
print(sys.version)

3.6.9 (default, Apr 18 2020, 01:56:04) 
[GCC 8.4.0]


In [None]:
# Print session context (Spark Context)
print(sc)

<SparkContext master=local[*] appName=My First App>


In [None]:
# Version context
print(sc.version)

2.4.4


In [15]:
# Testing Spark and creating a RDD
# We can't put a Python list in a Spark cluster, it's needed to convert it to 
# a RDD
lst = [25,90,81,37,776,3320]
test_data = sc.parallelize(lst,10)

In [None]:
# What does sc.parallelize?
?sc.parallelize

# Signature: sc.parallelize(c, numSlices=None)
# Docstring:
# Distribute a local Python collection to form an RDD (Resilient Distribuited 
# Dataset). 
# Using xrange
# is recommended if the input represents a range for performance.

# >>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect()
# [[0], [2], [3], [4], [6]]
# >>> sc.parallelize(xrange(0, 6, 2), 5).glom().collect()
# [[], [0], [], [2], [4]]
# File:      /content/spark-2.4.4-bin-hadoop2.7/python/pyspark/context.py
# Type:      method

In [None]:
# Check data type
type(test_data)

pyspark.rdd.RDD

In [None]:
# Counting data
test_data.count()

6

In [None]:
# List values
test_data.collect()

[25, 90, 81, 37, 776, 3320]

# 2. Executing an Spark Application
RDDs are distribuited collections of items, RDDs can created from Hadoop (HDFS files), through transformations from others RDDs, from non-relational or relational databases or local files. RDDs are immutable.

In [16]:
# Making a RDD from a CSV file
sentiment_rdd = sc.textFile('/content/drive/My Drive/Colab Notebooks/08-apache-spark/data/sentimentos.csv')

In [17]:
# Check type
type(sentiment_rdd)

pyspark.rdd.RDD

In [18]:
# Action: Counting the number of rows
sentiment_rdd.count()

100

In [19]:
# Listing the 5 firsts rows
sentiment_rdd.take(5)

['positivo,Esse livro é incrível.',
 'positivo,Um dos melhores livros que eu já li.',
 'positivo,um dos melhores livros que eu já li',
 'positivo,Acho que ele tem um conteúdo que vai além do que está em sua descrição.',
 'positivo,O Sol é para todos é profundo e emocionante']

In [20]:
# Transforming data: Transform lower case to upper case
transf_rdd = sentiment_rdd.map(lambda x: x.upper())

In [21]:
transf_rdd.take(5)

['POSITIVO,ESSE LIVRO É INCRÍVEL.',
 'POSITIVO,UM DOS MELHORES LIVROS QUE EU JÁ LI.',
 'POSITIVO,UM DOS MELHORES LIVROS QUE EU JÁ LI',
 'POSITIVO,ACHO QUE ELE TEM UM CONTEÚDO QUE VAI ALÉM DO QUE ESTÁ EM SUA DESCRIÇÃO.',
 'POSITIVO,O SOL É PARA TODOS É PROFUNDO E EMOCIONANTE']

In [23]:
# Return only the first row
transf_rdd.first()

'POSITIVO,ESSE LIVRO É INCRÍVEL.'

In [24]:
# Apply a filter 
rows_with_sol = sentiment_rdd.filter(lambda line: 'Sol' in line)

In [25]:
type(rows_with_sol)

pyspark.rdd.PipelinedRDD

In [26]:
rows_with_sol.count()

3

First the `map()` function determines the lenght of each row, making a RDD. The `reduce()` function is called to search the row with the greatest number of characters. The argument to the functions `map()` and `reduce()` are anonymous functions maked with lambda (from the Python language).

In [28]:
sentiment_rdd.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)

27

In [29]:
# This command can be rewrite like this
def max(a,b):
  if a > b:
    return a
  else:
    return b

sentiment_rdd.map(lambda line: len(line.split())).reduce(max)

27