# Big Data: Como instalar o PySpark no Google Colab

Como instalar o PySpark no Google Colab é uma dúvida comum entre aqueles que estão migrando seus projetos de Data Science para ambientes na nuvem.

O termo Big Data está cada vez mais presente, e mesmo projetos pessoais podem assumir uma grande dimensionalidade devido à quantidade de dados disponíveis.

Para analisar grandes volumes de dados, Big Data, com velocidade, o Apache Spark é uma ferramenta muito utilizada, dada a sua capacidade de processamento de dados e computação paralela.

O Spark foi pensado para ser acessível, oferecendo diversas APIs e frameworks em Python, Scala, SQL e diversas outras linguagens.

Este tutorial está baseado na documentação oficial, que pode ser conferida [neste link](https://spark.apache.org/docs/latest/api/python/getting_started/index.html).

## PySpark no Google Colab

[PySpark](https://spark.apache.org/docs/latest/api/python/) é a interface alto nível que permite você conseguir acessar e usar o Spark por meio da linguagem Python. Usando o PySpark, você consegue escrever todo o seu código usando apenas o nosso estilo Python de escrever código.

## Instalando o PySpark no Google Colab

Instalar o PySpark não é um processo direto como de praxe em Python. Não basta usar um pip install apenas. Na verdade, antes de tudo é necessário instalar dependências como o Java 8, Apache Spark 2.3.2 junto com o Hadoop 2.7.

In [1]:
# instalar as dependências
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

A próxima etapa é configurar as variáveis de ambiente, pois isso habilita o ambiente do Colab a identificar corretamente onde as dependências estão rodando.

Para conseguir “manipular” o terminal e interagir como ele, você pode usar a biblioteca os.

In [2]:
# configurar as variáveis de ambiente
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

# tornar o pyspark "importável"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

Feito o processo de instalação das dependências, vamos instalar o `pyspark` e configurar uma sessão:

In [3]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 33 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 5.5 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=e0740ff57bfd2653b0d89f70bbae6e24470833443fde7ab91e9377dc88b3b8cc
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


Com tudo pronto, vamos rodar uma sessão local para testar se a instalação funcionou corretamente.

[SparkSession Explanation](https://towardsdatascience.com/sparksession-vs-sparkcontext-vs-sqlcontext-vs-hivecontext-741d50c9486a)

In [4]:
# iniciar uma sessão local e importar dados do Airbnb
from pyspark.sql import SparkSession
sc = SparkSession.builder.master('local[*]').getOrCreate()

# download do http para arquivo local
!wget --quiet --show-progress http://data.insideairbnb.com/brazil/rj/rio-de-janeiro/2021-12-24/visualisations/listings.csv

# carregar dados do Airbnb
df_spark = sc.read.csv("./listings.csv", inferSchema=True, header=True)

# ver algumas informações sobre os tipos de dados de cada coluna
df_spark.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- host_id: string (nullable = true)
 |-- host_name: string (nullable = true)
 |-- neighbourhood_group: string (nullable = true)
 |-- neighbourhood: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- room_type: string (nullable = true)
 |-- price: string (nullable = true)
 |-- minimum_nights: string (nullable = true)
 |-- number_of_reviews: string (nullable = true)
 |-- last_review: string (nullable = true)
 |-- reviews_per_month: string (nullable = true)
 |-- calculated_host_listings_count: string (nullable = true)
 |-- availability_365: string (nullable = true)
 |-- number_of_reviews_ltm: double (nullable = true)
 |-- license: integer (nullable = true)



In [5]:
import sys
# É como se o dataframe em spark fosse apenas um ponteiro para os dados distribuídos
sys.getsizeof(df_spark) 

64

In [6]:
import pandas as pd
df_pandas = pd.read_csv("listings.csv")
sys.getsizeof(df_pandas) 

11523735

A variável `df_spark` é denominado DataFrame PySpark:

In [7]:
df_spark

DataFrame[id: string, name: string, host_id: string, host_name: string, neighbourhood_group: string, neighbourhood: string, latitude: string, longitude: string, room_type: string, price: string, minimum_nights: string, number_of_reviews: string, last_review: string, reviews_per_month: string, calculated_host_listings_count: string, availability_365: string, number_of_reviews_ltm: double, license: int]

Para visualizarmos o dataframe, utilizamos o método `.show()`

In [8]:
df_spark.show()

+-----+--------------------+---------+--------------------+-------------------+---------------+---------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+-------+
|   id|                name|  host_id|           host_name|neighbourhood_group|  neighbourhood| latitude|longitude|      room_type|price|minimum_nights|number_of_reviews|last_review|reviews_per_month|calculated_host_listings_count|availability_365|number_of_reviews_ltm|license|
+-----+--------------------+---------+--------------------+-------------------+---------------+---------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+-------+
|17878|Very Nice 2Br in ...|    68997|            Matthias|               null|     Copacabana|-22.96599| -43.1794|Entire home/apt|  350|             5|           

Alternativamente, você pode habilitar a configuração `spark.sql.repl.eagerEval.enabled` para a avaliação antecipada do PySpark DataFrame em notebooks como Jupyter. O número de linhas a serem exibidas pode ser controlado através da configuração `spark.sql.repl.eagerEval.maxNumRows`.

In [9]:
sc.conf.set('spark.sql.repl.eagerEval.maxNumRows', 40)

In [10]:
sc.conf.set('spark.sql.repl.eagerEval.enabled', True)

In [11]:
df_spark

id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
17878,Very Nice 2Br in ...,68997,Matthias,,Copacabana,-22.96599,-43.1794,Entire home/apt,350,5,267,2021-12-03,1.92,1,257,10.0,
24480,Nice and cozy nea...,99249,Goya,,Ipanema,-22.98405,-43.20189,Entire home/apt,296,3,85,2018-02-14,0.62,1,107,0.0,
25026,Beautiful Modern ...,102840,Viviane,,Copacabana,-22.97735,-43.19105,Entire home/apt,387,3,238,2020-02-15,1.69,1,206,0.0,
35636,Cosy flat close t...,153232,Patricia,,Ipanema,-22.98839,-43.19232,Entire home/apt,172,2,181,2020-03-15,1.82,1,207,0.0,
35764,COPACABANA SEA BR...,153691,Patricia Miranda ...,,Copacabana,-22.98107,-43.19136,Entire home/apt,260,3,378,2021-12-05,2.76,1,58,32.0,
48305,Bright 6bed Penth...,70933,Goitaca,,Ipanema,-22.98591,-43.20302,Entire home/apt,4217,2,91,2021-12-06,0.69,9,325,17.0,
48726,Rio de Janeiro Co...,221941,Vana,,Copacabana,-22.98528,-43.19264,Private room,114,3,42,2019-08-08,0.83,2,26,0.0,
48901,Confortable 4BD 3...,222884,Marcio,,Copacabana,-22.96574,-43.17514,Entire home/apt,2015,2,8,2021-12-10,0.1,2,5,4.0,
49179,Djalma Ocean View...,224192,David,,Copacabana,-22.9791,-43.19008,Entire home/apt,380,3,106,2021-12-19,0.96,36,161,18.0,
50294,Ipanema Beach Blo...,70933,Goitaca,,Ipanema,-22.98584,-43.20305,Entire home/apt,2310,2,74,2021-07-07,0.58,9,330,1.0,


Você pode ver o esquema e os nomes das colunas do DataFrame da seguinte forma:

In [12]:
df_spark.columns

['id',
 'name',
 'host_id',
 'host_name',
 'neighbourhood_group',
 'neighbourhood',
 'latitude',
 'longitude',
 'room_type',
 'price',
 'minimum_nights',
 'number_of_reviews',
 'last_review',
 'reviews_per_month',
 'calculated_host_listings_count',
 'availability_365',
 'number_of_reviews_ltm',
 'license']

In [13]:
df_spark.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- host_id: string (nullable = true)
 |-- host_name: string (nullable = true)
 |-- neighbourhood_group: string (nullable = true)
 |-- neighbourhood: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- room_type: string (nullable = true)
 |-- price: string (nullable = true)
 |-- minimum_nights: string (nullable = true)
 |-- number_of_reviews: string (nullable = true)
 |-- last_review: string (nullable = true)
 |-- reviews_per_month: string (nullable = true)
 |-- calculated_host_listings_count: string (nullable = true)
 |-- availability_365: string (nullable = true)
 |-- number_of_reviews_ltm: double (nullable = true)
 |-- license: integer (nullable = true)



O resumo do dataframe pode ser representado da seguinte forma:

In [15]:
df_spark.describe().show()

+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+-------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------------------+------------------+---------------------+------------------+
|summary|                  id|                name|             host_id|           host_name| neighbourhood_group|       neighbourhood|           latitude|          longitude|         room_type|             price|    minimum_nights| number_of_reviews|      last_review| reviews_per_month|calculated_host_listings_count|  availability_365|number_of_reviews_ltm|           license|
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+-------------------+------------------+------------------+------------------+--------

DataFrame.collect() coleta os dados distribuídos para o lado do driver como os dados locais em Python. Observe que isso pode gerar um erro de falta de memória quando o conjunto de dados é muito grande para caber no lado do driver porque ele coleta todos os dados dos executores para o lado do driver.

In [16]:
df_spark.collect()

[Row(id='17878', name='Very Nice 2Br in Copacabana w. balcony, fast WiFi', host_id='68997', host_name='Matthias', neighbourhood_group=None, neighbourhood='Copacabana', latitude='-22.96599', longitude='-43.1794', room_type='Entire home/apt', price='350', minimum_nights='5', number_of_reviews='267', last_review='2021-12-03', reviews_per_month='1.92', calculated_host_listings_count='1', availability_365='257', number_of_reviews_ltm=10.0, license=None),
 Row(id='24480', name='Nice and cozy near Ipanema Beach, w/ home office', host_id='99249', host_name='Goya', neighbourhood_group=None, neighbourhood='Ipanema', latitude='-22.98405', longitude='-43.20189', room_type='Entire home/apt', price='296', minimum_nights='3', number_of_reviews='85', last_review='2018-02-14', reviews_per_month='0.62', calculated_host_listings_count='1', availability_365='107', number_of_reviews_ltm=0.0, license=None),
 Row(id='25026', name='Beautiful Modern Decorated Studio in Copa', host_id='102840', host_name='Vivia

Para evitar lançar uma exceção de falta de memória, use `DataFrame.take()`.

In [17]:
df_spark.take(5)

[Row(id='17878', name='Very Nice 2Br in Copacabana w. balcony, fast WiFi', host_id='68997', host_name='Matthias', neighbourhood_group=None, neighbourhood='Copacabana', latitude='-22.96599', longitude='-43.1794', room_type='Entire home/apt', price='350', minimum_nights='5', number_of_reviews='267', last_review='2021-12-03', reviews_per_month='1.92', calculated_host_listings_count='1', availability_365='257', number_of_reviews_ltm=10.0, license=None),
 Row(id='24480', name='Nice and cozy near Ipanema Beach, w/ home office', host_id='99249', host_name='Goya', neighbourhood_group=None, neighbourhood='Ipanema', latitude='-22.98405', longitude='-43.20189', room_type='Entire home/apt', price='296', minimum_nights='3', number_of_reviews='85', last_review='2018-02-14', reviews_per_month='0.62', calculated_host_listings_count='1', availability_365='107', number_of_reviews_ltm=0.0, license=None),
 Row(id='25026', name='Beautiful Modern Decorated Studio in Copa', host_id='102840', host_name='Vivia

O PySpark DataFrame também fornece a conversão de volta para um DataFrame pandas para aproveitar a API do pandas. Observe que o toPandas também coleta todos os dados no lado do driver que podem facilmente causar um erro de falta de memória quando os dados são muito grandes para caber no lado do driver.

In [18]:
df_spark.toPandas()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,17878,"Very Nice 2Br in Copacabana w. balcony, fast WiFi",68997,Matthias,,Copacabana,-22.96599,-43.1794,Entire home/apt,350,5,267,2021-12-03,1.92,1,257,10.0,
1,24480,"Nice and cozy near Ipanema Beach, w/ home office",99249,Goya,,Ipanema,-22.98405,-43.20189,Entire home/apt,296,3,85,2018-02-14,0.62,1,107,0.0,
2,25026,Beautiful Modern Decorated Studio in Copa,102840,Viviane,,Copacabana,-22.97735,-43.19105,Entire home/apt,387,3,238,2020-02-15,1.69,1,206,0.0,
3,35636,Cosy flat close to Ipanema beach,153232,Patricia,,Ipanema,-22.98839,-43.19232,Entire home/apt,172,2,181,2020-03-15,1.82,1,207,0.0,
4,35764,COPACABANA SEA BREEZE - RIO - 20 X Superhost,153691,Patricia Miranda & Paulo,,Copacabana,-22.98107,-43.19136,Entire home/apt,260,3,378,2021-12-05,2.76,1,58,32.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24708,53957052,COPACABANA FAMILY HOME,190156483,Fernando,,Copacabana,-22.977974,-43.190044,Entire home/apt,1786,4,0,,,1,365,0.0,
24709,53957980,Stunning & Luxurious Apartment in Flamengo (#2),20561832,Pedro,,Flamengo,-22.932098,-43.17709,Entire home/apt,460,2,1,2021-12-24,1,4,3,1.0,
24710,53958210,Apartamento aconchegante no Leme,71566676,Dulce/Raquel,,Leme,-22.962902258070564,-43.170973085960014,Entire home/apt,891,2,0,,,1,267,0.0,
24711,53958814,Apto 520 na Quadra da Praia de Copacabana Posto 3,33865879,Miguel,,Copacabana,-22.96871398426808,-43.18329311240373,Entire home/apt,177,4,0,,,4,81,0.0,


Também podemos utilizar agrupamentos para alguns cálculos:

In [None]:
# spark_df.groupby(group_column).action(column_to_transform).show()
df_spark.groupby('neighbourhood').avg('number_of_reviews_ltm').show()

+-------------------+--------------------------+
|      neighbourhood|avg(number_of_reviews_ltm)|
+-------------------+--------------------------+
|             Cocotá|                       0.0|
|              Gávea|         1.053191489361702|
|           -22.9851|                      null|
|       Tomás Coelho|        0.3333333333333333|
|  -23.0633674927355|                      null|
|            Ipanema|         4.502656313853699|
|          -22.98487|                      null|
|           -22.8927|                      null|
|          -22.97122|                      null|
|           Realengo|        0.4444444444444444|
|          -22.99748|                      null|
|      Gardênia Azul|        0.4230769230769231|
|          -22.98135|                      null|
|              Rocha|                     0.125|
|      Bento Ribeiro|                     3.625|
|Vicente de Carvalho|                       3.0|
| -22.97010221488871|                      null|
|          -22.99694

In [23]:
df_spark.dtypes

[('id', 'string'),
 ('name', 'string'),
 ('host_id', 'string'),
 ('host_name', 'string'),
 ('neighbourhood_group', 'string'),
 ('neighbourhood', 'string'),
 ('latitude', 'string'),
 ('longitude', 'string'),
 ('room_type', 'string'),
 ('price', 'string'),
 ('minimum_nights', 'string'),
 ('number_of_reviews', 'string'),
 ('last_review', 'string'),
 ('reviews_per_month', 'string'),
 ('calculated_host_listings_count', 'string'),
 ('availability_365', 'string'),
 ('number_of_reviews_ltm', 'double'),
 ('license', 'int')]

Veja que algumas variáveis numéricas estão aparecendo como string. Podemos, também, mudar o tipo dessas colunas com a sintaxe a seguir:

[withColumn Documentation](https://sparkbyexamples.com/spark/spark-dataframe-withcolumn/#:~:text=Spark%20withColumn()%20is%20a,column%20operations%20with%20Scala%20examples.)

In [24]:
df_spark = df_spark.withColumn("price", df_spark["price"].cast("double"))

In [25]:
df_spark.dtypes

[('id', 'string'),
 ('name', 'string'),
 ('host_id', 'string'),
 ('host_name', 'string'),
 ('neighbourhood_group', 'string'),
 ('neighbourhood', 'string'),
 ('latitude', 'string'),
 ('longitude', 'string'),
 ('room_type', 'string'),
 ('price', 'double'),
 ('minimum_nights', 'string'),
 ('number_of_reviews', 'string'),
 ('last_review', 'string'),
 ('reviews_per_month', 'string'),
 ('calculated_host_listings_count', 'string'),
 ('availability_365', 'string'),
 ('number_of_reviews_ltm', 'double'),
 ('license', 'int')]

In [26]:
df_spark.groupby('neighbourhood').avg('price').show()

+-------------------+------------------+
|      neighbourhood|        avg(price)|
+-------------------+------------------+
|             Cocotá|             150.0|
|              Gávea|1140.2659574468084|
|           -22.9851|               3.0|
|       Tomás Coelho|             105.0|
|  -23.0633674927355|               1.0|
|            Ipanema|1335.4372701266857|
|          -22.98487|               3.0|
|           -22.8927|               2.0|
|          -22.97122|               1.0|
|           Realengo| 633.2222222222222|
|          -22.99748|               3.0|
|      Gardênia Azul| 526.4230769230769|
|          -22.98135|               1.0|
|              Rocha|           211.375|
|      Bento Ribeiro|           316.625|
|Vicente de Carvalho|              80.0|
| -22.97010221488871|               1.0|
|          -22.99694|               4.0|
|          -23.00357|               2.0|
|          -22.98148|               3.0|
+-------------------+------------------+
only showing top