# Pyspark SQL

Este kernel possui alguns exemplos de código do módulo pyspark.sql.
Apresenta exemplos de funções que implementam instruções SQL, como: SELECT, WHERE, GROUP BY, ORDER BY, PARTITION BY, entre outras.


São usados seis datasets para os exemplos:
1.   Conjunto de dados contendo características e preços de imóveis no estado da Califórnia.
2.   Dataset com informações sobre os jogadores que disputaram a Copa do Mundo 2018
3.   Quatro conjuntos contendo informações sobre uma rede lojas com clientes e produtos comprados.

In [1]:
%pip install ipython-autotime
%pip install pyspark



Trecho de código opcional que resolve problema de compatibilidade entre a linguagem Python e o PySpark.

In [2]:
%%script echo 'ignore cell'
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

ignore cell


# Imports básicos.

In [3]:
from google.colab          import drive, files
from pyspark.sql           import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types     import *
from pyspark.sql.window    import Window

%load_ext autotime

time: 369 µs (started: 2023-09-27 17:42:00 +00:00)


# Início da sessão.

In [4]:
drive.mount('/content/drive', force_remount=True)
spark = SparkSession.builder.master('local').appName('pyspark_app').getOrCreate()
spark

Mounted at /content/drive


time: 14.2 s (started: 2023-09-27 17:42:00 +00:00)


## Leitura dos arquivos.

In [5]:
houses = spark.read.csv("/content/drive/MyDrive/datasets/housing/housing.csv", header=True, inferSchema=True, encoding='utf-8')
houses = houses.drop('housing_median_age', 'population', 'median_income', 'median_house_value') # remoção de atributos desnecessários
print(type(houses))
print(f'rows: {houses.count()}')
print(f'cols: {len(houses.columns)}')
houses.show(5)

<class 'pyspark.sql.dataframe.DataFrame'>
rows: 20640
cols: 6
+---------+--------+-----------+--------------+----------+---------------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|
+---------+--------+-----------+--------------+----------+---------------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY|
+---------+--------+-----------+--------------+----------+---------------+
only showing top 5 rows

time: 11.3 s (started: 2023-09-27 17:42:14 +00:00)


Abaixo, faço a leitura de um arquivo .csv com as opções definidas através da função option(). Além disso, defino manualmente os tipos de dados de cada coluna.

Note que defino "header=True" em uma das opções. Caso o conjunto de dados não tivesse cabeçalho e eu tivesse que criá-lo manualmente bastaria definir "header=False". Os nomes das colunas seriam os definidos no parâmetro "name=" de cada instância de StructField.

In [6]:
customSchema = StructType([StructField(name="Team", dataType=StringType(), nullable=True),
                           StructField(name="#", dataType=StringType(), nullable=True),
                           StructField(name="Pos.", dataType=StringType(), nullable=True),
                           StructField(name="FIFA Popular Name", dataType=StringType(), nullable=True),
                           StructField(name="Birth Date", dataType=StringType(), nullable=True),
                           StructField(name="Shirt Name", dataType=StringType(), nullable=True),
                           StructField(name="Club", dataType=StringType(), nullable=True),
                           StructField(name="Height", dataType=IntegerType(), nullable=True),
                           StructField(name="Weight", dataType=IntegerType(), nullable=True),])

players = spark.read.option("inferSchema", "False")\
                    .option("header", "True")\
                    .option("encoding", "utf-8")\
                    .schema(customSchema)\
                    .csv("/content/drive/MyDrive/datasets/wc2018-players.csv")

players = players.drop('#', 'club') # remoção de atributos desnecessários
print(type(players))
print(f'rows: {players.count()}')
print(f'cols: {len(players.columns)}')
players.show(5)

<class 'pyspark.sql.dataframe.DataFrame'>
rows: 736
cols: 7
+---------+----+------------------+----------+----------+------+------+
|     Team|Pos.| FIFA Popular Name|Birth Date|Shirt Name|Height|Weight|
+---------+----+------------------+----------+----------+------+------+
|Argentina|  DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|   169|    65|
|Argentina|  MF|    PAVON Cristian|21.01.1996|     PAVÓN|   169|    65|
|Argentina|  MF|    LANZINI Manuel|15.02.1993|   LANZINI|   167|    66|
|Argentina|  DF|    SALVIO Eduardo|13.07.1990|    SALVIO|   167|    69|
|Argentina|  FW|      MESSI Lionel|24.06.1987|     MESSI|   170|    72|
+---------+----+------------------+----------+----------+------+------+
only showing top 5 rows

time: 1.39 s (started: 2023-09-27 17:42:25 +00:00)


In [7]:
clients  = spark.read.csv("/content/drive/MyDrive/datasets/bix-tecnologia/clients.csv", header=True, inferSchema=True, encoding='utf-8')
products = spark.read.csv("/content/drive/MyDrive/datasets/bix-tecnologia/products.csv", header=True, inferSchema=True, encoding='utf-8')
sales    = spark.read.csv("/content/drive/MyDrive/datasets/bix-tecnologia/sales.csv", header=True, inferSchema=True, encoding='utf-8')
stores   = spark.read.csv("/content/drive/MyDrive/datasets/bix-tecnologia/stores.csv", header=True, inferSchema=True, encoding='utf-8')

clients.show(3)
products.show(3)
stores.show(3)
sales.show(6)

+-----+--------------+-----+-----------+-----+
|   ID|          City|State|DateOfBirth|  Sex|
+-----+--------------+-----+-----------+-----+
|14001|      Curitiba|   PR|  6/28/1985|Homem|
|14002| Florianópolis|   SC|  1/10/1987|Homem|
|14003|Rio de Janeiro|   RJ|  11/5/1979|Homem|
+-----+--------------+-----+-----------+-----+
only showing top 3 rows

+--------------------+---------------+----+
|                  ID|           Name|Size|
+--------------------+---------------+----+
|00066f42aeeb9f300...|Capitão América|   P|
|00066f42aeeb9f300...|Capitão América|   M|
|00066f42aeeb9f300...|Capitão América|   G|
+--------------------+---------------+----+
only showing top 3 rows

+---+--------------+-----+
| ID|          Name|State|
+---+--------------+-----+
|  1| Florianópolis|   SC|
|  2|Rio de Janeiro|   RJ|
|  3|  Porto Alegre|   RS|
+---+--------------+-----+
only showing top 3 rows

+----+----+--------------------+--------------------+--------+--------+---------+--------+-------+-

O dataframe "sales" está meio bagunçado, então vou precisar fazer alguns ajustes antes de poder usá-lo.
*   remover as 4 primeiras linhas
*   remover as 2 primeiras colunas e também 'row_number'

In [8]:
sales = sales.withColumn('row_number', monotonically_increasing_id())
sales = sales.where("row_number > 3")
sales = sales.drop('_c0', '_c1', 'row_number')

time: 260 ms (started: 2023-09-27 17:42:31 +00:00)


# Funções descritivas básicas.
Vou usar 'houses' como referência no uso das funções abaixo.

In [9]:
houses.show(5)

+---------+--------+-----------+--------------+----------+---------------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|
+---------+--------+-----------+--------------+----------+---------------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY|
+---------+--------+-----------+--------------+----------+---------------+
only showing top 5 rows

time: 319 ms (started: 2023-09-27 17:42:31 +00:00)


In [10]:
houses.take(5)

[Row(longitude=-122.23, latitude=37.88, total_rooms=880.0, total_bedrooms=129.0, households=126.0, ocean_proximity='NEAR BAY'),
 Row(longitude=-122.22, latitude=37.86, total_rooms=7099.0, total_bedrooms=1106.0, households=1138.0, ocean_proximity='NEAR BAY'),
 Row(longitude=-122.24, latitude=37.85, total_rooms=1467.0, total_bedrooms=190.0, households=177.0, ocean_proximity='NEAR BAY'),
 Row(longitude=-122.25, latitude=37.85, total_rooms=1274.0, total_bedrooms=235.0, households=219.0, ocean_proximity='NEAR BAY'),
 Row(longitude=-122.25, latitude=37.85, total_rooms=1627.0, total_bedrooms=280.0, households=259.0, ocean_proximity='NEAR BAY')]

time: 354 ms (started: 2023-09-27 17:42:32 +00:00)


In [11]:
print(houses.columns)

['longitude', 'latitude', 'total_rooms', 'total_bedrooms', 'households', 'ocean_proximity']
time: 426 µs (started: 2023-09-27 17:42:32 +00:00)


In [12]:
houses.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- households: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)

time: 7.37 ms (started: 2023-09-27 17:42:32 +00:00)


In [13]:
stores.describe().show()

+-------+-----------------+--------------+-----+
|summary|               ID|          Name|State|
+-------+-----------------+--------------+-----+
|  count|                7|             7|    7|
|   mean|6.857142857142857|          NULL| NULL|
| stddev|7.244045173533258|          NULL| NULL|
|    min|                1|Belo Horizonte|   MG|
|    max|               22|     São Paulo|   na|
+-------+-----------------+--------------+-----+

time: 2.01 s (started: 2023-09-27 17:42:32 +00:00)


In [14]:
rows = houses.count()
cols = len(houses.columns)
print(f'shape: {(rows, cols)}')

shape: (20640, 6)
time: 361 ms (started: 2023-09-27 17:42:34 +00:00)


# Funções úteis para feature engineering.

Criando backup do dataframe

In [15]:
houses_backup = houses

time: 423 µs (started: 2023-09-27 17:42:34 +00:00)


## Atributos

### Renomeação
Transformando todos os nomes de colunas para letras maiúsculas upper(). Poderíamos usar a função lower() caso quisessemos letras minúsculas.

In [16]:
upper = [column.upper() for column in houses.columns]
for column, up in zip(houses.columns, upper):
  houses = houses.withColumnRenamed(column, up)
print(houses.columns)

['LONGITUDE', 'LATITUDE', 'TOTAL_ROOMS', 'TOTAL_BEDROOMS', 'HOUSEHOLDS', 'OCEAN_PROXIMITY']
time: 63.4 ms (started: 2023-09-27 17:42:34 +00:00)


Uma função de renomeação para se usar nos datasets houses e players.

In [17]:
def to_lower(dataset):
  lower = [name.lower() for name in dataset.columns]
  for name, low_name in zip(dataset.columns, lower):
    dataset = dataset.withColumnRenamed(name, low_name)
  print(lower)
  return dataset



houses  = to_lower(houses)
players = to_lower(players)
print(houses.columns)
print(players.columns)

['longitude', 'latitude', 'total_rooms', 'total_bedrooms', 'households', 'ocean_proximity']
['team', 'pos.', 'fifa popular name', 'birth date', 'shirt name', 'height', 'weight']
['longitude', 'latitude', 'total_rooms', 'total_bedrooms', 'households', 'ocean_proximity']
['team', 'pos.', 'fifa popular name', 'birth date', 'shirt name', 'height', 'weight']
time: 178 ms (started: 2023-09-27 17:42:34 +00:00)


Outra função de renomeação para ser usada nos datasets, clients, products, sales e stores. Vai ficar mais fácil escrever código assim. Não vou precisar me preocupar com letras maiúsculas.

In [18]:
def rename_cols(df, names):
  for column, name in zip(df.columns, names):
    df = df.withColumnRenamed(column, name)
  return df



cols_clients  = ['client_id','client_city', 'client_state', 'client_birth', 'client_gender']
cols_products = ['product_id', 'product_name', 'product_size']
cols_sales    = ['id', 'product_id', 'client_id', 'discount', 'unit_price', 'quantity', 'store_id', 'date']
cols_stores   = ['store_id', 'store_city', 'store_state']

clients  = rename_cols(clients, cols_clients)
products = rename_cols(products, cols_products)
sales    = rename_cols(sales, cols_sales)
stores   = rename_cols(stores, cols_stores)

print(clients.columns)
print(products.columns)
print(sales.columns)
print(stores.columns)

['client_id', 'client_city', 'client_state', 'client_birth', 'client_gender']
['product_id', 'product_name', 'product_size']
['id', 'product_id', 'client_id', 'discount', 'unit_price', 'quantity', 'store_id', 'date']
['store_id', 'store_city', 'store_state']
time: 272 ms (started: 2023-09-27 17:42:35 +00:00)


É possível atribuir um 'alias' a cada atributo selecionado. Mas, isso só pode ser feito através da função col() que retorna um objeto Column.


In [19]:
lat = col('latitude').alias('lat')
lon = col('longitude').alias('lon')

print(lat)
houses.select([lat, lon]).show(5)

Column<'latitude AS lat'>
+-----+-------+
|  lat|    lon|
+-----+-------+
|37.88|-122.23|
|37.86|-122.22|
|37.85|-122.24|
|37.85|-122.25|
|37.85|-122.25|
+-----+-------+
only showing top 5 rows

time: 198 ms (started: 2023-09-27 17:42:35 +00:00)


### Criação
Atribuo um valor literal(True) à nova coluna chamada 'new_col'. A função lit() retorna um objeto Column. Note que este atributo é do tipo constante, ou seja, todos os seus valores são True. Normalmente este tipo de atributo não tem utilidade em modelos preditivos.

In [20]:
houses.withColumn('new_col', lit(True)).show(5)

+---------+--------+-----------+--------------+----------+---------------+-------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|new_col|
+---------+--------+-----------+--------------+----------+---------------+-------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|   true|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|   true|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|   true|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|   true|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY|   true|
+---------+--------+-----------+--------------+----------+---------------+-------+
only showing top 5 rows

time: 247 ms (started: 2023-09-27 17:42:35 +00:00)


Nova coluna criada a partir de uma operação matemática entre outras duas. Neste caso é preciso que os atributos sejam numéricos.

In [21]:
result = houses['total_bedrooms'] / houses['total_rooms']
print(type(result))
houses.withColumn('new_col', result).show(5)

<class 'pyspark.sql.column.Column'>
+---------+--------+-----------+--------------+----------+---------------+-------------------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|            new_col|
+---------+--------+-----------+--------------+----------+---------------+-------------------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|0.14659090909090908|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|0.15579659106916466|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|0.12951601908657123|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|0.18445839874411302|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY| 0.1720958819913952|
+---------+--------+-----------+--------------+----------+---------------+-------------------+
only showing top 5 rows

time: 274 ms (started: 2023-09-27 17:42:35 +00:00)


Nova coluna usando a função substring().

In [22]:
houses.withColumn('new_col', substring('ocean_proximity', 1, 4)).show(5)

+---------+--------+-----------+--------------+----------+---------------+-------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|new_col|
+---------+--------+-----------+--------------+----------+---------------+-------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|   NEAR|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|   NEAR|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|   NEAR|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|   NEAR|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY|   NEAR|
+---------+--------+-----------+--------------+----------+---------------+-------+
only showing top 5 rows

time: 150 ms (started: 2023-09-27 17:42:36 +00:00)


Concatenando dois atributos para formar um novo.

In [23]:
houses.withColumn('new_col', concat(houses['latitude'], houses['longitude'])).show(10)

+---------+--------+-----------+--------------+----------+---------------+------------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|     new_col|
+---------+--------+-----------+--------------+----------+---------------+------------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|37.88-122.23|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|37.86-122.22|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|37.85-122.24|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|37.85-122.25|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY|37.85-122.25|
|  -122.25|   37.85|      919.0|         213.0|     193.0|       NEAR BAY|37.85-122.25|
|  -122.25|   37.84|     2535.0|         489.0|     514.0|       NEAR BAY|37.84-122.25|
|  -122.25|   37.84|     3104.0|         687.0|     647.0|       NEAR BAY|37.84-122.25|
|  -122.26|   37.84|     2555.0|

In [24]:
houses.withColumn('new_col', concat_ws(' # ', houses['latitude'], houses['longitude'])).show(10)

+---------+--------+-----------+--------------+----------+---------------+---------------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|        new_col|
+---------+--------+-----------+--------------+----------+---------------+---------------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|37.88 # -122.23|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|37.86 # -122.22|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|37.85 # -122.24|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|37.85 # -122.25|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY|37.85 # -122.25|
|  -122.25|   37.85|      919.0|         213.0|     193.0|       NEAR BAY|37.85 # -122.25|
|  -122.25|   37.84|     2535.0|         489.0|     514.0|       NEAR BAY|37.84 # -122.25|
|  -122.25|   37.84|     3104.0|         687.0|     647.0|       NEAR BAY|37.84 # -122.25|

### Remoção

A função usada para se romover colunas já foi utilizada no início do kernel, na sessão de leitura de arquivos, mas vou apresentar aqui. A função usada para isso é drop(). Abaixo, crio duas colunas para depois removê-las.

In [25]:
houses = houses.withColumn('new_col_1', lit(True))
houses = houses.withColumn('new_col_2', lit(False))
houses = houses.drop('new_col_1', 'new_col_2')

time: 41.5 ms (started: 2023-09-27 17:42:36 +00:00)


### Conversão de tipo

In [26]:
houses.withColumn('latitude', col('latitude').cast(FloatType())).show(5)

+---------+--------+-----------+--------------+----------+---------------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|
+---------+--------+-----------+--------------+----------+---------------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY|
+---------+--------+-----------+--------------+----------+---------------+
only showing top 5 rows

time: 291 ms (started: 2023-09-27 17:42:36 +00:00)


Os exemplos a seguir ilustram duas maneiras de se converter strings que representam datas. Primeiro, apresento um forma de extrair os componentes, dia, mês e ano. No bloco de código abaixo, em "birth date", note que o separador desses componentes é um ponto ".".

In [27]:
players.show(3)

+---------+----+------------------+----------+----------+------+------+
|     team|pos.| fifa popular name|birth date|shirt name|height|weight|
+---------+----+------------------+----------+----------+------+------+
|Argentina|  DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|   169|    65|
|Argentina|  MF|    PAVON Cristian|21.01.1996|     PAVÓN|   169|    65|
|Argentina|  MF|    LANZINI Manuel|15.02.1993|   LANZINI|   167|    66|
+---------+----+------------------+----------+----------+------+------+
only showing top 3 rows

time: 206 ms (started: 2023-09-27 17:42:37 +00:00)


In [28]:
dia = udf(lambda date:date.split('.')[0])
mes = udf(lambda date:date.split('.')[1])
ano = udf(lambda date:date.split('.')[2])

players = players.withColumn('dia', dia('birth date'))
players = players.withColumn('mes', mes('birth date'))
players = players.withColumn('ano', ano('birth date'))

# convertendo o tipo string para int
players = players.withColumn('dia', col('dia').cast(IntegerType()))
players = players.withColumn('mes', col('mes').cast(IntegerType()))
players = players.withColumn('ano', col('ano').cast(IntegerType()))

players.show(5)
players.printSchema()

+---------+----+------------------+----------+----------+------+------+---+---+----+
|     team|pos.| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+---------+----+------------------+----------+----------+------+------+---+---+----+
|Argentina|  DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|   169|    65| 31|  8|1992|
|Argentina|  MF|    PAVON Cristian|21.01.1996|     PAVÓN|   169|    65| 21|  1|1996|
|Argentina|  MF|    LANZINI Manuel|15.02.1993|   LANZINI|   167|    66| 15|  2|1993|
|Argentina|  DF|    SALVIO Eduardo|13.07.1990|    SALVIO|   167|    69| 13|  7|1990|
|Argentina|  FW|      MESSI Lionel|24.06.1987|     MESSI|   170|    72| 24|  6|1987|
+---------+----+------------------+----------+----------+------+------+---+---+----+
only showing top 5 rows

root
 |-- team: string (nullable = true)
 |-- pos.: string (nullable = true)
 |-- fifa popular name: string (nullable = true)
 |-- birth date: string (nullable = true)
 |-- shirt name: string (nullable = true)


Outra forma de conversão é utilizando a função to_date(). Neste caso não há separação explícita dos componentes e precisamos informar o formato das datas.

In [29]:
players = players.withColumn("birth date", to_date(col("birth date"), "dd.MM.yyyy"))
# ou
players = players.withColumn("birth date", to_date(col("birth date"), "dd.MM.yyyy"))
#players = players.withColumn("birth date", to_date(col("birth date"), "dd.MM.yyyy").cast(DateType()))
players.printSchema()

root
 |-- team: string (nullable = true)
 |-- pos.: string (nullable = true)
 |-- fifa popular name: string (nullable = true)
 |-- birth date: date (nullable = true)
 |-- shirt name: string (nullable = true)
 |-- height: integer (nullable = true)
 |-- weight: integer (nullable = true)
 |-- dia: integer (nullable = true)
 |-- mes: integer (nullable = true)
 |-- ano: integer (nullable = true)

time: 43.2 ms (started: 2023-09-27 17:42:38 +00:00)


## Valores NaN

### Identificação

In [30]:
for column in houses.columns:
  mask = houses['longitude'].isNull()
  nan_amount = houses.filter(mask).count()
  print(f'{column}: {nan_amount}')

longitude: 0
latitude: 0
total_rooms: 0
total_bedrooms: 0
households: 0
ocean_proximity: 0
time: 1.79 s (started: 2023-09-27 17:42:38 +00:00)


# Instruções SQL
Esta sessão contém exemplos de código de funções do PySpark que implementam instruções da linguagem de consulta SQL.

## SELECT

In [31]:
print(type(houses.select(['longitude', 'latitude', 'households'])))
houses.select(['longitude', 'latitude', 'households']).show(5)

<class 'pyspark.sql.dataframe.DataFrame'>
+---------+--------+----------+
|longitude|latitude|households|
+---------+--------+----------+
|  -122.23|   37.88|     126.0|
|  -122.22|   37.86|    1138.0|
|  -122.24|   37.85|     177.0|
|  -122.25|   37.85|     219.0|
|  -122.25|   37.85|     259.0|
+---------+--------+----------+
only showing top 5 rows

time: 289 ms (started: 2023-09-27 17:42:40 +00:00)


Forma alternativa utilizando a função col() que retorna um objeto da classe Column.

In [32]:
print(type(houses.select([col('latitude'), col('longitude'), col('households')])))
houses.select([col('latitude'), col('longitude'), col('households')]).show(5)

<class 'pyspark.sql.dataframe.DataFrame'>
+--------+---------+----------+
|latitude|longitude|households|
+--------+---------+----------+
|   37.88|  -122.23|     126.0|
|   37.86|  -122.22|    1138.0|
|   37.85|  -122.24|     177.0|
|   37.85|  -122.25|     219.0|
|   37.85|  -122.25|     259.0|
+--------+---------+----------+
only showing top 5 rows

time: 172 ms (started: 2023-09-27 17:42:40 +00:00)


As principais funções de estatística descritiva disponíveis no PySpark são:
*   min()
*   max()
*   count()
*   std() ou stddev()
*   mode()

Abaixo, um exemplo de como se obter o menor valor de uma coluna.

In [33]:
houses.select(min('households')).show(1)

+---------------+
|min(households)|
+---------------+
|            1.0|
+---------------+

time: 320 ms (started: 2023-09-27 17:42:41 +00:00)


## WHERE
A string passada como argumento da função where() deve seguir o padrão SQL. Note que nessa string o nome da coluna não fica entre àspas, mas o valor, sim. Além disso, o sinal de igualdade é apenas um "=". Isso contece porque em SQL igualdades são verificadas dessa maneira.

Uma alternativa é a função filter() que pode ser usada da mesma forma.

In [34]:
players.where("team = 'Brazil'").show(5)

+------+----+-----------------+----------+-----------+------+------+---+---+----+
|  team|pos.|fifa popular name|birth date| shirt name|height|weight|dia|mes| ano|
+------+----+-----------------+----------+-----------+------+------+---+---+----+
|Brazil|  MF|             FRED|1993-03-05|       FRED|   169|    64|  5|  3|1993|
|Brazil|  FW|           TAISON|1988-01-13|     TAISON|   172|    64| 13|  1|1988|
|Brazil|  MF|      FERNANDINHO|1985-05-04|FERNANDINHO|   179|    67|  4|  5|1985|
|Brazil|  DF|           FAGNER|1989-06-11|     FAGNER|   168|    67| 11|  6|1989|
|Brazil|  FW|           NEYMAR|1992-02-05|  NEYMAR JR|   175|    68|  5|  2|1992|
+------+----+-----------------+----------+-----------+------+------+---+---+----+
only showing top 5 rows

time: 549 ms (started: 2023-09-27 17:42:41 +00:00)


In [35]:
mask = players['team'] == 'Argentina'
print(mask)
players.where(mask).show(5)

Column<'(team = Argentina)'>
+---------+----+------------------+----------+----------+------+------+---+---+----+
|     team|pos.| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+---------+----+------------------+----------+----------+------+------+---+---+----+
|Argentina|  DF|TAGLIAFICO Nicolas|1992-08-31|TAGLIAFICO|   169|    65| 31|  8|1992|
|Argentina|  MF|    PAVON Cristian|1996-01-21|     PAVÓN|   169|    65| 21|  1|1996|
|Argentina|  MF|    LANZINI Manuel|1993-02-15|   LANZINI|   167|    66| 15|  2|1993|
|Argentina|  DF|    SALVIO Eduardo|1990-07-13|    SALVIO|   167|    69| 13|  7|1990|
|Argentina|  FW|      MESSI Lionel|1987-06-24|     MESSI|   170|    72| 24|  6|1987|
+---------+----+------------------+----------+----------+------+------+---+---+----+
only showing top 5 rows

time: 327 ms (started: 2023-09-27 17:42:41 +00:00)


In [36]:
mask = (col('shirt name') == 'MESSI')
print(mask)
players.filter(mask).show(5)

Column<'(shirt name = MESSI)'>
+---------+----+-----------------+----------+----------+------+------+---+---+----+
|     team|pos.|fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+---------+----+-----------------+----------+----------+------+------+---+---+----+
|Argentina|  FW|     MESSI Lionel|1987-06-24|     MESSI|   170|    72| 24|  6|1987|
+---------+----+-----------------+----------+----------+------+------+---+---+----+

time: 466 ms (started: 2023-09-27 17:42:42 +00:00)


### Filtros compostos

In [37]:
mask = ("team = 'Brazil' AND height < 170")
print(mask)
players.where(mask).show(5)

team = 'Brazil' AND height < 170
+------+----+-----------------+----------+----------+------+------+---+---+----+
|  team|pos.|fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+------+----+-----------------+----------+----------+------+------+---+---+----+
|Brazil|  MF|             FRED|1993-03-05|      FRED|   169|    64|  5|  3|1993|
|Brazil|  DF|           FAGNER|1989-06-11|    FAGNER|   168|    67| 11|  6|1989|
+------+----+-----------------+----------+----------+------+------+---+---+----+

time: 414 ms (started: 2023-09-27 17:42:42 +00:00)


In [38]:
mask = (col('team') == 'Brazil') & (col('height') < 170)
print(mask)
players.where(mask).show(5)

Column<'((team = Brazil) AND (height < 170))'>
+------+----+-----------------+----------+----------+------+------+---+---+----+
|  team|pos.|fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+------+----+-----------------+----------+----------+------+------+---+---+----+
|Brazil|  MF|             FRED|1993-03-05|      FRED|   169|    64|  5|  3|1993|
|Brazil|  DF|           FAGNER|1989-06-11|    FAGNER|   168|    67| 11|  6|1989|
+------+----+-----------------+----------+----------+------+------+---+---+----+

time: 451 ms (started: 2023-09-27 17:42:43 +00:00)


## GROUP BY

Tendo uma coluna como referência, todas as linhas onde os valores dessa coluna são iguais são "colapsadas" em apenas uma. É preciso especificar o que deve ser feito com as outras colunas caso contrário elas serão ignoradas. Normalmente aplicamos funções de estatística descritiva.

In [39]:
players.groupBy('team').mean('weight').orderBy('avg(weight)', ascending=True).show(10)

+--------------+-----------------+
|          team|      avg(weight)|
+--------------+-----------------+
|         Japan|71.52173913043478|
|  Saudi Arabia|73.04347826086956|
|      Portugal| 73.6086956521739|
|        Mexico|74.08695652173913|
|    Costa Rica| 74.1304347826087|
|Korea Republic|74.43478260869566|
|       Uruguay| 74.6086956521739|
|       Morocco|74.65217391304348|
|         Spain|74.73913043478261|
|       Tunisia|             75.0|
+--------------+-----------------+
only showing top 10 rows

time: 748 ms (started: 2023-09-27 17:42:43 +00:00)


Para especificar qual função de agregação deve ser usada em cada coluna podemos usar a função agg().

In [40]:
players.groupBy('team').agg({'weight':'avg', 'dia':'min', 'height':'max'}).orderBy('max(height)', ascending=False).show(10)

+--------------+-----------------+--------+-----------+
|          team|      avg(weight)|min(dia)|max(height)|
+--------------+-----------------+--------+-----------+
|       Croatia|79.30434782608695|       2|        201|
|       Denmark| 82.6086956521739|       1|        200|
|     Argentina|75.56521739130434|       2|        199|
|       Belgium|79.56521739130434|       2|        199|
|        Sweden|78.82608695652173|       2|        198|
|       Iceland|80.73913043478261|       1|        198|
|Korea Republic|74.43478260869566|       3|        197|
|       Nigeria|80.47826086956522|       1|        197|
|        Panama|             80.0|       1|        197|
|        France|             80.0|       3|        197|
+--------------+-----------------+--------+-----------+
only showing top 10 rows

time: 767 ms (started: 2023-09-27 17:42:44 +00:00)


In [41]:
players.groupBy('team').agg(avg('height'), min('height'), max('height')).orderBy('avg(height)', ascending=False).show(20)

+--------------+------------------+-----------+-----------+
|          team|       avg(height)|min(height)|max(height)|
+--------------+------------------+-----------+-----------+
|        Serbia|186.69565217391303|        169|        195|
|       Denmark| 186.6086956521739|        171|        200|
|       Germany| 185.7826086956522|        176|        195|
|        Sweden| 185.7391304347826|        177|        198|
|       Iceland|185.52173913043478|        170|        198|
|       Belgium|185.34782608695653|        169|        199|
|       Croatia| 185.2608695652174|        172|        201|
|       Nigeria|184.52173913043478|        172|        197|
|       IR Iran|184.47826086956522|        177|        194|
|        Russia| 184.3913043478261|        173|        196|
|       Senegal|183.65217391304347|        173|        196|
|        France|183.30434782608697|        168|        197|
|        Poland|183.17391304347825|        172|        195|
|       Tunisia|183.08695652173913|     

In [42]:
players.groupBy('team').agg(avg('weight')).orderBy('avg(weight)', ascending=True).show(10)

+--------------+-----------------+
|          team|      avg(weight)|
+--------------+-----------------+
|         Japan|71.52173913043478|
|  Saudi Arabia|73.04347826086956|
|      Portugal| 73.6086956521739|
|        Mexico|74.08695652173913|
|    Costa Rica| 74.1304347826087|
|Korea Republic|74.43478260869566|
|       Uruguay| 74.6086956521739|
|       Morocco|74.65217391304348|
|         Spain|74.73913043478261|
|       Tunisia|             75.0|
+--------------+-----------------+
only showing top 10 rows

time: 657 ms (started: 2023-09-27 17:42:45 +00:00)


## PARTITION BY
Tem o conceito muito parecido com groupby, mas em partitionBy(), as linhas iguais são agrupadas formando agrupamentos de instâncias de dados baseados em uma ou mais colunas.

*   row_number()
*   rank()
*   dense_rank()
*   persent_rank()
*   ntile()

**Obs.** A função orderBy() usada com Window.partitionBy() não é a mesma usada com as funções de agregação de groupBy(). Enquanto essa retorna um DataFrame o outro cria uma WindowSpec.

row_number(): Cria  uma coluna com contagem de linhas começando por 1.

In [43]:
prt = Window.partitionBy('team').orderBy(desc('height'))
print(type(prt))
print(type(row_number()))
players.withColumn('row', row_number().over(prt)).show(10)

<class 'pyspark.sql.window.WindowSpec'>
<class 'pyspark.sql.column.Column'>
+---------+----+------------------+----------+----------+------+------+---+---+----+---+
|     team|pos.| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|row|
+---------+----+------------------+----------+----------+------+------+---+---+----+---+
|Argentina|  DF|    FAZIO Federico|1987-03-17|     FAZIO|   199|    85| 17|  3|1987|  1|
|Argentina|  GK|     GUZMAN Nahuel|1986-02-10|    GUZMÁN|   192|    90| 10|  2|1986|  2|
|Argentina|  DF|       ROJO Marcos|1990-03-20|      ROJO|   189|    82| 20|  3|1990|  3|
|Argentina|  GK|     ARMANI Franco|1986-10-16|    ARMANI|   189|    85| 16| 10|1986|  4|
|Argentina|  GK|CABALLERO Wilfredo|1981-09-28| CABALLERO|   186|    80| 28|  9|1981|  5|
|Argentina|  FW|   HIGUAIN Gonzalo|1987-12-10|   HIGUAÍN|   184|    75| 10| 12|1987|  6|
|Argentina|  DF|  ANSALDI Cristian|1986-09-20|   ANSALDI|   181|    73| 20|  9|1986|  7|
|Argentina|  DF|   MERCADO Gabriel

In [44]:
# Selecionar os atletas mais altos de cada time.
prt = Window.partitionBy('team').orderBy(desc('height'))
players.withColumn('top', row_number().over(prt)).where("top = 1").show(10)

+----------+----+------------------+----------+-----------+------+------+---+---+----+---+
|      team|pos.| fifa popular name|birth date| shirt name|height|weight|dia|mes| ano|top|
+----------+----+------------------+----------+-----------+------+------+---+---+----+---+
| Argentina|  DF|    FAZIO Federico|1987-03-17|      FAZIO|   199|    85| 17|  3|1987|  1|
| Australia|  GK|        JONES Brad|1982-03-19|      JONES|   193|    87| 19|  3|1982|  1|
|   Belgium|  GK|  COURTOIS Thibaut|1992-05-11|   COURTOIS|   199|    91| 11|  5|1992|  1|
|    Brazil|  GK|            CASSIO|1987-06-06|     CASSIO|   195|    92|  6|  6|1987|  1|
|  Colombia|  DF|        MINA Yerry|1994-09-23|    Y. MINA|   194|    95| 23|  9|1994|  1|
|Costa Rica|  DF|    WASTON Kendall|1988-01-01|  K. WASTON|   196|    87|  1|  1|1988|  1|
|   Croatia|  GK|     KALINIC Lovre|1990-04-03| L. KALINIĆ|   201|    96|  3|  4|1990|  1|
|   Denmark|  DF|VESTERGAARD Jannik|1992-08-03|VESTERGAARD|   200|    98|  3|  8|1992|  1|

rank(): Note como rank=3 se repete duas vezes e depois há um salto para rank=5. Esta é uma peculiaridade dessa função.

In [45]:
prt = Window.partitionBy('team').orderBy(desc('height'))
players.withColumn('rank', rank().over(prt)).show(10)

+---------+----+------------------+----------+----------+------+------+---+---+----+----+
|     team|pos.| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|rank|
+---------+----+------------------+----------+----------+------+------+---+---+----+----+
|Argentina|  DF|    FAZIO Federico|1987-03-17|     FAZIO|   199|    85| 17|  3|1987|   1|
|Argentina|  GK|     GUZMAN Nahuel|1986-02-10|    GUZMÁN|   192|    90| 10|  2|1986|   2|
|Argentina|  DF|       ROJO Marcos|1990-03-20|      ROJO|   189|    82| 20|  3|1990|   3|
|Argentina|  GK|     ARMANI Franco|1986-10-16|    ARMANI|   189|    85| 16| 10|1986|   3|
|Argentina|  GK|CABALLERO Wilfredo|1981-09-28| CABALLERO|   186|    80| 28|  9|1981|   5|
|Argentina|  FW|   HIGUAIN Gonzalo|1987-12-10|   HIGUAÍN|   184|    75| 10| 12|1987|   6|
|Argentina|  DF|  ANSALDI Cristian|1986-09-20|   ANSALDI|   181|    73| 20|  9|1986|   7|
|Argentina|  DF|   MERCADO Gabriel|1987-03-18|   MERCADO|   181|    81| 18|  3|1987|   7|
|Argentina

dense_rank(): Aqui, mesmo que rank=3 se repita o próximo valor de rank é 4 e assim por diante. Não há saltos de valores.

In [46]:
prt = Window.partitionBy('team').orderBy(desc('height'))
players.withColumn('dense_rank', dense_rank().over(prt)).show(10)

+---------+----+------------------+----------+----------+------+------+---+---+----+----------+
|     team|pos.| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|dense_rank|
+---------+----+------------------+----------+----------+------+------+---+---+----+----------+
|Argentina|  DF|    FAZIO Federico|1987-03-17|     FAZIO|   199|    85| 17|  3|1987|         1|
|Argentina|  GK|     GUZMAN Nahuel|1986-02-10|    GUZMÁN|   192|    90| 10|  2|1986|         2|
|Argentina|  DF|       ROJO Marcos|1990-03-20|      ROJO|   189|    82| 20|  3|1990|         3|
|Argentina|  GK|     ARMANI Franco|1986-10-16|    ARMANI|   189|    85| 16| 10|1986|         3|
|Argentina|  GK|CABALLERO Wilfredo|1981-09-28| CABALLERO|   186|    80| 28|  9|1981|         4|
|Argentina|  FW|   HIGUAIN Gonzalo|1987-12-10|   HIGUAÍN|   184|    75| 10| 12|1987|         5|
|Argentina|  DF|  ANSALDI Cristian|1986-09-20|   ANSALDI|   181|    73| 20|  9|1986|         6|
|Argentina|  DF|   MERCADO Gabriel|1987-

persent_rank(): ranking relativo(percentual)

In [47]:
prt = Window.partitionBy('team').orderBy(desc('height'))
players.withColumn('persent_rank', percent_rank().over(prt)).show(10)

+---------+----+------------------+----------+----------+------+------+---+---+----+--------------------+
|     team|pos.| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|        persent_rank|
+---------+----+------------------+----------+----------+------+------+---+---+----+--------------------+
|Argentina|  DF|    FAZIO Federico|1987-03-17|     FAZIO|   199|    85| 17|  3|1987|                 0.0|
|Argentina|  GK|     GUZMAN Nahuel|1986-02-10|    GUZMÁN|   192|    90| 10|  2|1986|0.045454545454545456|
|Argentina|  DF|       ROJO Marcos|1990-03-20|      ROJO|   189|    82| 20|  3|1990| 0.09090909090909091|
|Argentina|  GK|     ARMANI Franco|1986-10-16|    ARMANI|   189|    85| 16| 10|1986| 0.09090909090909091|
|Argentina|  GK|CABALLERO Wilfredo|1981-09-28| CABALLERO|   186|    80| 28|  9|1981| 0.18181818181818182|
|Argentina|  FW|   HIGUAIN Gonzalo|1987-12-10|   HIGUAÍN|   184|    75| 10| 12|1987| 0.22727272727272727|
|Argentina|  DF|  ANSALDI Cristian|1986-09-20|

ntile(): Divide cada partição em uma quantidade n de quartiles. Cada quartile recebe um valor único. Caso uma partição não seja divisível por n o aloritmo ajustará a quantidade de instâncias de dados pertencentes aos últimos quartiles de modo que a partição tenha n quartiles. Por exemplo, na seleção da Argentina, quando n=5, o penúltimo quartil tem 4 instâncias para poder formar mais um, o último quartil que terá apenas uma instância.

In [48]:
prt = Window.partitionBy('team').orderBy(desc('height'))
players.withColumn('ntile', ntile(5).over(prt)).show(20)

+---------+----+------------------+----------+----------+------+------+---+---+----+-----+
|     team|pos.| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|ntile|
+---------+----+------------------+----------+----------+------+------+---+---+----+-----+
|Argentina|  DF|    FAZIO Federico|1987-03-17|     FAZIO|   199|    85| 17|  3|1987|    1|
|Argentina|  GK|     GUZMAN Nahuel|1986-02-10|    GUZMÁN|   192|    90| 10|  2|1986|    1|
|Argentina|  DF|       ROJO Marcos|1990-03-20|      ROJO|   189|    82| 20|  3|1990|    1|
|Argentina|  GK|     ARMANI Franco|1986-10-16|    ARMANI|   189|    85| 16| 10|1986|    1|
|Argentina|  GK|CABALLERO Wilfredo|1981-09-28| CABALLERO|   186|    80| 28|  9|1981|    1|
|Argentina|  FW|   HIGUAIN Gonzalo|1987-12-10|   HIGUAÍN|   184|    75| 10| 12|1987|    2|
|Argentina|  DF|  ANSALDI Cristian|1986-09-20|   ANSALDI|   181|    73| 20|  9|1986|    2|
|Argentina|  DF|   MERCADO Gabriel|1987-03-18|   MERCADO|   181|    81| 18|  3|1987|    2|

Lag function: O mesmo tipo de lag usado em séries temporais.

In [49]:
prt = Window.partitionBy('team').orderBy(desc('height'))
players.withColumn('lag', lag('weight', offset=2).over(prt)).show(10)

+---------+----+------------------+----------+----------+------+------+---+---+----+----+
|     team|pos.| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano| lag|
+---------+----+------------------+----------+----------+------+------+---+---+----+----+
|Argentina|  DF|    FAZIO Federico|1987-03-17|     FAZIO|   199|    85| 17|  3|1987|NULL|
|Argentina|  GK|     GUZMAN Nahuel|1986-02-10|    GUZMÁN|   192|    90| 10|  2|1986|NULL|
|Argentina|  DF|       ROJO Marcos|1990-03-20|      ROJO|   189|    82| 20|  3|1990|  85|
|Argentina|  GK|     ARMANI Franco|1986-10-16|    ARMANI|   189|    85| 16| 10|1986|  90|
|Argentina|  GK|CABALLERO Wilfredo|1981-09-28| CABALLERO|   186|    80| 28|  9|1981|  82|
|Argentina|  FW|   HIGUAIN Gonzalo|1987-12-10|   HIGUAÍN|   184|    75| 10| 12|1987|  85|
|Argentina|  DF|  ANSALDI Cristian|1986-09-20|   ANSALDI|   181|    73| 20|  9|1986|  80|
|Argentina|  DF|   MERCADO Gabriel|1987-03-18|   MERCADO|   181|    81| 18|  3|1987|  75|
|Argentina

Forward function: O mesmo tipo de forward usado em séries temporais.

In [50]:
prt = Window.partitionBy('team').orderBy(desc('height'))
players.withColumn('lead', lead('weight', offset=1).over(prt)).show(10)

+---------+----+------------------+----------+----------+------+------+---+---+----+----+
|     team|pos.| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|lead|
+---------+----+------------------+----------+----------+------+------+---+---+----+----+
|Argentina|  DF|    FAZIO Federico|1987-03-17|     FAZIO|   199|    85| 17|  3|1987|  90|
|Argentina|  GK|     GUZMAN Nahuel|1986-02-10|    GUZMÁN|   192|    90| 10|  2|1986|  82|
|Argentina|  DF|       ROJO Marcos|1990-03-20|      ROJO|   189|    82| 20|  3|1990|  85|
|Argentina|  GK|     ARMANI Franco|1986-10-16|    ARMANI|   189|    85| 16| 10|1986|  80|
|Argentina|  GK|CABALLERO Wilfredo|1981-09-28| CABALLERO|   186|    80| 28|  9|1981|  75|
|Argentina|  FW|   HIGUAIN Gonzalo|1987-12-10|   HIGUAÍN|   184|    75| 10| 12|1987|  73|
|Argentina|  DF|  ANSALDI Cristian|1986-09-20|   ANSALDI|   181|    73| 20|  9|1986|  81|
|Argentina|  DF|   MERCADO Gabriel|1987-03-18|   MERCADO|   181|    81| 18|  3|1987|  81|
|Argentina

## DISTINCT

In [51]:
players.select('team').distinct().show(5)

+-------+
|   team|
+-------+
| Russia|
|Senegal|
| Sweden|
|IR Iran|
|Germany|
+-------+
only showing top 5 rows

time: 451 ms (started: 2023-09-27 17:42:52 +00:00)


Número de valores únicos em um atributo.

In [52]:
nunique = players.select('team').distinct().count()
print(f'unique values: {nunique}')

unique values: 32
time: 277 ms (started: 2023-09-27 17:42:52 +00:00)


## COLLECT
Salva o resultado de uma consulta em uma lista.

In [53]:
players.select('team').distinct().collect()

[Row(team='Russia'),
 Row(team='Senegal'),
 Row(team='Sweden'),
 Row(team='IR Iran'),
 Row(team='Germany'),
 Row(team='France'),
 Row(team='Argentina'),
 Row(team='Belgium'),
 Row(team='Peru'),
 Row(team='Croatia'),
 Row(team='Nigeria'),
 Row(team='Korea Republic'),
 Row(team='Spain'),
 Row(team='Denmark'),
 Row(team='Morocco'),
 Row(team='Panama'),
 Row(team='Iceland'),
 Row(team='Uruguay'),
 Row(team='Mexico'),
 Row(team='Tunisia'),
 Row(team='Saudi Arabia'),
 Row(team='Switzerland'),
 Row(team='Brazil'),
 Row(team='Japan'),
 Row(team='England'),
 Row(team='Poland'),
 Row(team='Portugal'),
 Row(team='Australia'),
 Row(team='Costa Rica'),
 Row(team='Egypt'),
 Row(team='Serbia'),
 Row(team='Colombia')]

time: 283 ms (started: 2023-09-27 17:42:53 +00:00)


O resultado anterior é uma lista de objetos Row. Caso seja necessário apenas o nome do país podemos usar o código abaixo.

In [54]:
result = players.select('team').distinct().collect()
countries = [row[0] for row in result]
print(countries)

['Russia', 'Senegal', 'Sweden', 'IR Iran', 'Germany', 'France', 'Argentina', 'Belgium', 'Peru', 'Croatia', 'Nigeria', 'Korea Republic', 'Spain', 'Denmark', 'Morocco', 'Panama', 'Iceland', 'Uruguay', 'Mexico', 'Tunisia', 'Saudi Arabia', 'Switzerland', 'Brazil', 'Japan', 'England', 'Poland', 'Portugal', 'Australia', 'Costa Rica', 'Egypt', 'Serbia', 'Colombia']
time: 226 ms (started: 2023-09-27 17:42:53 +00:00)


## CASE/WHEN/THEN
Caso queiramos usar essas intruções com PySpark podemos usar as funções when() e otherwise(). CASE e THEN ficam implícitas. Também podemos dizer que são o if/else deste framework.

In [55]:
val = when(condition=(col('team') == 'Argentina'), value='Argentinos').otherwise(value='Normais')
players.withColumn('new_col', val).show()

+---------+----+------------------+----------+----------+------+------+---+---+----+----------+
|     team|pos.| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|   new_col|
+---------+----+------------------+----------+----------+------+------+---+---+----+----------+
|Argentina|  DF|TAGLIAFICO Nicolas|1992-08-31|TAGLIAFICO|   169|    65| 31|  8|1992|Argentinos|
|Argentina|  MF|    PAVON Cristian|1996-01-21|     PAVÓN|   169|    65| 21|  1|1996|Argentinos|
|Argentina|  MF|    LANZINI Manuel|1993-02-15|   LANZINI|   167|    66| 15|  2|1993|Argentinos|
|Argentina|  DF|    SALVIO Eduardo|1990-07-13|    SALVIO|   167|    69| 13|  7|1990|Argentinos|
|Argentina|  FW|      MESSI Lionel|1987-06-24|     MESSI|   170|    72| 24|  6|1987|Argentinos|
|Argentina|  DF|  ANSALDI Cristian|1986-09-20|   ANSALDI|   181|    73| 20|  9|1986|Argentinos|
|Argentina|  MF|      BIGLIA Lucas|1986-01-30|    BIGLIA|   175|    73| 30|  1|1986|Argentinos|
|Argentina|  MF|       BANEGA Ever|1988-

In [56]:
africa = ['Senegal', 'Morocco', 'Tunisia', 'Egypt']
america_norte = ['Panama', 'Mexico', 'Costa Rica']
america_sul = ['Argentina', 'Peru', 'Uruguay', 'Brazil', 'Colombia']
asia = ['Russia', 'IR Iran', 'Nigeria', 'Korea Republic', 'Saudi Arabia', 'Japan', ]
europa = ['Sweden', 'Germany', 'France', 'Belgium', 'Croatia', 'Spain', 'Denmark', 'Iceland', 'Switzerland', 'England', 'Poland', 'Portugal', 'Serbia']
oceania = ['Australia']

val = when(condition=(col('team').isin(africa)), value=('europeu'))\
      .when(condition=(col('team').isin(america_norte)), value=('n_americano'))\
      .when(condition=(col('team').isin(america_sul)), value=('s_americano'))\
      .when(condition=(col('team').isin(asia)), value=('asiatico'))\
      .when(condition=(col('team').isin(europa)), value=('europeu'))\
      .when(condition=(col('team').isin(oceania)), value=('oceanicos'))\
      .otherwise('desconhecidos')

players.withColumn('new_col', val).sample(fraction=0.01).show(5)

+--------------+----+-----------------+----------+----------+------+------+---+---+----+-----------+
|          team|pos.|fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|    new_col|
+--------------+----+-----------------+----------+----------+------+------+---+---+----+-----------+
|      Colombia|  DF| SANCHEZ Davinson|1996-06-12|D. SANCHEZ|   187|    81| 12|  6|1996|s_americano|
|    Costa Rica|  DF|    ACOSTA Johnny|1983-07-21| J. ACOSTA|   176|    75| 21|  7|1983|n_americano|
|Korea Republic|  DF|        OH Bansuk|1988-05-20|    B S OH|   189|    79| 20|  5|1988|   asiatico|
|          Peru|  DF|       CORZO Aldo|1989-05-20|     CORZO|   172|    75| 20|  5|1989|s_americano|
|  Saudi Arabia|  DF| YASIR ALSHAHRANI|1992-05-25|     YASIR|   170|    63| 25|  5|1992|   asiatico|
+--------------+----+-----------------+----------+----------+------+------+---+---+----+-----------+
only showing top 5 rows

time: 659 ms (started: 2023-09-27 17:42:54 +00:00)


## UNION

A função union() verifica unicamente a quantidade colunas que os dataframes envolvidos possuem. Caso eles tenham a mesma quantidade a função concatenará um embaixo do outro. Ou seja, considerando um dataframe df_x e outro df_y, ela concatenará a primeira coluna de df_x com a primeira coluna de df_y, a segunda de df_x com a segunda de df_y, e assim por diante. Portanto, union() não verifica os tipos de dados nem os nomes das colunas. Para que o resultado faça sentido o programador deve fazer essas verificações.

Vou criar dois dataframes com países americanos e concatená-los formando apenas um. Mas primeiro, preciso criar um novo atributo com os continentes os quais cada país pertence.

In [57]:
players = players.withColumn('continent', val)
players.sample(fraction=0.01).show(5)

+--------------+----+-----------------+----------+----------+------+------+---+---+----+---------+
|          team|pos.|fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|continent|
+--------------+----+-----------------+----------+----------+------+------+---+---+----+---------+
|       Denmark|  FW|   DOLBERG Kasper|1997-10-06|   DOLBERG|   187|    85|  6| 10|1997|  europeu|
|       Germany|  DF|   BOATENG Jerome|1988-09-03|   BOATENG|   192|    90|  3|  9|1988|  europeu|
|Korea Republic|  MF|      LEE Jaesung|1992-08-10|   J S LEE|   180|    70| 10|  8|1992| asiatico|
|  Saudi Arabia|  MF|  SALEM ALDAWSARI|1991-08-19|     SALEM|   174|    68| 19|  8|1991| asiatico|
|   Switzerland|  MF|  DZEMAILI Blerim|1986-04-12|  DZEMAILI|   180|    77| 12|  4|1986|  europeu|
+--------------+----+-----------------+----------+----------+------+------+---+---+----+---------+

time: 352 ms (started: 2023-09-27 17:42:54 +00:00)


Agora, vou criar um dataframe com os países da América do Sul e outro com os da América do Norte.

In [58]:
s_america = players.where("continent = 's_americano'")
n_america = players.where("continent = 'n_americano'")

df_america = s_america.union(n_america)

time: 71.1 ms (started: 2023-09-27 17:42:55 +00:00)


In [59]:
print(s_america.count())
print(n_america.count())
print(df_america.count())
df_america.sample(fraction=0.04).show(15)

115
69
184
+----------+----+-----------------+----------+------------+------+------+---+---+----+-----------+
|      team|pos.|fifa popular name|birth date|  shirt name|height|weight|dia|mes| ano|  continent|
+----------+----+-----------------+----------+------------+------+------+---+---+----+-----------+
|  Colombia|  DF|   ARIAS Santiago|1992-01-13|       ARIAS|   177|    71| 13|  1|1992|s_americano|
|      Peru|  MF|     AQUINO Pedro|1995-04-13|      AQUINO|   174|    71| 13|  4|1995|s_americano|
|   Uruguay|  GK|   CAMPANA Martin|1989-05-29|     CAMPAÑA|   184|    79| 29|  5|1989|s_americano|
|Costa Rica|  MF|   WALLACE Rodney|1988-06-17|  R. WALLACE|   180|    70| 17|  6|1988|n_americano|
|Costa Rica|  GK|PEMBERTON Patrick|1982-04-24|P. PEMBERTON|   178|    72| 24|  4|1982|n_americano|
|Costa Rica|  MF| COLINDRES Daniel|1985-01-10|D. COLINDRES|   180|    75| 10|  1|1985|n_americano|
|    Mexico|  FW|      VELA Carlos|1989-03-01|    CARLOS V|   178|    78|  1|  3|1989|n_americano|

## JOIN

In [60]:
clients.show(3)
products.show(3)
sales.show(3)
stores.show(3)

+---------+--------------+------------+------------+-------------+
|client_id|   client_city|client_state|client_birth|client_gender|
+---------+--------------+------------+------------+-------------+
|    14001|      Curitiba|          PR|   6/28/1985|        Homem|
|    14002| Florianópolis|          SC|   1/10/1987|        Homem|
|    14003|Rio de Janeiro|          RJ|   11/5/1979|        Homem|
+---------+--------------+------------+------------+-------------+
only showing top 3 rows

+--------------------+---------------+------------+
|          product_id|   product_name|product_size|
+--------------------+---------------+------------+
|00066f42aeeb9f300...|Capitão América|           P|
|00066f42aeeb9f300...|Capitão América|           M|
|00066f42aeeb9f300...|Capitão América|           G|
+--------------------+---------------+------------+
only showing top 3 rows

+--------------------+--------------------+---------+--------+----------+--------+--------+----------+
|             

A condição de igualdade entre chave primária e extrangeira fica implicita utilizando apenas uma string ou lista de strings no parâmetro 'on'. Portanto, o nome da coluna deve ser o mesmo em ambos dataframes.

In [61]:
# removo algumas colunas com drop() para a visualização do resultado ficar mais legível.
sales.join(stores, on='store_id').drop('product_id', 'id', 'date', 'quantity').show(5)

+--------+---------+--------+----------+-------------+-----------+
|store_id|client_id|discount|unit_price|   store_city|store_state|
+--------+---------+--------+----------+-------------+-----------+
|       4|    14001|    0,08|     249,2|     Curitiba|         PR|
|       4|    14001|     0,1|     162,4|     Curitiba|         PR|
|       4|    14001|     0,1|     194,6|     Curitiba|         PR|
|       1|    14002|     0,1|     201,6|Florianópolis|         SC|
|       1|    14002|     0,1|       406|Florianópolis|         SC|
+--------+---------+--------+----------+-------------+-----------+
only showing top 5 rows

time: 558 ms (started: 2023-09-27 17:42:57 +00:00)


Caso for necessário mais de uma chave primária, podemos usar uma lista de chaves.

In [62]:
# removo algumas colunas com drop() para a visualização do resultado ficar mais legível.
sales.join(stores, on=['store_id']).drop('product_id', 'id', 'date', 'quantity').show(5)

+--------+---------+--------+----------+-------------+-----------+
|store_id|client_id|discount|unit_price|   store_city|store_state|
+--------+---------+--------+----------+-------------+-----------+
|       4|    14001|    0,08|     249,2|     Curitiba|         PR|
|       4|    14001|     0,1|     162,4|     Curitiba|         PR|
|       4|    14001|     0,1|     194,6|     Curitiba|         PR|
|       1|    14002|     0,1|     201,6|Florianópolis|         SC|
|       1|    14002|     0,1|       406|Florianópolis|         SC|
+--------+---------+--------+----------+-------------+-----------+
only showing top 5 rows

time: 318 ms (started: 2023-09-27 17:42:58 +00:00)


Podemos definir explicitamente a condição de igualdade com o código abaixo.

In [63]:
# removo algumas colunas com drop() para a visualização do resultado ficar mais legível.
mask = sales['store_id'] == stores['store_id']
sales.join(stores, on=mask).drop('product_id', 'id', 'date', 'quantity').show(5)

+---------+--------+----------+--------+--------+-------------+-----------+
|client_id|discount|unit_price|store_id|store_id|   store_city|store_state|
+---------+--------+----------+--------+--------+-------------+-----------+
|    14001|    0,08|     249,2|       4|       4|     Curitiba|         PR|
|    14001|     0,1|     162,4|       4|       4|     Curitiba|         PR|
|    14001|     0,1|     194,6|       4|       4|     Curitiba|         PR|
|    14002|     0,1|     201,6|       1|       1|Florianópolis|         SC|
|    14002|     0,1|       406|       1|       1|Florianópolis|         SC|
+---------+--------+----------+--------+--------+-------------+-----------+
only showing top 5 rows

time: 442 ms (started: 2023-09-27 17:42:58 +00:00)


### Tipos de JOIN
A função join() do PySpark suporta os seguintes tipos de join: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti e left_anti.

Para escolher qual tipo vamos usar devemos passar o seu nome como argumento no parâmetro "how" da função join().

In [64]:
sales.join(products, on=['product_id'], how="inner").drop('product_id', 'id', 'date', 'quantity').show(5)

+---------+--------+----------+--------+---------------+------------+
|client_id|discount|unit_price|store_id|   product_name|product_size|
+---------+--------+----------+--------+---------------+------------+
|    14001|    0,08|     249,2|       4|     Tempestade|           G|
|    14001|     0,1|     162,4|       4|         Thanos|           G|
|    14001|     0,1|     194,6|       4|Capitão América|           P|
|    14002|     0,1|     201,6|       1|      Wolverine|           G|
|    14002|     0,1|       406|       1|         Naruto|           G|
+---------+--------+----------+--------+---------------+------------+
only showing top 5 rows

time: 844 ms (started: 2023-09-27 17:42:58 +00:00)


In [65]:
sales.join(products, on=['product_id'], how="left").drop('product_id', 'id', 'date', 'quantity').show(5)

+---------+--------+----------+--------+---------------+------------+
|client_id|discount|unit_price|store_id|   product_name|product_size|
+---------+--------+----------+--------+---------------+------------+
|    14001|    0,08|     249,2|       4|     Tempestade|           G|
|    14001|     0,1|     162,4|       4|         Thanos|           G|
|    14001|     0,1|     194,6|       4|Capitão América|           P|
|    14002|     0,1|     201,6|       1|      Wolverine|           G|
|    14002|     0,1|       406|       1|         Naruto|           G|
+---------+--------+----------+--------+---------------+------------+
only showing top 5 rows

time: 786 ms (started: 2023-09-27 17:42:59 +00:00)


In [66]:
sales.join(clients, on=['client_id'], how="right").drop('product_id', 'id', 'date', 'quantity').show(5)

+---------+--------+----------+--------+-------------+------------+------------+-------------+
|client_id|discount|unit_price|store_id|  client_city|client_state|client_birth|client_gender|
+---------+--------+----------+--------+-------------+------------+------------+-------------+
|    14001|     0,1|     194,6|       4|     Curitiba|          PR|   6/28/1985|        Homem|
|    14001|     0,1|     162,4|       4|     Curitiba|          PR|   6/28/1985|        Homem|
|    14001|    0,08|     249,2|       4|     Curitiba|          PR|   6/28/1985|        Homem|
|    14002|    0,08|     114,8|       1|Florianópolis|          SC|   1/10/1987|        Homem|
|    14002|    0,08|     261,8|       1|Florianópolis|          SC|   1/10/1987|        Homem|
+---------+--------+----------+--------+-------------+------------+------------+-------------+
only showing top 5 rows

time: 1.26 s (started: 2023-09-27 17:43:00 +00:00)
