# Pyspark SQL

Este kernel possui alguns exemplos de código do módulo pyspark.sql.
Apresenta exemplos de funções que implementam instruções SQL, como: SELECT, WHERE, GROUP BY, ORDER BY, PARTITION BY, entre outras.


São usados dois datasets para os exemplos:
*   housing.csv: contém informações sobre imóveis no estado da Califórnia;
*   wc2018-players.csv: contém informações sobre jogadores de futebol da Copa do Mundo 2018.

In [None]:
%pip install ipython-autotime
%pip install pyspark

time: 11.1 s (started: 2023-09-19 21:36:51 +00:00)


Trecho de código opcional que resolve problema de compatibilidade entre a linguagem Python e o PySpark.

In [None]:
%%script echo 'ignore cell'
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

ignore cell
time: 6.16 ms (started: 2023-09-19 21:37:03 +00:00)


# Imports básicos.

In [None]:
from google.colab          import drive, files
from pyspark.sql           import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types     import *
from pyspark.sql.window    import Window

%load_ext autotime

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 899 µs (started: 2023-09-19 21:37:03 +00:00)


# Início da sessão.

In [None]:
drive.mount('/content/drive', force_remount=True)
spark = SparkSession.builder.master('local').appName('pyspark_app').getOrCreate()
spark

Mounted at /content/drive


time: 4.37 s (started: 2023-09-19 21:37:03 +00:00)


In [None]:
df = spark.read.csv("/content/drive/MyDrive/datasets/housing/housing.csv", header=True, inferSchema=True, encoding='utf-8')
df = df.drop('housing_median_age', 'population', 'median_income', 'median_house_value') # remoção de atributos desnecessários
print(type(df))
print(f'rows: {df.count()}')
print(f'cols: {len(df.columns)}')

<class 'pyspark.sql.dataframe.DataFrame'>
rows: 20640
cols: 6
time: 1.38 s (started: 2023-09-19 21:37:07 +00:00)


# Funções descritivas básicas.

In [None]:
df.show(5)

+---------+--------+-----------+--------------+----------+---------------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|
+---------+--------+-----------+--------------+----------+---------------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY|
+---------+--------+-----------+--------------+----------+---------------+
only showing top 5 rows

time: 428 ms (started: 2023-09-19 21:37:08 +00:00)


In [None]:
print(df.columns)

['longitude', 'latitude', 'total_rooms', 'total_bedrooms', 'households', 'ocean_proximity']
time: 573 µs (started: 2023-09-19 21:37:09 +00:00)


In [None]:
df.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- households: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)

time: 5.96 ms (started: 2023-09-19 21:37:09 +00:00)


In [None]:
df.describe().show()

+-------+-------------------+-----------------+------------------+------------------+-----------------+---------------+
|summary|          longitude|         latitude|       total_rooms|    total_bedrooms|       households|ocean_proximity|
+-------+-------------------+-----------------+------------------+------------------+-----------------+---------------+
|  count|              20640|            20640|             20640|             20433|            20640|          20640|
|   mean|-119.56970445736148| 35.6318614341087|2635.7630813953488| 537.8705525375618|499.5396802325581|           null|
| stddev|  2.003531723502584|2.135952397457101|2181.6152515827944|421.38507007403115|382.3297528316098|           null|
|    min|            -124.35|            32.54|               2.0|               1.0|              1.0|      <1H OCEAN|
|    max|            -114.31|            41.95|           39320.0|            6445.0|           6082.0|     NEAR OCEAN|
+-------+-------------------+-----------

# Funções úteis para feature engineering.

## Atributos

### Renomeação
Transformando tudo em letras maiúsculas.

In [None]:
upper = [column.upper() for column in df.columns]
for column, up in zip(df.columns, upper):
  df = df.withColumnRenamed(column, up)
print(df.columns)

['LONGITUDE', 'LATITUDE', 'TOTAL_ROOMS', 'TOTAL_BEDROOMS', 'HOUSEHOLDS', 'OCEAN_PROXIMITY']
time: 26.9 ms (started: 2023-09-19 21:37:12 +00:00)


Transformando tudo em letras minúsculas.

In [None]:
lower = [column.lower() for column in df.columns]
for column, low in zip(df.columns, lower):
  df = df.withColumnRenamed(column, low)
print(df.columns)

['longitude', 'latitude', 'total_rooms', 'total_bedrooms', 'households', 'ocean_proximity']
time: 25.6 ms (started: 2023-09-19 21:37:12 +00:00)


Atribuindo um 'alias' a cada atributo selecionado. Só pode ser feito através da função col() que retorna um objeto Column.


In [None]:
lat = col('latitude').alias('lat')
lon = col('longitude').alias('lon')

print(lat)
df.select([lat, lon]).show(5)

Column<'latitude AS lat'>
+-----+-------+
|  lat|    lon|
+-----+-------+
|37.88|-122.23|
|37.86|-122.22|
|37.85|-122.24|
|37.85|-122.25|
|37.85|-122.25|
+-----+-------+
only showing top 5 rows

time: 166 ms (started: 2023-09-19 21:37:12 +00:00)


### Instrução: SELECT

In [None]:
print(type(df.select(['longitude', 'latitude', 'households'])))
df.select(['longitude', 'latitude', 'households']).show(5)

<class 'pyspark.sql.dataframe.DataFrame'>
+---------+--------+----------+
|longitude|latitude|households|
+---------+--------+----------+
|  -122.23|   37.88|     126.0|
|  -122.22|   37.86|    1138.0|
|  -122.24|   37.85|     177.0|
|  -122.25|   37.85|     219.0|
|  -122.25|   37.85|     259.0|
+---------+--------+----------+
only showing top 5 rows

time: 348 ms (started: 2023-09-19 21:37:12 +00:00)


Forma alternativa utilizando a função col() que retorna um objeto da classe Column.

In [None]:
print(type(df.select([col('latitude'), col('longitude'), col('households')])))
df.select([col('latitude'), col('longitude'), col('households')]).show(5)

<class 'pyspark.sql.dataframe.DataFrame'>
+--------+---------+----------+
|latitude|longitude|households|
+--------+---------+----------+
|   37.88|  -122.23|     126.0|
|   37.86|  -122.22|    1138.0|
|   37.85|  -122.24|     177.0|
|   37.85|  -122.25|     219.0|
|   37.85|  -122.25|     259.0|
+--------+---------+----------+
only showing top 5 rows

time: 206 ms (started: 2023-09-19 21:37:12 +00:00)


### Criação
Atribuo um valor literal(True) à nova coluna chamada 'new_col'. A função lit() retorna um objeto Column.

In [None]:
df.withColumn('new_col', lit(True)).show(5)

+---------+--------+-----------+--------------+----------+---------------+-------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|new_col|
+---------+--------+-----------+--------------+----------+---------------+-------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|   true|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|   true|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|   true|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|   true|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY|   true|
+---------+--------+-----------+--------------+----------+---------------+-------+
only showing top 5 rows

time: 248 ms (started: 2023-09-19 21:37:12 +00:00)


Nova coluna criada a partir de uma operação matemática entre outras duas. Neste caso é preciso que os atributos sejam numéricos.

In [None]:
result = df['total_bedrooms'] / df['total_rooms']
print(type(result))
df.withColumn('new_col', result).show(5)

<class 'pyspark.sql.column.Column'>
+---------+--------+-----------+--------------+----------+---------------+-------------------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|            new_col|
+---------+--------+-----------+--------------+----------+---------------+-------------------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|0.14659090909090908|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|0.15579659106916466|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|0.12951601908657123|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|0.18445839874411302|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY| 0.1720958819913952|
+---------+--------+-----------+--------------+----------+---------------+-------------------+
only showing top 5 rows

time: 473 ms (started: 2023-09-19 21:37:13 +00:00)


Nova coluna usando a função substring().

In [None]:
df.withColumn('new_col', substring('ocean_proximity', 1, 4)).show(5)

+---------+--------+-----------+--------------+----------+---------------+-------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|new_col|
+---------+--------+-----------+--------------+----------+---------------+-------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|   NEAR|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|   NEAR|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|   NEAR|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|   NEAR|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY|   NEAR|
+---------+--------+-----------+--------------+----------+---------------+-------+
only showing top 5 rows

time: 212 ms (started: 2023-09-19 21:37:13 +00:00)


Concatenando dois atributos para formar um novo.

In [None]:
df.withColumn('new_col', concat(df['latitude'], df['longitude'])).show(10)

+---------+--------+-----------+--------------+----------+---------------+------------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|     new_col|
+---------+--------+-----------+--------------+----------+---------------+------------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|37.88-122.23|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|37.86-122.22|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|37.85-122.24|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|37.85-122.25|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY|37.85-122.25|
|  -122.25|   37.85|      919.0|         213.0|     193.0|       NEAR BAY|37.85-122.25|
|  -122.25|   37.84|     2535.0|         489.0|     514.0|       NEAR BAY|37.84-122.25|
|  -122.25|   37.84|     3104.0|         687.0|     647.0|       NEAR BAY|37.84-122.25|
|  -122.26|   37.84|     2555.0|

In [None]:
df.withColumn('new_col', concat_ws(' # ', df['latitude'], df['longitude'])).show(10)

+---------+--------+-----------+--------------+----------+---------------+---------------+
|longitude|latitude|total_rooms|total_bedrooms|households|ocean_proximity|        new_col|
+---------+--------+-----------+--------------+----------+---------------+---------------+
|  -122.23|   37.88|      880.0|         129.0|     126.0|       NEAR BAY|37.88 # -122.23|
|  -122.22|   37.86|     7099.0|        1106.0|    1138.0|       NEAR BAY|37.86 # -122.22|
|  -122.24|   37.85|     1467.0|         190.0|     177.0|       NEAR BAY|37.85 # -122.24|
|  -122.25|   37.85|     1274.0|         235.0|     219.0|       NEAR BAY|37.85 # -122.25|
|  -122.25|   37.85|     1627.0|         280.0|     259.0|       NEAR BAY|37.85 # -122.25|
|  -122.25|   37.85|      919.0|         213.0|     193.0|       NEAR BAY|37.85 # -122.25|
|  -122.25|   37.84|     2535.0|         489.0|     514.0|       NEAR BAY|37.84 # -122.25|
|  -122.25|   37.84|     3104.0|         687.0|     647.0|       NEAR BAY|37.84 # -122.25|

## Valores NaN

### Identificação

In [None]:
for column in df.columns:
  mask = df['longitude'].isNull()
  nan_amount = df.filter(mask).count()
  print(f'{column}: {nan_amount}')

longitude: 0
latitude: 0
total_rooms: 0
total_bedrooms: 0
households: 0
ocean_proximity: 0
time: 1.74 s (started: 2023-09-19 21:37:14 +00:00)


--------------

# Outro dataset
Vou usar outro dataset que contenha datas em formato string para fazer a conversão de string para o formato DateType.

In [None]:
df = spark.read.csv("/content/drive/MyDrive/datasets/wc2018-players.csv", header=True, inferSchema=True, encoding='utf-8')
df = df.drop('#', 'dia', 'mes', 'ano', 'club') # remoção de atributos desnecessários
print(type(df))
print(f'rows: {df.count()}')
print(f'cols: {len(df.columns)}')
df.show(5)

<class 'pyspark.sql.dataframe.DataFrame'>
rows: 736
cols: 7
+---------+----+------------------+----------+----------+------+------+
|     Team|Pos.| FIFA Popular Name|Birth Date|Shirt Name|Height|Weight|
+---------+----+------------------+----------+----------+------+------+
|Argentina|  DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|   169|    65|
|Argentina|  MF|    PAVON Cristian|21.01.1996|     PAVÓN|   169|    65|
|Argentina|  MF|    LANZINI Manuel|15.02.1993|   LANZINI|   167|    66|
|Argentina|  DF|    SALVIO Eduardo|13.07.1990|    SALVIO|   167|    69|
|Argentina|  FW|      MESSI Lionel|24.06.1987|     MESSI|   170|    72|
+---------+----+------------------+----------+----------+------+------+
only showing top 5 rows

time: 879 ms (started: 2023-09-19 21:37:16 +00:00)


In [None]:
df.printSchema()

root
 |-- Team: string (nullable = true)
 |-- Pos.: string (nullable = true)
 |-- FIFA Popular Name: string (nullable = true)
 |-- Birth Date: string (nullable = true)
 |-- Shirt Name: string (nullable = true)
 |-- Height: integer (nullable = true)
 |-- Weight: integer (nullable = true)

time: 4.99 ms (started: 2023-09-19 21:37:17 +00:00)


Vou renomear as colunas convertendo-as para letras minúsculas.

In [None]:
lower = [column.lower() for column in df.columns]
for column, low in zip(df.columns, lower):
  df = df.withColumnRenamed(column, low)

df = df.withColumnRenamed('pos.', 'pos')
print(df.columns)

['team', 'pos', 'fifa popular name', 'birth date', 'shirt name', 'height', 'weight']
time: 61.5 ms (started: 2023-09-19 21:37:17 +00:00)


Note que 'birth date' está no formato string e também o separador dos componentes YYY, MM, DD é um ponto. Abaixo vemos uma forma de extrair os componentes.

In [None]:
dia = udf(lambda date:date.split('.')[0])
mes = udf(lambda date:date.split('.')[1])
ano = udf(lambda date:date.split('.')[2])

df = df.withColumn('dia', dia('birth date'))
df = df.withColumn('mes', mes('birth date'))
df = df.withColumn('ano', ano('birth date'))

df.show(5)

+---------+---+------------------+----------+----------+------+------+---+---+----+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+---------+---+------------------+----------+----------+------+------+---+---+----+
|Argentina| DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|   169|    65| 31| 08|1992|
|Argentina| MF|    PAVON Cristian|21.01.1996|     PAVÓN|   169|    65| 21| 01|1996|
|Argentina| MF|    LANZINI Manuel|15.02.1993|   LANZINI|   167|    66| 15| 02|1993|
|Argentina| DF|    SALVIO Eduardo|13.07.1990|    SALVIO|   167|    69| 13| 07|1990|
|Argentina| FW|      MESSI Lionel|24.06.1987|     MESSI|   170|    72| 24| 06|1987|
+---------+---+------------------+----------+----------+------+------+---+---+----+
only showing top 5 rows

time: 967 ms (started: 2023-09-19 21:37:17 +00:00)


A conversão de "birth date" para o formato DateType pode ser feita da seguinte forma.

In [None]:
df.withColumn('data nascimento', to_date(col("birth date"), "dd.MM.yyyy")).printSchema()
# ou
#df.withColumn('Data', to_date(col("birth date"), "dd.MM.yyyy").cast(DateType())).printSchema()

root
 |-- team: string (nullable = true)
 |-- pos: string (nullable = true)
 |-- fifa popular name: string (nullable = true)
 |-- birth date: string (nullable = true)
 |-- shirt name: string (nullable = true)
 |-- height: integer (nullable = true)
 |-- weight: integer (nullable = true)
 |-- dia: string (nullable = true)
 |-- mes: string (nullable = true)
 |-- ano: string (nullable = true)
 |-- data nascimento: date (nullable = true)

time: 31 ms (started: 2023-09-19 21:37:18 +00:00)


Outra conversão de tipo com a função cast().

In [None]:
df.withColumn('height', col('height').cast(FloatType())).show(5)

+---------+---+------------------+----------+----------+------+------+---+---+----+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+---------+---+------------------+----------+----------+------+------+---+---+----+
|Argentina| DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO| 169.0|    65| 31| 08|1992|
|Argentina| MF|    PAVON Cristian|21.01.1996|     PAVÓN| 169.0|    65| 21| 01|1996|
|Argentina| MF|    LANZINI Manuel|15.02.1993|   LANZINI| 167.0|    66| 15| 02|1993|
|Argentina| DF|    SALVIO Eduardo|13.07.1990|    SALVIO| 167.0|    69| 13| 07|1990|
|Argentina| FW|      MESSI Lionel|24.06.1987|     MESSI| 170.0|    72| 24| 06|1987|
+---------+---+------------------+----------+----------+------+------+---+---+----+
only showing top 5 rows

time: 574 ms (started: 2023-09-19 21:37:18 +00:00)


Remoção de uma coluna.

In [None]:
df = df.withColumn('new_col', lit(True))
df = df.drop('new_col')
df.show(5)

+---------+---+------------------+----------+----------+------+------+---+---+----+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+---------+---+------------------+----------+----------+------+------+---+---+----+
|Argentina| DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|   169|    65| 31| 08|1992|
|Argentina| MF|    PAVON Cristian|21.01.1996|     PAVÓN|   169|    65| 21| 01|1996|
|Argentina| MF|    LANZINI Manuel|15.02.1993|   LANZINI|   167|    66| 15| 02|1993|
|Argentina| DF|    SALVIO Eduardo|13.07.1990|    SALVIO|   167|    69| 13| 07|1990|
|Argentina| FW|      MESSI Lionel|24.06.1987|     MESSI|   170|    72| 24| 06|1987|
+---------+---+------------------+----------+----------+------+------+---+---+----+
only showing top 5 rows

time: 692 ms (started: 2023-09-19 21:37:18 +00:00)


Criando backup do dataframe

In [None]:
df_backup = df

time: 439 µs (started: 2023-09-19 21:37:19 +00:00)


# Instrução: WHERE
Função que implementa a instrução WHERE do SQL. A string argumento da função where() deve seguir o padrão SQL.

Uma alternativa é a função filter() que pode ser usada da mesma forma.

In [None]:
df.where("team = 'Brazil'").show(5)

+------+---+-----------------+----------+-----------+------+------+---+---+----+
|  team|pos|fifa popular name|birth date| shirt name|height|weight|dia|mes| ano|
+------+---+-----------------+----------+-----------+------+------+---+---+----+
|Brazil| MF|             FRED|05.03.1993|       FRED|   169|    64| 05| 03|1993|
|Brazil| FW|           TAISON|13.01.1988|     TAISON|   172|    64| 13| 01|1988|
|Brazil| MF|      FERNANDINHO|04.05.1985|FERNANDINHO|   179|    67| 04| 05|1985|
|Brazil| DF|           FAGNER|11.06.1989|     FAGNER|   168|    67| 11| 06|1989|
|Brazil| FW|           NEYMAR|05.02.1992|  NEYMAR JR|   175|    68| 05| 02|1992|
+------+---+-----------------+----------+-----------+------+------+---+---+----+
only showing top 5 rows

time: 691 ms (started: 2023-09-19 21:37:19 +00:00)


In [None]:
mask = df['team'] == 'Argentina'
print(mask)
df.where(mask).show(5)

Column<'(team = Argentina)'>
+---------+---+------------------+----------+----------+------+------+---+---+----+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+---------+---+------------------+----------+----------+------+------+---+---+----+
|Argentina| DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|   169|    65| 31| 08|1992|
|Argentina| MF|    PAVON Cristian|21.01.1996|     PAVÓN|   169|    65| 21| 01|1996|
|Argentina| MF|    LANZINI Manuel|15.02.1993|   LANZINI|   167|    66| 15| 02|1993|
|Argentina| DF|    SALVIO Eduardo|13.07.1990|    SALVIO|   167|    69| 13| 07|1990|
|Argentina| FW|      MESSI Lionel|24.06.1987|     MESSI|   170|    72| 24| 06|1987|
+---------+---+------------------+----------+----------+------+------+---+---+----+
only showing top 5 rows

time: 699 ms (started: 2023-09-19 21:37:20 +00:00)


In [None]:
mask = (col('shirt name') == 'MESSI')
print(mask)
df.filter(mask).show(5)

Column<'(shirt name = MESSI)'>
+---------+---+-----------------+----------+----------+------+------+---+---+----+
|     team|pos|fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+---------+---+-----------------+----------+----------+------+------+---+---+----+
|Argentina| FW|     MESSI Lionel|24.06.1987|     MESSI|   170|    72| 24| 06|1987|
+---------+---+-----------------+----------+----------+------+------+---+---+----+

time: 1.08 s (started: 2023-09-19 21:37:20 +00:00)


## Filtros compostos

In [None]:
mask = ("team = 'Brazil' AND height < 170")
print(mask)
df.where(mask).show(5)

team = 'Brazil' AND height < 170
+------+---+-----------------+----------+----------+------+------+---+---+----+
|  team|pos|fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+------+---+-----------------+----------+----------+------+------+---+---+----+
|Brazil| MF|             FRED|05.03.1993|      FRED|   169|    64| 05| 03|1993|
|Brazil| DF|           FAGNER|11.06.1989|    FAGNER|   168|    67| 11| 06|1989|
+------+---+-----------------+----------+----------+------+------+---+---+----+

time: 689 ms (started: 2023-09-19 21:37:21 +00:00)


In [None]:
mask = (col('team') == 'Brazil') & (col('height') < 170)
print(mask)
df.where(mask).show(5)

Column<'((team = Brazil) AND (height < 170))'>
+------+---+-----------------+----------+----------+------+------+---+---+----+
|  team|pos|fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+------+---+-----------------+----------+----------+------+------+---+---+----+
|Brazil| MF|             FRED|05.03.1993|      FRED|   169|    64| 05| 03|1993|
|Brazil| DF|           FAGNER|11.06.1989|    FAGNER|   168|    67| 11| 06|1989|
+------+---+-----------------+----------+----------+------+------+---+---+----+

time: 658 ms (started: 2023-09-19 21:37:22 +00:00)


# Instrução: GROUP BY

Tendo uma coluna como referência, todas as linhas onde os valores dessa coluna são iguais são "colapsadas" em apenas uma. É preciso especificar o que deve ser feito com as outras colunas caso contrário elas serão ignoradas. Normalmente aplicamos funções de estatística descritiva.

In [None]:
df.groupBy('team').mean('weight').orderBy('avg(weight)', ascending=True).show(10)

+--------------+-----------------+
|          team|      avg(weight)|
+--------------+-----------------+
|         Japan|71.52173913043478|
|  Saudi Arabia|73.04347826086956|
|      Portugal| 73.6086956521739|
|        Mexico|74.08695652173913|
|    Costa Rica| 74.1304347826087|
|Korea Republic|74.43478260869566|
|       Uruguay| 74.6086956521739|
|       Morocco|74.65217391304348|
|         Spain|74.73913043478261|
|       Tunisia|             75.0|
+--------------+-----------------+
only showing top 10 rows

time: 930 ms (started: 2023-09-19 21:37:23 +00:00)


Para especificar qual função de agregação deve ser usada em cada coluna podemos usar a função agg().

In [None]:
df.groupBy('team').agg({'weight':'avg', 'dia':'min', 'height':'max'}).orderBy('max(height)', ascending=False).show(10)

+--------------+-----------------+--------+-----------+
|          team|      avg(weight)|min(dia)|max(height)|
+--------------+-----------------+--------+-----------+
|       Croatia|79.30434782608695|      02|        201|
|       Denmark| 82.6086956521739|      01|        200|
|     Argentina|75.56521739130434|      02|        199|
|       Belgium|79.56521739130434|      02|        199|
|       Iceland|80.73913043478261|      01|        198|
|        Sweden|78.82608695652173|      02|        198|
|       Nigeria|80.47826086956522|      01|        197|
|Korea Republic|74.43478260869566|      03|        197|
|        France|             80.0|      03|        197|
|        Panama|             80.0|      01|        197|
+--------------+-----------------+--------+-----------+
only showing top 10 rows

time: 2.18 s (started: 2023-09-19 21:37:24 +00:00)


In [None]:
df.groupBy('team').agg(avg('height'), min('height'), max('height')).orderBy('avg(height)', ascending=False).show(20)

+--------------+------------------+-----------+-----------+
|          team|       avg(height)|min(height)|max(height)|
+--------------+------------------+-----------+-----------+
|        Serbia|186.69565217391303|        169|        195|
|       Denmark| 186.6086956521739|        171|        200|
|       Germany| 185.7826086956522|        176|        195|
|        Sweden| 185.7391304347826|        177|        198|
|       Iceland|185.52173913043478|        170|        198|
|       Belgium|185.34782608695653|        169|        199|
|       Croatia| 185.2608695652174|        172|        201|
|       Nigeria|184.52173913043478|        172|        197|
|       IR Iran|184.47826086956522|        177|        194|
|        Russia| 184.3913043478261|        173|        196|
|       Senegal|183.65217391304347|        173|        196|
|        France|183.30434782608697|        168|        197|
|        Poland|183.17391304347825|        172|        195|
|       Tunisia|183.08695652173913|     

In [None]:
df.groupBy('team').agg(avg('weight')).orderBy('avg(weight)', ascending=True).show(10)

+--------------+-----------------+
|          team|      avg(weight)|
+--------------+-----------------+
|         Japan|71.52173913043478|
|  Saudi Arabia|73.04347826086956|
|      Portugal| 73.6086956521739|
|        Mexico|74.08695652173913|
|    Costa Rica| 74.1304347826087|
|Korea Republic|74.43478260869566|
|       Uruguay| 74.6086956521739|
|       Morocco|74.65217391304348|
|         Spain|74.73913043478261|
|       Tunisia|             75.0|
+--------------+-----------------+
only showing top 10 rows

time: 1.27 s (started: 2023-09-19 21:37:27 +00:00)


# Instrução: PARTITION BY
Tem o conceito muito parecido com groupby, mas enquanto neste, as linhas iguais são agrupadas formando agrupamentos de instâncias de dados baseados em uma ou mais colunas.

*   row_number()
*   rank()
*   dense_rank()
*   persent_rank()
*   ntile()

**Obs.** A função orderBy() usada com Window.partitionBy() não é a mesma usada com as funções de agregação de groupBy(). Enquanto essa retorna um DataFrame o outro cria uma WindowSpec.

In [None]:
df.show(5)

+---------+---+------------------+----------+----------+------+------+---+---+----+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|
+---------+---+------------------+----------+----------+------+------+---+---+----+
|Argentina| DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|   169|    65| 31| 08|1992|
|Argentina| MF|    PAVON Cristian|21.01.1996|     PAVÓN|   169|    65| 21| 01|1996|
|Argentina| MF|    LANZINI Manuel|15.02.1993|   LANZINI|   167|    66| 15| 02|1993|
|Argentina| DF|    SALVIO Eduardo|13.07.1990|    SALVIO|   167|    69| 13| 07|1990|
|Argentina| FW|      MESSI Lionel|24.06.1987|     MESSI|   170|    72| 24| 06|1987|
+---------+---+------------------+----------+----------+------+------+---+---+----+
only showing top 5 rows

time: 551 ms (started: 2023-09-19 21:37:29 +00:00)


row_number()

In [None]:
prt = Window.partitionBy('team').orderBy(desc('height'))
print(type(prt))
print(type(row_number()))
df.withColumn('row', row_number().over(prt)).show(10)

<class 'pyspark.sql.window.WindowSpec'>
<class 'pyspark.sql.column.Column'>
+---------+---+------------------+----------+----------+------+------+---+---+----+---+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|row|
+---------+---+------------------+----------+----------+------+------+---+---+----+---+
|Argentina| DF|    FAZIO Federico|17.03.1987|     FAZIO|   199|    85| 17| 03|1987|  1|
|Argentina| GK|     GUZMAN Nahuel|10.02.1986|    GUZMÁN|   192|    90| 10| 02|1986|  2|
|Argentina| DF|       ROJO Marcos|20.03.1990|      ROJO|   189|    82| 20| 03|1990|  3|
|Argentina| GK|     ARMANI Franco|16.10.1986|    ARMANI|   189|    85| 16| 10|1986|  4|
|Argentina| GK|CABALLERO Wilfredo|28.09.1981| CABALLERO|   186|    80| 28| 09|1981|  5|
|Argentina| FW|   HIGUAIN Gonzalo|10.12.1987|   HIGUAÍN|   184|    75| 10| 12|1987|  6|
|Argentina| DF|  ANSALDI Cristian|20.09.1986|   ANSALDI|   181|    73| 20| 09|1986|  7|
|Argentina| DF|   MERCADO Gabriel|18.03.1987

In [None]:
# Selecionar os atletas mais altos de cada time.
prt = Window.partitionBy('team').orderBy(desc('height'))
df.withColumn('top', row_number().over(prt)).where("top = 1").show(10)

+----------+---+------------------+----------+-----------+------+------+---+---+----+---+
|      team|pos| fifa popular name|birth date| shirt name|height|weight|dia|mes| ano|top|
+----------+---+------------------+----------+-----------+------+------+---+---+----+---+
| Argentina| DF|    FAZIO Federico|17.03.1987|      FAZIO|   199|    85| 17| 03|1987|  1|
| Australia| GK|        JONES Brad|19.03.1982|      JONES|   193|    87| 19| 03|1982|  1|
|   Belgium| GK|  COURTOIS Thibaut|11.05.1992|   COURTOIS|   199|    91| 11| 05|1992|  1|
|    Brazil| GK|            CASSIO|06.06.1987|     CASSIO|   195|    92| 06| 06|1987|  1|
|  Colombia| DF|        MINA Yerry|23.09.1994|    Y. MINA|   194|    95| 23| 09|1994|  1|
|Costa Rica| DF|    WASTON Kendall|01.01.1988|  K. WASTON|   196|    87| 01| 01|1988|  1|
|   Croatia| GK|     KALINIC Lovre|03.04.1990| L. KALINIĆ|   201|    96| 03| 04|1990|  1|
|   Denmark| DF|VESTERGAARD Jannik|03.08.1992|VESTERGAARD|   200|    98| 03| 08|1992|  1|
|     Egyp

rank(): Note como rank=3 se repete duas vezes e depois há um salto para rank=5. Esta é uma peculiaridade dessa função.

In [None]:
prt = Window.partitionBy('team').orderBy(desc('height'))
df.withColumn('rank', rank().over(prt)).show(10)

+---------+---+------------------+----------+----------+------+------+---+---+----+----+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|rank|
+---------+---+------------------+----------+----------+------+------+---+---+----+----+
|Argentina| DF|    FAZIO Federico|17.03.1987|     FAZIO|   199|    85| 17| 03|1987|   1|
|Argentina| GK|     GUZMAN Nahuel|10.02.1986|    GUZMÁN|   192|    90| 10| 02|1986|   2|
|Argentina| DF|       ROJO Marcos|20.03.1990|      ROJO|   189|    82| 20| 03|1990|   3|
|Argentina| GK|     ARMANI Franco|16.10.1986|    ARMANI|   189|    85| 16| 10|1986|   3|
|Argentina| GK|CABALLERO Wilfredo|28.09.1981| CABALLERO|   186|    80| 28| 09|1981|   5|
|Argentina| FW|   HIGUAIN Gonzalo|10.12.1987|   HIGUAÍN|   184|    75| 10| 12|1987|   6|
|Argentina| DF|  ANSALDI Cristian|20.09.1986|   ANSALDI|   181|    73| 20| 09|1986|   7|
|Argentina| DF|   MERCADO Gabriel|18.03.1987|   MERCADO|   181|    81| 18| 03|1987|   7|
|Argentina| DF|  OTAM

dense_rank(): Aqui, mesmo que rank=3 se repita o próximo valor de rank é 4 e assim por diante. Não há saltos de valores.

In [None]:
prt = Window.partitionBy('team').orderBy(desc('height'))
df.withColumn('dense_rank', dense_rank().over(prt)).show(10)

+---------+---+------------------+----------+----------+------+------+---+---+----+----------+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|dense_rank|
+---------+---+------------------+----------+----------+------+------+---+---+----+----------+
|Argentina| DF|    FAZIO Federico|17.03.1987|     FAZIO|   199|    85| 17| 03|1987|         1|
|Argentina| GK|     GUZMAN Nahuel|10.02.1986|    GUZMÁN|   192|    90| 10| 02|1986|         2|
|Argentina| DF|       ROJO Marcos|20.03.1990|      ROJO|   189|    82| 20| 03|1990|         3|
|Argentina| GK|     ARMANI Franco|16.10.1986|    ARMANI|   189|    85| 16| 10|1986|         3|
|Argentina| GK|CABALLERO Wilfredo|28.09.1981| CABALLERO|   186|    80| 28| 09|1981|         4|
|Argentina| FW|   HIGUAIN Gonzalo|10.12.1987|   HIGUAÍN|   184|    75| 10| 12|1987|         5|
|Argentina| DF|  ANSALDI Cristian|20.09.1986|   ANSALDI|   181|    73| 20| 09|1986|         6|
|Argentina| DF|   MERCADO Gabriel|18.03.1987|   ME

persent_rank(): ranking relativo(percentual)

In [None]:
prt = Window.partitionBy('team').orderBy(desc('height'))
df.withColumn('persent_rank', percent_rank().over(prt)).show(10)

+---------+---+------------------+----------+----------+------+------+---+---+----+--------------------+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|        persent_rank|
+---------+---+------------------+----------+----------+------+------+---+---+----+--------------------+
|Argentina| DF|    FAZIO Federico|17.03.1987|     FAZIO|   199|    85| 17| 03|1987|                 0.0|
|Argentina| GK|     GUZMAN Nahuel|10.02.1986|    GUZMÁN|   192|    90| 10| 02|1986|0.045454545454545456|
|Argentina| DF|       ROJO Marcos|20.03.1990|      ROJO|   189|    82| 20| 03|1990| 0.09090909090909091|
|Argentina| GK|     ARMANI Franco|16.10.1986|    ARMANI|   189|    85| 16| 10|1986| 0.09090909090909091|
|Argentina| GK|CABALLERO Wilfredo|28.09.1981| CABALLERO|   186|    80| 28| 09|1981| 0.18181818181818182|
|Argentina| FW|   HIGUAIN Gonzalo|10.12.1987|   HIGUAÍN|   184|    75| 10| 12|1987| 0.22727272727272727|
|Argentina| DF|  ANSALDI Cristian|20.09.1986|   ANSALDI

ntile(): Divide cada partição em uma quantidade n de quartiles. Cada quartile recebe um valor único. Caso uma partição não seja divisível por n o aloritmo ajustará a quantidade de instâncias de dados pertencentes aos últimos quartiles de modo que a partição tenha n quartiles. Por exemplo, na seleção da Argentina, quando n=5, o penúltimo quartil tem 4 instâncias para poder formar mais um, o último quartil que terá apenas uma instância.

In [None]:
prt = Window.partitionBy('team').orderBy(desc('height'))
df.withColumn('ntile', ntile(5).over(prt)).show(20)

+---------+---+------------------+----------+----------+------+------+---+---+----+-----+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|ntile|
+---------+---+------------------+----------+----------+------+------+---+---+----+-----+
|Argentina| DF|    FAZIO Federico|17.03.1987|     FAZIO|   199|    85| 17| 03|1987|    1|
|Argentina| GK|     GUZMAN Nahuel|10.02.1986|    GUZMÁN|   192|    90| 10| 02|1986|    1|
|Argentina| DF|       ROJO Marcos|20.03.1990|      ROJO|   189|    82| 20| 03|1990|    1|
|Argentina| GK|     ARMANI Franco|16.10.1986|    ARMANI|   189|    85| 16| 10|1986|    1|
|Argentina| GK|CABALLERO Wilfredo|28.09.1981| CABALLERO|   186|    80| 28| 09|1981|    1|
|Argentina| FW|   HIGUAIN Gonzalo|10.12.1987|   HIGUAÍN|   184|    75| 10| 12|1987|    2|
|Argentina| DF|  ANSALDI Cristian|20.09.1986|   ANSALDI|   181|    73| 20| 09|1986|    2|
|Argentina| DF|   MERCADO Gabriel|18.03.1987|   MERCADO|   181|    81| 18| 03|1987|    2|
|Argentina

Lag function: O mesmo tipo de lag usado em séries temporais.

In [None]:
prt = Window.partitionBy('team').orderBy(desc('height'))
df.withColumn('lag', lag('weight', offset=2).over(prt)).show(10)

+---------+---+------------------+----------+----------+------+------+---+---+----+----+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano| lag|
+---------+---+------------------+----------+----------+------+------+---+---+----+----+
|Argentina| DF|    FAZIO Federico|17.03.1987|     FAZIO|   199|    85| 17| 03|1987|null|
|Argentina| GK|     GUZMAN Nahuel|10.02.1986|    GUZMÁN|   192|    90| 10| 02|1986|null|
|Argentina| DF|       ROJO Marcos|20.03.1990|      ROJO|   189|    82| 20| 03|1990|  85|
|Argentina| GK|     ARMANI Franco|16.10.1986|    ARMANI|   189|    85| 16| 10|1986|  90|
|Argentina| GK|CABALLERO Wilfredo|28.09.1981| CABALLERO|   186|    80| 28| 09|1981|  82|
|Argentina| FW|   HIGUAIN Gonzalo|10.12.1987|   HIGUAÍN|   184|    75| 10| 12|1987|  85|
|Argentina| DF|  ANSALDI Cristian|20.09.1986|   ANSALDI|   181|    73| 20| 09|1986|  80|
|Argentina| DF|   MERCADO Gabriel|18.03.1987|   MERCADO|   181|    81| 18| 03|1987|  75|
|Argentina| DF|  OTAM

Forward function: O mesmo tipo de forward usado em séries temporais.

In [None]:
prt = Window.partitionBy('team').orderBy(desc('height'))
df.withColumn('lead', lead('weight', offset=1).over(prt)).show(10)

+---------+---+------------------+----------+----------+------+------+---+---+----+----+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|lead|
+---------+---+------------------+----------+----------+------+------+---+---+----+----+
|Argentina| DF|    FAZIO Federico|17.03.1987|     FAZIO|   199|    85| 17| 03|1987|  90|
|Argentina| GK|     GUZMAN Nahuel|10.02.1986|    GUZMÁN|   192|    90| 10| 02|1986|  82|
|Argentina| DF|       ROJO Marcos|20.03.1990|      ROJO|   189|    82| 20| 03|1990|  85|
|Argentina| GK|     ARMANI Franco|16.10.1986|    ARMANI|   189|    85| 16| 10|1986|  80|
|Argentina| GK|CABALLERO Wilfredo|28.09.1981| CABALLERO|   186|    80| 28| 09|1981|  75|
|Argentina| FW|   HIGUAIN Gonzalo|10.12.1987|   HIGUAÍN|   184|    75| 10| 12|1987|  73|
|Argentina| DF|  ANSALDI Cristian|20.09.1986|   ANSALDI|   181|    73| 20| 09|1986|  81|
|Argentina| DF|   MERCADO Gabriel|18.03.1987|   MERCADO|   181|    81| 18| 03|1987|  81|
|Argentina| DF|  OTAM

# Instrução: DISTINCT

In [None]:
df.select('team').distinct().show(5)

+-------+
|   team|
+-------+
| Russia|
|Senegal|
| Sweden|
|IR Iran|
|Germany|
+-------+
only showing top 5 rows

time: 250 ms (started: 2023-09-19 21:37:33 +00:00)


Número de valores únicos em um atributo.

In [None]:
nunique = df.select('team').distinct().count()
print(f'unique values: {nunique}')

unique values: 32
time: 486 ms (started: 2023-09-19 21:37:34 +00:00)


# Instrução: COLLECT
Salva o resultado de uma consulta em uma lista.

In [None]:
df.select('team').distinct().collect()

[Row(team='Russia'),
 Row(team='Senegal'),
 Row(team='Sweden'),
 Row(team='IR Iran'),
 Row(team='Germany'),
 Row(team='France'),
 Row(team='Argentina'),
 Row(team='Belgium'),
 Row(team='Peru'),
 Row(team='Croatia'),
 Row(team='Nigeria'),
 Row(team='Korea Republic'),
 Row(team='Spain'),
 Row(team='Denmark'),
 Row(team='Morocco'),
 Row(team='Panama'),
 Row(team='Iceland'),
 Row(team='Uruguay'),
 Row(team='Mexico'),
 Row(team='Tunisia'),
 Row(team='Saudi Arabia'),
 Row(team='Switzerland'),
 Row(team='Brazil'),
 Row(team='Japan'),
 Row(team='England'),
 Row(team='Poland'),
 Row(team='Portugal'),
 Row(team='Australia'),
 Row(team='Costa Rica'),
 Row(team='Egypt'),
 Row(team='Serbia'),
 Row(team='Colombia')]

time: 263 ms (started: 2023-09-19 21:37:34 +00:00)


O resultado anterior é uma lista de objetos Row. Caso seja necessário apenas o nome do país podemos usar o código abaixo.

In [None]:
result = df.select('team').distinct().collect()
countries = [row[0] for row in result]
print(countries)

['Russia', 'Senegal', 'Sweden', 'IR Iran', 'Germany', 'France', 'Argentina', 'Belgium', 'Peru', 'Croatia', 'Nigeria', 'Korea Republic', 'Spain', 'Denmark', 'Morocco', 'Panama', 'Iceland', 'Uruguay', 'Mexico', 'Tunisia', 'Saudi Arabia', 'Switzerland', 'Brazil', 'Japan', 'England', 'Poland', 'Portugal', 'Australia', 'Costa Rica', 'Egypt', 'Serbia', 'Colombia']
time: 315 ms (started: 2023-09-19 21:37:34 +00:00)


# When/Otherwise
É o if/else do PySpark.

In [None]:
val = when(condition=(col('team') == 'Argentina'), value='Argentinos').otherwise(value='Normais')
df.withColumn('new_col', val).show()

+---------+---+------------------+----------+----------+------+------+---+---+----+----------+
|     team|pos| fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|   new_col|
+---------+---+------------------+----------+----------+------+------+---+---+----+----------+
|Argentina| DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|   169|    65| 31| 08|1992|Argentinos|
|Argentina| MF|    PAVON Cristian|21.01.1996|     PAVÓN|   169|    65| 21| 01|1996|Argentinos|
|Argentina| MF|    LANZINI Manuel|15.02.1993|   LANZINI|   167|    66| 15| 02|1993|Argentinos|
|Argentina| DF|    SALVIO Eduardo|13.07.1990|    SALVIO|   167|    69| 13| 07|1990|Argentinos|
|Argentina| FW|      MESSI Lionel|24.06.1987|     MESSI|   170|    72| 24| 06|1987|Argentinos|
|Argentina| DF|  ANSALDI Cristian|20.09.1986|   ANSALDI|   181|    73| 20| 09|1986|Argentinos|
|Argentina| MF|      BIGLIA Lucas|30.01.1986|    BIGLIA|   175|    73| 30| 01|1986|Argentinos|
|Argentina| MF|       BANEGA Ever|29.06.1988|    B

In [None]:
africa = ['Senegal', 'Morocco', 'Tunisia', 'Egypt']
america_norte = ['Panama', 'Mexico', 'Costa Rica']
america_sul = ['Argentina', 'Peru', 'Uruguay', 'Brazil', 'Colombia']
asia = ['Russia', 'IR Iran', 'Nigeria', 'Korea Republic', 'Saudi Arabia', 'Japan', ]
europa = ['Sweden', 'Germany', 'France', 'Belgium', 'Croatia', 'Spain', 'Denmark', 'Iceland', 'Switzerland', 'England', 'Poland', 'Portugal', 'Serbia']
oceania = ['Australia']

val = when(condition=(col('team').isin(africa)), value=('europeu'))\
      .when(condition=(col('team').isin(america_norte)), value=('n_americano'))\
      .when(condition=(col('team').isin(america_sul)), value=('s_americano'))\
      .when(condition=(col('team').isin(asia)), value=('asiatico'))\
      .when(condition=(col('team').isin(europa)), value=('europeu'))\
      .when(condition=(col('team').isin(oceania)), value=('oceanicos'))\
      .otherwise('desconhecidos')

df.withColumn('new_col', val).sample(fraction=0.01).show(5)

+---------+---+-----------------+----------+----------+------+------+---+---+----+-----------+
|     team|pos|fifa popular name|birth date|shirt name|height|weight|dia|mes| ano|    new_col|
+---------+---+-----------------+----------+----------+------+------+---+---+----+-----------+
|Argentina| MF|      BANEGA Ever|29.06.1988|    BANEGA|   175|    73| 29| 06|1988|s_americano|
|  Belgium| GK| COURTOIS Thibaut|11.05.1992|  COURTOIS|   199|    91| 11| 05|1992|    europeu|
|  Denmark| FW|      SISTO Pione|04.02.1995|     SISTO|   173|    69| 04| 02|1995|    europeu|
|    Japan| MF|    KAGAWA Shinji|17.03.1989|    KAGAWA|   175|    68| 17| 03|1989|   asiatico|
|  Senegal| FW|       MANE Sadio|10.04.1992|      MANE|   175|    69| 10| 04|1992|    europeu|
+---------+---+-----------------+----------+----------+------+------+---+---+----+-----------+
only showing top 5 rows

time: 459 ms (started: 2023-09-19 21:37:35 +00:00)


# Instrução: UNION

A função union() verifica unicamente a quantidade colunas que os dataframes envolvidos possuem. Caso eles tenham a mesma quantidade a função concatenará um embaixo do outro. Ou seja, considerando um dataframe df_x e outro df_y, ela concatenará a primeira coluna de df_x com a primeira coluna de df_y, a segunda de df_x com a segunda de df_y, e assim por diante. Portanto, union() não verifica os tipos de dados nem os nomes das colunas. Para que o resultado faça sentido o programador deve fazer essas verificações.

Vou criar dois dataframes com países americanos e concatená-los formando apenas um. Mas primeiro, preciso criar um novo atributo com os continentes os quais cada país pertence.

In [None]:
df = df.withColumn('continent', val)
df.sample(fraction=0.01).show(5)

+----------+---+------------------+----------+------------+------+------+---+---+----+-----------+
|      team|pos| fifa popular name|birth date|  shirt name|height|weight|dia|mes| ano|  continent|
+----------+---+------------------+----------+------------+------+------+---+---+----+-----------+
| Australia| FW|PETRATOS Dimitrios|10.11.1992|    PETRATOS|   176|    72| 10| 11|1992|  oceanicos|
| Australia| FW|    NABBOUT Andrew|17.12.1992|     NABBOUT|   178|    85| 17| 12|1992|  oceanicos|
|Costa Rica| DF|  MATARRITA Ronald|09.07.1994|R. MATARRITA|   175|    70| 09| 07|1994|n_americano|
|   Croatia| FW|    KALINIC Nikola|05.01.1988|  N. KALINIĆ|   187|    81| 05| 01|1988|    europeu|
|    France| FW|   DEMBELE Ousmane|15.05.1997|     DEMBELE|   178|    70| 15| 05|1997|    europeu|
+----------+---+------------------+----------+------------+------+------+---+---+----+-----------+
only showing top 5 rows

time: 320 ms (started: 2023-09-19 21:37:35 +00:00)


Agora, vou criar um dataframe com os países da América do Sul e outro com os da América do Norte.

In [None]:
s_america = df.where("continent = 's_americano'")
n_america = df.where("continent = 'n_americano'")

df_america = s_america.union(n_america)

time: 37.9 ms (started: 2023-09-19 21:37:36 +00:00)


In [None]:
print(s_america.count())
print(n_america.count())
print(df_america.count())
df_america.sample(fraction=0.04).show(15)

115
69
184
+----------+---+------------------+----------+-------------+------+------+---+---+----+-----------+
|      team|pos| fifa popular name|birth date|   shirt name|height|weight|dia|mes| ano|  continent|
+----------+---+------------------+----------+-------------+------+------+---+---+----+-----------+
| Argentina| GK|CABALLERO Wilfredo|28.09.1981|    CABALLERO|   186|    80| 28| 09|1981|s_americano|
|    Brazil| DF|           MIRANDA|07.09.1984|      MIRANDA|   186|    78| 07| 09|1984|s_americano|
|  Colombia| FW|      BACCA Carlos|08.09.1986|        BACCA|   181|    77| 08| 09|1986|s_americano|
|  Colombia| GK|     VARGAS Camilo|09.03.1989|    C. VARGAS|   185|    80| 09| 03|1989|s_americano|
|      Peru| DF|     TRAUCO Miguel|25.08.1992|       TRAUCO|   169|    74| 25| 08|1992|s_americano|
|      Peru| FW|    CARRILLO Andre|14.06.1991|     CARRILLO|   182|    77| 14| 06|1991|s_americano|
|      Peru| FW|  FARFAN Jefferson|26.10.1984|       FARFAN|   177|    85| 26| 10|1984|s_