# Aula 1 - Conhecendo o Spark

O Apache Spark é um framework de processamento de dados distribuídos, altamente eficiente para lidar com grandes volumes de dados. Ele se destaca pelo seu desempenho rápido, principalmente devido ao processamento em memória, e oferece suporte para diversas tarefas de análise de dados, machine learning e processamento de fluxos contínuos. Amplamente adotado em diversos setores, o Spark facilita a manipulação de dados em grande escala, com uma API flexível e suporte para várias linguagens de programação.

In [46]:
from pyspark.sql import SparkSession

from pyspark.sql import Row, DataFrame
from pyspark.sql.types import StringType, StructType, StructField, IntegerType
from pyspark.sql.functions import col, expr, lit, substring, concat, concat_ws, when, coalesce
from pyspark.sql import functions as F
from functools import reduce

Inicializando o Apache Spark

In [1]:
spark = SparkSession.builder \
      .master("local[*]") \
      .appName("postech") \
      .getOrCreate()

Vamos testar a conexão com o Spark

In [7]:
df = spark.sql("SELECT 'Sucesso total, estamos online!' AS hello")
df.show()

+--------------------+
|               hello|
+--------------------+
|Sucesso total, es...|
+--------------------+



## Manipulação de dados

In [8]:
df = spark.read.csv('data/banklist.csv', sep = ',', inferSchema = True, header = True)

print(f"df.count {df.count()}")
print(f"df.col ct {len(df.columns)}")
print(f"df.columns {df.columns}")

df.count 553
df.col ct 7
df.columns ['Bank Name', 'City', 'ST', 'CERT', 'Acquiring Institution', 'Closing Date', 'Updated Date']


É possível fazermos consultas nos dados utilizando Linguagem SQL:

In [12]:
df.createOrReplaceTempView('banklist')

df_check = spark.sql("SELECT `Bank Name`, City, `Closing Date` from banklist")
df_check.show(5, truncate=False)

+------------------------------------------------------+------------------+------------+
|Bank Name                                             |City              |Closing Date|
+------------------------------------------------------+------------------+------------+
|Fayette County Bank                                   |Saint Elmo        |26-May-17   |
|Guaranty Bank, (d/b/a BestBank in Georgia & Michigan) |Milwaukee         |5-May-17    |
|First NBC Bank                                        |New Orleans       |28-Apr-17   |
|Proficio Bank                                         |Cottonwood Heights|3-Mar-17    |
|Seaway Bank and Trust Company                         |Chicago           |27-Jan-17   |
+------------------------------------------------------+------------------+------------+
only showing top 5 rows



### Funções Básicas

Existem comando semelhantes ao Pandas e Polars, como o describe

In [6]:
df.describe().show()

+-------+--------------------+-------+----+-----------------+---------------------+------------+------------+
|summary|           Bank Name|   City|  ST|             CERT|Acquiring Institution|Closing Date|Updated Date|
+-------+--------------------+-------+----+-----------------+---------------------+------------+------------+
|  count|                 553|    553| 553|              553|                  553|         553|         553|
|   mean|                NULL|   NULL|NULL|31729.65280289331|                 NULL|        NULL|        NULL|
| stddev|                NULL|   NULL|NULL|16420.59489355429|                 NULL|        NULL|        NULL|
|    min|1st American Stat...|Acworth|  AL|               91|      1st United Bank|    1-Aug-08|    1-Aug-13|
|    max|               ebank|Wyoming|  WY|            58701|  Your Community Bank|    9-Sep-11|    9-Sep-12|
+-------+--------------------+-------+----+-----------------+---------------------+------------+------------+



Podemos executar o comando apenas em algumas colunas

In [13]:
df.describe('City', 'ST').show()

+-------+-------+----+
|summary|   City|  ST|
+-------+-------+----+
|  count|    553| 553|
|   mean|   NULL|NULL|
| stddev|   NULL|NULL|
|    min|Acworth|  AL|
|    max|Wyoming|  WY|
+-------+-------+----+



Existem outras funções que podemos extrair metadados do nosso dataframe

In [15]:
print(f"Count: {df.count()}")
print(f"Columns: {df.columns}")
print(f"DTypes: {df.dtypes}")
print(f"Schema: {df.schema}")

Count: 553
Columns: ['Bank Name', 'City', 'ST', 'CERT', 'Acquiring Institution', 'Closing Date', 'Updated Date']
DTypes: [('Bank Name', 'string'), ('City', 'string'), ('ST', 'string'), ('CERT', 'int'), ('Acquiring Institution', 'string'), ('Closing Date', 'string'), ('Updated Date', 'string')]
Schema: StructType([StructField('Bank Name', StringType(), True), StructField('City', StringType(), True), StructField('ST', StringType(), True), StructField('CERT', IntegerType(), True), StructField('Acquiring Institution', StringType(), True), StructField('Closing Date', StringType(), True), StructField('Updated Date', StringType(), True)])


Conseguimos visualizar o Schema do nosso DataFrame de uma forma mais clara:

In [16]:
df.printSchema()

root
 |-- Bank Name: string (nullable = true)
 |-- City: string (nullable = true)
 |-- ST: string (nullable = true)
 |-- CERT: integer (nullable = true)
 |-- Acquiring Institution: string (nullable = true)
 |-- Closing Date: string (nullable = true)
 |-- Updated Date: string (nullable = true)



Podemos também remover registros duplicados

In [17]:
df = df.dropDuplicates()
print(f"Count: {df.count()}")

Count: 553


### Selecionar Colunas específicas

Podemos passar uma lista de colunas ou apenas os nomes separados por virgula

In [20]:
df2 = df.select(['Bank Name', 'City'])
df2.show(5, truncate=False)

+------------------------------------------------------------------------------------------+----------+
|Bank Name                                                                                 |City      |
+------------------------------------------------------------------------------------------+----------+
|InBank                                                                                    |Oak Forest|
|Bank of Alamo                                                                             |Alamo     |
|First Community Bank of Southwest Florida (also operating as Community Bank of Cape Coral)|Fort Myers|
|The National Republic Bank of Chicago                                                     |Chicago   |
|NOVA Bank                                                                                 |Berwyn    |
+------------------------------------------------------------------------------------------+----------+
only showing top 5 rows



In [21]:
df2 = df.select('Bank Name', 'City')
df2.show(5, truncate=False)

+------------------------------------------------------------------------------------------+----------+
|Bank Name                                                                                 |City      |
+------------------------------------------------------------------------------------------+----------+
|InBank                                                                                    |Oak Forest|
|Bank of Alamo                                                                             |Alamo     |
|First Community Bank of Southwest Florida (also operating as Community Bank of Cape Coral)|Fort Myers|
|The National Republic Bank of Chicago                                                     |Chicago   |
|NOVA Bank                                                                                 |Berwyn    |
+------------------------------------------------------------------------------------------+----------+
only showing top 5 rows



Podemos selecionar todas as colunas com algumas excessões usando o seguinte truque:

In [35]:
col = list(set(df.columns) - set(['CERT', 'ST']))
df2 = df.select(col)
df2.show(5, truncate=True)

+----------+---------------------+--------------------+------------+------------+
|      City|Acquiring Institution|           Bank Name|Updated Date|Closing Date|
+----------+---------------------+--------------------+------------+------------+
|Oak Forest| MB Financial Bank...|              InBank|   17-Oct-15|    4-Sep-09|
|     Alamo|          No Acquirer|       Bank of Alamo|   18-Mar-05|    8-Nov-02|
|Fort Myers|              C1 Bank|First Community B...|    9-Feb-17|    2-Aug-13|
|   Chicago|  State Bank of Texas|The National Repu...|    6-Jan-16|   24-Oct-14|
|    Berwyn|          No Acquirer|           NOVA Bank|   24-Jan-13|   26-Oct-12|
+----------+---------------------+--------------------+------------+------------+
only showing top 5 rows



### Renomeando, Adicionando e Excluíndo Colunas

Usamos a função `withColumnRenamed` para renomear colunas

In [39]:
df2 = df \
    .withColumnRenamed('Bank Name',                 'bank_name') \
    .withColumnRenamed('City',                      'city') \
    .withColumnRenamed('ST',                        'state') \
    .withColumnRenamed('CERT',                      'cert') \
    .withColumnRenamed('Acquiring Institution',     'acquiring_institution') \
    .withColumnRenamed('Closing Date',              'closing_date') \
    .withColumnRenamed('Updated Date',              'update_date')

df2.show(5, truncate=True)

+--------------------+----------+-----+-----+---------------------+------------+-----------+
|           bank_name|      city|state| cert|acquiring_institution|closing_date|update_date|
+--------------------+----------+-----+-----+---------------------+------------+-----------+
|              InBank|Oak Forest|   IL|20203| MB Financial Bank...|    4-Sep-09|  17-Oct-15|
|       Bank of Alamo|     Alamo|   TN| 9961|          No Acquirer|    8-Nov-02|  18-Mar-05|
|First Community B...|Fort Myers|   FL|34943|              C1 Bank|    2-Aug-13|   9-Feb-17|
|The National Repu...|   Chicago|   IL|  916|  State Bank of Texas|   24-Oct-14|   6-Jan-16|
|           NOVA Bank|    Berwyn|   PA|27148|          No Acquirer|   26-Oct-12|  24-Jan-13|
+--------------------+----------+-----+-----+---------------------+------------+-----------+
only showing top 5 rows



Usamos a função `withColumn` para criar uma nova coluna

In [48]:
df.withColumn('state', col('ST')).show(5)

+--------------------+----------+---+-----+---------------------+------------+------------+-----+
|           Bank Name|      City| ST| CERT|Acquiring Institution|Closing Date|Updated Date|state|
+--------------------+----------+---+-----+---------------------+------------+------------+-----+
|              InBank|Oak Forest| IL|20203| MB Financial Bank...|    4-Sep-09|   17-Oct-15|   IL|
|       Bank of Alamo|     Alamo| TN| 9961|          No Acquirer|    8-Nov-02|   18-Mar-05|   TN|
|First Community B...|Fort Myers| FL|34943|              C1 Bank|    2-Aug-13|    9-Feb-17|   FL|
|The National Repu...|   Chicago| IL|  916|  State Bank of Texas|   24-Oct-14|    6-Jan-16|   IL|
|           NOVA Bank|    Berwyn| PA|27148|          No Acquirer|   26-Oct-12|   24-Jan-13|   PA|
+--------------------+----------+---+-----+---------------------+------------+------------+-----+
only showing top 5 rows



Podemos adicionar uma coluna com um valor constante

In [49]:
df.withColumn('country', lit('EUA')).show(5)

+--------------------+----------+---+-----+---------------------+------------+------------+-------+
|           Bank Name|      City| ST| CERT|Acquiring Institution|Closing Date|Updated Date|country|
+--------------------+----------+---+-----+---------------------+------------+------------+-------+
|              InBank|Oak Forest| IL|20203| MB Financial Bank...|    4-Sep-09|   17-Oct-15|    EUA|
|       Bank of Alamo|     Alamo| TN| 9961|          No Acquirer|    8-Nov-02|   18-Mar-05|    EUA|
|First Community B...|Fort Myers| FL|34943|              C1 Bank|    2-Aug-13|    9-Feb-17|    EUA|
|The National Repu...|   Chicago| IL|  916|  State Bank of Texas|   24-Oct-14|    6-Jan-16|    EUA|
|           NOVA Bank|    Berwyn| PA|27148|          No Acquirer|   26-Oct-12|   24-Jan-13|    EUA|
+--------------------+----------+---+-----+---------------------+------------+------------+-------+
only showing top 5 rows



Para excluir uma coluna usamos `drop` ou `reduce` em conjunto com o comando drop

In [51]:
df.drop('ST', 'CERT').show(5)

+--------------------+----------+---------------------+------------+------------+
|           Bank Name|      City|Acquiring Institution|Closing Date|Updated Date|
+--------------------+----------+---------------------+------------+------------+
|              InBank|Oak Forest| MB Financial Bank...|    4-Sep-09|   17-Oct-15|
|       Bank of Alamo|     Alamo|          No Acquirer|    8-Nov-02|   18-Mar-05|
|First Community B...|Fort Myers|              C1 Bank|    2-Aug-13|    9-Feb-17|
|The National Repu...|   Chicago|  State Bank of Texas|   24-Oct-14|    6-Jan-16|
|           NOVA Bank|    Berwyn|          No Acquirer|   26-Oct-12|   24-Jan-13|
+--------------------+----------+---------------------+------------+------------+
only showing top 5 rows



In [52]:
reduce(DataFrame.drop, ['ST', 'CERT'], df).show(5)

+--------------------+----------+---------------------+------------+------------+
|           Bank Name|      City|Acquiring Institution|Closing Date|Updated Date|
+--------------------+----------+---------------------+------------+------------+
|              InBank|Oak Forest| MB Financial Bank...|    4-Sep-09|   17-Oct-15|
|       Bank of Alamo|     Alamo|          No Acquirer|    8-Nov-02|   18-Mar-05|
|First Community B...|Fort Myers|              C1 Bank|    2-Aug-13|    9-Feb-17|
|The National Repu...|   Chicago|  State Bank of Texas|   24-Oct-14|    6-Jan-16|
|           NOVA Bank|    Berwyn|          No Acquirer|   26-Oct-12|   24-Jan-13|
+--------------------+----------+---------------------+------------+------------+
only showing top 5 rows



### Filtragem de Dados

Para aplicarmos filtros no dataframe utilizamos a função `where`

In [53]:
df.where(col('ST') == 'IL').show(5)

+--------------------+----------+---+-----+---------------------+------------+------------+
|           Bank Name|      City| ST| CERT|Acquiring Institution|Closing Date|Updated Date|
+--------------------+----------+---+-----+---------------------+------------+------------+
|              InBank|Oak Forest| IL|20203| MB Financial Bank...|    4-Sep-09|   17-Oct-15|
|The National Repu...|   Chicago| IL|  916|  State Bank of Texas|   24-Oct-14|    6-Jan-16|
|First National Ba...|  Danville| IL| 3644| First Financial B...|    2-Jul-09|   20-Aug-12|
|    Bank of Illinois|    Normal| IL| 9268| Heartland Bank an...|    5-Mar-10|   23-Aug-12|
|       Meridian Bank|    Eldred| IL|13789|        National Bank|   10-Oct-08|   31-May-12|
+--------------------+----------+---+-----+---------------------+------------+------------+
only showing top 5 rows



In [54]:
df.where(col('ST').isin('IL', 'CA')).show(5)

+--------------------+----------+---+-----+---------------------+------------+------------+
|           Bank Name|      City| ST| CERT|Acquiring Institution|Closing Date|Updated Date|
+--------------------+----------+---+-----+---------------------+------------+------------+
|              InBank|Oak Forest| IL|20203| MB Financial Bank...|    4-Sep-09|   17-Oct-15|
|The National Repu...|   Chicago| IL|  916|  State Bank of Texas|   24-Oct-14|    6-Jan-16|
|First National Ba...|  Danville| IL| 3644| First Financial B...|    2-Jul-09|   20-Aug-12|
|    Bank of Illinois|    Normal| IL| 9268| Heartland Bank an...|    5-Mar-10|   23-Aug-12|
|       Meridian Bank|    Eldred| IL|13789|        National Bank|   10-Oct-08|   31-May-12|
+--------------------+----------+---+-----+---------------------+------------+------------+
only showing top 5 rows



In [57]:
df.where(col('CERT').between(1000, 2000)).show(10)

+--------------------+-------------+---+----+---------------------+------------+------------+
|           Bank Name|         City| ST|CERT|Acquiring Institution|Closing Date|Updated Date|
+--------------------+-------------+---+----+---------------------+------------+------------+
|Barnes Banking Co...|    Kaysville| UT|1252|          No Acquirer|   15-Jan-10|   23-Aug-12|
|     Mainstreet Bank|  Forest Lake| MN|1909|         Central Bank|   28-Aug-09|   21-Aug-12|
|     Bank of Ephraim|      Ephraim| UT|1249|        Far West Bank|   25-Jun-04|    9-Apr-08|
| Citizens State Bank|New Baltimore| MI|1006|          No Acquirer|   18-Dec-09|   21-Mar-14|
|      Heartland Bank|      Leawood| KS|1361|         Metcalf Bank|   20-Jul-12|   30-Jul-13|
|Glasgow Savings Bank|      Glasgow| MO|1056| Regional Missouri...|   13-Jul-12|   19-Aug-14|
|           Hume Bank|         Hume| MO|1971|        Security Bank|    7-Mar-08|   28-Aug-12|
| Fayette County Bank|   Saint Elmo| IL|1802| United Fidelit

Podemos usar operadores lógicos dentro do `where`

In [67]:
df.where((col('ST').like('A%')) & (col('CERT') > 50_000)).show(10)

+--------------------+-----------+---+-----+---------------------+------------+------------+
|           Bank Name|       City| ST| CERT|Acquiring Institution|Closing Date|Updated Date|
+--------------------+-----------+---+-----+---------------------+------------+------------+
|Towne Bank of Ari...|       Mesa| AZ|57697| Commerce Bank of ...|    7-May-10|   23-Aug-12|
|Western National ...|    Phoenix| AZ|57917|   Washington Federal|   16-Dec-11|    5-Feb-15|
| First Southern Bank| Batesville| AR|58052|        Southern Bank|   17-Dec-10|   20-Aug-12|
|Community Bank of...|    Phoenix| AZ|57645|        MidFirst Bank|   14-Aug-09|   21-Aug-12|
|Valley Capital Ba...|       Mesa| AZ|58399| Enterprise Bank &...|   11-Dec-09|   20-Oct-16|
|   Desert Hills Bank|    Phoenix| AZ|57060| New York Communit...|   26-Mar-10|   23-Aug-12|
|    Gold Canyon Bank|Gold Canyon| AZ|58066| First Scottsdale ...|    5-Apr-13|    7-Oct-15|
|         Legacy Bank| Scottsdale| AZ|57820| Enterprise Bank &...|    

### Substituir valores

In [74]:
df.show(2)
print('Substituindo IL por SP')
df.na.replace('IL', 'SP').show(2)

+-------------+----------+---+-----+---------------------+------------+------------+
|    Bank Name|      City| ST| CERT|Acquiring Institution|Closing Date|Updated Date|
+-------------+----------+---+-----+---------------------+------------+------------+
|       InBank|Oak Forest| IL|20203| MB Financial Bank...|    4-Sep-09|   17-Oct-15|
|Bank of Alamo|     Alamo| TN| 9961|          No Acquirer|    8-Nov-02|   18-Mar-05|
+-------------+----------+---+-----+---------------------+------------+------------+
only showing top 2 rows

Substituindo IL por SP
+-------------+----------+---+-----+---------------------+------------+------------+
|    Bank Name|      City| ST| CERT|Acquiring Institution|Closing Date|Updated Date|
+-------------+----------+---+-----+---------------------+------------+------------+
|       InBank|Oak Forest| SP|20203| MB Financial Bank...|    4-Sep-09|   17-Oct-15|
|Bank of Alamo|     Alamo| TN| 9961|          No Acquirer|    8-Nov-02|   18-Mar-05|
+-------------+--