# ETL

## Definição da Constelação de Fatos

**Tabelas de dimensão**

- dim_data (data_pk, data_completa, data_dia, data_mes, data_semestre, data_ano)
- dim_localidade (localidade_pk, latitude, longitude, cidade, estado, regiao, pais)
- dim_tipo_cancer (tipo_cancer_pk, tipo_cancer, mortalidade, taxa_incidencia_total)
- dim_faixa_etaria (faixa_pk, faixa_idade, idade_min, idade_max, id_idade)
- dim_metrica (metrica_pk, tipo_metrica)
- dim_sexo (sexo_pk, sexo)

**Tabelas de fatos**

- fato_cancer (ano_pk, estado_pk, tipo_cancer_pk, sexo_pk, faixa_pk, metrica_pk, obitos_cancer, incidencia_cancer, prevalencia_cancer)
- fato_clima (data_pk, localidade_pk, temperatura_media, temperatura_max, temperatura_min, radiacao_uv, radiacao_uva, radiacao_uvb, precipitacao)

**Views**

- vw_cidade (localidade_pk, latitude, longitude, clima_cidade, clima_estado, clima_regiao, clima_pais)
- vw_estado (estado_pk, cancer_regiao, cancer_pais)
- vw_dia (data_pk, clima_data_completa, clima_dia, clima_mes, clima_semestre, clima_ano)
- vw_ano (ano_pk, cancer_ano, cancer_decada)

In [1]:
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from dotenv import load_dotenv
import psycopg2
import os

# Python 3.11
# Java 11
# PySpark == 3.4

In [2]:
spark = SparkSession.builder \
    .appName("OLAP - P2") \
    .config("spark.jars.packages",
            "org.apache.hadoop:hadoop-aws:3.3.4,"
            "com.amazonaws:aws-java-sdk-bundle:1.11.1026") \
    .config("spark.jars", "/home/rodrigo/jars/postgresql-42.7.3.jar") \
    .config("spark.jars.ivyLogLevel", "ERROR") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
    .getOrCreate()

print(spark.version)

your 131072x1 screen size is bogus. expect trouble


25/06/25 10:15:01 WARN Utils: Your hostname, NoteTheo resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/06/25 10:15:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/theoriffel/.local/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/theoriffel/.ivy2/cache
The jars for the packages stored in: /home/theoriffel/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-ea9392f5-9427-42ee-924c-bfc7c9267bdb;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar ...
	[SUCCESSFUL ] org.apache.hadoop#hadoop-aws;3.3.4!hadoop-aws.jar (606ms)
downloading https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar ...
	[SUCCESSFUL ] com.amazonaws#aws-java-sdk-bundle;1.12.262!aws-java-sdk-bundle.jar (22517ms)
downloading https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openss

25/06/25 10:15:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/06/25 10:15:32 WARN DependencyUtils: Local jar /home/rodrigo/jars/postgresql-42.7.3.jar does not exist, skipping.
25/06/25 10:15:33 INFO SparkContext: Running Spark version 3.3.2
25/06/25 10:15:33 INFO ResourceUtils: No custom resources configured for spark.driver.
25/06/25 10:15:33 INFO SparkContext: Submitted application: OLAP - P2
25/06/25 10:15:33 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/06/25 10:15:33 INFO ResourceProfile: Limiting resource is cpu
25/06/25 10:15:33 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/06/25 10:15:33 INFO SecurityManager: Changing view acls 

## Extract

In [4]:
schema_cancer = StructType([
    StructField("measure_id", IntegerType(), True),
    StructField("measure_name", StringType(), True),
    StructField("location_id", IntegerType(), True),
    StructField("location_name", StringType(), True),
    StructField("sex_id", IntegerType(), True),
    StructField("sex_name", StringType(), True),
    StructField("age_id", IntegerType(), True),
    StructField("age_name", StringType(), True),
    StructField("cause_id", IntegerType(), True),
    StructField("cause_name", StringType(), True),
    StructField("metric_id", IntegerType(), True),
    StructField("metric_name", StringType(), True),
    StructField("year", IntegerType(), True),
    StructField("val", DoubleType(), True),
    StructField("upper", DoubleType(), True),
    StructField("lower", DoubleType(), True),
])

df_cancer1 = spark.read.csv("data/cancer-1.csv", schema=schema_cancer, header=True, inferSchema=True)
df_cancer2 = spark.read.csv("data/cancer-2.csv", schema=schema_cancer, header=True, inferSchema=True)
df_cancer = df_cancer1.union(df_cancer2)

print("Numero de Tuplas: ", df_cancer.count())
df_cancer.show()

25/06/25 10:39:47 INFO InMemoryFileIndex: It took 40 ms to list leaf files for 1 paths.
25/06/25 10:39:49 INFO InMemoryFileIndex: It took 1 ms to list leaf files for 1 paths.
25/06/25 10:39:49 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:39:49 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:39:49 INFO FileSourceStrategy: Output Data Schema: struct<>
25/06/25 10:39:49 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:39:49 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:39:49 INFO FileSourceStrategy: Output Data Schema: struct<>
25/06/25 10:39:49 INFO CodeGenerator: Code generated in 103.337291 ms
25/06/25 10:39:49 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 203.1 KiB, free 434.2 MiB)
25/06/25 10:39:50 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 35.0 KiB, free 434.2 MiB)
25/06/25 10:39:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.255.255.254:38971 (size: 35.0 

                                                                                

25/06/25 10:39:50 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 2025 bytes result sent to driver
25/06/25 10:39:50 INFO Executor: Finished task 16.0 in stage 0.0 (TID 16). 2025 bytes result sent to driver
25/06/25 10:39:50 INFO Executor: Finished task 12.0 in stage 0.0 (TID 12). 2025 bytes result sent to driver
25/06/25 10:39:50 INFO Executor: Finished task 11.0 in stage 0.0 (TID 11). 2025 bytes result sent to driver
25/06/25 10:39:50 INFO Executor: Finished task 14.0 in stage 0.0 (TID 14). 2025 bytes result sent to driver
25/06/25 10:39:50 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 2025 bytes result sent to driver
25/06/25 10:39:50 INFO Executor: Finished task 19.0 in stage 0.0 (TID 19). 2025 bytes result sent to driver
25/06/25 10:39:50 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2025 bytes result sent to driver
25/06/25 10:39:50 INFO Executor: Finished task 18.0 in stage 0.0 (TID 18). 2025 bytes result sent to driver
25/06/25 10:39:50 INFO Executor: F

In [6]:
schema_clima = StructType([
    StructField("cidade", StringType(), True),
    StructField("latitude", DoubleType(), True),
    StructField("longitude", DoubleType(), True),
    StructField("YEAR", IntegerType(), True),
    StructField("DOY", IntegerType(), True),
    StructField("T2M", DoubleType(), True),
    StructField("T2M_MAX", DoubleType(), True),
    StructField("T2M_MIN", DoubleType(), True),
    StructField("ALLSKY_SFC_UV_INDEX", DoubleType(), True),
    StructField("ALLSKY_SFC_UVA", DoubleType(), True),
    StructField("ALLSKY_SFC_UVB", DoubleType(), True),
    StructField("PRECTOTCORR", DoubleType(), True),
    StructField("codigo_ibge", IntegerType(), True),
    StructField("capital", StringType(), True),
    StructField("estado", StringType(), True),
])

df_clima = spark.read.csv("data/climate.csv", schema=schema_clima, header=True, inferSchema=True)

print("Numero de Tuplas: ", df_clima.count())
df_clima.show()

25/06/25 10:40:43 INFO InMemoryFileIndex: It took 2 ms to list leaf files for 1 paths.
25/06/25 10:40:43 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:40:43 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:40:43 INFO FileSourceStrategy: Output Data Schema: struct<>
25/06/25 10:40:43 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 203.1 KiB, free 433.2 MiB)
25/06/25 10:40:43 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 35.0 KiB, free 433.2 MiB)
25/06/25 10:40:43 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on 10.255.255.254:38971 (size: 35.0 KiB, free: 434.2 MiB)
25/06/25 10:40:43 INFO SparkContext: Created broadcast 7 from count at NativeMethodAccessorImpl.java:0
25/06/25 10:40:43 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
25/06/25 10:40:43 INFO DAGScheduler: Registering RDD 21 (count at NativeMet

[Stage 4:>                                                        (0 + 22) / 33]

25/06/25 10:40:44 INFO Executor: Finished task 20.0 in stage 4.0 (TID 44). 1969 bytes result sent to driver
25/06/25 10:40:44 INFO TaskSetManager: Starting task 22.0 in stage 4.0 (TID 46) (10.255.255.254, executor driver, partition 22, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:40:44 INFO Executor: Running task 22.0 in stage 4.0 (TID 46)
25/06/25 10:40:44 INFO TaskSetManager: Finished task 20.0 in stage 4.0 (TID 44) in 1001 ms on 10.255.255.254 (executor driver) (1/33)
25/06/25 10:40:44 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2952790016-3087007744, partition values: [empty row]
25/06/25 10:40:44 INFO Executor: Finished task 4.0 in stage 4.0 (TID 28). 1969 bytes result sent to driver
25/06/25 10:40:44 INFO TaskSetManager: Starting task 23.0 in stage 4.0 (TID 47) (10.255.255.254, executor driver, partition 23, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25



25/06/25 10:40:44 INFO Executor: Finished task 1.0 in stage 4.0 (TID 25). 1969 bytes result sent to driver
25/06/25 10:40:44 INFO Executor: Finished task 12.0 in stage 4.0 (TID 36). 1969 bytes result sent to driver
25/06/25 10:40:44 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 25) in 1203 ms on 10.255.255.254 (executor driver) (19/33)
25/06/25 10:40:44 INFO Executor: Finished task 14.0 in stage 4.0 (TID 38). 1969 bytes result sent to driver
25/06/25 10:40:44 INFO TaskSetManager: Finished task 14.0 in stage 4.0 (TID 38) in 1210 ms on 10.255.255.254 (executor driver) (20/33)
25/06/25 10:40:44 INFO TaskSetManager: Finished task 12.0 in stage 4.0 (TID 36) in 1212 ms on 10.255.255.254 (executor driver) (21/33)
25/06/25 10:40:44 INFO Executor: Finished task 11.0 in stage 4.0 (TID 35). 1969 bytes result sent to driver
25/06/25 10:40:44 INFO TaskSetManager: Finished task 11.0 in stage 4.0 (TID 35) in 1222 ms on 10.255.255.254 (executor driver) (22/33)
25/06/25 10:40:44 INFO Executo

                                                                                

25/06/25 10:40:45 INFO Executor: Finished task 30.0 in stage 4.0 (TID 54). 1969 bytes result sent to driver
25/06/25 10:40:45 INFO TaskSetManager: Finished task 30.0 in stage 4.0 (TID 54) in 670 ms on 10.255.255.254 (executor driver) (25/33)
25/06/25 10:40:45 INFO Executor: Finished task 29.0 in stage 4.0 (TID 53). 1969 bytes result sent to driver
25/06/25 10:40:45 INFO TaskSetManager: Finished task 29.0 in stage 4.0 (TID 53) in 703 ms on 10.255.255.254 (executor driver) (26/33)
25/06/25 10:40:45 INFO Executor: Finished task 24.0 in stage 4.0 (TID 48). 1969 bytes result sent to driver
25/06/25 10:40:45 INFO TaskSetManager: Finished task 24.0 in stage 4.0 (TID 48) in 747 ms on 10.255.255.254 (executor driver) (27/33)
25/06/25 10:40:45 INFO Executor: Finished task 28.0 in stage 4.0 (TID 52). 1969 bytes result sent to driver
25/06/25 10:40:45 INFO TaskSetManager: Finished task 28.0 in stage 4.0 (TID 52) in 739 ms on 10.255.255.254 (executor driver) (28/33)
25/06/25 10:40:45 INFO Executor:

## Transform

In [8]:
df_data = df_clima.withColumn(
    "data_completa",
    F.expr("date_add(to_date(concat(YEAR, '-01-01')), DOY - 1)")
).select(
    "data_completa",
    F.dayofmonth("data_completa").alias("data_dia"),
    F.month("data_completa").alias("data_mes"),
    ((F.month("data_completa")-1)/6).cast("int").alias("data_semestre"),
    F.year("data_completa").alias("data_ano")
).distinct()

windowSpec = Window.orderBy("data_completa")

df_data = df_data.withColumn(
    "data_pk",
    F.row_number().over(windowSpec)
)

df_data = df_data.withColumn(
    "data_decada",
    (F.floor(F.col("data_ano") / 10) * 10).cast("int")
)

df_data = df_data.select("data_pk", "data_completa", "data_dia", "data_mes", "data_semestre", "data_ano", "data_decada")

print("Numero de Tuplas: ", df_data.count())
df_data.show()

25/06/25 10:41:41 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:41:41 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:41:41 INFO FileSourceStrategy: Output Data Schema: struct<YEAR: int, DOY: int>
25/06/25 10:41:41 INFO HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current version of codegened fast hashmap does not support this aggregate.
25/06/25 10:41:41 INFO MemoryStore: Block broadcast_20 stored as values in memory (estimated size 203.1 KiB, free 433.5 MiB)
25/06/25 10:41:41 INFO MemoryStore: Block broadcast_20_piece0 stored as bytes in memory (estimated size 35.0 KiB, free 433.5 MiB)
25/06/25 10:41:41 INFO BlockManagerInfo: Added broadcast_20_piece0 in memory on 10.255.255.254:38971 (size: 35.0 KiB, free: 434.2 MiB)
25/06/25 10:41:41 INFO SparkContext: Created broadcast 20 from count at NativeMethodAccessorImpl.java:0
25/06/25 10:41:41 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open

[Stage 18:>                                                       (0 + 22) / 33]

25/06/25 10:41:44 INFO Executor: Finished task 5.0 in stage 18.0 (TID 140). 2772 bytes result sent to driver
25/06/25 10:41:44 INFO TaskSetManager: Starting task 22.0 in stage 18.0 (TID 157) (10.255.255.254, executor driver, partition 22, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:41:44 INFO TaskSetManager: Finished task 5.0 in stage 18.0 (TID 140) in 2835 ms on 10.255.255.254 (executor driver) (1/33)
25/06/25 10:41:44 INFO Executor: Running task 22.0 in stage 18.0 (TID 157)
25/06/25 10:41:44 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2952790016-3087007744, partition values: [empty row]
25/06/25 10:41:44 INFO Executor: Finished task 17.0 in stage 18.0 (TID 152). 2772 bytes result sent to driver
25/06/25 10:41:44 INFO TaskSetManager: Starting task 23.0 in stage 18.0 (TID 158) (10.255.255.254, executor driver, partition 23, PROCESS_LOCAL, 4949 bytes) taskResourceAssignmen



25/06/25 10:41:44 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 3623878656-3758096384, partition values: [empty row]
25/06/25 10:41:44 INFO Executor: Finished task 16.0 in stage 18.0 (TID 151). 2772 bytes result sent to driver
25/06/25 10:41:44 INFO TaskSetManager: Starting task 28.0 in stage 18.0 (TID 163) (10.255.255.254, executor driver, partition 28, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:41:44 INFO TaskSetManager: Finished task 16.0 in stage 18.0 (TID 151) in 3041 ms on 10.255.255.254 (executor driver) (7/33)
25/06/25 10:41:44 INFO Executor: Running task 28.0 in stage 18.0 (TID 163)
25/06/25 10:41:44 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 3758096384-3892314112, partition values: [empty row]
25/06/25 10:41:44 INFO Executor: Finished task 15.0 in stage 18.0 (TID 150). 277



25/06/25 10:41:45 INFO Executor: Finished task 8.0 in stage 18.0 (TID 143). 2772 bytes result sent to driver
25/06/25 10:41:45 INFO TaskSetManager: Finished task 8.0 in stage 18.0 (TID 143) in 3565 ms on 10.255.255.254 (executor driver) (22/33)
25/06/25 10:41:45 INFO Executor: Finished task 32.0 in stage 18.0 (TID 167). 2772 bytes result sent to driver
25/06/25 10:41:45 INFO TaskSetManager: Finished task 32.0 in stage 18.0 (TID 167) in 1027 ms on 10.255.255.254 (executor driver) (23/33)




25/06/25 10:41:46 INFO Executor: Finished task 25.0 in stage 18.0 (TID 160). 2772 bytes result sent to driver
25/06/25 10:41:46 INFO TaskSetManager: Finished task 25.0 in stage 18.0 (TID 160) in 1912 ms on 10.255.255.254 (executor driver) (24/33)
25/06/25 10:41:46 INFO Executor: Finished task 22.0 in stage 18.0 (TID 157). 2772 bytes result sent to driver
25/06/25 10:41:46 INFO TaskSetManager: Finished task 22.0 in stage 18.0 (TID 157) in 2095 ms on 10.255.255.254 (executor driver) (25/33)
25/06/25 10:41:46 INFO Executor: Finished task 24.0 in stage 18.0 (TID 159). 2772 bytes result sent to driver
25/06/25 10:41:46 INFO TaskSetManager: Finished task 24.0 in stage 18.0 (TID 159) in 1964 ms on 10.255.255.254 (executor driver) (26/33)
25/06/25 10:41:46 INFO Executor: Finished task 23.0 in stage 18.0 (TID 158). 2772 bytes result sent to driver
25/06/25 10:41:46 INFO TaskSetManager: Finished task 23.0 in stage 18.0 (TID 158) in 2059 ms on 10.255.255.254 (executor driver) (27/33)
25/06/25 10:



25/06/25 10:41:46 INFO Executor: Finished task 30.0 in stage 18.0 (TID 165). 2772 bytes result sent to driver
25/06/25 10:41:46 INFO TaskSetManager: Finished task 30.0 in stage 18.0 (TID 165) in 2053 ms on 10.255.255.254 (executor driver) (31/33)
25/06/25 10:41:46 INFO Executor: Finished task 29.0 in stage 18.0 (TID 164). 2772 bytes result sent to driver
25/06/25 10:41:46 INFO TaskSetManager: Finished task 29.0 in stage 18.0 (TID 164) in 2068 ms on 10.255.255.254 (executor driver) (32/33)
25/06/25 10:41:46 INFO Executor: Finished task 31.0 in stage 18.0 (TID 166). 2772 bytes result sent to driver
25/06/25 10:41:46 INFO TaskSetManager: Finished task 31.0 in stage 18.0 (TID 166) in 2068 ms on 10.255.255.254 (executor driver) (33/33)
25/06/25 10:41:46 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks have all completed, from pool 
25/06/25 10:41:46 INFO DAGScheduler: ShuffleMapStage 18 (count at NativeMethodAccessorImpl.java:0) finished in 5.167 s
25/06/25 10:41:46 INFO DAGSchedul

                                                                                

25/06/25 10:41:47 INFO BlockManagerInfo: Removed broadcast_23_piece0 on 10.255.255.254:38971 in memory (size: 5.5 KiB, free: 434.2 MiB)
25/06/25 10:41:47 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2550136832-2684354560, partition values: [empty row]
25/06/25 10:41:47 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 268435456-402653184, partition values: [empty row]
25/06/25 10:41:47 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 1610612736-1744830464, partition values: [empty row]
25/06/25 10:41:47 INFO BlockManagerInfo: Removed broadcast_22_piece0 on 10.255.255.254:38971 in memory (size: 20.9 KiB, free: 434.3 MiB)
25/06/25 10:41:47 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 10.255.255.254:38971 in memory (size: 1

[Stage 24:>                                                       (0 + 22) / 33]

25/06/25 10:41:49 INFO Executor: Finished task 5.0 in stage 24.0 (TID 178). 2772 bytes result sent to driver
25/06/25 10:41:49 INFO TaskSetManager: Starting task 22.0 in stage 24.0 (TID 195) (10.255.255.254, executor driver, partition 22, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:41:49 INFO TaskSetManager: Finished task 5.0 in stage 24.0 (TID 178) in 2811 ms on 10.255.255.254 (executor driver) (1/33)
25/06/25 10:41:49 INFO Executor: Running task 22.0 in stage 24.0 (TID 195)
25/06/25 10:41:49 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2952790016-3087007744, partition values: [empty row]
25/06/25 10:41:49 INFO Executor: Finished task 0.0 in stage 24.0 (TID 173). 2772 bytes result sent to driver
25/06/25 10:41:49 INFO TaskSetManager: Starting task 23.0 in stage 24.0 (TID 196) (10.255.255.254, executor driver, partition 23, PROCESS_LOCAL, 4949 bytes) taskResourceAssignment



25/06/25 10:41:49 INFO Executor: Finished task 17.0 in stage 24.0 (TID 190). 2772 bytes result sent to driver
25/06/25 10:41:49 INFO TaskSetManager: Starting task 30.0 in stage 24.0 (TID 203) (10.255.255.254, executor driver, partition 30, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:41:49 INFO Executor: Running task 30.0 in stage 24.0 (TID 203)
25/06/25 10:41:49 INFO TaskSetManager: Finished task 17.0 in stage 24.0 (TID 190) in 3020 ms on 10.255.255.254 (executor driver) (9/33)
25/06/25 10:41:50 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 4026531840-4160749568, partition values: [empty row]
25/06/25 10:41:50 INFO Executor: Finished task 9.0 in stage 24.0 (TID 182). 2772 bytes result sent to driver
25/06/25 10:41:50 INFO TaskSetManager: Starting task 31.0 in stage 24.0 (TID 204) (10.255.255.254, executor driver, partition 31, PROCESS_LOCAL, 4949 bytes) taskResourceAssignme



25/06/25 10:41:50 INFO Executor: Finished task 8.0 in stage 24.0 (TID 181). 2772 bytes result sent to driver
25/06/25 10:41:50 INFO TaskSetManager: Finished task 8.0 in stage 24.0 (TID 181) in 3490 ms on 10.255.255.254 (executor driver) (18/33)
25/06/25 10:41:50 INFO Executor: Finished task 13.0 in stage 24.0 (TID 186). 2772 bytes result sent to driver
25/06/25 10:41:50 INFO TaskSetManager: Finished task 13.0 in stage 24.0 (TID 186) in 3497 ms on 10.255.255.254 (executor driver) (19/33)
25/06/25 10:41:50 INFO Executor: Finished task 6.0 in stage 24.0 (TID 179). 2772 bytes result sent to driver
25/06/25 10:41:50 INFO TaskSetManager: Finished task 6.0 in stage 24.0 (TID 179) in 3527 ms on 10.255.255.254 (executor driver) (20/33)
25/06/25 10:41:50 INFO Executor: Finished task 12.0 in stage 24.0 (TID 185). 2772 bytes result sent to driver
25/06/25 10:41:50 INFO TaskSetManager: Finished task 12.0 in stage 24.0 (TID 185) in 3574 ms on 10.255.255.254 (executor driver) (21/33)
25/06/25 10:41:5



25/06/25 10:41:51 INFO Executor: Finished task 32.0 in stage 24.0 (TID 205). 2772 bytes result sent to driver
25/06/25 10:41:51 INFO TaskSetManager: Finished task 32.0 in stage 24.0 (TID 205) in 1082 ms on 10.255.255.254 (executor driver) (23/33)




25/06/25 10:41:52 INFO Executor: Finished task 26.0 in stage 24.0 (TID 199). 2772 bytes result sent to driver
25/06/25 10:41:52 INFO TaskSetManager: Finished task 26.0 in stage 24.0 (TID 199) in 2167 ms on 10.255.255.254 (executor driver) (24/33)
25/06/25 10:41:52 INFO Executor: Finished task 22.0 in stage 24.0 (TID 195). 2772 bytes result sent to driver
25/06/25 10:41:52 INFO TaskSetManager: Finished task 22.0 in stage 24.0 (TID 195) in 2292 ms on 10.255.255.254 (executor driver) (25/33)
25/06/25 10:41:52 INFO Executor: Finished task 24.0 in stage 24.0 (TID 197). 2772 bytes result sent to driver
25/06/25 10:41:52 INFO TaskSetManager: Finished task 24.0 in stage 24.0 (TID 197) in 2232 ms on 10.255.255.254 (executor driver) (26/33)
25/06/25 10:41:52 INFO Executor: Finished task 28.0 in stage 24.0 (TID 201). 2772 bytes result sent to driver
25/06/25 10:41:52 INFO TaskSetManager: Finished task 28.0 in stage 24.0 (TID 201) in 2202 ms on 10.255.255.254 (executor driver) (27/33)
25/06/25 10:



25/06/25 10:41:52 INFO SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:0
25/06/25 10:41:52 INFO DAGScheduler: Registering RDD 69 (showString at NativeMethodAccessorImpl.java:0) as input to shuffle 9
25/06/25 10:41:52 INFO DAGScheduler: Got job 15 (showString at NativeMethodAccessorImpl.java:0) with 1 output partitions
25/06/25 10:41:52 INFO DAGScheduler: Final stage: ResultStage 27 (showString at NativeMethodAccessorImpl.java:0)
25/06/25 10:41:52 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 26)
25/06/25 10:41:52 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 26)
25/06/25 10:41:52 INFO DAGScheduler: Submitting ShuffleMapStage 26 (MapPartitionsRDD[69] at showString at NativeMethodAccessorImpl.java:0), which has no missing parents
25/06/25 10:41:52 INFO MemoryStore: Block broadcast_26 stored as values in memory (estimated size 46.5 KiB, free 433.8 MiB)
25/06/25 10:41:52 INFO MemoryStore: Block broadcast_26_piece0 stored as bytes in memo

                                                                                

In [9]:
estado_para_regiao = {
    "Acre": "Norte",
    "Alagoas": "Nordeste",
    "Amapá": "Norte",
    "Amazonas": "Norte",
    "Bahia": "Nordeste",
    "Ceará": "Nordeste",
    "Distrito Federal": "Centro-Oeste",
    "Espírito Santo": "Sudeste",
    "Goiás": "Centro-Oeste",
    "Maranhão": "Nordeste",
    "Mato Grosso": "Centro-Oeste",
    "Mato Grosso do Sul": "Centro-Oeste",
    "Minas Gerais": "Sudeste",
    "Pará": "Norte",
    "Paraíba": "Nordeste",
    "Paraná": "Sul",
    "Pernambuco": "Nordeste",
    "Piauí": "Nordeste",
    "Rio de Janeiro": "Sudeste",
    "Rio Grande do Norte": "Nordeste",
    "Rio Grande do Sul": "Sul",
    "Rondônia": "Norte",
    "Roraima": "Norte",
    "Santa Catarina": "Sul",
    "São Paulo": "Sudeste",
    "Sergipe": "Nordeste",
    "Tocantins": "Norte"
}

df_localidade = (
    df_clima.select("cidade", "estado", "latitude", "longitude")
            .distinct()
            .withColumn("pais", F.lit("Brasil"))
            .withColumn("regiao", F.lit(None))
)

for estado, regiao in estado_para_regiao.items():
    df_localidade = df_localidade.withColumn(
        "regiao",
        F.when(F.col("estado") == estado, regiao).otherwise(F.col("regiao"))
    )


windowSpec = Window.orderBy("cidade", "estado", "latitude", "longitude")
df_localidade = (
    df_localidade.withColumn("localidade_pk", F.row_number().over(windowSpec))
                .select("localidade_pk", "latitude", "longitude", "cidade", "estado", "regiao", "pais")
)

print("Numero de Tuplas: ", df_localidade.count())
df_localidade.show(truncate=False)




25/06/25 10:41:52 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:41:52 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:41:52 INFO FileSourceStrategy: Output Data Schema: struct<cidade: string, latitude: double, longitude: double, estado: string ... 2 more fields>
25/06/25 10:41:52 INFO HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current version of codegened fast hashmap does not support this aggregate.
25/06/25 10:41:52 INFO CodeGenerator: Code generated in 10.106757 ms
25/06/25 10:41:52 INFO MemoryStore: Block broadcast_28 stored as values in memory (estimated size 203.1 KiB, free 433.5 MiB)
25/06/25 10:41:52 INFO MemoryStore: Block broadcast_28_piece0 stored as bytes in memory (estimated size 35.0 KiB, free 433.5 MiB)
25/06/25 10:41:52 INFO BlockManagerInfo: Added broadcast_28_piece0 in memory on 10.255.255.254:38971 (size: 35.0 KiB, free: 434.2 MiB)
25/06/25 10:41:52 INFO SparkContext: Created broadcast 28 from count at Nat

[Stage 28:>                                                       (0 + 22) / 33]

25/06/25 10:41:56 INFO Executor: Finished task 21.0 in stage 28.0 (TID 232). 2772 bytes result sent to driver
25/06/25 10:41:56 INFO TaskSetManager: Starting task 22.0 in stage 28.0 (TID 233) (10.255.255.254, executor driver, partition 22, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:41:56 INFO TaskSetManager: Finished task 21.0 in stage 28.0 (TID 232) in 3623 ms on 10.255.255.254 (executor driver) (1/33)
25/06/25 10:41:56 INFO Executor: Running task 22.0 in stage 28.0 (TID 233)
25/06/25 10:41:56 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2952790016-3087007744, partition values: [empty row]
25/06/25 10:41:56 INFO Executor: Finished task 11.0 in stage 28.0 (TID 222). 2772 bytes result sent to driver
25/06/25 10:41:56 INFO TaskSetManager: Starting task 23.0 in stage 28.0 (TID 234) (10.255.255.254, executor driver, partition 23, PROCESS_LOCAL, 4949 bytes) taskResourceAssignm

[Stage 28:===>                                                    (2 + 22) / 33]

25/06/25 10:41:56 INFO Executor: Finished task 6.0 in stage 28.0 (TID 217). 2772 bytes result sent to driver
25/06/25 10:41:56 INFO TaskSetManager: Starting task 24.0 in stage 28.0 (TID 235) (10.255.255.254, executor driver, partition 24, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:41:56 INFO Executor: Running task 24.0 in stage 28.0 (TID 235)
25/06/25 10:41:56 INFO TaskSetManager: Finished task 6.0 in stage 28.0 (TID 217) in 3869 ms on 10.255.255.254 (executor driver) (3/33)
25/06/25 10:41:56 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 3221225472-3355443200, partition values: [empty row]


[Stage 28:=====>                                                  (3 + 22) / 33]

25/06/25 10:41:57 INFO Executor: Finished task 14.0 in stage 28.0 (TID 225). 2772 bytes result sent to driver
25/06/25 10:41:57 INFO Executor: Finished task 5.0 in stage 28.0 (TID 216). 2772 bytes result sent to driver
25/06/25 10:41:57 INFO TaskSetManager: Starting task 25.0 in stage 28.0 (TID 236) (10.255.255.254, executor driver, partition 25, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:41:57 INFO TaskSetManager: Finished task 14.0 in stage 28.0 (TID 225) in 4702 ms on 10.255.255.254 (executor driver) (4/33)
25/06/25 10:41:57 INFO Executor: Running task 25.0 in stage 28.0 (TID 236)
25/06/25 10:41:57 INFO TaskSetManager: Starting task 26.0 in stage 28.0 (TID 237) (10.255.255.254, executor driver, partition 26, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:41:57 INFO TaskSetManager: Finished task 5.0 in stage 28.0 (TID 216) in 4704 ms on 10.255.255.254 (executor driver) (5/33)
25/06/25 10:41:57 INFO Executor: Running task 26.0 in stage 2



25/06/25 10:41:57 INFO Executor: Running task 29.0 in stage 28.0 (TID 240)
25/06/25 10:41:57 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 3892314112-4026531840, partition values: [empty row]
25/06/25 10:41:57 INFO Executor: Finished task 3.0 in stage 28.0 (TID 214). 2772 bytes result sent to driver
25/06/25 10:41:57 INFO TaskSetManager: Starting task 31.0 in stage 28.0 (TID 242) (10.255.255.254, executor driver, partition 31, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:41:57 INFO TaskSetManager: Finished task 3.0 in stage 28.0 (TID 214) in 4947 ms on 10.255.255.254 (executor driver) (10/33)
25/06/25 10:41:57 INFO Executor: Running task 31.0 in stage 28.0 (TID 242)
25/06/25 10:41:57 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 4160749568-4294967296, partition values: [empty row]
25/06/



25/06/25 10:41:58 INFO Executor: Finished task 15.0 in stage 28.0 (TID 226). 2772 bytes result sent to driver
25/06/25 10:41:58 INFO TaskSetManager: Finished task 15.0 in stage 28.0 (TID 226) in 5388 ms on 10.255.255.254 (executor driver) (15/33)
25/06/25 10:41:58 INFO Executor: Finished task 12.0 in stage 28.0 (TID 223). 2772 bytes result sent to driver
25/06/25 10:41:58 INFO TaskSetManager: Finished task 12.0 in stage 28.0 (TID 223) in 5407 ms on 10.255.255.254 (executor driver) (16/33)
25/06/25 10:41:58 INFO Executor: Finished task 18.0 in stage 28.0 (TID 229). 2772 bytes result sent to driver
25/06/25 10:41:58 INFO TaskSetManager: Finished task 18.0 in stage 28.0 (TID 229) in 5414 ms on 10.255.255.254 (executor driver) (17/33)
25/06/25 10:41:58 INFO Executor: Finished task 4.0 in stage 28.0 (TID 215). 2772 bytes result sent to driver
25/06/25 10:41:58 INFO TaskSetManager: Finished task 4.0 in stage 28.0 (TID 215) in 5457 ms on 10.255.255.254 (executor driver) (18/33)
25/06/25 10:41



25/06/25 10:41:59 INFO Executor: Finished task 32.0 in stage 28.0 (TID 243). 2772 bytes result sent to driver
25/06/25 10:41:59 INFO TaskSetManager: Finished task 32.0 in stage 28.0 (TID 243) in 1226 ms on 10.255.255.254 (executor driver) (23/33)




25/06/25 10:41:59 INFO Executor: Finished task 23.0 in stage 28.0 (TID 234). 2772 bytes result sent to driver
25/06/25 10:41:59 INFO TaskSetManager: Finished task 23.0 in stage 28.0 (TID 234) in 2889 ms on 10.255.255.254 (executor driver) (24/33)
25/06/25 10:41:59 INFO Executor: Finished task 24.0 in stage 28.0 (TID 235). 2772 bytes result sent to driver
25/06/25 10:41:59 INFO TaskSetManager: Finished task 24.0 in stage 28.0 (TID 235) in 2777 ms on 10.255.255.254 (executor driver) (25/33)
25/06/25 10:41:59 INFO Executor: Finished task 22.0 in stage 28.0 (TID 233). 2772 bytes result sent to driver
25/06/25 10:41:59 INFO TaskSetManager: Finished task 22.0 in stage 28.0 (TID 233) in 3055 ms on 10.255.255.254 (executor driver) (26/33)




25/06/25 10:41:59 INFO Executor: Finished task 25.0 in stage 28.0 (TID 236). 2772 bytes result sent to driver
25/06/25 10:41:59 INFO TaskSetManager: Finished task 25.0 in stage 28.0 (TID 236) in 2618 ms on 10.255.255.254 (executor driver) (27/33)
25/06/25 10:42:00 INFO Executor: Finished task 26.0 in stage 28.0 (TID 237). 2772 bytes result sent to driver
25/06/25 10:42:00 INFO TaskSetManager: Finished task 26.0 in stage 28.0 (TID 237) in 2618 ms on 10.255.255.254 (executor driver) (28/33)
25/06/25 10:42:00 INFO Executor: Finished task 31.0 in stage 28.0 (TID 242). 2772 bytes result sent to driver
25/06/25 10:42:00 INFO TaskSetManager: Finished task 31.0 in stage 28.0 (TID 242) in 2405 ms on 10.255.255.254 (executor driver) (29/33)
25/06/25 10:42:00 INFO Executor: Finished task 30.0 in stage 28.0 (TID 241). 2772 bytes result sent to driver
25/06/25 10:42:00 INFO TaskSetManager: Finished task 30.0 in stage 28.0 (TID 241) in 2498 ms on 10.255.255.254 (executor driver) (30/33)
25/06/25 10:

                                                                                

25/06/25 10:42:00 INFO Executor: Finished task 0.0 in stage 30.0 (TID 244). 4314 bytes result sent to driver
25/06/25 10:42:00 INFO TaskSetManager: Finished task 0.0 in stage 30.0 (TID 244) in 18 ms on 10.255.255.254 (executor driver) (1/1)
25/06/25 10:42:00 INFO TaskSchedulerImpl: Removed TaskSet 30.0, whose tasks have all completed, from pool 
25/06/25 10:42:00 INFO DAGScheduler: ShuffleMapStage 30 (count at NativeMethodAccessorImpl.java:0) finished in 0.023 s
25/06/25 10:42:00 INFO DAGScheduler: looking for newly runnable stages
25/06/25 10:42:00 INFO DAGScheduler: running: Set()
25/06/25 10:42:00 INFO DAGScheduler: waiting: Set()
25/06/25 10:42:00 INFO DAGScheduler: failed: Set()
25/06/25 10:42:00 INFO SparkContext: Starting job: count at NativeMethodAccessorImpl.java:0
25/06/25 10:42:00 INFO DAGScheduler: Got job 18 (count at NativeMethodAccessorImpl.java:0) with 1 output partitions
25/06/25 10:42:00 INFO DAGScheduler: Final stage: ResultStage 33 (count at NativeMethodAccessorImpl

[Stage 34:>                                                       (0 + 22) / 33]

25/06/25 10:42:03 INFO Executor: Finished task 3.0 in stage 34.0 (TID 249). 2772 bytes result sent to driver
25/06/25 10:42:03 INFO TaskSetManager: Starting task 22.0 in stage 34.0 (TID 268) (10.255.255.254, executor driver, partition 22, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:03 INFO Executor: Running task 22.0 in stage 34.0 (TID 268)
25/06/25 10:42:03 INFO TaskSetManager: Finished task 3.0 in stage 34.0 (TID 249) in 3215 ms on 10.255.255.254 (executor driver) (1/33)
25/06/25 10:42:03 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2952790016-3087007744, partition values: [empty row]
25/06/25 10:42:03 INFO Executor: Finished task 5.0 in stage 34.0 (TID 251). 2772 bytes result sent to driver
25/06/25 10:42:03 INFO TaskSetManager: Starting task 23.0 in stage 34.0 (TID 269) (10.255.255.254, executor driver, partition 23, PROCESS_LOCAL, 4949 bytes) taskResourceAssignment



25/06/25 10:42:03 INFO TaskSetManager: Starting task 26.0 in stage 34.0 (TID 272) (10.255.255.254, executor driver, partition 26, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:03 INFO Executor: Running task 26.0 in stage 34.0 (TID 272)
25/06/25 10:42:03 INFO TaskSetManager: Finished task 13.0 in stage 34.0 (TID 259) in 3414 ms on 10.255.255.254 (executor driver) (5/33)
25/06/25 10:42:03 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 3489660928-3623878656, partition values: [empty row]
25/06/25 10:42:03 INFO Executor: Finished task 20.0 in stage 34.0 (TID 266). 2772 bytes result sent to driver
25/06/25 10:42:03 INFO TaskSetManager: Starting task 27.0 in stage 34.0 (TID 273) (10.255.255.254, executor driver, partition 27, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:03 INFO Executor: Running task 27.0 in stage 34.0 (TID 273)
25/06/25 10:42:03 INFO T



25/06/25 10:42:03 INFO Executor: Finished task 14.0 in stage 34.0 (TID 260). 2772 bytes result sent to driver
25/06/25 10:42:03 INFO TaskSetManager: Starting task 31.0 in stage 34.0 (TID 277) (10.255.255.254, executor driver, partition 31, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:03 INFO Executor: Running task 31.0 in stage 34.0 (TID 277)
25/06/25 10:42:03 INFO TaskSetManager: Finished task 14.0 in stage 34.0 (TID 260) in 3674 ms on 10.255.255.254 (executor driver) (10/33)
25/06/25 10:42:03 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 4160749568-4294967296, partition values: [empty row]
25/06/25 10:42:04 INFO Executor: Finished task 8.0 in stage 34.0 (TID 254). 2772 bytes result sent to driver
25/06/25 10:42:04 INFO TaskSetManager: Starting task 32.0 in stage 34.0 (TID 278) (10.255.255.254, executor driver, partition 32, PROCESS_LOCAL, 4949 bytes) taskResourceAssignm



25/06/25 10:42:05 INFO Executor: Finished task 32.0 in stage 34.0 (TID 278). 2772 bytes result sent to driver
25/06/25 10:42:05 INFO TaskSetManager: Finished task 32.0 in stage 34.0 (TID 278) in 1143 ms on 10.255.255.254 (executor driver) (23/33)




25/06/25 10:42:06 INFO Executor: Finished task 25.0 in stage 34.0 (TID 271). 2772 bytes result sent to driver
25/06/25 10:42:06 INFO TaskSetManager: Finished task 25.0 in stage 34.0 (TID 271) in 3015 ms on 10.255.255.254 (executor driver) (24/33)
25/06/25 10:42:06 INFO Executor: Finished task 23.0 in stage 34.0 (TID 269). 2772 bytes result sent to driver
25/06/25 10:42:06 INFO TaskSetManager: Finished task 23.0 in stage 34.0 (TID 269) in 3193 ms on 10.255.255.254 (executor driver) (25/33)
25/06/25 10:42:06 INFO Executor: Finished task 26.0 in stage 34.0 (TID 272). 2772 bytes result sent to driver
25/06/25 10:42:06 INFO TaskSetManager: Finished task 26.0 in stage 34.0 (TID 272) in 3047 ms on 10.255.255.254 (executor driver) (26/33)
25/06/25 10:42:06 INFO Executor: Finished task 22.0 in stage 34.0 (TID 268). 2772 bytes result sent to driver
25/06/25 10:42:06 INFO TaskSetManager: Finished task 22.0 in stage 34.0 (TID 268) in 3303 ms on 10.255.255.254 (executor driver) (27/33)
25/06/25 10:



25/06/25 10:42:06 INFO Executor: Finished task 28.0 in stage 34.0 (TID 274). 2772 bytes result sent to driver
25/06/25 10:42:06 INFO TaskSetManager: Finished task 28.0 in stage 34.0 (TID 274) in 3054 ms on 10.255.255.254 (executor driver) (29/33)
25/06/25 10:42:06 INFO Executor: Finished task 27.0 in stage 34.0 (TID 273). 2772 bytes result sent to driver
25/06/25 10:42:06 INFO TaskSetManager: Finished task 27.0 in stage 34.0 (TID 273) in 3160 ms on 10.255.255.254 (executor driver) (30/33)
25/06/25 10:42:06 INFO Executor: Finished task 30.0 in stage 34.0 (TID 276). 2772 bytes result sent to driver
25/06/25 10:42:06 INFO TaskSetManager: Finished task 30.0 in stage 34.0 (TID 276) in 3112 ms on 10.255.255.254 (executor driver) (31/33)
25/06/25 10:42:06 INFO Executor: Finished task 31.0 in stage 34.0 (TID 277). 2772 bytes result sent to driver
25/06/25 10:42:06 INFO TaskSetManager: Finished task 31.0 in stage 34.0 (TID 277) in 3028 ms on 10.255.255.254 (executor driver) (32/33)
25/06/25 10:

                                                                                

In [10]:
df_sexo = df_cancer.select(
    F.col("sex_name").alias("sexo")
).distinct()

df_sexo = df_sexo.withColumn("sexo_pk", F.monotonically_increasing_id())

print("Numero de Tuplas: ", df_sexo.count())
df_sexo.show()

25/06/25 10:42:07 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:42:07 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:42:07 INFO FileSourceStrategy: Output Data Schema: struct<sex_name: string>
25/06/25 10:42:07 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:42:07 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:42:07 INFO FileSourceStrategy: Output Data Schema: struct<sex_name: string>
25/06/25 10:42:07 INFO HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current version of codegened fast hashmap does not support this aggregate.
25/06/25 10:42:07 INFO CodeGenerator: Code generated in 7.948962 ms
25/06/25 10:42:07 INFO CodeGenerator: Code generated in 5.260401 ms
25/06/25 10:42:07 INFO MemoryStore: Block broadcast_35 stored as values in memory (estimated size 203.1 KiB, free 433.6 MiB)
25/06/25 10:42:07 INFO MemoryStore: Block broadcast_35_piece0 stored as bytes in memory (estimated size 35.0 KiB, free 433.6 MiB)
2

In [11]:
total_casos = df_cancer.filter(F.col("metric_name") == "Número") \
    .agg(F.sum("val").alias("total_casos_geral")) \
    .collect()[0]["total_casos_geral"]

df_tipo_cancer = (
    df_cancer.filter(F.col("metric_name") == "Número") 
            .groupBy("cause_name")
            .agg(
                F.sum(F.when(F.col("measure_name") == "Óbitos", F.col("val")).otherwise(0)).alias("total_obitos"),
                F.sum(F.when(F.col("measure_name") != "Óbitos", F.col("val")).otherwise(0)).alias("total_casos")
            ) 
            .withColumn(
                "mortalidade",
                F.round(F.when(F.col("total_casos") > 0, (F.col("total_obitos") / F.col("total_casos")) * 100).otherwise(0),2)
            ) 
            .withColumn(
                "taxa_incidencia_total",
                F.round((F.col("total_casos") / F.lit(total_casos)) * 100, 2)
            ) 
            .withColumn(
                "tipo_cancer_pk",
                F.monotonically_increasing_id()
            ) 
            .select(
                "tipo_cancer_pk",
                F.col("cause_name").alias("tipo_cancer"),
                "mortalidade",
                "taxa_incidencia_total"
            )
)

print("Numero de Tuplas: ", df_tipo_cancer.count())
df_tipo_cancer.show(truncate=False)

25/06/25 10:42:07 INFO FileSourceStrategy: Pushed Filters: IsNotNull(metric_name),EqualTo(metric_name,Número)
25/06/25 10:42:07 INFO FileSourceStrategy: Post-Scan Filters: isnotnull(metric_name#11),(metric_name#11 = Número)
25/06/25 10:42:07 INFO FileSourceStrategy: Output Data Schema: struct<metric_name: string, val: double>
25/06/25 10:42:07 INFO FileSourceStrategy: Pushed Filters: IsNotNull(metric_name),EqualTo(metric_name,Número)
25/06/25 10:42:07 INFO FileSourceStrategy: Post-Scan Filters: isnotnull(metric_name#43),(metric_name#43 = Número)
25/06/25 10:42:07 INFO FileSourceStrategy: Output Data Schema: struct<metric_name: string, val: double>
25/06/25 10:42:07 INFO CodeGenerator: Code generated in 6.612181 ms
25/06/25 10:42:07 INFO CodeGenerator: Code generated in 3.942265 ms
25/06/25 10:42:07 INFO MemoryStore: Block broadcast_44 stored as values in memory (estimated size 203.1 KiB, free 432.8 MiB)
25/06/25 10:42:07 INFO MemoryStore: Block broadcast_44_piece0 stored as bytes in me

In [12]:
df_faixa_etaria = (
    df_cancer.select(
        F.col("age_id").alias("id_idade"),
        F.col("age_name").alias("faixa_etaria")
    )
    .distinct()
    .withColumn(
        "faixa_etaria",
        F.regexp_replace(F.regexp_replace(F.col("faixa_etaria"), "anos", ""), " ", "")
    )
    .withColumn(
        "idade_min",
        F.when(
            F.col("faixa_etaria").contains("+"),
            F.regexp_replace(F.col("faixa_etaria"), "[^0-9]", "").cast("int")
        ).otherwise(
            F.split(F.col("faixa_etaria"), "-")[0].cast("int")
        )
    )
    .withColumn(
        "idade_max",
        F.when(
            F.col("faixa_etaria").contains("+"),
            F.lit(130)
        ).otherwise(
            F.split(F.col("faixa_etaria"), "-")[1].cast("int")
        )
    )
    .withColumn("faixa_pk", F.monotonically_increasing_id())
    .select("faixa_pk", "faixa_etaria", "idade_min", "idade_max", "id_idade")
)

print("Numero de Tuplas: ", df_faixa_etaria.count())
df_faixa_etaria.show()

25/06/25 10:42:09 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:42:09 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:42:09 INFO FileSourceStrategy: Output Data Schema: struct<age_id: int, age_name: string>
25/06/25 10:42:09 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:42:09 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:42:09 INFO FileSourceStrategy: Output Data Schema: struct<age_id: int, age_name: string>
25/06/25 10:42:09 INFO HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current version of codegened fast hashmap does not support this aggregate.
25/06/25 10:42:09 INFO CodeGenerator: Code generated in 5.136437 ms
25/06/25 10:42:09 INFO CodeGenerator: Code generated in 2.641694 ms
25/06/25 10:42:09 INFO MemoryStore: Block broadcast_57 stored as values in memory (estimated size 203.1 KiB, free 432.6 MiB)
25/06/25 10:42:09 INFO MemoryStore: Block broadcast_57_piece0 stored as bytes in memory (estimated size 3

In [13]:
df_metrica = (
    df_cancer
    .select(F.col("metric_name").alias("tipo_metrica"))
    .distinct()
    
)

windowSpec = Window.orderBy("tipo_metrica")

df_metrica = (
    df_metrica
    .withColumn("metrica_pk", F.row_number().over(windowSpec))
    .select("metrica_pk", "tipo_metrica")
)

df_metrica.show()

25/06/25 10:42:09 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:42:09 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:42:09 INFO FileSourceStrategy: Output Data Schema: struct<metric_name: string>
25/06/25 10:42:09 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:42:09 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:42:09 INFO FileSourceStrategy: Output Data Schema: struct<metric_name: string>
25/06/25 10:42:09 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/25 10:42:09 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/25 10:42:09 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/25 10:42:09 INFO HashAggregateExec: spark.sql.codegen.aggregate.map.two

In [14]:
df_fato_clima = (
    df_clima.withColumn(
            "data_completa",
            F.expr("date_add(to_date(concat(YEAR, '-01-01')), DOY - 1)")
        ) 
        .join(df_data, ["data_completa"], "left") 
        .join(df_localidade, ["cidade", "latitude", "longitude"], "left") 
        .select(
            F.col("data_pk"),
            F.col("localidade_pk"),
            F.format_string("%.2f",F.col("T2M")).alias("temperatura_media"),
            F.format_string("%.2f",F.col("T2M_MAX")).alias("temperatura_max"),
            F.format_string("%.2f",F.col("T2M_MIN")).alias("temperatura_min"),
            F.format_string("%.2f",F.col("ALLSKY_SFC_UV_INDEX")).alias("radiacao_uv"),
            F.format_string("%.2f",F.col("ALLSKY_SFC_UVA")).alias("radiacao_uva"),
            F.format_string("%.2f",F.col("ALLSKY_SFC_UVB")).alias("radiacao_uvb"),
            F.format_string("%.2f",F.col("PRECTOTCORR")).alias("precipitacao")
        )
)
#print("Numero de Tuplas: ", df_fato_clima.count())
df_fato_clima.show()

25/06/25 10:42:10 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:42:10 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:42:10 INFO FileSourceStrategy: Output Data Schema: struct<cidade: string, latitude: double, longitude: double, YEAR: int, DOY: int ... 10 more fields>
25/06/25 10:42:10 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:42:10 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:42:10 INFO FileSourceStrategy: Output Data Schema: struct<YEAR: int, DOY: int>
25/06/25 10:42:10 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:42:10 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:42:10 INFO FileSourceStrategy: Output Data Schema: struct<cidade: string, latitude: double, longitude: double, estado: string ... 2 more fields>
25/06/25 10:42:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/25 10:42:10 WARN WindowExec: No Partition Defined

[Stage 71:>               (0 + 22) / 33][Stage 72:>                (0 + 0) / 33]

25/06/25 10:42:13 INFO Executor: Finished task 4.0 in stage 71.0 (TID 504). 2772 bytes result sent to driver
25/06/25 10:42:13 INFO TaskSetManager: Starting task 22.0 in stage 71.0 (TID 522) (10.255.255.254, executor driver, partition 22, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:13 INFO TaskSetManager: Finished task 4.0 in stage 71.0 (TID 504) in 3063 ms on 10.255.255.254 (executor driver) (1/33)
25/06/25 10:42:13 INFO Executor: Running task 22.0 in stage 71.0 (TID 522)
25/06/25 10:42:13 INFO Executor: Finished task 9.0 in stage 71.0 (TID 509). 2772 bytes result sent to driver
25/06/25 10:42:13 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2952790016-3087007744, partition values: [empty row]
25/06/25 10:42:13 INFO TaskSetManager: Starting task 23.0 in stage 71.0 (TID 523) (10.255.255.254, executor driver, partition 23, PROCESS_LOCAL, 4949 bytes) taskResourceAssignment

[Stage 71:====>          (10 + 22) / 33][Stage 72:>                (0 + 0) / 33]

25/06/25 10:42:13 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 3758096384-3892314112, partition values: [empty row]
25/06/25 10:42:13 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 3892314112-4026531840, partition values: [empty row]
25/06/25 10:42:13 INFO Executor: Finished task 13.0 in stage 71.0 (TID 513). 2772 bytes result sent to driver
25/06/25 10:42:13 INFO TaskSetManager: Starting task 30.0 in stage 71.0 (TID 530) (10.255.255.254, executor driver, partition 30, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:13 INFO TaskSetManager: Finished task 13.0 in stage 71.0 (TID 513) in 3355 ms on 10.255.255.254 (executor driver) (9/33)
25/06/25 10:42:13 INFO Executor: Running task 30.0 in stage 71.0 (TID 530)
25/06/25 10:42:13 INFO Executor: Finished task 7.0 in stage 71.0 (TID 507). 2815



25/06/25 10:42:13 INFO Executor: Finished task 14.0 in stage 71.0 (TID 514). 2772 bytes result sent to driver
25/06/25 10:42:13 INFO TaskSetManager: Starting task 10.0 in stage 72.0 (TID 543) (10.255.255.254, executor driver, partition 10, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:13 INFO TaskSetManager: Finished task 14.0 in stage 71.0 (TID 514) in 3691 ms on 10.255.255.254 (executor driver) (22/33)
25/06/25 10:42:13 INFO Executor: Running task 10.0 in stage 72.0 (TID 543)
25/06/25 10:42:13 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 1342177280-1476395008, partition values: [empty row]
25/06/25 10:42:15 INFO Executor: Finished task 32.0 in stage 71.0 (TID 532). 2772 bytes result sent to driver
25/06/25 10:42:15 INFO TaskSetManager: Starting task 11.0 in stage 72.0 (TID 544) (10.255.255.254, executor driver, partition 11, PROCESS_LOCAL, 4949 bytes) taskResourceAssign



25/06/25 10:42:16 INFO Executor: Finished task 25.0 in stage 71.0 (TID 525). 2772 bytes result sent to driver
25/06/25 10:42:16 INFO TaskSetManager: Starting task 12.0 in stage 72.0 (TID 545) (10.255.255.254, executor driver, partition 12, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:16 INFO Executor: Running task 12.0 in stage 72.0 (TID 545)
25/06/25 10:42:16 INFO TaskSetManager: Finished task 25.0 in stage 71.0 (TID 525) in 2877 ms on 10.255.255.254 (executor driver) (24/33)
25/06/25 10:42:16 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 1610612736-1744830464, partition values: [empty row]




25/06/25 10:42:16 INFO Executor: Finished task 24.0 in stage 71.0 (TID 524). 2772 bytes result sent to driver
25/06/25 10:42:16 INFO TaskSetManager: Starting task 13.0 in stage 72.0 (TID 546) (10.255.255.254, executor driver, partition 13, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:16 INFO Executor: Running task 13.0 in stage 72.0 (TID 546)
25/06/25 10:42:16 INFO TaskSetManager: Finished task 24.0 in stage 71.0 (TID 524) in 3098 ms on 10.255.255.254 (executor driver) (25/33)
25/06/25 10:42:16 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 1744830464-1879048192, partition values: [empty row]
25/06/25 10:42:16 INFO Executor: Finished task 28.0 in stage 71.0 (TID 528). 2772 bytes result sent to driver
25/06/25 10:42:16 INFO TaskSetManager: Starting task 14.0 in stage 72.0 (TID 547) (10.255.255.254, executor driver, partition 14, PROCESS_LOCAL, 4949 bytes) taskResourceAssign



25/06/25 10:42:16 INFO Executor: Finished task 31.0 in stage 71.0 (TID 531). 2772 bytes result sent to driver
25/06/25 10:42:16 INFO TaskSetManager: Starting task 20.0 in stage 72.0 (TID 553) (10.255.255.254, executor driver, partition 20, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:16 INFO Executor: Running task 20.0 in stage 72.0 (TID 553)
25/06/25 10:42:16 INFO TaskSetManager: Finished task 31.0 in stage 71.0 (TID 531) in 3436 ms on 10.255.255.254 (executor driver) (32/33)
25/06/25 10:42:17 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2684354560-2818572288, partition values: [empty row]
25/06/25 10:42:17 INFO Executor: Finished task 30.0 in stage 71.0 (TID 530). 2772 bytes result sent to driver
25/06/25 10:42:17 INFO TaskSetManager: Starting task 21.0 in stage 72.0 (TID 554) (10.255.255.254, executor driver, partition 21, PROCESS_LOCAL, 4949 bytes) taskResourceAssign

[Stage 72:>                                                       (0 + 22) / 33]

25/06/25 10:42:17 INFO Executor: Finished task 6.0 in stage 72.0 (TID 539). 2772 bytes result sent to driver
25/06/25 10:42:17 INFO TaskSetManager: Starting task 22.0 in stage 72.0 (TID 555) (10.255.255.254, executor driver, partition 22, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:17 INFO TaskSetManager: Finished task 6.0 in stage 72.0 (TID 539) in 4019 ms on 10.255.255.254 (executor driver) (1/33)
25/06/25 10:42:17 INFO Executor: Running task 22.0 in stage 72.0 (TID 555)
25/06/25 10:42:17 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2952790016-3087007744, partition values: [empty row]
25/06/25 10:42:17 INFO Executor: Finished task 9.0 in stage 72.0 (TID 542). 2772 bytes result sent to driver
25/06/25 10:42:17 INFO TaskSetManager: Starting task 23.0 in stage 72.0 (TID 556) (10.255.255.254, executor driver, partition 23, PROCESS_LOCAL, 4949 bytes) taskResourceAssignment

[Stage 72:=>              (3 + 22) / 33][Stage 74:>                 (0 + 0) / 4]

25/06/25 10:42:18 INFO Executor: Finished task 0.0 in stage 72.0 (TID 533). 2772 bytes result sent to driver
25/06/25 10:42:18 INFO TaskSetManager: Starting task 25.0 in stage 72.0 (TID 558) (10.255.255.254, executor driver, partition 25, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:18 INFO TaskSetManager: Finished task 0.0 in stage 72.0 (TID 533) in 5016 ms on 10.255.255.254 (executor driver) (4/33)
25/06/25 10:42:18 INFO Executor: Running task 25.0 in stage 72.0 (TID 558)
25/06/25 10:42:18 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 3355443200-3489660928, partition values: [empty row]
25/06/25 10:42:18 INFO Executor: Finished task 11.0 in stage 72.0 (TID 544). 2772 bytes result sent to driver
25/06/25 10:42:18 INFO TaskSetManager: Starting task 26.0 in stage 72.0 (TID 559) (10.255.255.254, executor driver, partition 26, PROCESS_LOCAL, 4949 bytes) taskResourceAssignmen

[Stage 72:===>            (7 + 22) / 33][Stage 74:>                 (0 + 0) / 4]

25/06/25 10:42:18 INFO Executor: Finished task 4.0 in stage 72.0 (TID 537). 2772 bytes result sent to driver
25/06/25 10:42:18 INFO TaskSetManager: Starting task 27.0 in stage 72.0 (TID 560) (10.255.255.254, executor driver, partition 27, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:18 INFO TaskSetManager: Finished task 4.0 in stage 72.0 (TID 537) in 5254 ms on 10.255.255.254 (executor driver) (6/33)
25/06/25 10:42:18 INFO Executor: Running task 27.0 in stage 72.0 (TID 560)
25/06/25 10:42:18 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 3623878656-3758096384, partition values: [empty row]
25/06/25 10:42:18 INFO Executor: Finished task 2.0 in stage 72.0 (TID 535). 2772 bytes result sent to driver
25/06/25 10:42:18 INFO TaskSetManager: Starting task 28.0 in stage 72.0 (TID 561) (10.255.255.254, executor driver, partition 28, PROCESS_LOCAL, 4949 bytes) taskResourceAssignment

[Stage 72:====>          (10 + 22) / 33][Stage 74:>                 (0 + 0) / 4]

25/06/25 10:42:19 INFO Executor: Finished task 8.0 in stage 72.0 (TID 541). 2772 bytes result sent to driver
25/06/25 10:42:19 INFO TaskSetManager: Starting task 32.0 in stage 72.0 (TID 565) (10.255.255.254, executor driver, partition 32, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:19 INFO Executor: Running task 32.0 in stage 72.0 (TID 565)
25/06/25 10:42:19 INFO TaskSetManager: Finished task 8.0 in stage 72.0 (TID 541) in 5767 ms on 10.255.255.254 (executor driver) (11/33)
25/06/25 10:42:19 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 4294967296-4356534577, partition values: [empty row]


[Stage 72:=====>         (11 + 22) / 33][Stage 74:>                 (0 + 0) / 4]

25/06/25 10:42:19 INFO Executor: Finished task 7.0 in stage 72.0 (TID 540). 2772 bytes result sent to driver
25/06/25 10:42:19 INFO TaskSetManager: Starting task 0.0 in stage 74.0 (TID 566) (10.255.255.254, executor driver, partition 0, NODE_LOCAL, 4442 bytes) taskResourceAssignments Map()
25/06/25 10:42:19 INFO Executor: Running task 0.0 in stage 74.0 (TID 566)
25/06/25 10:42:19 INFO TaskSetManager: Finished task 7.0 in stage 72.0 (TID 540) in 6086 ms on 10.255.255.254 (executor driver) (12/33)
25/06/25 10:42:19 INFO ShuffleBlockFetcherIterator: Getting 33 (1040.6 KiB) non-empty blocks including 33 (1040.6 KiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
25/06/25 10:42:19 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
25/06/25 10:42:19 INFO Executor: Finished task 0.0 in stage 74.0 (TID 566). 4021 bytes result sent to driver
25/06/25 10:42:19 INFO TaskSetManager: Starting task 1.0 in stage 74.0 (TID 567) (10.255.255.25



25/06/25 10:42:20 INFO Executor: Finished task 2.0 in stage 74.0 (TID 568). 4021 bytes result sent to driver
25/06/25 10:42:20 INFO TaskSetManager: Starting task 3.0 in stage 74.0 (TID 569) (10.255.255.254, executor driver, partition 3, NODE_LOCAL, 4442 bytes) taskResourceAssignments Map()
25/06/25 10:42:20 INFO TaskSetManager: Finished task 2.0 in stage 74.0 (TID 568) in 37 ms on 10.255.255.254 (executor driver) (3/4)
25/06/25 10:42:20 INFO Executor: Running task 3.0 in stage 74.0 (TID 569)
25/06/25 10:42:20 INFO ShuffleBlockFetcherIterator: Getting 33 (1493.6 KiB) non-empty blocks including 33 (1493.6 KiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
25/06/25 10:42:20 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 3 ms
25/06/25 10:42:20 INFO Executor: Finished task 3.0 in stage 74.0 (TID 569). 4021 bytes result sent to driver
25/06/25 10:42:20 INFO TaskSetManager: Finished task 3.0 in stage 74.0 (TID 569) in 71 ms on 10.255

                                                                                

25/06/25 10:42:20 INFO Executor: Finished task 17.0 in stage 72.0 (TID 550). 2772 bytes result sent to driver
25/06/25 10:42:20 INFO TaskSetManager: Finished task 17.0 in stage 72.0 (TID 550) in 4024 ms on 10.255.255.254 (executor driver) (14/33)
25/06/25 10:42:20 INFO CodeGenerator: Code generated in 7.311783 ms
25/06/25 10:42:20 INFO DAGScheduler: Registering RDD 225 (showString at NativeMethodAccessorImpl.java:0) as input to shuffle 28
25/06/25 10:42:20 INFO DAGScheduler: Got map stage job 45 (showString at NativeMethodAccessorImpl.java:0) with 1 output partitions
25/06/25 10:42:20 INFO DAGScheduler: Final stage: ShuffleMapStage 79 (showString at NativeMethodAccessorImpl.java:0)
25/06/25 10:42:20 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 78)
25/06/25 10:42:20 INFO DAGScheduler: Missing parents: List()
25/06/25 10:42:20 INFO DAGScheduler: Submitting ShuffleMapStage 79 (MapPartitionsRDD[225] at showString at NativeMethodAccessorImpl.java:0), which has no missing 



25/06/25 10:42:21 INFO Executor: Finished task 19.0 in stage 72.0 (TID 552). 2772 bytes result sent to driver
25/06/25 10:42:21 INFO TaskSetManager: Finished task 19.0 in stage 72.0 (TID 552) in 4364 ms on 10.255.255.254 (executor driver) (21/33)
25/06/25 10:42:21 INFO Executor: Finished task 20.0 in stage 72.0 (TID 553). 2772 bytes result sent to driver
25/06/25 10:42:21 INFO TaskSetManager: Finished task 20.0 in stage 72.0 (TID 553) in 4317 ms on 10.255.255.254 (executor driver) (22/33)
25/06/25 10:42:21 INFO Executor: Finished task 32.0 in stage 72.0 (TID 565). 2772 bytes result sent to driver
25/06/25 10:42:21 INFO TaskSetManager: Finished task 32.0 in stage 72.0 (TID 565) in 1842 ms on 10.255.255.254 (executor driver) (23/33)
25/06/25 10:42:21 INFO Executor: Finished task 24.0 in stage 72.0 (TID 557). 2772 bytes result sent to driver
25/06/25 10:42:21 INFO TaskSetManager: Finished task 24.0 in stage 72.0 (TID 557) in 3632 ms on 10.255.255.254 (executor driver) (24/33)
25/06/25 10:



25/06/25 10:42:21 INFO Executor: Finished task 22.0 in stage 72.0 (TID 555). 2772 bytes result sent to driver
25/06/25 10:42:21 INFO TaskSetManager: Finished task 22.0 in stage 72.0 (TID 555) in 3973 ms on 10.255.255.254 (executor driver) (26/33)
25/06/25 10:42:21 INFO Executor: Finished task 25.0 in stage 72.0 (TID 558). 2772 bytes result sent to driver
25/06/25 10:42:21 INFO TaskSetManager: Finished task 25.0 in stage 72.0 (TID 558) in 3309 ms on 10.255.255.254 (executor driver) (27/33)




25/06/25 10:42:22 INFO Executor: Finished task 26.0 in stage 72.0 (TID 559). 2772 bytes result sent to driver
25/06/25 10:42:22 INFO TaskSetManager: Finished task 26.0 in stage 72.0 (TID 559) in 3359 ms on 10.255.255.254 (executor driver) (28/33)
25/06/25 10:42:22 INFO Executor: Finished task 28.0 in stage 72.0 (TID 561). 2772 bytes result sent to driver
25/06/25 10:42:22 INFO TaskSetManager: Finished task 28.0 in stage 72.0 (TID 561) in 3188 ms on 10.255.255.254 (executor driver) (29/33)
25/06/25 10:42:22 INFO Executor: Finished task 27.0 in stage 72.0 (TID 560). 2772 bytes result sent to driver
25/06/25 10:42:22 INFO TaskSetManager: Finished task 27.0 in stage 72.0 (TID 560) in 3303 ms on 10.255.255.254 (executor driver) (30/33)
25/06/25 10:42:22 INFO Executor: Finished task 30.0 in stage 72.0 (TID 563). 2772 bytes result sent to driver
25/06/25 10:42:22 INFO TaskSetManager: Finished task 30.0 in stage 72.0 (TID 563) in 3168 ms on 10.255.255.254 (executor driver) (31/33)




25/06/25 10:42:22 INFO Executor: Finished task 29.0 in stage 72.0 (TID 562). 2772 bytes result sent to driver
25/06/25 10:42:22 INFO TaskSetManager: Finished task 29.0 in stage 72.0 (TID 562) in 3249 ms on 10.255.255.254 (executor driver) (32/33)
25/06/25 10:42:22 INFO Executor: Finished task 31.0 in stage 72.0 (TID 564). 2772 bytes result sent to driver
25/06/25 10:42:22 INFO TaskSetManager: Finished task 31.0 in stage 72.0 (TID 564) in 3103 ms on 10.255.255.254 (executor driver) (33/33)
25/06/25 10:42:22 INFO TaskSchedulerImpl: Removed TaskSet 72.0, whose tasks have all completed, from pool 
25/06/25 10:42:22 INFO DAGScheduler: ShuffleMapStage 72 (showString at NativeMethodAccessorImpl.java:0) finished in 12.241 s
25/06/25 10:42:22 INFO DAGScheduler: looking for newly runnable stages
25/06/25 10:42:22 INFO DAGScheduler: running: Set()
25/06/25 10:42:22 INFO DAGScheduler: waiting: Set()
25/06/25 10:42:22 INFO DAGScheduler: failed: Set()
25/06/25 10:42:22 WARN WindowExec: No Partition 

                                                                                

25/06/25 10:42:22 INFO BlockManagerInfo: Removed broadcast_73_piece0 on 10.255.255.254:38971 in memory (size: 18.6 KiB, free: 433.7 MiB)
25/06/25 10:42:22 INFO BlockManagerInfo: Removed broadcast_75_piece0 on 10.255.255.254:38971 in memory (size: 16.0 KiB, free: 433.7 MiB)


In [15]:
df_estado = (
    df_localidade
    .filter(F.col("estado").isNotNull())
    .select(F.col("estado"))
    .distinct()
)

window_estado = Window.orderBy("estado")
df_estado = (
    df_estado
    .withColumn("estado_pk", F.row_number().over(window_estado))
    

    .select("estado_pk", "estado")
)

df_ano = (
    df_data
    .select(F.col("data_ano").alias("ano"))
    .distinct()
)

window_ano = Window.orderBy("ano")
df_ano = (
    df_ano
    .withColumn("ano_pk", F.row_number().over(window_ano))
    .select("ano_pk", "ano")
)

df_estado.show()
df_ano.show()

25/06/25 10:42:22 INFO BlockManagerInfo: Removed broadcast_53_piece0 on 10.255.255.254:38971 in memory (size: 35.0 KiB, free: 433.8 MiB)
25/06/25 10:42:22 INFO BlockManagerInfo: Removed broadcast_54_piece0 on 10.255.255.254:38971 in memory (size: 35.0 KiB, free: 433.8 MiB)
25/06/25 10:42:22 INFO BlockManagerInfo: Removed broadcast_66_piece0 on 10.255.255.254:38971 in memory (size: 35.0 KiB, free: 433.8 MiB)
25/06/25 10:42:22 INFO BlockManagerInfo: Removed broadcast_68_piece0 on 10.255.255.254:38971 in memory (size: 16.8 KiB, free: 433.9 MiB)
25/06/25 10:42:22 INFO BlockManagerInfo: Removed broadcast_67_piece0 on 10.255.255.254:38971 in memory (size: 35.0 KiB, free: 433.9 MiB)
25/06/25 10:42:22 INFO BlockManagerInfo: Removed broadcast_76_piece0 on 10.255.255.254:38971 in memory (size: 19.6 KiB, free: 433.9 MiB)
25/06/25 10:42:22 INFO FileSourceStrategy: Pushed Filters: IsNotNull(estado)
25/06/25 10:42:22 INFO FileSourceStrategy: Post-Scan Filters: isnotnull(estado#228)
25/06/25 10:42:22

[Stage 88:>                                                       (0 + 22) / 33]

25/06/25 10:42:25 INFO Executor: Finished task 17.0 in stage 88.0 (TID 592). 2828 bytes result sent to driver
25/06/25 10:42:25 INFO TaskSetManager: Starting task 22.0 in stage 88.0 (TID 597) (10.255.255.254, executor driver, partition 22, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:25 INFO TaskSetManager: Finished task 17.0 in stage 88.0 (TID 592) in 2752 ms on 10.255.255.254 (executor driver) (1/33)
25/06/25 10:42:25 INFO Executor: Running task 22.0 in stage 88.0 (TID 597)
25/06/25 10:42:25 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2952790016-3087007744, partition values: [empty row]
25/06/25 10:42:25 INFO Executor: Finished task 11.0 in stage 88.0 (TID 586). 2828 bytes result sent to driver
25/06/25 10:42:25 INFO TaskSetManager: Starting task 23.0 in stage 88.0 (TID 598) (10.255.255.254, executor driver, partition 23, PROCESS_LOCAL, 4949 bytes) taskResourceAssignm

[Stage 88:=====>                                                  (3 + 22) / 33]

25/06/25 10:42:25 INFO Executor: Finished task 10.0 in stage 88.0 (TID 585). 2828 bytes result sent to driver
25/06/25 10:42:25 INFO TaskSetManager: Starting task 31.0 in stage 88.0 (TID 606) (10.255.255.254, executor driver, partition 31, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:25 INFO Executor: Running task 31.0 in stage 88.0 (TID 606)
25/06/25 10:42:25 INFO TaskSetManager: Finished task 10.0 in stage 88.0 (TID 585) in 3009 ms on 10.255.255.254 (executor driver) (10/33)
25/06/25 10:42:25 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 4160749568-4294967296, partition values: [empty row]
25/06/25 10:42:25 INFO Executor: Finished task 18.0 in stage 88.0 (TID 593). 2828 bytes result sent to driver
25/06/25 10:42:25 INFO Executor: Finished task 7.0 in stage 88.0 (TID 582). 2828 bytes result sent to driver
25/06/25 10:42:25 INFO TaskSetManager: Starting task 32.0 in stage



25/06/25 10:42:26 INFO Executor: Finished task 4.0 in stage 88.0 (TID 579). 2828 bytes result sent to driver
25/06/25 10:42:26 INFO TaskSetManager: Finished task 4.0 in stage 88.0 (TID 579) in 3220 ms on 10.255.255.254 (executor driver) (18/33)
25/06/25 10:42:26 INFO Executor: Finished task 6.0 in stage 88.0 (TID 581). 2828 bytes result sent to driver
25/06/25 10:42:26 INFO TaskSetManager: Finished task 6.0 in stage 88.0 (TID 581) in 3239 ms on 10.255.255.254 (executor driver) (19/33)
25/06/25 10:42:26 INFO Executor: Finished task 14.0 in stage 88.0 (TID 589). 2828 bytes result sent to driver
25/06/25 10:42:26 INFO TaskSetManager: Finished task 14.0 in stage 88.0 (TID 589) in 3364 ms on 10.255.255.254 (executor driver) (20/33)




25/06/25 10:42:26 INFO Executor: Finished task 15.0 in stage 88.0 (TID 590). 2828 bytes result sent to driver
25/06/25 10:42:26 INFO TaskSetManager: Finished task 15.0 in stage 88.0 (TID 590) in 3582 ms on 10.255.255.254 (executor driver) (21/33)
25/06/25 10:42:26 INFO Executor: Finished task 0.0 in stage 88.0 (TID 575). 2828 bytes result sent to driver
25/06/25 10:42:26 INFO TaskSetManager: Finished task 0.0 in stage 88.0 (TID 575) in 3672 ms on 10.255.255.254 (executor driver) (22/33)




25/06/25 10:42:26 INFO Executor: Finished task 32.0 in stage 88.0 (TID 607). 2828 bytes result sent to driver
25/06/25 10:42:26 INFO TaskSetManager: Finished task 32.0 in stage 88.0 (TID 607) in 1029 ms on 10.255.255.254 (executor driver) (23/33)




25/06/25 10:42:27 INFO Executor: Finished task 24.0 in stage 88.0 (TID 599). 2828 bytes result sent to driver
25/06/25 10:42:27 INFO TaskSetManager: Finished task 24.0 in stage 88.0 (TID 599) in 2073 ms on 10.255.255.254 (executor driver) (24/33)
25/06/25 10:42:27 INFO Executor: Finished task 22.0 in stage 88.0 (TID 597). 2828 bytes result sent to driver
25/06/25 10:42:27 INFO TaskSetManager: Finished task 22.0 in stage 88.0 (TID 597) in 2120 ms on 10.255.255.254 (executor driver) (25/33)
25/06/25 10:42:27 INFO Executor: Finished task 28.0 in stage 88.0 (TID 603). 2828 bytes result sent to driver
25/06/25 10:42:27 INFO TaskSetManager: Finished task 28.0 in stage 88.0 (TID 603) in 1969 ms on 10.255.255.254 (executor driver) (26/33)
25/06/25 10:42:27 INFO Executor: Finished task 26.0 in stage 88.0 (TID 601). 2828 bytes result sent to driver
25/06/25 10:42:27 INFO TaskSetManager: Finished task 26.0 in stage 88.0 (TID 601) in 2028 ms on 10.255.255.254 (executor driver) (27/33)
25/06/25 10:

                                                                                

25/06/25 10:42:27 INFO CodeGenerator: Code generated in 6.761201 ms
25/06/25 10:42:27 INFO MemoryStore: Block broadcast_87 stored as values in memory (estimated size 203.1 KiB, free 430.3 MiB)
25/06/25 10:42:27 INFO MemoryStore: Block broadcast_87_piece0 stored as bytes in memory (estimated size 35.0 KiB, free 430.3 MiB)
25/06/25 10:42:27 INFO BlockManagerInfo: Added broadcast_87_piece0 in memory on 10.255.255.254:38971 (size: 35.0 KiB, free: 433.8 MiB)
25/06/25 10:42:27 INFO SparkContext: Created broadcast 87 from showString at NativeMethodAccessorImpl.java:0
25/06/25 10:42:27 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
25/06/25 10:42:27 INFO DAGScheduler: Registering RDD 250 (showString at NativeMethodAccessorImpl.java:0) as input to shuffle 31
25/06/25 10:42:27 INFO DAGScheduler: Got map stage job 51 (showString at NativeMethodAccessorImpl.java:0) with 33 output partitions
25/06/25 10:42:27 IN

[Stage 91:>                                                       (0 + 22) / 33]

25/06/25 10:42:29 INFO BlockManagerInfo: Removed broadcast_84_piece0 on 10.255.255.254:38971 in memory (size: 35.0 KiB, free: 434.3 MiB)
25/06/25 10:42:29 INFO BlockManagerInfo: Removed broadcast_85_piece0 on 10.255.255.254:38971 in memory (size: 14.5 KiB, free: 434.3 MiB)
25/06/25 10:42:30 INFO Executor: Finished task 10.0 in stage 91.0 (TID 619). 2772 bytes result sent to driver
25/06/25 10:42:30 INFO TaskSetManager: Starting task 22.0 in stage 91.0 (TID 631) (10.255.255.254, executor driver, partition 22, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:30 INFO Executor: Running task 22.0 in stage 91.0 (TID 631)
25/06/25 10:42:30 INFO TaskSetManager: Finished task 10.0 in stage 91.0 (TID 619) in 2792 ms on 10.255.255.254 (executor driver) (1/33)
25/06/25 10:42:30 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2952790016-3087007744, partition values: [empty row]
25/06/25 10:



25/06/25 10:42:30 INFO Executor: Finished task 19.0 in stage 91.0 (TID 628). 2772 bytes result sent to driver
25/06/25 10:42:30 INFO TaskSetManager: Starting task 25.0 in stage 91.0 (TID 634) (10.255.255.254, executor driver, partition 25, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:30 INFO Executor: Running task 25.0 in stage 91.0 (TID 634)
25/06/25 10:42:30 INFO TaskSetManager: Finished task 19.0 in stage 91.0 (TID 628) in 3005 ms on 10.255.255.254 (executor driver) (4/33)
25/06/25 10:42:30 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 3355443200-3489660928, partition values: [empty row]
25/06/25 10:42:30 INFO Executor: Finished task 14.0 in stage 91.0 (TID 623). 2772 bytes result sent to driver
25/06/25 10:42:30 INFO TaskSetManager: Starting task 26.0 in stage 91.0 (TID 635) (10.255.255.254, executor driver, partition 26, PROCESS_LOCAL, 4949 bytes) taskResourceAssignm



25/06/25 10:42:31 INFO Executor: Finished task 2.0 in stage 91.0 (TID 611). 2729 bytes result sent to driver
25/06/25 10:42:31 INFO TaskSetManager: Finished task 2.0 in stage 91.0 (TID 611) in 3482 ms on 10.255.255.254 (executor driver) (18/33)
25/06/25 10:42:31 INFO Executor: Finished task 11.0 in stage 91.0 (TID 620). 2772 bytes result sent to driver
25/06/25 10:42:31 INFO TaskSetManager: Finished task 11.0 in stage 91.0 (TID 620) in 3523 ms on 10.255.255.254 (executor driver) (19/33)
25/06/25 10:42:31 INFO Executor: Finished task 13.0 in stage 91.0 (TID 622). 2772 bytes result sent to driver
25/06/25 10:42:31 INFO TaskSetManager: Finished task 13.0 in stage 91.0 (TID 622) in 3582 ms on 10.255.255.254 (executor driver) (20/33)
25/06/25 10:42:31 INFO Executor: Finished task 4.0 in stage 91.0 (TID 613). 2772 bytes result sent to driver
25/06/25 10:42:31 INFO TaskSetManager: Finished task 4.0 in stage 91.0 (TID 613) in 3593 ms on 10.255.255.254 (executor driver) (21/33)
25/06/25 10:42:3



25/06/25 10:42:32 INFO Executor: Finished task 32.0 in stage 91.0 (TID 641). 2772 bytes result sent to driver
25/06/25 10:42:32 INFO TaskSetManager: Finished task 32.0 in stage 91.0 (TID 641) in 1025 ms on 10.255.255.254 (executor driver) (23/33)




25/06/25 10:42:32 INFO Executor: Finished task 25.0 in stage 91.0 (TID 634). 2772 bytes result sent to driver
25/06/25 10:42:32 INFO TaskSetManager: Finished task 25.0 in stage 91.0 (TID 634) in 2056 ms on 10.255.255.254 (executor driver) (24/33)
25/06/25 10:42:32 INFO Executor: Finished task 24.0 in stage 91.0 (TID 633). 2772 bytes result sent to driver
25/06/25 10:42:32 INFO TaskSetManager: Finished task 24.0 in stage 91.0 (TID 633) in 2110 ms on 10.255.255.254 (executor driver) (25/33)
25/06/25 10:42:32 INFO Executor: Finished task 22.0 in stage 91.0 (TID 631). 2772 bytes result sent to driver
25/06/25 10:42:32 INFO TaskSetManager: Finished task 22.0 in stage 91.0 (TID 631) in 2292 ms on 10.255.255.254 (executor driver) (26/33)
25/06/25 10:42:33 INFO Executor: Finished task 26.0 in stage 91.0 (TID 635). 2772 bytes result sent to driver
25/06/25 10:42:33 INFO TaskSetManager: Finished task 26.0 in stage 91.0 (TID 635) in 2140 ms on 10.255.255.254 (executor driver) (27/33)
25/06/25 10:

                                                                                

In [16]:

df_fato_cancer = (
    df_cancer
    .join(df_estado, df_cancer.location_name == df_estado.estado, "left")
    .join(df_sexo, df_cancer.sex_name == df_sexo.sexo, "left")
    .join(df_faixa_etaria, df_cancer.age_id == df_faixa_etaria.id_idade, "left")
    .join(df_tipo_cancer, df_cancer.cause_name == df_tipo_cancer.tipo_cancer, "left")
    .join(df_ano, df_cancer.year == df_ano.ano, "left")
    .join(df_metrica, df_cancer.metric_name == df_metrica.tipo_metrica, "left")
    .groupBy("ano_pk", "estado_pk", "sexo_pk", "faixa_pk", "tipo_cancer_pk", "metrica_pk")
    .pivot("measure_name")
    .agg(F.sum("val"))
)

group_cols = ["ano_pk", "estado_pk", "sexo_pk", "faixa_pk", "tipo_cancer_pk", "metrica_pk"]
pivot_cols = [c for c in df_fato_cancer.columns if c not in group_cols]

for col_name in pivot_cols:
    df_fato_cancer = df_fato_cancer.withColumn(
        col_name,
        F.format_string("%.3f", F.coalesce(F.col(col_name), F.lit(0.0)))
    )

df_fato_cancer = df_fato_cancer \
    .withColumnRenamed("Incidência", "incidencia_cancer") \
    .withColumnRenamed("Prevalência", "prevalencia_cancer") \
    .withColumnRenamed("Óbitos", "obitos_cancer")

df_fato_cancer.show()

25/06/25 10:42:33 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:42:33 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:42:33 INFO FileSourceStrategy: Output Data Schema: struct<measure_name: string>
25/06/25 10:42:33 INFO FileSourceStrategy: Pushed Filters: 
25/06/25 10:42:33 INFO FileSourceStrategy: Post-Scan Filters: 
25/06/25 10:42:33 INFO FileSourceStrategy: Output Data Schema: struct<measure_name: string>
25/06/25 10:42:33 INFO HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current version of codegened fast hashmap does not support this aggregate.
25/06/25 10:42:33 INFO CodeGenerator: Code generated in 4.672333 ms
25/06/25 10:42:33 INFO MemoryStore: Block broadcast_90 stored as values in memory (estimated size 203.1 KiB, free 433.9 MiB)
25/06/25 10:42:33 INFO MemoryStore: Block broadcast_90_piece0 stored as bytes in memory (estimated size 35.0 KiB, free 433.8 MiB)
25/06/25 10:42:33 INFO BlockManagerInfo: Added broadcast_90_p

[Stage 103:>(0 + 22) / 33][Stage 104:> (0 + 0) / 22][Stage 105:> (0 + 0) / 22]2]

25/06/25 10:42:36 INFO Executor: Finished task 0.0 in stage 103.0 (TID 690). 2828 bytes result sent to driver
25/06/25 10:42:36 INFO TaskSetManager: Starting task 22.0 in stage 103.0 (TID 712) (10.255.255.254, executor driver, partition 22, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:36 INFO Executor: Running task 22.0 in stage 103.0 (TID 712)
25/06/25 10:42:36 INFO TaskSetManager: Finished task 0.0 in stage 103.0 (TID 690) in 2766 ms on 10.255.255.254 (executor driver) (1/33)
25/06/25 10:42:36 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2952790016-3087007744, partition values: [empty row]


[Stage 103:>(2 + 22) / 33][Stage 104:> (0 + 0) / 22][Stage 105:> (0 + 0) / 22]

25/06/25 10:42:36 INFO Executor: Finished task 4.0 in stage 103.0 (TID 694). 2828 bytes result sent to driver
25/06/25 10:42:36 INFO TaskSetManager: Starting task 23.0 in stage 103.0 (TID 713) (10.255.255.254, executor driver, partition 23, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:36 INFO Executor: Running task 23.0 in stage 103.0 (TID 713)
25/06/25 10:42:36 INFO TaskSetManager: Finished task 4.0 in stage 103.0 (TID 694) in 2702 ms on 10.255.255.254 (executor driver) (2/33)
25/06/25 10:42:37 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 3087007744-3221225472, partition values: [empty row]
25/06/25 10:42:37 INFO Executor: Finished task 15.0 in stage 103.0 (TID 705). 2828 bytes result sent to driver
25/06/25 10:42:37 INFO TaskSetManager: Starting task 24.0 in stage 103.0 (TID 714) (10.255.255.254, executor driver, partition 24, PROCESS_LOCAL, 4949 bytes) taskResourceAss

[Stage 103:(16 + 17) / 33][Stage 104:> (0 + 5) / 22][Stage 105:> (0 + 0) / 22]

25/06/25 10:42:37 INFO Executor: Running task 2.0 in stage 104.0 (TID 725)
25/06/25 10:42:37 INFO TaskSetManager: Finished task 6.0 in stage 103.0 (TID 696) in 3783 ms on 10.255.255.254 (executor driver) (14/33)
25/06/25 10:42:37 INFO Executor: Finished task 9.0 in stage 103.0 (TID 699). 2828 bytes result sent to driver
25/06/25 10:42:37 INFO TaskSetManager: Starting task 3.0 in stage 104.0 (TID 726) (10.255.255.254, executor driver, partition 3, PROCESS_LOCAL, 5059 bytes) taskResourceAssignments Map()
25/06/25 10:42:37 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/cancer-1.csv, range: 8388608-12582912, partition values: [empty row]
25/06/25 10:42:37 INFO TaskSetManager: Finished task 9.0 in stage 103.0 (TID 699) in 3775 ms on 10.255.255.254 (executor driver) (15/33)
25/06/25 10:42:37 INFO Executor: Running task 3.0 in stage 104.0 (TID 726)
25/06/25 10:42:37 INFO FileScanRDD: Reading File path: file:///home/theoriffel

[Stage 103:(22 + 11) / 33][Stage 105:>(9 + 11) / 22][Stage 106:> (0 + 0) / 22]

25/06/25 10:42:38 INFO Executor: Finished task 0.0 in stage 105.0 (TID 745). 3108 bytes result sent to driver
25/06/25 10:42:38 INFO Executor: Finished task 1.0 in stage 105.0 (TID 746). 3108 bytes result sent to driver
25/06/25 10:42:38 INFO TaskSetManager: Starting task 15.0 in stage 105.0 (TID 760) (10.255.255.254, executor driver, partition 15, PROCESS_LOCAL, 5059 bytes) taskResourceAssignments Map()
25/06/25 10:42:38 INFO TaskSetManager: Finished task 0.0 in stage 105.0 (TID 745) in 175 ms on 10.255.255.254 (executor driver) (5/22)
25/06/25 10:42:38 INFO Executor: Running task 15.0 in stage 105.0 (TID 760)
25/06/25 10:42:38 INFO Executor: Finished task 3.0 in stage 105.0 (TID 748). 3108 bytes result sent to driver
25/06/25 10:42:38 INFO TaskSetManager: Starting task 16.0 in stage 105.0 (TID 761) (10.255.255.254, executor driver, partition 16, PROCESS_LOCAL, 5059 bytes) taskResourceAssignments Map()
25/06/25 10:42:38 INFO TaskSetManager: Finished task 3.0 in stage 105.0 (TID 748) i

[Stage 103:(22 + 11) / 33][Stage 107:>(0 + 11) / 33][Stage 108:> (0 + 0) / 22]

25/06/25 10:42:38 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 939524096-1073741824, partition values: [empty row]
25/06/25 10:42:38 INFO Executor: Running task 9.0 in stage 107.0 (TID 798)
25/06/25 10:42:38 INFO Executor: Finished task 19.0 in stage 106.0 (TID 786). 3220 bytes result sent to driver
25/06/25 10:42:38 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 1207959552-1342177280, partition values: [empty row]
25/06/25 10:42:38 INFO TaskSetManager: Starting task 10.0 in stage 107.0 (TID 799) (10.255.255.254, executor driver, partition 10, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:38 INFO TaskSetManager: Finished task 19.0 in stage 106.0 (TID 786) in 128 ms on 10.255.255.254 (executor driver) (22/22)
25/06/25 10:42:38 INFO TaskSchedulerImpl: Removed TaskSet 106.0, whose tasks h

[Stage 103:(23 + 10) / 33][Stage 107:>(0 + 12) / 33][Stage 108:> (0 + 0) / 22]

25/06/25 10:42:40 INFO Executor: Finished task 24.0 in stage 103.0 (TID 714). 2828 bytes result sent to driver
25/06/25 10:42:40 INFO TaskSetManager: Starting task 12.0 in stage 107.0 (TID 801) (10.255.255.254, executor driver, partition 12, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:40 INFO TaskSetManager: Finished task 24.0 in stage 103.0 (TID 714) in 2868 ms on 10.255.255.254 (executor driver) (24/33)
25/06/25 10:42:40 INFO Executor: Running task 12.0 in stage 107.0 (TID 801)
25/06/25 10:42:40 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 1610612736-1744830464, partition values: [empty row]
25/06/25 10:42:40 INFO Executor: Finished task 23.0 in stage 103.0 (TID 713). 2828 bytes result sent to driver
25/06/25 10:42:40 INFO TaskSetManager: Starting task 13.0 in stage 107.0 (TID 802) (10.255.255.254, executor driver, partition 13, PROCESS_LOCAL, 4949 bytes) taskResource

[Stage 103:>(28 + 5) / 33][Stage 107:>(0 + 17) / 33][Stage 108:> (0 + 0) / 22]

25/06/25 10:42:40 INFO Executor: Finished task 28.0 in stage 103.0 (TID 718). 2828 bytes result sent to driver
25/06/25 10:42:40 INFO TaskSetManager: Starting task 15.0 in stage 107.0 (TID 804) (10.255.255.254, executor driver, partition 15, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:40 INFO Executor: Running task 15.0 in stage 107.0 (TID 804)
25/06/25 10:42:40 INFO TaskSetManager: Finished task 28.0 in stage 103.0 (TID 718) in 3022 ms on 10.255.255.254 (executor driver) (27/33)
25/06/25 10:42:40 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2013265920-2147483648, partition values: [empty row]
25/06/25 10:42:40 INFO Executor: Finished task 22.0 in stage 103.0 (TID 712). 2828 bytes result sent to driver
25/06/25 10:42:40 INFO TaskSetManager: Starting task 16.0 in stage 107.0 (TID 805) (10.255.255.254, executor driver, partition 16, PROCESS_LOCAL, 4949 bytes) taskResource

[Stage 103:>(32 + 1) / 33][Stage 107:>(0 + 21) / 33][Stage 108:> (0 + 0) / 22]

25/06/25 10:42:41 INFO Executor: Finished task 29.0 in stage 103.0 (TID 719). 2828 bytes result sent to driver
25/06/25 10:42:41 INFO TaskSetManager: Starting task 21.0 in stage 107.0 (TID 810) (10.255.255.254, executor driver, partition 21, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:41 INFO Executor: Running task 21.0 in stage 107.0 (TID 810)
25/06/25 10:42:41 INFO TaskSetManager: Finished task 29.0 in stage 103.0 (TID 719) in 3478 ms on 10.255.255.254 (executor driver) (33/33)
25/06/25 10:42:41 INFO TaskSchedulerImpl: Removed TaskSet 103.0, whose tasks have all completed, from pool 
25/06/25 10:42:41 INFO DAGScheduler: ShuffleMapStage 103 (showString at NativeMethodAccessorImpl.java:0) finished in 7.393 s
25/06/25 10:42:41 INFO DAGScheduler: looking for newly runnable stages
25/06/25 10:42:41 INFO DAGScheduler: running: Set(ResultStage 114, ShuffleMapStage 107, ShuffleMapStage 108, ResultStage 112, ResultStage 110)
25/06/25 10:42:41 INFO DAGScheduler: wai

[Stage 107:>(1 + 22) / 33][Stage 108:> (0 + 0) / 22][Stage 110:>  (0 + 0) / 1]

25/06/25 10:42:41 INFO Executor: Finished task 6.0 in stage 107.0 (TID 795). 2772 bytes result sent to driver
25/06/25 10:42:41 INFO TaskSetManager: Starting task 22.0 in stage 107.0 (TID 811) (10.255.255.254, executor driver, partition 22, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:41 INFO Executor: Running task 22.0 in stage 107.0 (TID 811)
25/06/25 10:42:41 INFO TaskSetManager: Finished task 6.0 in stage 107.0 (TID 795) in 2789 ms on 10.255.255.254 (executor driver) (1/33)
25/06/25 10:42:41 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 2952790016-3087007744, partition values: [empty row]
25/06/25 10:42:41 INFO Executor: Finished task 7.0 in stage 107.0 (TID 796). 2772 bytes result sent to driver
25/06/25 10:42:41 INFO TaskSetManager: Starting task 23.0 in stage 107.0 (TID 812) (10.255.255.254, executor driver, partition 23, PROCESS_LOCAL, 4949 bytes) taskResourceAssi

[Stage 107:>(6 + 22) / 33][Stage 108:> (0 + 0) / 22][Stage 110:>  (0 + 0) / 1]

25/06/25 10:42:42 INFO Executor: Finished task 4.0 in stage 107.0 (TID 793). 2772 bytes result sent to driver
25/06/25 10:42:42 INFO TaskSetManager: Starting task 28.0 in stage 107.0 (TID 817) (10.255.255.254, executor driver, partition 28, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:42 INFO Executor: Running task 28.0 in stage 107.0 (TID 817)
25/06/25 10:42:42 INFO TaskSetManager: Finished task 4.0 in stage 107.0 (TID 793) in 3311 ms on 10.255.255.254 (executor driver) (7/33)
25/06/25 10:42:42 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 3758096384-3892314112, partition values: [empty row]
25/06/25 10:42:42 INFO Executor: Finished task 3.0 in stage 107.0 (TID 792). 2772 bytes result sent to driver
25/06/25 10:42:42 INFO TaskSetManager: Starting task 29.0 in stage 107.0 (TID 818) (10.255.255.254, executor driver, partition 29, PROCESS_LOCAL, 4949 bytes) taskResourceAssi

[Stage 107:(12 + 21) / 33][Stage 108:> (0 + 1) / 22][Stage 110:>  (0 + 0) / 1]

25/06/25 10:42:42 INFO Executor: Finished task 1.0 in stage 107.0 (TID 790). 2772 bytes result sent to driver
25/06/25 10:42:42 INFO TaskSetManager: Starting task 31.0 in stage 107.0 (TID 820) (10.255.255.254, executor driver, partition 31, PROCESS_LOCAL, 4949 bytes) taskResourceAssignments Map()
25/06/25 10:42:42 INFO Executor: Running task 31.0 in stage 107.0 (TID 820)
25/06/25 10:42:42 INFO TaskSetManager: Finished task 1.0 in stage 107.0 (TID 790) in 3571 ms on 10.255.255.254 (executor driver) (10/33)
25/06/25 10:42:42 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/climate.csv, range: 4160749568-4294967296, partition values: [empty row]
25/06/25 10:42:42 INFO Executor: Finished task 0.0 in stage 107.0 (TID 789). 2772 bytes result sent to driver
25/06/25 10:42:42 INFO TaskSetManager: Starting task 32.0 in stage 107.0 (TID 821) (10.255.255.254, executor driver, partition 32, PROCESS_LOCAL, 4949 bytes) taskResourceAss

[Stage 107:(12 + 21) / 33][Stage 108:> (3 + 1) / 22][Stage 110:>  (0 + 0) / 1]

25/06/25 10:42:42 INFO Executor: Finished task 2.0 in stage 108.0 (TID 824). 3108 bytes result sent to driver
25/06/25 10:42:42 INFO TaskSetManager: Starting task 3.0 in stage 108.0 (TID 825) (10.255.255.254, executor driver, partition 3, PROCESS_LOCAL, 5059 bytes) taskResourceAssignments Map()
25/06/25 10:42:42 INFO Executor: Running task 3.0 in stage 108.0 (TID 825)
25/06/25 10:42:42 INFO TaskSetManager: Finished task 2.0 in stage 108.0 (TID 824) in 105 ms on 10.255.255.254 (executor driver) (3/22)
25/06/25 10:42:42 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/cancer-1.csv, range: 12582912-16777216, partition values: [empty row]
25/06/25 10:42:42 INFO Executor: Finished task 3.0 in stage 108.0 (TID 825). 3108 bytes result sent to driver
25/06/25 10:42:42 INFO TaskSetManager: Starting task 4.0 in stage 108.0 (TID 826) (10.255.255.254, executor driver, partition 4, PROCESS_LOCAL, 5059 bytes) taskResourceAssignments M

[Stage 107:(12 + 21) / 33][Stage 108:> (9 + 1) / 22][Stage 110:>  (0 + 0) / 1]

25/06/25 10:42:43 INFO Executor: Finished task 7.0 in stage 108.0 (TID 829). 3108 bytes result sent to driver
25/06/25 10:42:43 INFO TaskSetManager: Starting task 8.0 in stage 108.0 (TID 830) (10.255.255.254, executor driver, partition 8, PROCESS_LOCAL, 5059 bytes) taskResourceAssignments Map()
25/06/25 10:42:43 INFO Executor: Running task 8.0 in stage 108.0 (TID 830)
25/06/25 10:42:43 INFO TaskSetManager: Finished task 7.0 in stage 108.0 (TID 829) in 118 ms on 10.255.255.254 (executor driver) (8/22)
25/06/25 10:42:43 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/cancer-1.csv, range: 33554432-37748736, partition values: [empty row]
25/06/25 10:42:43 INFO Executor: Finished task 8.0 in stage 108.0 (TID 830). 3108 bytes result sent to driver
25/06/25 10:42:43 INFO TaskSetManager: Starting task 9.0 in stage 108.0 (TID 831) (10.255.255.254, executor driver, partition 9, PROCESS_LOCAL, 5059 bytes) taskResourceAssignments M

                                                                                

25/06/25 10:42:43 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/cancer-1.csv, range: 71303168-75497472, partition values: [empty row]
25/06/25 10:42:43 INFO Executor: Finished task 16.0 in stage 108.0 (TID 838). 3108 bytes result sent to driver
25/06/25 10:42:43 INFO TaskSetManager: Starting task 18.0 in stage 108.0 (TID 840) (10.255.255.254, executor driver, partition 18, PROCESS_LOCAL, 5059 bytes) taskResourceAssignments Map()
25/06/25 10:42:43 INFO Executor: Running task 18.0 in stage 108.0 (TID 840)
25/06/25 10:42:43 INFO TaskSetManager: Finished task 16.0 in stage 108.0 (TID 838) in 77 ms on 10.255.255.254 (executor driver) (15/22)
25/06/25 10:42:43 INFO FileScanRDD: Reading File path: file:///home/theoriffel/codes/codes_25_01/ProcAnaDados/cancer-climate-dw/data/cancer-1.csv, range: 75497472-79691776, partition values: [empty row]
25/06/25 10:42:43 INFO Executor: Finished task 14.0 in stage 108.0 (TID 836). 3108 

                                                                                

25/06/25 10:42:44 INFO CodeGenerator: Code generated in 3.572816 ms
25/06/25 10:42:44 INFO CodeGenerator: Code generated in 2.255997 ms
25/06/25 10:42:44 INFO CodeGenerator: Code generated in 3.347638 ms
25/06/25 10:42:44 INFO CodeGenerator: Code generated in 2.451483 ms
25/06/25 10:42:45 INFO BlockManagerInfo: Removed broadcast_128_piece0 on 10.255.255.254:38971 in memory (size: 18.1 KiB, free: 433.9 MiB)
25/06/25 10:42:45 INFO BlockManagerInfo: Removed broadcast_129_piece0 on 10.255.255.254:38971 in memory (size: 20.5 KiB, free: 433.9 MiB)
25/06/25 10:42:45 INFO BlockManagerInfo: Removed broadcast_111_piece0 on 10.255.255.254:38971 in memory (size: 16.8 KiB, free: 433.9 MiB)
25/06/25 10:42:45 INFO BlockManagerInfo: Removed broadcast_114_piece0 on 10.255.255.254:38971 in memory (size: 16.8 KiB, free: 433.9 MiB)
25/06/25 10:42:45 INFO Executor: Finished task 7.0 in stage 134.0 (TID 868). 21597 bytes result sent to driver
25/06/25 10:42:45 INFO TaskSetManager: Finished task 7.0 in stage

## Load

In [17]:
load_dotenv() 

DB_HOST = os.getenv("DB_HOST")
DB_PORT = os.getenv("DB_PORT")
DB_NAME = os.getenv("DB_NAME")
DB_USER = os.getenv("DB_USER")
DB_PASSWORD = os.getenv("DB_PASSWORD")

In [23]:
def salvar(df, tabela):
    url = f"jdbc:postgresql://{DB_HOST}:{DB_PORT}/{DB_NAME}"
    df.write \
      .format("jdbc") \
      .option("url", url) \
      .option("dbtable", tabela) \
      .option("user", DB_USER) \
      .option("password", DB_PASSWORD) \
      .option("driver", "org.postgresql.Driver") \
      .mode('overwrite') \
      .save()

salvar(df_data, "dim_data")
salvar(df_localidade, "dim_localidade")
salvar(df_sexo, "dim_sexo")
salvar(df_faixa_etaria, "dim_faixa_etaria")
salvar(df_tipo_cancer, "dim_tipo_cancer")
salvar(df_metrica, "dim_metrica")
salvar(df_fato_clima, "fato_clima")
salvar(df_fato_cancer, "fato_cancer")

Py4JJavaError: An error occurred while calling o728.save.
: java.lang.ClassNotFoundException: org.postgresql.Driver
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:594)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
	at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:101)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:101)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:101)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:229)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:233)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:116)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)


In [19]:
def executa_sql_postgres(sql_texto):
    conn = psycopg2.connect(
        dbname=DB_NAME,
        user=DB_USER,
        password=DB_PASSWORD,
        host=DB_HOST,
        port=DB_PORT
    )
    try:
        with conn:
            with conn.cursor() as cur:
                cur.execute(sql_texto)
    finally:
        conn.close()

sql_dim_ano = """
CREATE OR REPLACE VIEW vw_ano AS
SELECT
    ROW_NUMBER() OVER (ORDER BY data_ano) AS ano_pk,
    data_ano AS cancer_ano,
    data_decada AS cancer_decada
FROM (
    SELECT DISTINCT
        data_ano,
        data_decada
    FROM
        dim_data
) AS sub;
"""

sql_dim_dia = """
CREATE OR REPLACE VIEW vw_dia AS
SELECT
    data_pk,
    data_completa AS clima_data_completa,
    data_dia AS clima_dia,
    data_mes AS clima_mes,
    data_ano AS clima_ano,
    data_decada AS clima_decada
FROM dim_data;
"""

sql_dim_estado = """
CREATE OR REPLACE VIEW vw_estado AS
SELECT
    ROW_NUMBER() OVER (ORDER BY estado) AS estado_pk,
    estado AS cancer_estado,
    regiao AS cancer_regiao,
    pais AS cancer_pais
FROM (
    SELECT DISTINCT
        estado,
        regiao,
        pais
    FROM
        dim_localidade
) AS sub;
"""

sql_dim_cidade = """
CREATE OR REPLACE VIEW vw_cidade AS
SELECT
    localidade_pk,
    latitude,
    longitude,
    cidade AS clima_cidade,
    estado AS clima_estado,
    regiao AS clima_regiao,
    pais AS clima_pais
FROM dim_localidade;
"""

executa_sql_postgres(sql_dim_ano)
executa_sql_postgres(sql_dim_dia)
executa_sql_postgres(sql_dim_cidade)
executa_sql_postgres(sql_dim_estado)

OperationalError: connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
