# 1 Definição da Constelação de Fatos

**Tabelas de dimensão**

- data (dataPK, dataCompleta, dataDia, dataMes, dataSemestre, dataAno)
- localidade (localidadePk, latitude, longitude, cidade, estado, região, país)
- tipo-cancer (tipoCancerPK, tipoCancer, mortalidade, taxaIncidenciTotal)
- faixa-etaria (faixaPK, faixaIdade, idadeMin, idadeMax, idIdade)
- sexo (sexoPK, sexo)
- metrica (metricaPK, tipoMetrica)

**Tabelas de fatos**
- cancer (dataPK, localidadePK, tipoCancerPK, sexoPK, faixaPK, metricaPK, obitosCancer, incidenciaCancer, prevalenciaCancer)
- clima (dataPK, localidadePK, temperaturaMedia, temperaturaMax, temperaturaMin, radiacaoUV, radiacaoUVA, radiacaoUVB, precipitacao)

In [1]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Python 3.11
# Java 11
# PySpark == 3.4

In [2]:
spark = SparkSession.builder \
    .appName("OLAP - P2") \
    .config("spark.jars.packages",
            "org.apache.hadoop:hadoop-aws:3.3.4,"
            "com.amazonaws:aws-java-sdk-bundle:1.11.1026,") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
    .getOrCreate()

print(spark.version)

:: loading settings :: url = jar:file:/home/rodrigo/.local/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/rodrigo/.ivy2/cache
The jars for the packages stored in: /home/rodrigo/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-85e1fffe-2a64-4c99-851e-202348b096fe;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
:: resolution report :: resolve 769ms :: artifacts dl 8ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.12.262 from central in [default]
	org.apache.hadoop#hadoop-aws;3.3.4 from central in [default]
	org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
	:: evicted modules:
	com.amazonaws#aws-java-sdk-bundle;1.11.1026 by [com.amazonaws#aws-java-sdk-bundle;1.12.262] in [default]
	-------------------------------------------------------------

3.4.0


## Extract

In [3]:
df_cancer1 = spark.read.csv("s3a://datalake/cancer-1.csv", header=True, inferSchema=True)
df_cancer2 = spark.read.csv("s3a://datalake/cancer-2.csv", header=True, inferSchema=True)
df_cancer = df_cancer1.union(df_cancer2)

print("Numero de Tuplas: ", df_cancer.count())
df_cancer.show()

25/06/23 09:26:37 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

Numero de Tuplas:  789264
+----------+------------+-----------+-------------+------+---------+------+-----------+--------+--------------------+---------+-----------+----+--------------------+--------------------+--------------------+
|measure_id|measure_name|location_id|location_name|sex_id| sex_name|age_id|   age_name|cause_id|          cause_name|metric_id|metric_name|year|                 val|               upper|               lower|
+----------+------------+-----------+-------------+------+---------+------+-----------+--------+--------------------+---------+-----------+----+--------------------+--------------------+--------------------+
|         1|      Óbitos|       4750|         Acre|     1|Masculino|     8|15 -19 anos|     459|Melanoma maligno ...|        1|     Número|2001|0.008899466563112322|0.010069127561452958|0.007914220987606399|
|         1|      Óbitos|       4750|         Acre|     2| Feminino|     8|15 -19 anos|     459|Melanoma maligno ...|        1|     Número|200

In [4]:
df_clima = spark.read.csv("s3a://datalake/climate.csv", header=True, inferSchema=True)

print("Numero de Tuplas: ", df_clima.count())
df_clima.show()

                                                                                

Numero de Tuplas:  42721900
+---------------+--------+---------+----+---+-----+-------+-------+-------------------+--------------+--------------+-----------+-----------+-------+------+
|         cidade|latitude|longitude|YEAR|DOY|  T2M|T2M_MAX|T2M_MIN|ALLSKY_SFC_UV_INDEX|ALLSKY_SFC_UVA|ALLSKY_SFC_UVB|PRECTOTCORR|codigo_ibge|capital|estado|
+---------------+--------+---------+----+---+-----+-------+-------+-------------------+--------------+--------------+-----------+-----------+-------+------+
|Abadia de Goiás|-16.7573| -49.4412|2001|  1|23.12|  26.25|  20.85|               2.29|           1.2|          0.04|      10.58|  5200050.0|  false| Goiás|
|Abadia de Goiás|-16.7573| -49.4412|2001|  2|21.86|  24.46|  19.76|               1.95|          1.05|          0.03|        4.7|  5200050.0|  false| Goiás|
|Abadia de Goiás|-16.7573| -49.4412|2001|  3|21.72|  25.39|  18.91|               2.27|           1.2|          0.04|        4.3|  5200050.0|  false| Goiás|
|Abadia de Goiás|-16.7573| -49

## Transform

In [14]:
# Sua transformação normal da dimensão data
df_data = df_clima.withColumn(
    "dataCompleta",
    F.expr("date_add(to_date(concat(YEAR, '-01-01')), DOY - 1)")
).select(
    "dataCompleta",
    F.dayofmonth("dataCompleta").alias("dataDia"),
    F.month("dataCompleta").alias("dataMes"),
    ((F.month("dataCompleta")-1)/6).cast("int").alias("dataSemestre"),
    F.year("dataCompleta").alias("dataAno")
).distinct()

# Criando a chave artificial sequencial usando row_number
windowSpec = Window.orderBy("dataCompleta")

df_data = df_data.withColumn(
    "dataPK",
    F.row_number().over(windowSpec)
)

# Organizar as colunas
df_data = df_data.select("dataPK", "dataCompleta", "dataDia", "dataMes", "dataSemestre", "dataAno")

print("Numero de Tuplas: ", df_data.count())
df_data.show(500)

                                                                                

Numero de Tuplas:  7670


25/06/23 10:04:31 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 10:04:31 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 10:04:31 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 10:05:19 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 10:05:19 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 10:05:19 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
          

+------+------------+-------+-------+------------+-------+
|dataPK|dataCompleta|dataDia|dataMes|dataSemestre|dataAno|
+------+------------+-------+-------+------------+-------+
|     1|  2001-01-01|      1|      1|           0|   2001|
|     2|  2001-01-02|      2|      1|           0|   2001|
|     3|  2001-01-03|      3|      1|           0|   2001|
|     4|  2001-01-04|      4|      1|           0|   2001|
|     5|  2001-01-05|      5|      1|           0|   2001|
|     6|  2001-01-06|      6|      1|           0|   2001|
|     7|  2001-01-07|      7|      1|           0|   2001|
|     8|  2001-01-08|      8|      1|           0|   2001|
|     9|  2001-01-09|      9|      1|           0|   2001|
|    10|  2001-01-10|     10|      1|           0|   2001|
|    11|  2001-01-11|     11|      1|           0|   2001|
|    12|  2001-01-12|     12|      1|           0|   2001|
|    13|  2001-01-13|     13|      1|           0|   2001|
|    14|  2001-01-14|     14|      1|           0|   200

In [6]:
estado_para_regiao = {
    "Acre": "Norte",
    "Alagoas": "Nordeste",
    "Amapá": "Norte",
    "Amazonas": "Norte",
    "Bahia": "Nordeste",
    "Ceará": "Nordeste",
    "Distrito Federal": "Centro-Oeste",
    "Espírito Santo": "Sudeste",
    "Goiás": "Centro-Oeste",
    "Maranhão": "Nordeste",
    "Mato Grosso": "Centro-Oeste",
    "Mato Grosso do Sul": "Centro-Oeste",
    "Minas Gerais": "Sudeste",
    "Pará": "Norte",
    "Paraíba": "Nordeste",
    "Paraná": "Sul",
    "Pernambuco": "Nordeste",
    "Piauí": "Nordeste",
    "Rio de Janeiro": "Sudeste",
    "Rio Grande do Norte": "Nordeste",
    "Rio Grande do Sul": "Sul",
    "Rondônia": "Norte",
    "Roraima": "Norte",
    "Santa Catarina": "Sul",
    "São Paulo": "Sudeste",
    "Sergipe": "Nordeste",
    "Tocantins": "Norte"
}

df_localidade = (
    df_clima.select("cidade", "estado", "latitude", "longitude")
            .distinct()
            .withColumn("pais", F.lit("Brasil"))
            .withColumn("regiao", F.lit(None))
)

for estado, regiao in estado_para_regiao.items():
    df_localidade = df_localidade.withColumn(
        "regiao",
        F.when(F.col("estado") == estado, regiao).otherwise(F.col("regiao"))
    )


windowSpec = Window.orderBy("cidade", "estado", "latitude", "longitude")
df_localidade = (
    df_localidade.withColumn("localidadePK", F.row_number().over(windowSpec))
                .select("localidadePK", "latitude", "longitude", "cidade", "estado", "regiao", "pais"
)
)

print("Numero de Tuplas: ", df_localidade.count())
df_localidade.show(truncate=False)




25/06/23 09:32:21 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 09:32:21 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 09:32:21 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


Numero de Tuplas:  5570


25/06/23 09:33:15 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 09:33:15 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 09:33:15 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+------------+--------+---------+-------------------+-------------------+------------+------+
|localidadePK|latitude|longitude|cidade             |estado             |regiao      |pais  |
+------------+--------+---------+-------------------+-------------------+------------+------+
|1           |-16.7573|-49.4412 |Abadia de Goiás    |Goiás              |Centro-Oeste|Brasil|
|2           |-18.4831|-47.3916 |Abadia dos Dourados|Minas Gerais       |Sudeste     |Brasil|
|3           |-16.197 |-48.7057 |Abadiânia          |Goiás              |Centro-Oeste|Brasil|
|4           |-1.7218 |-48.8788 |Abaetetuba         |Pará               |Norte       |Brasil|
|5           |-19.1551|-45.4444 |Abaeté             |Minas Gerais       |Sudeste     |Brasil|
|6           |-7.3459 |-39.0416 |Abaiara            |Ceará              |Nordeste    |Brasil|
|7           |-8.7207 |-39.1162 |Abaré              |Bahia              |Nordeste    |Brasil|
|8           |-23.3049|-50.3133 |Abatiá             |Paraná 

                                                                                

In [7]:
df_sexo = df_cancer.select(
    F.col("sex_name").alias("sexo")
).distinct()

df_sexo = df_sexo.withColumn("sexoPK", F.monotonically_increasing_id())

print("Numero de Tuplas: ", df_sexo.count())
df_sexo.show()

                                                                                

Numero de Tuplas:  2




+---------+------+
|     sexo|sexoPK|
+---------+------+
| Feminino|     0|
|Masculino|     1|
+---------+------+



                                                                                

In [8]:
total_casos = df_cancer.filter(F.col("metric_name") == "Número") \
    .agg(F.sum("val").alias("total_casos_geral")) \
    .collect()[0]["total_casos_geral"]

df_tipo_cancer = (
    df_cancer.filter(F.col("metric_name") == "Número") 
            .groupBy("cause_name")
            .agg(
                F.sum(F.when(F.col("measure_name") == "Óbitos", F.col("val")).otherwise(0)).alias("total_obitos"),
                F.sum(F.when(F.col("measure_name") != "Óbitos", F.col("val")).otherwise(0)).alias("total_casos")
            ) 
            .withColumn(
                "mortalidade",
                F.round(F.when(F.col("total_casos") > 0, (F.col("total_obitos") / F.col("total_casos")) * 100).otherwise(0),2)
            ) 
            .withColumn(
                "taxaIncidenciaTotal",
                F.round((F.col("total_casos") / F.lit(total_casos)) * 100, 2)
            ) 
            .withColumn(
                "tipoCancerPK",
                F.monotonically_increasing_id()
            ) 
            .select(
                "tipoCancerPK",
                F.col("cause_name").alias("tipoCancer"),
                "mortalidade",
                "taxaIncidenciaTotal"
            )
)

print("Numero de Tuplas: ", df_tipo_cancer.count())
df_tipo_cancer.show(truncate=False)

                                                                                

Numero de Tuplas:  3




+------------+------------------------------------------------------------+-----------+-------------------+
|tipoCancerPK|tipoCancer                                                  |mortalidade|taxaIncidenciaTotal|
+------------+------------------------------------------------------------+-----------+-------------------+
|0           |Câncer de pele não melanoma (carcinoma basocelular)         |0.0        |89.55              |
|1           |Melanoma maligno da pele                                    |35.94      |3.52               |
|2           |Câncer de pele não melanoma (carcinoma de células escamosas)|23.84      |4.57               |
+------------+------------------------------------------------------------+-----------+-------------------+



                                                                                

In [24]:
df_faixa_etaria = (
    df_cancer.select(
        F.col("age_id").alias("idIdade"),
        F.col("age_name").alias("faixaEtaria")
    )
    .distinct()
    .withColumn(
        "faixaEtaria",
        F.regexp_replace(F.regexp_replace(F.col("faixaEtaria"), "anos", ""), " ", "")
    )
    .withColumn(
        "idadeMin",
        F.when(
            F.col("faixaEtaria").contains("+"),
            F.regexp_replace(F.col("faixaEtaria"), "[^0-9]", "").cast("int")
        ).otherwise(
            F.split(F.col("faixaEtaria"), "-")[0].cast("int")
        )
    )
    .withColumn(
        "idadeMax",
        F.when(
            F.col("faixaEtaria").contains("+"),
            F.lit(130)
        ).otherwise(
            F.split(F.col("faixaEtaria"), "-")[1].cast("int")
        )
    )
    .withColumn("faixaPK", F.monotonically_increasing_id())
    .select("faixaPK", "faixaEtaria", "idadeMin", "idadeMax", "idIdade")
)

print("Numero de Tuplas: ", df_faixa_etaria.count())
df_faixa_etaria.show()

                                                                                

Numero de Tuplas:  20




+-------+-----------+--------+--------+-------+
|faixaPK|faixaEtaria|idadeMin|idadeMax|idIdade|
+-------+-----------+--------+--------+-------+
|      0|      35-39|      35|      39|     12|
|      1|      40-44|      40|      44|     13|
|      2|      25-29|      25|      29|     10|
|      3|      70-74|      70|      74|     19|
|      4|      20-24|      20|      24|      9|
|      5|      65-69|      65|      69|     18|
|      6|      10-14|      10|      14|      7|
|      7|      50-54|      50|      54|     15|
|      8|      80-84|      80|      84|     30|
|      9|      90-94|      90|      94|     32|
|     10|      15-19|      15|      19|      8|
|     11|        95+|      95|     130|    235|
|     12|      75-79|      75|      79|     20|
|     13|      30-34|      30|      34|     11|
|     14|      60-64|      60|      64|     17|
|     15|        5-9|       5|       9|      6|
|     16|      85-89|      85|      89|     31|
|     17|      55-59|      55|      59| 

                                                                                

In [17]:
df_metrica = (
    df_cancer
    .select(F.col("metric_name").alias("tipoMetrica"))
    .distinct()
)

windowSpec = Window.orderBy("tipoMetrica")

df_metrica = (
    df_metrica
    .withColumn("metricaPK", F.row_number().over(windowSpec))
    .select("metricaPK", "tipoMetrica")
)

df_metrica.show()

25/06/23 13:07:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:07:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:07:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

+---------+-----------+
|metricaPK|tipoMetrica|
+---------+-----------+
|        1|     Número|
|        2| Percentual|
|        3|       Taxa|
+---------+-----------+



25/06/23 13:07:52 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:07:52 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:07:52 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
                                                                                

In [None]:
df_fato_clima = (
    df_clima.withColumn(
            "dataCompleta",
            F.expr("date_add(to_date(concat(YEAR, '-01-01')), DOY - 1)")
        ) 
        .join(df_data, ["dataCompleta"], "left") 
        .join(df_localidade, ["cidade", "latitude", "longitude"], "left") 
        .select(
            F.col("dataPK"),
            F.col("localidadePK"),
            F.col("T2M").alias("temperaturaMedia"),
            F.col("T2M_MAX").alias("temperaturaMax"),
            F.col("T2M_MIN").alias("temperaturaMin"),
            F.col("ALLSKY_SFC_UV_INDEX").alias("radiacaoUV"),
            F.col("ALLSKY_SFC_UVA").alias("radiacaoUVA"),
            F.col("ALLSKY_SFC_UVB").alias("radiacaoUVB"),
            F.col("PRECTOTCORR").alias("precipitacao")
        )
)
print("Numero de Tuplas: ", df_fato_clima.count())
df_fato_clima.show()

                                                                                

Numero de Tuplas:  42721900


25/06/23 10:09:44 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 10:09:44 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 10:09:44 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 10:09:44 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 10:09:47 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 10:09:47 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 1

+------+------------+----------------+--------------+--------------+----------+-----------+-----------+------------+
|dataPK|localidadePK|temperaturaMedia|temperaturaMax|temperaturaMin|radiacaoUV|radiacaoUVA|radiacaoUVB|precipitacao|
+------+------------+----------------+--------------+--------------+----------+-----------+-----------+------------+
|     5|           1|           23.18|         29.38|         17.65|      3.03|       1.65|       0.05|        1.83|
|    11|           1|           21.46|         25.64|         17.72|      1.93|       1.07|       0.03|        4.05|
|    12|           1|           22.58|         27.68|         17.29|      2.12|       1.09|       0.04|         6.4|
|     4|           1|           23.14|         27.84|          18.3|      1.81|       0.95|       0.03|        1.72|
|    20|           1|           24.27|         28.82|          19.7|      3.11|       1.65|       0.05|        2.01|
|    13|           1|           23.26|         28.17|          1

In [22]:
df_estado = (
    df_localidade
    .select(F.col("estado"))
    .distinct()
)

window_estado = Window.orderBy("estado")
df_estado = (
    df_estado
    .withColumn("estadoPK", F.row_number().over(window_estado))
    .select("estadoPK", "estado")
)

df_ano = (
    df_data
    .select(F.col("dataAno").alias("ano"))
    .distinct()
)

window_ano = Window.orderBy("ano")
df_ano = (
    df_ano
    .withColumn("anoPK", F.row_number().over(window_ano))
    .select("anoPK", "ano")
)

df_estado.show()
df_ano.show()

25/06/23 13:19:02 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:19:02 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:19:02 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:19:55 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:19:55 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:19:55 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
          

+--------+-------------------+
|estadoPK|             estado|
+--------+-------------------+
|       1|               null|
|       2|               Acre|
|       3|            Alagoas|
|       4|              Amapá|
|       5|           Amazonas|
|       6|              Bahia|
|       7|              Ceará|
|       8|   Distrito Federal|
|       9|     Espírito Santo|
|      10|              Goiás|
|      11|           Maranhão|
|      12|        Mato Grosso|
|      13| Mato Grosso do Sul|
|      14|       Minas Gerais|
|      15|             Paraná|
|      16|            Paraíba|
|      17|               Pará|
|      18|         Pernambuco|
|      19|              Piauí|
|      20|Rio Grande do Norte|
+--------+-------------------+
only showing top 20 rows



25/06/23 13:19:55 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:19:55 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:19:55 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

+-----+----+
|anoPK| ano|
+-----+----+
|    1|2001|
|    2|2002|
|    3|2003|
|    4|2004|
|    5|2005|
|    6|2006|
|    7|2007|
|    8|2008|
|    9|2009|
|   10|2010|
|   11|2011|
|   12|2012|
|   13|2013|
|   14|2014|
|   15|2015|
|   16|2016|
|   17|2017|
|   18|2018|
|   19|2019|
|   20|2020|
+-----+----+
only showing top 20 rows



25/06/23 13:20:38 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:20:38 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:20:38 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
                                                                                

In [30]:

df_fato_cancer = (
    df_cancer
    .join(df_estado, df_cancer.location_name == df_estado.estado, "left")
    .join(df_sexo, df_cancer.sex_name == df_sexo.sexo, "left")
    .join(df_faixa_etaria, df_cancer.age_id == df_faixa_etaria.idIdade, "left")
    .join(df_tipo_cancer, df_cancer.cause_name == df_tipo_cancer.tipoCancer, "left")
    .join(df_ano, df_cancer.year == df_ano.ano, "left")
    .join(df_metrica, df_cancer.metric_name == df_metrica.tipoMetrica, "left")
    .groupBy("anoPK", "estadoPK", "sexoPK", "faixaPK", "tipoCancerPK", "metricaPK")
    .pivot("measure_name")
    .agg(F.sum("val"))
)

group_cols = ["anoPK", "estadoPK", "sexoPK", "faixaPK", "tipoCancerPK", "metricaPK"]
pivot_cols = [c for c in df_fato_cancer.columns if c not in group_cols]

for col_name in pivot_cols:
    df_fato_cancer = df_fato_cancer.withColumn(
        col_name,
        F.format_string("%.3f", F.coalesce(F.col(col_name), F.lit(0.0)))
    )

df_fato_cancer = df_fato_cancer \
    .withColumnRenamed("Incidência", "incidenciaCancer") \
    .withColumnRenamed("Prevalência", "prevalenciaCancer") \
    .withColumnRenamed("Óbitos", "obitosCancer")

df_fato_cancer.show()

25/06/23 13:36:26 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:36:26 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:36:26 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:36:26 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:36:26 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 13:36:26 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/23 1

+-----+--------+------+-------+------------+---------+----------------+-----------------+------------+
|anoPK|estadoPK|sexoPK|faixaPK|tipoCancerPK|metricaPK|incidenciaCancer|prevalenciaCancer|obitosCancer|
+-----+--------+------+-------+------------+---------+----------------+-----------------+------------+
|   18|       8|     1|     14|           0|        2|           0.001|            0.000|       0.000|
|   18|      14|     1|      5|           1|        3|           6.256|           16.494|       3.779|
|   15|      10|     1|     13|           0|        3|          21.212|            2.249|       0.000|
|   14|      27|     0|     13|           1|        3|           1.497|           12.376|       0.314|
|    4|       8|     1|      3|           2|        1|           1.153|            1.500|       0.763|
|   12|       4|     0|      9|           2|        2|           0.000|            0.000|       0.003|
|    9|      11|     0|      8|           1|        2|           0.000|  

                                                                                