# Spark
Plataforma de código aberto para processar grandes volumes de dados de forma rápida e eficiente. Oferece suporte para várias linguagens de programação e fornece uma API rica e extensivel para desenvolvedores. 

A interação do Spark com clusters é fundamental para sua arquitetura e escalabilidade.
* Cluster: conjunto de computadores interconectados que são usados para executar tarefas em paralelo.
* Spark Driver: O processo principal do Spark que coordena as operações e interage com o cluster manager. 
* Executor: O processo em execução nos nós do cluster que executa as tarefas atribuídas pelo driver.


# Pyspark
PySpark é a interface que permite interagir com o Spark usando a linguagem Python. Ele oferece suporte a todas as funcionalidades do Spark e permite que os desenvolvedores aproveitem as bibliotecas e ferramentas do Spark para realizar tarefas de processamento de dados distribuídos.


Para iniciar uma sessão, você precisa importar a classe SparkSession do módulo pyspark.sql. A SparkSession é responsável por estabelecer a conexão com o cluster Spark e fornece uma interface para trabalhar com DataFrames e executar consultas SQL.


In [1]:
# criando sessão
from pyspark.sql import SparkSession
import pyspark as ps    
spark = (SparkSession
        .builder
        .appName('Python Spark SQL basic exemple')
        .config('spark.some.config.option', 'some-value')
        .getOrCreate())

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/05 10:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
conf = ps.SparkConf().setMaster("yarn-client").setAppName("spark-mer")
conf.set("spark.executor.heartbeatInterval", '3600s')

<pyspark.conf.SparkConf at 0x7f083427d450>

park.read.csv é usado para ler o arquivo CSV. O parâmetro header=True é opcional e é usado para indicar que a primeira linha do arquivo contém o cabeçalho das coluna

In [3]:
# consultando os tipos de dados
df_spark = (spark.read.format('csv')
            .option("header", "true")
            .load("Cancer_Data.csv"))
df_spark

                                                                                

DataFrame[id: string, diagnosis: string, radius_mean: string, texture_mean: string, perimeter_mean: string, area_mean: string, smoothness_mean: string, compactness_mean: string, concavity_mean: string, concave points_mean: string, symmetry_mean: string, fractal_dimension_mean: string, radius_se: string, texture_se: string, perimeter_se: string, area_se: string, smoothness_se: string, compactness_se: string, concavity_se: string, concave points_se: string, symmetry_se: string, fractal_dimension_se: string, radius_worst: string, texture_worst: string, perimeter_worst: string, area_worst: string, smoothness_worst: string, compactness_worst: string, concavity_worst: string, concave points_worst: string, symmetry_worst: string, fractal_dimension_worst: string, _c32: string]

In [4]:
# mostra os dados em forma de tabela
df_spark.show(5)

23/11/05 10:50:37 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
23/11/05 10:50:38 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: id, diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave points_worst, symmetry_worst, fractal_dimension_worst, 
 Schema: id, diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perim

+--------+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+----+
|      id|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|_c32|
+--------+---------+-----------+------

                                                                                

In [5]:
#transformando para um dataframe em pandas
df_pd = df_spark.toPandas()
df_pd.head(2)

23/11/05 10:50:56 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: id, diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave points_worst, symmetry_worst, fractal_dimension_worst, 
 Schema: id, diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_wor

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,_c32
0,842302,M,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956,0.1238,0.1866,0.2416,0.186,0.275,0.08902,


In [6]:
# dropando colunas null
df_pd.drop('_c32', axis=1, inplace=True) 
df_pd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   id                       569 non-null    object
 1   diagnosis                569 non-null    object
 2   radius_mean              569 non-null    object
 3   texture_mean             569 non-null    object
 4   perimeter_mean           569 non-null    object
 5   area_mean                569 non-null    object
 6   smoothness_mean          569 non-null    object
 7   compactness_mean         569 non-null    object
 8   concavity_mean           569 non-null    object
 9   concave points_mean      569 non-null    object
 10  symmetry_mean            569 non-null    object
 11  fractal_dimension_mean   569 non-null    object
 12  radius_se                569 non-null    object
 13  texture_se               569 non-null    object
 14  perimeter_se             569 non-null    o

In [7]:
#convertendo df em pandas pra spark
df_spark2 = spark.createDataFrame(df_pd)
df_spark2.show(2)

  if should_localize and is_datetime64tz_dtype(s.dtype) and s.dt.tz is not None:
[Stage 3:>                                                          (0 + 1) / 1]

+------+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+
|    id|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|
+------+---------+-----------+------------+---------

                                                                                

In [8]:
# dropando colunas
df_spark2_mod = df_spark2.drop('id')
df_spark2_mod.show(2)

+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|
+---------+-----------+------------+--------------+---------+-----

In [9]:
# selecionando algumas colunas
df_spark_select = df_spark2_mod.select('diagnosis', 'radius_mean', 'perimeter_mean', 'area_mean')
df_spark_select.show(2)

+---------+-----------+--------------+---------+
|diagnosis|radius_mean|perimeter_mean|area_mean|
+---------+-----------+--------------+---------+
|        M|      17.99|         122.8|     1001|
|        M|      20.57|         132.9|     1326|
+---------+-----------+--------------+---------+
only showing top 2 rows



In [10]:
# renomeando colunas
df_spark_rename = df_spark2_mod.withColumnRenamed('perimeter_mean', 'perimeter_avg')
df_spark_rename.show(2)

+---------+-----------+------------+-------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+
|diagnosis|radius_mean|texture_mean|perimeter_avg|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|
+---------+-----------+------------+-------------+---------+--------

O método withColumn() é usado para adicionar ou substituir uma coluna em um DataFrame.
* DataFrame.withColumn(colName, col)
  
  O método lit() cria uma coluna de valor literal. 

In [11]:
import pyspark.sql.functions as f   
from pyspark.sql.functions import col, lit, when
# setar um valor para a nova coluna 'base'
df_spark2_mod = df_spark2_mod.withColumn('base', lit('cancer_data'))
df_spark2_mod.show(2)

+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+-----------+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|       base|
+---------+-----------+------------+------

In [12]:
# acessando uma coluna com base em seu nome
df_spark2_col = df_spark2_mod.withColumn('area_aprox', 3.1415*col('radius_mean')**2)
df_spark2_col.show(2)

+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+-----------+------------------+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|       base|        area_aprox|
+---

In [13]:
# arrendondando valor
df_spark2_col_round = df_spark2_col.withColumn('area_aprox', f.round(col('area_aprox')))
df_spark2_col_round.show(2)

+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+-----------+----------+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|       base|area_aprox|
+---------+---------

In [14]:
# definindo expressões condicionais
df_spark2_diagnosis = df_spark2_mod.withColumn('diagnosis_desc',
                                            # quando na coluna de diagnostico tiver a letra m, escrever na coluna de descrição 'maligno'
                                            when(col('diagnosis') == 'M', 'Maligno')
                                            # quando na coluna de diagnostico tiver a letra b, escrever na coluna de descrição 'benigno'
                                            .when(col('diagnosis')== 'B', 'Benigno')
                                            .otherwise('outros'))

df_spark2_diagnosis.show(2)

+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+-----------+--------------+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|       base|diagnosis_desc|
+---------+-

In [15]:
# filtrando linhas de um df com base em uma condição
df_benigno = df_spark2_diagnosis.filter(col('diagnosis_desc')== 'Benigno')
df_benigno.show(3)

+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+-----------+--------------+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|       base|diagnosis_desc|
+---------+-

In [16]:
# outros tipos de filtro
df_benigno2 = df_spark2_diagnosis.filter("diagnosis_desc == 'Benigno'")
df_benigno2.show(2)

+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+-----------+--------------+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|       base|diagnosis_desc|
+---------+-

In [17]:
# outros tipos de filtro 
df_benigno3 = df_spark2_diagnosis.filter((col('diagnosis_desc')== 'Benigno') & (col('radius_mean') <= 10))
df_benigno3.show(3)

+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+-----------+--------------+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|       base|diagnosis_desc|
+---------+-

In [18]:
# contando o número de linhas de um dataframe
print(f"base full: {df_spark2_mod.count()}")
print(f"base benigno: {df_benigno2.count()}")
print(f"base benigno2: {df_benigno3.count()}")

                                                                                

base full: 569


                                                                                

base benigno: 357




base benigno2: 84


                                                                                

## Operações transformação e ação
No contexto do PySpark, as operações podem ser classificadas em operações ativas (ações) e operações passivas (transformações). 
* Passivas: não são executadas imediatamente, só são executadas até que uma ação seja acionada.
   
   select(),
   
   filter(), 
   
   withColumn(), 
   
   groupBy(), 
   
   join(), 
   
   sort()

* Ativas: desencadeiam a execução das transformações em um DataFrame. As ações são executadas quando são chamadas e acionam todo o processo de execução das transformações.
  
  show(), 
  
  count(), 
  
  collect(), 
  
  write(),
  
  save() 



# Parquet
O Parquet é um formato de arquivo colunar otimizado para processamento de big data. Projetado para oferecer uma compressão eficiente e um acesso rápido aos dados (utilizando metadados para pular blocos de dados). 


- A operação **write.parquet** é usada para escrever um DataFrame em formato Parquet em um local específico. O parâmetro mode permite especificar o modo de gravação (append ou overwrite).
- A operação **read.parquet** é usada para ler um arquivo ou diretório no formato Parquet


In [19]:
# write
df_spark2_mod.write.parquet('cancer_parquet.parquet', mode='overwrite')

23/11/05 10:53:32 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 95,00% for 8 writers
                                                                                

In [21]:
df = spark.read.parquet('cancer_parquet.parquet')
df

DataFrame[diagnosis: string, radius_mean: string, texture_mean: string, perimeter_mean: string, area_mean: string, smoothness_mean: string, compactness_mean: string, concavity_mean: string, concave points_mean: string, symmetry_mean: string, fractal_dimension_mean: string, radius_se: string, texture_se: string, perimeter_se: string, area_se: string, smoothness_se: string, compactness_se: string, concavity_se: string, concave points_se: string, symmetry_se: string, fractal_dimension_se: string, radius_worst: string, texture_worst: string, perimeter_worst: string, area_worst: string, smoothness_worst: string, compactness_worst: string, concavity_worst: string, concave points_worst: string, symmetry_worst: string, fractal_dimension_worst: string, base: string]

### Agrupamento
Assim como no pandas, a função groupBy é usada para agrupar os dados em um DataFrame com base em uma ou mais colunas.

In [22]:
df_agg = (df_spark2_mod
        .groupBy('diagnosis')
        .count())

df_agg.show()



+---------+-----+
|diagnosis|count|
+---------+-----+
|        B|  357|
|        M|  212|
+---------+-----+



                                                                                

A função **agg** é usada para realizar operações de agregação. Ela permite que você especifique operações de agregação, como soma, média, contagem, mínimo, máximo, entre outras. Além de poder executar suas próprias funções personalizadas.
alias() permite a personalização do nome da coluna. 

**não confundir F.min, F.max com as funções nativas do Python min, max**


In [24]:
df_spark_agg = (df_spark2_mod
                .groupBy('diagnosis')
                .agg(f.count('radius_mean').alias('n'),
                    f.min('radius_mean').alias('min_radius_mean'),
                    f.mean('radius_mean').alias('avg_radius_mean'),
                    f.max('radius_mean').alias('max_radius_mean')))
df_spark_agg.show()



+---------+---+---------------+------------------+---------------+
|diagnosis|  n|min_radius_mean|   avg_radius_mean|max_radius_mean|
+---------+---+---------------+------------------+---------------+
|        B|357|          10.03|12.146523809523808|          9.904|
|        M|212|          10.95| 17.46283018867925|          28.11|
+---------+---+---------------+------------------+---------------+



                                                                                

A função pivot é usada para transformar os valores exclusivos de uma coluna em colunas individuais. A operação pivot requer uma coluna para os valores, uma coluna para as colunas de pivô e uma função de agregação opcional para combinar os valores se houver duplicatas.

In [25]:
df_spark2_pivot = (df_spark2_mod
                .withColumn('bigger', col('area_mean')>1000)
                .groupBy('bigger')
                .pivot('diagnosis')
                .count())

df_spark2_pivot.show()



+------+----+---+
|bigger|   B|  M|
+------+----+---+
|  true|NULL| 92|
| false| 357|120|
+------+----+---+



                                                                                

### Ordenamento
orderBy é usada para ordenar os dados em um DataFrame com base em uma ou mais colunas.

In [26]:
df_spark_sort = (df_spark2_mod.orderBy('radius_mean'))
df_spark_sort.show()



+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+-----------+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|       base|
+---------+-----------+------------+------

                                                                                

Podemos adicionar mais de uma coluna. 
Para classificar em ordem decrescente, utilizamos a função desc.

In [27]:
df_spark_sort2 = (df_spark2_mod.orderBy('diagnosis', f.desc('radius_mean')))
df_spark_sort2.show(5)



+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+-----------+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|       base|
+---------+-----------+------------+------

                                                                                

### Join
Semelhante ao merge do pandas. O método join do PySpark é usado para combinar dois DataFrames com base em uma ou mais colunas em comum. Ele retorna um novo DataFrame contendo os registros que satisfazem a condição de junção.

In [28]:
df1 = df_spark2.select('id', 'diagnosis')
df2 = df_spark2.select('id', 'radius_mean')
df1.show(2)
df2.show(2)

+------+---------+
|    id|diagnosis|
+------+---------+
|842302|        M|
|842517|        M|
+------+---------+
only showing top 2 rows

+------+-----------+
|    id|radius_mean|
+------+-----------+
|842302|      17.99|
|842517|      20.57|
+------+-----------+
only showing top 2 rows



In [29]:
df_join = df1.join(df2, 'id', 'inner')
print(df_join.count())
df_join.show(3)

                                                                                

569


                                                                                

+------+---------+-----------+
|    id|diagnosis|radius_mean|
+------+---------+-----------+
|852631|        M|      17.14|
|859487|        B|      12.78|
|857156|        B|      13.49|
+------+---------+-----------+
only showing top 3 rows



### Cache
Cache refere-se à técnica de armazenar temporariamente os dados de um DataFrame em memória para acelerar o acesso subsequente aos mesmos. (df.cache())

O cache é particularmente útil quando você precisa executar várias ações em um DataFrame.

Para liberar os dados em cache, você pode usar o método unpersist().


### Partições
Um DataFrame é dividido em partições para permitir o processamento paralelo de dados em um cluster de computação distribuída. 

A escolha adequada do número de partições depende do tamanho dos dados, da quantidade de recursos disponíveis no cluster e do tipo de operações que serão realizadas no DataFrame.

É possível particionar por colunas e salvar arquivos parquet de forma particionada.


In [31]:
df_part = df_spark2.repartition(10)
df_part.rdd.getNumPartitions()



10