# Preparação de Dados para Machine Learning com PySpark

Neste notebook, vamos cobrir o ciclo completo de pré processamento de dados para Machine Learning utilizando PySpark. <br>
Os tópicos abordados serão:

- Tratamento de valores faltantes (`Imputer`)
    - Strategy
- Montagem do vetor de features (`VectorAssembler`)
- Encoders: `OneHotEncoder`, `StringIndexer`
- Scaling e Normalização (`StandardScaler`, `MinMaxScaler`)
- Pipeline completo de ML

In [None]:
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder, StandardScaler, MinMaxScaler
from pyspark.ml import Pipeline

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

## Carregando o Dataset

In [None]:
path = "data/processed/final"

df = spark.read.parquet(path)
df.show(5)

+--------------------+--------------------+------------+--------------------+------------+-------------------+-------------------+------------------+----------------+---------------+---------------+---------------+---------------+-----------------------+-----------------------+-----------------------+-----------------------+-------------+-----------------+-----------------+-----------------+-----------------+------------------------+---------------------+----------------------+---------------------+------+
|            order_id|           review_id|review_score|         customer_id|order_status|  order_purchase_ts| order_delivered_ts|order_days_between|order_count_item|order_sum_price|order_min_price|order_max_price|order_avg_price|order_sum_freight_value|order_min_freight_value|order_max_freight_value|order_avg_freight_value|payment_count|payment_min_value|payment_max_value|payment_avg_value|payment_sum_value|payment_max_installments|payment_primary_value|payment_primary_method|paymen

## Tratamento de Valores Faltantes

Vamos verificar a taxa de valores nulos em todas as colunas.

In [22]:
null_logic = [(F.mean(F.col(c).isNull().cast("int"))).alias(c) for c in df.columns]

null_rates = df.agg(*null_logic)

(
    null_rates
    .transpose()
    .withColumnRenamed("0.0", "null_rate")
    .where(F.col("null_rate") > 0)
).show()

+------------------+--------------------+
|               key|           null_rate|
+------------------+--------------------+
|order_delivered_ts|6.261152678208058E-5|
|order_days_between|6.261152678208058E-5|
+------------------+--------------------+



Agora vamos tratar as colunas que possuem valores nulos

In [None]:
from pyspark.ml.feature import Imputer

imputer = (
    Imputer(
        inputCols=['order_days_between'], 
        outputCols=['order_days_between'],
        strategy='median' #Podemos utilizar mean, median and mode
    )
)

df_clean = imputer.fit(df).transform(df)

(
    df_clean
    .agg(*null_logic)
    .transpose()
    .withColumnRenamed('0.0', 'null_count')
    .where(F.col('null_count') > 0)
).show()

+------------------+--------------------+
|               key|          null_count|
+------------------+--------------------+
|order_delivered_ts|6.261152678208058E-5|
+------------------+--------------------+



## Seleção de Features



In [None]:
# Calcular o numero de colunas no dataset final

Regra:
- Não são timestamp/date
- Não terminam com `_id`
- Não são o target (`review_score`, `target`)

In [25]:
all_cols = df_clean.columns
ignore_suffixes = ('_id')
ignore_names = ['review_score', 'target']
dtypes = dict(df_clean.dtypes)

feature_cols = []
for c in all_cols:
    if c.endswith(ignore_suffixes) or c in ignore_names:
        continue
    if dtypes[c] in ['timestamp', 'date']:
        continue
    
    feature_cols.append(c)

print("Features selecionadas:")
feature_cols

Features selecionadas:


['order_status',
 'order_days_between',
 'order_count_item',
 'order_sum_price',
 'order_min_price',
 'order_max_price',
 'order_avg_price',
 'order_sum_freight_value',
 'order_min_freight_value',
 'order_max_freight_value',
 'order_avg_freight_value',
 'payment_count',
 'payment_min_value',
 'payment_max_value',
 'payment_avg_value',
 'payment_sum_value',
 'payment_max_installments',
 'payment_primary_value',
 'payment_primary_method',
 'payment_primary_share']

### Separando Colunas Numéricas e Categóricas

In [33]:
categorical_cols = ['order_status', 'payment_primary_method']
numeric_features = [c for c in feature_cols if c not in categorical_cols]

## Encoders: StringIndexer e OneHotEncoder

Para colunas categóricas (`order_status`, `payment_method`), precisamos converter string para indices numéricos e depois para vetores binários (OneHot).

1.  **StringIndexer**: Converte strings em índices numéricos (0 a n-1), ordenados por frequência.
2.  **OneHotEncoder (OHE)**: Transforma índices em **vetores esparsos** para evitar que o modelo atribua pesos errados a categorias (ex: achar que 2 > 0).
3.  **Vetor Esparso `ex: (3, [1], [1.0])`**: Formato otimizado que armazena apenas o tamanho do vetor e a posição do valor ativo (1.0). (n_categorias, [posição da categoria], [1.0]) 


In [None]:
indexer = StringIndexer(inputCols=categorical_cols, outputCols=[c + "_index" for c in categorical_cols])
model_indexer = indexer.fit(df_clean)
df_indexed = model_indexer.transform(df_clean)

encoder = OneHotEncoder(inputCols=[c + "_index" for c in categorical_cols], outputCols=[c + "_vec" for c in categorical_cols])
model_encoder = encoder.fit(df_indexed)
df_encoded = model_encoder.transform(df_indexed)

In [44]:
df_encoded.select(
    categorical_cols
    + [c + "_index" for c in categorical_cols]
    + [c + "_vec" for c in categorical_cols] 
).show(5)

+------------+----------------------+------------------+----------------------------+----------------+--------------------------+
|order_status|payment_primary_method|order_status_index|payment_primary_method_index|order_status_vec|payment_primary_method_vec|
+------------+----------------------+------------------+----------------------------+----------------+--------------------------+
|   delivered|                boleto|               0.0|                         1.0|   (1,[0],[1.0])|             (3,[1],[1.0])|
|   delivered|           credit_card|               0.0|                         0.0|   (1,[0],[1.0])|             (3,[0],[1.0])|
|   delivered|           credit_card|               0.0|                         0.0|   (1,[0],[1.0])|             (3,[0],[1.0])|
|   delivered|           credit_card|               0.0|                         0.0|   (1,[0],[1.0])|             (3,[0],[1.0])|
|   delivered|           credit_card|               0.0|                         0.0|   (1

## Scaling e Normalização

Muitos algoritmos de ML performam melhor quando as features numéricas estão na mesma escala.

In [45]:
assembler_num = VectorAssembler(inputCols=numeric_features, outputCol="numeric_features_vec")
df_num_vec = assembler_num.transform(df_clean)

scaler = StandardScaler(inputCol="numeric_features_vec", outputCol="scaled_features", withStd=True, withMean=True)
scaler_model = scaler.fit(df_num_vec)
df_scaled = scaler_model.transform(df_num_vec)

df_scaled.select("numeric_features_vec", "scaled_features").show(5, truncate=False)

+-----------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|numeric_features_vec                                                                                                   |scaled_features                                                                                                                                                                                                                                                                                                                                                                 |
+-----------------

## Pipeline Completo

Agora vamos juntar tudo em um Pipeline para automatizar o processo.

In [50]:
stages = []

# 0. Imputer
imputer = Imputer(inputCols=['order_days_between'], outputCols=['order_days_between']).setStrategy('median')
stages.append(imputer)

# 1. StringIndexer para categoricas
indexer = StringIndexer(inputCols=categorical_cols, outputCols=[c + "_index" for c in categorical_cols])
stages.append(indexer)

# 2. OneHotEncoder para categoricas
encoder = OneHotEncoder(inputCols=[c + "_index" for c in categorical_cols], outputCols=[c + "_vec" for c in categorical_cols])
stages.append(encoder)

# 3. VectorAssembler (junta numericas + vetores OneHot)
assembler_inputs = numeric_features + [c + "_vec" for c in categorical_cols]
assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features_raw")
stages.append(assembler)

# 4. Scaler (opcional, mas recomendado)
# Nota: Scaling em vetores esparsos (do OneHot) pode ser custoso se withMean=True. 
# Geralmente escalamos as numericas separadamente ou usamos MinMaxScaler. 
scaler = MinMaxScaler(inputCol="features_raw", outputCol="features")
stages.append(scaler)

# Criando o Pipeline
pipeline = Pipeline(stages=stages)

# Fit e Transform
model = pipeline.fit(df_clean)
final_df = model.transform(df_clean)

final_df.select("features").show(5, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                                                                                                                                                                                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [51]:
final_df.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- review_score: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_purchase_ts: timestamp (nullable = true)
 |-- order_delivered_ts: timestamp (nullable = true)
 |-- order_days_between: double (nullable = true)
 |-- order_count_item: long (nullable = true)
 |-- order_sum_price: double (nullable = true)
 |-- order_min_price: double (nullable = true)
 |-- order_max_price: double (nullable = true)
 |-- order_avg_price: double (nullable = true)
 |-- order_sum_freight_value: double (nullable = true)
 |-- order_min_freight_value: double (nullable = true)
 |-- order_max_freight_value: double (nullable = true)
 |-- order_avg_freight_value: double (nullable = true)
 |-- payment_count: long (nullable = true)
 |-- payment_min_value: double (nullable = true)
 |-- payment_max_value: double (nullable = true)
 |-- payment_avg_value: double (n