# 4.1 - Features de pagamento (part 2) 

Neste notebook, vamos explorar como realizar agrupamentos e agregações em DataFrames do PySpark. <br>
Usaremos o dataset já processado de pagamentos

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
26/02/06 11:37:17 WARN Utils: Your hostname, MacBook-Air-de-Vitor.local, resolves to a loopback address: 127.0.0.1; using 192.168.3.49 instead (on interface en0)
26/02/06 11:37:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/02/06 11:37:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
path = "data/processed/feature_payment"

df = spark.read.parquet(path)
df.show(5)
df.printSchema()

+--------------------+------------------+------------+--------------------+-------------+-----------------------+-----------------------------+
|            order_id|payment_sequential|payment_type|payment_installments|payment_value|payment_log_total_value|payment_log_installment_value|
+--------------------+------------------+------------+--------------------+-------------+-----------------------+-----------------------------+
|b81ef226f3fe1789b...|                 1| credit_card|                   8|        99.33|     1.9970804354717304|            1.093990448479787|
|a9810da82917af2d9...|                 1| credit_card|                   1|        24.39|     1.3872118003137304|           1.3872118003137304|
|25e8ea4e93396b6fa...|                 1| credit_card|                   1|        65.71|     1.8176314671905152|           1.8176314671905152|
|ba78997921bbcdc13...|                 1| credit_card|                   8|       107.78|     2.0325381792600066|            1.129448192

## Conferindo se temos id cuplicados

In [3]:
# Nosso modelo vai ser construido para prever no nivel de order_id
# Ou seja, temos que ter só uma linha de order id

(
    df
    .groupBy("order_id")
    .count()
    .orderBy(F.col("count").desc())
).show()

+--------------------+-----+
|            order_id|count|
+--------------------+-----+
|fa65dad1b0e818e3c...|   29|
|ccf804e764ed5650c...|   26|
|285c2e15bebd4ac83...|   22|
|895ab968e7bb0d565...|   21|
|fedcd9f7ccdc8cba3...|   19|
|ee9ca989fc93ba09a...|   19|
|21577126c19bf11a0...|   15|
|4bfcba9e084f46c8e...|   15|
|3c58bffb70dcf45f1...|   14|
|4689b1816de42507a...|   14|
|cf101c3abd3c061ca...|   13|
|4fb76fa13b108a0d0...|   13|
|73df5d6adbeea12c8...|   13|
|6d58638e32674bebe...|   12|
|465c2e1bee4561cb3...|   12|
|c6492b842ac190db8...|   12|
|67d83bd36ec2c7fb5...|   12|
|1a611328643ae1114...|   12|
|1d9a9731b9c10fc9c...|   12|
|d744783ed2ace06ca...|   12|
+--------------------+-----+
only showing top 20 rows


## Criando Features fazendo agrupando por order id 

Aqui calculamos as estatísticas (mínimo, máximo, média e desvio padrão) para as colunas de payment

In [4]:
col_name = 'payment_value'

df_grouped = df.groupBy("order_id").agg(
    F.count(col_name).alias(f"count_{col_name}"),
    F.min(col_name).alias(f"min_{col_name}"),
    F.max(col_name).alias(f"max_{col_name}"),
    F.avg(col_name).alias(f"avg_{col_name}"),
    F.sum(col_name).alias(f"sum_{col_name}"),
    F.stddev(col_name).alias(f"stddev_{col_name}"),
)

df_grouped.show(3)

+--------------------+-------------------+-----------------+-----------------+------------------+------------------+--------------------+
|            order_id|count_payment_value|min_payment_value|max_payment_value| avg_payment_value| sum_payment_value|stddev_payment_value|
+--------------------+-------------------+-----------------+-----------------+------------------+------------------+--------------------+
|bb2d7e3141540afc2...|                  1|            37.15|            37.15|             37.15|             37.15|                NULL|
|85be7c94bcd3f908f...|                  1|            72.75|            72.75|             72.75|             72.75|                NULL|
|8ca5bdac5ebe8f2d6...|                  9|             15.0|            59.08|21.008888888888887|189.07999999999998|  14.654716343590929|
+--------------------+-------------------+-----------------+-----------------+------------------+------------------+--------------------+
only showing top 3 rows


## Analizando o numero de orders com só um payment

In [5]:
# Quando temos só uma linha o stddev fica null

(
    df_grouped
    .agg(
        F.mean(F.col("stddev_payment_value").isNull().cast("int")).alias("pct_w_one_payment")
    )
).show()

+------------------+
| pct_w_one_payment|
+------------------+
|0.9702232502011263|
+------------------+



## Impacto de valores Nulos no Agg


**Como o `agg` lida com null?**

As funções de agregação do Spark (`min`, `max`, `mean`, `stddev`) **ignoram** os valores nulos por padrão. 

In [6]:
df = spark.createDataFrame([(1,), (2,), (None,)], ["value"])

df.show()
df.agg(
    F.count("value"),
    F.count("*"),
    F.sum("value"),
    F.avg("value")
).show()

                                                                                

+-----+
|value|
+-----+
|    1|
|    2|
| NULL|
+-----+

+------------+--------+----------+----------+
|count(value)|count(1)|sum(value)|avg(value)|
+------------+--------+----------+----------+
|           2|       3|         3|       1.5|
+------------+--------+----------+----------+



## Salvando Payment Features

In [8]:
path = "data/processed/feature_payment_part2"
df_grouped.write.mode("overwrite").parquet(path)

In [9]:
!tree data/processed/feature_payment_part2

[1;36mdata/processed/feature_payment_part2[0m
├── _SUCCESS
├── part-00000-2b2fb310-e7d4-4ab6-9923-049f36001f29-c000.snappy.parquet
├── part-00001-2b2fb310-e7d4-4ab6-9923-049f36001f29-c000.snappy.parquet
├── part-00002-2b2fb310-e7d4-4ab6-9923-049f36001f29-c000.snappy.parquet
├── part-00003-2b2fb310-e7d4-4ab6-9923-049f36001f29-c000.snappy.parquet
├── part-00004-2b2fb310-e7d4-4ab6-9923-049f36001f29-c000.snappy.parquet
└── part-00005-2b2fb310-e7d4-4ab6-9923-049f36001f29-c000.snappy.parquet

1 directory, 7 files


26/02/06 14:25:15 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 903293 ms exceeds timeout 120000 ms
26/02/06 14:25:15 WARN SparkContext: Killing executors is not supported by current scheduler.
26/02/06 14:25:22 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:53)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:359)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:81)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:674)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1363)
	at o