# 5.2 - Features dos últimos review

Neste notebook, exploraremos outras funções Window no PySpark. <br>
Nesse processo vamos criar features baseadas nos reviews realizados por clientes

## Imports

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window

spark = SparkSession.builder.getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
26/02/07 00:30:02 WARN Utils: Your hostname, MacBook-Air-de-Vitor.local, resolves to a loopback address: 127.0.0.1; using 192.168.3.49 instead (on interface en0)
26/02/07 00:30:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/02/07 00:30:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
26/02/07 00:30:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
26/02/07 00:30:03 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
26/02/07 00:30:03 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


## Join datasets

Como vamos olhar os reviews do customer_id, precisamos juntas os datasets
- reviews
- orders
- customers

Dado que já temos conhecimento de como fazer join no PySpark, vamos utilizar o seguinte código

In [2]:
review_path = "data/raw/olist_order_reviews_dataset.csv"
order_path = "data/raw/olist_orders_dataset.csv"
customer_path = "data/raw/olist_customers_dataset.csv"

review_df = spark.read.csv(review_path, header=True, inferSchema=True)
order_df = spark.read.csv(order_path, header=True, inferSchema=True)
customer_df = spark.read.csv(customer_path, header=True, inferSchema=True)

review_df.printSchema()
order_df.printSchema()
customer_df.printSchema()

root
 |-- review_id: string (nullable = true)
 |-- order_id: string (nullable = true)
 |-- review_score: string (nullable = true)
 |-- review_comment_title: string (nullable = true)
 |-- review_comment_message: string (nullable = true)
 |-- review_creation_date: string (nullable = true)
 |-- review_answer_timestamp: string (nullable = true)

root
 |-- order_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_purchase_timestamp: timestamp (nullable = true)
 |-- order_approved_at: timestamp (nullable = true)
 |-- order_delivered_carrier_date: timestamp (nullable = true)
 |-- order_delivered_customer_date: timestamp (nullable = true)
 |-- order_estimated_delivery_date: timestamp (nullable = true)

root
 |-- customer_id: string (nullable = true)
 |-- customer_unique_id: string (nullable = true)
 |-- customer_zip_code_prefix: integer (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: 

In [3]:
w = Window.partitionBy("order_id").orderBy("review_score")

df = (
    review_df
    .join(order_df, "order_id", "inner")
    .join(customer_df, "customer_id", "inner")
    .withColumn("rn", F.row_number().over(w))
    .where("rn = 1")
    .select(
        "customer_id",
        "order_id",
        "review_creation_date",
        "review_score",
    )
)

df.orderBy(F.rand()).show(5)

+--------------------+--------------------+--------------------+------------+
|         customer_id|            order_id|review_creation_date|review_score|
+--------------------+--------------------+--------------------+------------+
|5fb72531a5f56bbbf...|3eb7b3f39b5c39ce0...| 2018-01-20 00:00:00|           5|
|f8e4b4531b76c0efa...|9fe314dcda4135956...| 2017-05-30 00:00:00|           5|
|fe4a5493fd7197b7b...|16fac761b4906243e...| 2018-03-15 00:00:00|           4|
|af71c6566cc7ebbfe...|adc3c2a7a5283d29f...| 2018-08-16 00:00:00|           5|
|d55606cbf9e683ce2...|23b051786ba773bb6...| 2017-02-22 00:00:00|           5|
+--------------------+--------------------+--------------------+------------+
only showing top 5 rows


## Usando um dataframe de exemplo

In [4]:
data = [(1, 1, 1, 1), (1, 2, 2, 2), (2, 1, 1, 5), (2, 2, 2, 1)]
cols = ["customer_id", "order_id", "review_creation_date", "review_score"]
df_exemplo = spark.createDataFrame(data, cols)

df_exemplo.show()

+-----------+--------+--------------------+------------+
|customer_id|order_id|review_creation_date|review_score|
+-----------+--------+--------------------+------------+
|          1|       1|                   1|           1|
|          1|       2|                   2|           2|
|          2|       1|                   1|           5|
|          2|       2|                   2|           1|
+-----------+--------+--------------------+------------+



                                                                                

In [5]:
# Vamos pegar o ultimo review com lag, para isso criar uma função para utilizar no dataset principal

def create_lag_review(df):
    w = (
        Window
        .partitionBy("customer_id")
        .orderBy(F.col("review_creation_date").asc())
    )

    return (
        df
        .withColumn("review_score_lag", F.lag("review_score", 1).over(w))
    )

df_exemplo.transform(create_lag_review).show()

+-----------+--------+--------------------+------------+----------------+
|customer_id|order_id|review_creation_date|review_score|review_score_lag|
+-----------+--------+--------------------+------------+----------------+
|          1|       1|                   1|           1|            NULL|
|          1|       2|                   2|           2|               1|
|          2|       1|                   1|           5|            NULL|
|          2|       2|                   2|           1|               5|
+-----------+--------+--------------------+------------+----------------+



In [6]:
# Vamos calcular a media movel olhando para as ultimas linhas

def create_ma_review(df):
    w = (
        Window
        .partitionBy("customer_id")
        .rowsBetween(-1, 0)
    )

    return (
        df
        .withColumn("review_score_lag", F.avg("review_score").over(w))
    )

df_exemplo.transform(create_ma_review).show()

+-----------+--------+--------------------+------------+----------------+
|customer_id|order_id|review_creation_date|review_score|review_score_lag|
+-----------+--------+--------------------+------------+----------------+
|          1|       1|                   1|           1|             1.0|
|          1|       2|                   2|           2|             1.5|
|          2|       1|                   1|           5|             5.0|
|          2|       2|                   2|           1|             3.0|
+-----------+--------+--------------------+------------+----------------+



## Aplicando no dataset de features e Salvando

In [7]:
path = "data/processed/features_last_review"

(
    df
    .transform(create_lag_review)
    .transform(create_ma_review)
    .write
    .mode("overwrite")
    .parquet(path)
)

                                                                                