# PARTE 3 - Treino do modelo, alternativa B

Para este modelo, vai ser usado o dataset criado na parte 2 designado **df_offers_and_trainHistory_with_count.csv.gz** que resulta do join das tabelas trainHistory.csv.gz com offers.csv.gz para cada offer.

In [103]:

# Basic imports

from pyspark.sql import SparkSession
from dotenv import load_dotenv
load_dotenv('.env')
import os
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import VectorAssembler, OneHotEncoder
from pyspark.ml.classification import LinearSVC, RandomForestClassifier, GBTClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
import pyspark.sql.functions as F


In [104]:
# Build SparkSession
spark = SparkSession.builder.appName("DataPreparation").getOrCreate()
base_path = os.getenv('BASE_PATH')

In [105]:

df_dataset = spark.read.csv(
    f"{base_path}-ml/df_offers_and_trainHistory.csv",
    header=True,
    inferSchema=True
)

df_dataset = df_dataset.withColumn("repeater", F.when(F.col("repeater") == "t", 1).otherwise(0))

In [106]:
df_dataset.printSchema()
df_dataset.show(5)

root
 |-- offer: integer (nullable = true)
 |-- id: long (nullable = true)
 |-- chain: integer (nullable = true)
 |-- market: integer (nullable = true)
 |-- repeattrips: integer (nullable = true)
 |-- repeater: integer (nullable = false)
 |-- offerdate: date (nullable = true)
 |-- category: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- company: integer (nullable = true)
 |-- offervalue: double (nullable = true)
 |-- brand: integer (nullable = true)

+-------+--------+-----+------+-----------+--------+----------+--------+--------+---------+----------+------+
|  offer|      id|chain|market|repeattrips|repeater| offerdate|category|quantity|  company|offervalue| brand|
+-------+--------+-----+------+-----------+--------+----------+--------+--------+---------+----------+------+
|1208251|   86246|  205|    34|          5|       1|2013-04-24|    2202|       1|104460040|       2.0|  3718|
|1197502|   86252|  205|    34|         16|       1|2013-03-27|    3203|       1

Tendo em conta o schema acima do dataset, as features escolhidas foram as:
- **offervalue** -> valor binário com o valor da oferta.
- **category** -> category of the made offer
- **quantity** -> quantity of the made offer
- **brand** -> brand of the made offer
- **company** -> company that the offer originates from.

Bellow we exclude and identify the columns that were not chosen as features

In [107]:
cols_feature = ['offervalue', 'category', 'quantity', 'brand', 'company']

# # As all the columns are numerical we won't need the StringIndexer
# index_output_cols = [x + ' Index' for x in df_dataset.columns if x not in cols_not_feature]
# one_output_cols = [x + ' OHE' for x in df_dataset.columns if x not in cols_not_feature]

# ohe_encoder = OneHotEncoder(inputCols=one_output_cols, outputCols=one_output_cols)
vec_assembler = VectorAssembler(
    inputCols=cols_feature,
    outputCol='features'
)

**Treino do modelo**

In [108]:
df_train, df_validation = df_dataset.randomSplit([0.8, 0.2], seed=42)

df_train.write.mode('overwrite').option('header', 'true').option('compression', 'gzip').csv(f"{base_path}-ml/model-B/df_train.csv.gz")

print(f'There are {df_train.count()} rows in the training set and {df_validation.count()} rows in the validation set.')

There are 127878 rows in the training set and 32179 rows in the validation set.


In [109]:
df_train.write.mode('overwrite').parquet(f"{base_path}-ml/model-B/parquet/df_train.parquet")
df_validation.write.mode('overwrite').parquet(f"{base_path}-ml/model-B/parquet/df_validation.parquet")

In [110]:
# Linear SVC algorithm
lsvc = LinearSVC(maxIter=100, regParam=0.05, labelCol='repeater')
# Random Forest algorithm
rf = RandomForestClassifier(    
    labelCol='repeater', 
    featuresCol='features', 
    numTrees=100,           # Good number of trees
    maxDepth=10,            # Add max depth to prevent overfitting
    minInstancesPerNode=5,  # Minimum instances per leaf node
    maxBins=32,             # Number of bins for discretizing continuous features
    subsamplingRate=0.8,    # Bootstrap sampling rate
    featureSubsetStrategy='sqrt',  # Number of features to consider at each split
    seed=42
)

# Gradient Boosted Trees algorithm
gbt = GBTClassifier(
    labelCol='repeater',
    featuresCol='features',
    maxIter=100,
    maxDepth=10,
    minInstancesPerNode=5,
    maxBins=32,
    subsamplingRate=0.8,
    featureSubsetStrategy='sqrt',
    seed=42
)

In [111]:
pipeline_lsvc = Pipeline(stages=[vec_assembler, lsvc])
pipeline_lsvc.save('data-ml/model-B/lsvc/pipeline_model_lsvc')

pipeline_rf = Pipeline(stages=[vec_assembler, rf])
pipeline_rf.save('data-ml/model-B/rf/pipeline_model_rf')

pipeline_gbt = Pipeline(stages=[vec_assembler, gbt])
pipeline_gbt.save('data-ml/model-B/gbt/pipeline_model_gbt')

In [112]:
model_lsvc = pipeline_lsvc.fit(df_train)
model_lsvc.save('data-ml/model-B/lsvc/model_lsvc')

model_rf = pipeline_rf.fit(df_train)
model_rf.save('data-ml/model-B/rf/model_rf')

model_gbt = pipeline_gbt.fit(df_train)
model_gbt.save('data-ml/model-B/gbt/model_gbt')

**Model evaluation**

In [113]:
df_predictions_lsvc = model_lsvc.transform(df_validation)
df_predictions_lsvc.printSchema()
df_predictions_lsvc.show(3)

root
 |-- offer: integer (nullable = true)
 |-- id: long (nullable = true)
 |-- chain: integer (nullable = true)
 |-- market: integer (nullable = true)
 |-- repeattrips: integer (nullable = true)
 |-- repeater: integer (nullable = false)
 |-- offerdate: date (nullable = true)
 |-- category: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- company: integer (nullable = true)
 |-- offervalue: double (nullable = true)
 |-- brand: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- prediction: double (nullable = false)

+-------+---------+-----+------+-----------+--------+----------+--------+--------+---------+----------+-----+--------------------+--------------------+----------+
|  offer|       id|chain|market|repeattrips|repeater| offerdate|category|quantity|  company|offervalue|brand|            features|       rawPrediction|prediction|
+-------+---------+-----+------+-----------+--------+----------+--

In [114]:
df_predictions_rf = model_rf.transform(df_validation)
df_predictions_rf.printSchema()
df_predictions_rf.show(3)

root
 |-- offer: integer (nullable = true)
 |-- id: long (nullable = true)
 |-- chain: integer (nullable = true)
 |-- market: integer (nullable = true)
 |-- repeattrips: integer (nullable = true)
 |-- repeater: integer (nullable = false)
 |-- offerdate: date (nullable = true)
 |-- category: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- company: integer (nullable = true)
 |-- offervalue: double (nullable = true)
 |-- brand: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

+-------+---------+-----+------+-----------+--------+----------+--------+--------+---------+----------+-----+--------------------+--------------------+--------------------+----------+
|  offer|       id|chain|market|repeattrips|repeater| offerdate|category|quantity|  company|offervalue|brand|            features|       rawPrediction|         proba

In [115]:
df_predictions_gbt = model_gbt.transform(df_validation)
df_predictions_gbt.printSchema()
df_predictions_gbt.show(3)

root
 |-- offer: integer (nullable = true)
 |-- id: long (nullable = true)
 |-- chain: integer (nullable = true)
 |-- market: integer (nullable = true)
 |-- repeattrips: integer (nullable = true)
 |-- repeater: integer (nullable = false)
 |-- offerdate: date (nullable = true)
 |-- category: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- company: integer (nullable = true)
 |-- offervalue: double (nullable = true)
 |-- brand: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

+-------+---------+-----+------+-----------+--------+----------+--------+--------+---------+----------+-----+--------------------+--------------------+--------------------+----------+
|  offer|       id|chain|market|repeattrips|repeater| offerdate|category|quantity|  company|offervalue|brand|            features|       rawPrediction|         proba

In [116]:
df_predictions_eval_lsvc = df_predictions_lsvc.select('features', 'rawPrediction', 'prediction', 'repeater')
df_predictions_eval_rf = df_predictions_rf.select('features', 'rawPrediction', 'prediction', 'repeater')
df_predictions_eval_gbt = df_predictions_gbt.select('features', 'rawPrediction', 'prediction', 'repeater')

binary_evaluator = BinaryClassificationEvaluator(
    labelCol='repeater',
    rawPredictionCol='rawPrediction',
    metricName='areaUnderROC'
)

area_under_roc_lsvc = binary_evaluator.evaluate(df_predictions_eval_lsvc)
area_under_roc_rf = binary_evaluator.evaluate(df_predictions_eval_rf)
area_under_roc_gbt = binary_evaluator.evaluate(df_predictions_eval_gbt)

print(f"Area Under ROC (LSVC): {area_under_roc_lsvc} with {df_predictions_eval_lsvc.count()} rows")
print(f"Area Under ROC (RF): {area_under_roc_rf} with {df_predictions_eval_rf.count()} rows")
print(f"Area Under ROC (GBT): {area_under_roc_gbt} with {df_predictions_eval_gbt.count()} rows")

Area Under ROC (LSVC): 0.566224110451648 with 32179 rows
Area Under ROC (RF): 0.6472421197329993 with 32179 rows
Area Under ROC (GBT): 0.6759872730209203 with 32179 rows


In [117]:
print("Confusion Matrix for LSVC:")
df_confusion_matrix_lsvc = df_predictions_eval_lsvc.groupBy('repeater', 'prediction').count()
df_confusion_matrix_lsvc.show()

print("Confusion Matrix for RF:")
df_confusion_matrix_rf = df_predictions_eval_rf.groupBy('repeater', 'prediction').count()
df_confusion_matrix_rf.show()

print("Confusion Matrix for GBT:")
df_confusion_matrix_gbt = df_predictions_eval_gbt.groupBy('repeater', 'prediction').count()
df_confusion_matrix_gbt.show()

Confusion Matrix for LSVC:
+--------+----------+-----+
|repeater|prediction|count|
+--------+----------+-----+
|       1|       0.0| 8725|
|       0|       0.0|23454|
+--------+----------+-----+

Confusion Matrix for RF:
+--------+----------+-----+
|repeater|prediction|count|
+--------+----------+-----+
|       1|       0.0| 8725|
|       0|       0.0|23454|
+--------+----------+-----+

Confusion Matrix for GBT:
+--------+----------+-----+
|repeater|prediction|count|
+--------+----------+-----+
|       1|       0.0| 8059|
|       0|       0.0|22820|
|       1|       1.0|  666|
|       0|       1.0|  634|
+--------+----------+-----+



In [118]:
# Compute the confusion matrix for LSVC
tp = df_confusion_matrix_lsvc.filter((df_confusion_matrix_lsvc.repeater == 1) & (df_confusion_matrix_lsvc.prediction == 1)).select('count').first()
tn = df_confusion_matrix_lsvc.filter((df_confusion_matrix_lsvc.repeater == 0) & (df_confusion_matrix_lsvc.prediction == 0)).select('count').first()
fp = df_confusion_matrix_lsvc.filter((df_confusion_matrix_lsvc.repeater == 0) & (df_confusion_matrix_lsvc.prediction == 1)).select('count').first()
fn = df_confusion_matrix_lsvc.filter((df_confusion_matrix_lsvc.repeater == 1) & (df_confusion_matrix_lsvc.prediction == 0)).select('count').first()
confmat = {'TP': 0, 'TN': 0, 'FP': 0, 'FN': 0}
if (tp):
    confmat['TP'] = tp['count'] * 1
if (tn):
    confmat['TN'] = tn['count'] * 1
if (fp):    
    confmat['FP'] = fp['count'] * 1
if (fn):
    confmat['FN'] = fn['count'] * 1

# Compute the confusion matrix for RF
tp_rf = df_confusion_matrix_rf.filter((df_confusion_matrix_rf.repeater == 1) & (df_confusion_matrix_rf.prediction == 1)).select('count').first()
tn_rf = df_confusion_matrix_rf.filter((df_confusion_matrix_rf.repeater == 0) & (df_confusion_matrix_rf.prediction == 0)).select('count').first()
fp_rf = df_confusion_matrix_rf.filter((df_confusion_matrix_rf.repeater == 0) & (df_confusion_matrix_rf.prediction == 1)).select('count').first()
fn_rf = df_confusion_matrix_rf.filter((df_confusion_matrix_rf.repeater == 1) & (df_confusion_matrix_rf.prediction == 0)).select('count').first()
confmat_rf = {'TP': 0, 'TN': 0, 'FP': 0, 'FN': 0}
if (tp_rf):
    confmat_rf['TP'] = tp_rf['count'] * 1
if (tn_rf):
    confmat_rf['TN'] = tn_rf['count'] * 1
if (fp_rf):
    confmat_rf['FP'] = fp_rf['count'] * 1
if (fn_rf):
    confmat_rf['FN'] = fn_rf['count'] * 1

# Compute the confusion matrix for GBT
tp_gbt = df_confusion_matrix_gbt.filter((df_confusion_matrix_gbt.repeater == 1) & (df_confusion_matrix_gbt.prediction == 1)).select('count').first()
tn_gbt = df_confusion_matrix_gbt.filter((df_confusion_matrix_gbt.repeater == 0) & (df_confusion_matrix_gbt.prediction == 0)).select('count').first()
fp_gbt = df_confusion_matrix_gbt.filter((df_confusion_matrix_gbt.repeater == 0) & (df_confusion_matrix_gbt.prediction == 1)).select('count').first()
fn_gbt = df_confusion_matrix_gbt.filter((df_confusion_matrix_gbt.repeater == 1) & (df_confusion_matrix_gbt.prediction == 0)).select('count').first()
confmat_gbt = {'TP': 0, 'TN': 0, 'FP': 0, 'FN': 0}
if (tp_gbt):
    confmat_gbt['TP'] = tp_gbt['count'] * 1
if (tn_gbt):
    confmat_gbt['TN'] = tn_gbt['count'] * 1
if (fp_gbt):
    confmat_gbt['FP'] = fp_gbt['count'] * 1
if (fn_gbt):
    confmat_gbt['FN'] = fn_gbt['count'] * 1
    
print(f"Confusion Matrix (LSVC): {confmat}")
print(f"Confusion Matrix (RF): {confmat_rf}")
print(f"Confusion Matrix (GBT): {confmat_gbt}")

Confusion Matrix (LSVC): {'TP': 0, 'TN': 23454, 'FP': 0, 'FN': 8725}
Confusion Matrix (RF): {'TP': 0, 'TN': 23454, 'FP': 0, 'FN': 8725}
Confusion Matrix (GBT): {'TP': 666, 'TN': 22820, 'FP': 634, 'FN': 8059}


In [120]:

accuracy = (confmat['TP'] + confmat['TN']) / (confmat['TP'] + confmat['TN'] + confmat['FP'] + confmat['FN'])
precision = (confmat['TP']) / (confmat['TP'] + confmat['FP']) if (confmat['TP'] + confmat['FP']) > 0 else 0
recall = confmat['TP'] / (confmat['TP'] + confmat['FN']) if (confmat['TP'] + confmat['FN']) > 0 else 0
specificity = confmat['TN'] / (confmat['TN'] + confmat['FP']) if (confmat['TN'] + confmat['FP']) > 0 else 0
fiscore = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print("Evaluation Metrics for LSVC:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"Specificity: {specificity}")
print(f"F1 Score: {fiscore}")

accuracy_rf = (confmat_rf['TP'] + confmat_rf['TN']) / (confmat_rf['TP'] + confmat_rf['TN'] + confmat_rf['FP'] + confmat_rf['FN'])
precision_rf = (confmat_rf['TP']) / (confmat_rf['TP'] + confmat_rf['FP']) if (confmat_rf['TP'] + confmat_rf['FP']) > 0 else 0
recall_rf = confmat_rf['TP'] / (confmat_rf['TP'] + confmat_rf['FN']) if (confmat_rf['TP'] + confmat_rf['FN']) > 0 else 0
specificity_rf = confmat_rf['TN'] / (confmat_rf['TN'] + confmat_rf['FP']) if (confmat_rf['TN'] + confmat_rf['FP']) > 0 else 0
fiscore_rf = 2 * (precision_rf * recall_rf) / (precision_rf + recall_rf) if (precision_rf + recall_rf) > 0 else 0
print("\n\nEvaluation Metrics for RF:")
print(f"Accuracy: {accuracy_rf}")
print(f"Precision: {precision_rf}")
print(f"Recall: {recall_rf}")
print(f"Specificity: {specificity_rf}")
print(f"F1 Score: {fiscore_rf}")


accuracy_gbt = (confmat_gbt['TP'] + confmat_gbt['TN']) / (confmat_gbt['TP'] + confmat_gbt['TN'] + confmat_gbt['FP'] + confmat_gbt['FN'])
precision_gbt = (confmat_gbt['TP']) / (confmat_gbt['TP'] + confmat_gbt['FP']) if (confmat_gbt['TP'] + confmat_gbt['FP']) > 0 else 0
recall_gbt = confmat_gbt['TP'] / (confmat_gbt['TP'] + confmat_gbt['FN']) if (confmat_gbt['TP'] + confmat_gbt['FN']) > 0 else 0
specificity_gbt = confmat_gbt['TN'] / (confmat_gbt['TN'] + confmat_gbt['FP']) if (confmat_gbt['TN'] + confmat_gbt['FP']) > 0 else 0
fiscore_gbt = 2 * (precision_gbt * recall_gbt) / (precision_gbt + recall_gbt) if (precision_gbt + recall_gbt) > 0 else 0
print("\n\nEvaluation Metrics for GBT:")
print(f"Accuracy: {accuracy_gbt}")
print(f"Precision: {precision_gbt}")
print(f"Recall: {recall_gbt}")
print(f"Specificity: {specificity_gbt}")
print(f"F1 Score: {fiscore_gbt}")
  

Evaluation Metrics for LSVC:
Accuracy: 0.7288604369309177
Precision: 0
Recall: 0.0
Specificity: 1.0
F1 Score: 0


Evaluation Metrics for RF:
Accuracy: 0.7288604369309177
Precision: 0
Recall: 0.0
Specificity: 1.0
F1 Score: 0


Evaluation Metrics for GBT:
Accuracy: 0.7298548742969017
Precision: 0.5123076923076924
Recall: 0.0763323782234957
Specificity: 0.9729683636053552
F1 Score: 0.13286783042394015


What These Results Mean:
Precision = 1.0 & Specificity = 1.0
Your model is extremely conservative - when it predicts someone is a repeater, it's always right
It correctly identifies 100% of non-repeaters
BUT this suggests the model rarely predicts positive cases
Recall = 0.013 (1.3%)
Your model is missing 98.7% of actual repeaters
It's only catching about 1 in 77 real repeat customers
F1 Score = 0.026 (2.6%)
This very low score confirms the model is practically useless for finding repeaters
Root Cause Analysis:
This pattern typically indicates:

Severe Class Imbalance: You likely have very few repeaters (positive cases) in your dataset
Conservative Model: The model learned to almost always predict "not a repeater" to maximize accuracy
Feature Issues: The features may not be discriminative enough

LSVC and Random Forest Results:
Precision = 0, Recall = 0, F1 = 0: These models predict ZERO repeaters - they classify everyone as non-repeaters
Specificity = 1.0: Perfect at identifying non-repeaters (because they never predict repeaters)
Accuracy ≈ 0.729: This just reflects your class distribution - about 73% of your data are non-repeaters
GBT (Gradient Boosted Trees) Results:
Precision = 0.512: When it predicts someone is a repeater, it's right about 51% of the time
Recall = 0.076: It only catches 7.6% of actual repeaters
F1 = 0.133: Still poor overall performance, but at least it's trying to predict some repeaters

Acho que o problema é que estamos a tentar prever compradores apenas com base no historico de ofertas e ofertas que levaram a comprar novamente e as
suas caracteristicas o que pode nao ser suficiente.