---
# Algoritmos para Big Data

**Handout 4 - Machine learning problem - binary classification**

**2024/25**

This lab class is about binary classification in a discrete space. We will setup a ML processing pipeline to achieve the goals, and the data to be considered relates to the domain of banking industry. Specifically, it is about fraud detection in credit cards transactions

This notebook should contain only the implementation of the task B presented in the handout.

Hence both handout and notebooks must be considered together.

---
# Task B - ML classifier model

**Datasets**

In case of clean data is needed, after task A, there are two parquet files available in the data server.

The archive files can be can be downloaded from: 

https://bigdata.iscte-iul.eu/datasets/cards-transactions.zip

https://bigdata.iscte-iul.eu/datasets/cards-transactions-small.zip



---
# 1.

In [None]:
# Imports
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

from pyspark.ml import Pipeline
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import BinaryClassificationEvaluator

import plotly.express as px


In [76]:
# Build SparkSession
spark = SparkSession.builder.appName("BinaryClassificationB").getOrCreate()

**Reading and checking data**

In [None]:
# Reading data
data_dir =  
file_transactions = data_dir + 'cards-transactions-small'

In [78]:
df_clean = spark.read.parquet(file_transactions)

In [None]:
# Checking data
print(f'df_clean - number of rows: {df_clean.count()}')
df_clean. 
df_clean. 

df_clean - number of rows: 7315741
root
 |-- User: integer (nullable = true)
 |-- Card: integer (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- Day: integer (nullable = true)
 |-- Time: timestamp (nullable = true)
 |-- Use Chip: string (nullable = true)
 |-- Merchant Name: long (nullable = true)
 |-- Merchant City: string (nullable = true)
 |-- MCC: integer (nullable = true)
 |-- Is Fraud?: string (nullable = true)
 |-- NumericAmount: float (nullable = true)
 |-- Hour: integer (nullable = true)
 |-- Min: integer (nullable = true)

+----+----+----+-----+---+-------------------+------------------+--------------------+-------------+----+---------+-------------+----+---+
|User|Card|Year|Month|Day|               Time|          Use Chip|       Merchant Name|Merchant City| MCC|Is Fraud?|NumericAmount|Hour|Min|
+----+----+----+-----+---+-------------------+------------------+--------------------+-------------+----+---------+-------------+----+-

**Is it really clean?**

In [None]:
print(f'df_clean - number of rows is {df_clean.count() }; after dropDuplicates() applied would be {df_clean.dropDuplicates().count()}.')

In [None]:
print('Checking nulls at each column of df_clean')
dict_nulls_clean = {col: df_clean.filter(df_clean[col].isNull()).count() for col in df_clean.columns}
dict_nulls_clean

In [None]:
print(f'''df_clean - number of rows after dropna(how='any') would be {df_clean.dropna(how='any').count()}.''')

---
# 3.

In [None]:
# Correlations

# Checking correlations among some columns - numeric types but no nulls

# The columns at stake
cols_non_numeric = [field.name for field in df_clean.schema.fields if isinstance(
    field.dataType, T.TimestampType) or isinstance(field.dataType, T.StringType)]
cols_numeric = [col for col in df_clean.columns if col not in cols_non_numeric]

cols_corr = cols_numeric
# Correlation needs vectors so we have to convert to vector column first
# Then assemble columns to compute
vector_col = 'corr_features'
assembler = VectorAssembler(inputCols=cols_corr, outputCol=vector_col)
df_vector = assembler.transform(df_clean).select(vector_col)
# Get correlation matrix - it can be Pearson’s (default) or Spearman’s correlation
corr = Correlation.corr(df_vector, vector_col)
corr_matrix = corr.collect()[0][0].toArray().tolist()

corr.show(truncate=False)
corr_matrix

In [None]:
# Plot computed correlations
# See colour scales in https://plotly.com/python/builtin-colorscales/
print(f'Computed correlations among {cols_corr}:')
fig = px.imshow(corr_matrix, title='Correlations',
                x = cols_corr, y = cols_corr,
                color_continuous_scale='Sunsetdark',  # Sunsetdark, RdBu_r
                text_auto=False)
fig.show()

---
# 3.

**Feature enginnering**

- Creating new column Fraud to be used as label/target column by the algorithm
- Defining features to be used in the creation of the model
- Assembling an array with the features to be used by the algorithm, with the help of:

    StringIndexer(), OneHotEncoder() and vectorAssembler()

    See Chapter 10 of the book "Learning Spark - Lightning-Fast Data Analytics" for details



In [None]:
# Set new column Fraud, to be used as label/target:
#   1 if value in column Is Fraud? is Yes, 0 otherwise
df_clean = ( df_clean
            .withColumn("Fraud", 
)

In [None]:
# Recall columns at stake
print(f'Non-numeric columns: 
print(f'Numeric columns: 

Non-numeric columns: ['Time', 'Use Chip', 'Merchant City', 'Is Fraud?']
Numeric columns: ['User', 'Card', 'Year', 'Month', 'Day', 'Merchant Name', 'MCC', 'NumericAmount', 'Hour', 'Min']


In [None]:
# Defining features to be used in the creation of the model

# First, set which columns not to be used as features. 

# As a starting point, we are going to exclude only those that 
# really do not make sense, without any correlation analysis whatsoever.
# But that analysis must be done in a next round of model tuning
# For the time being we exclude: 
#   (i) time because aka hour, min and (ii) the target

cols_not_features = [  

# Then, set columns to be used by StringIndexer() and OneHotEncoder()

categorical_cols = [i for i in cols_non_numeric if i not in cols_not_features]
non_categorical_cols = [i for i in cols_numeric if i not in cols_not_features]
index_output_cols = [x + ' Index' for x in categorical_cols]
ohe_output_cols = [x + ' OHE' for x in categorical_cols]

In [None]:
# Assembling an array with the features to be used by the algorithm,
# with the help of StringIndexer(), OneHotEncoder() and vectorAssembler()
string_indexer = StringIndexer(inputCols=categorical_cols, outputCols=index_output_cols, handleInvalid="skip")
ohe_encoder = OneHotEncoder(inputCols=index_output_cols, outputCols=ohe_output_cols)

# Put all input features into a single vector, by using a transformer
assembler_inputs = ohe_output_cols + non_categorical_cols
vec_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

print(f'Input features to be used (OHE were categorical):\n 

Input features to be used (OHE were categorical):
 ['Use Chip OHE', 'Merchant City OHE', 'User', 'Card', 'Year', 'Month', 'Day', 'Merchant Name', 'MCC', 'NumericAmount', 'Hour', 'Min']


---
# 4.

Select and train the model
- Train/validation split: creation of two dataframes for training and validation respectively, with a split size of 70/30 (%)
- Free memory space of the no longer needed initial dataframe
- Set the Linear SVC algorithm as the classifier estimator
- Set up a ML pipeline configuration, holding the sequence of the four stages previously set:
    1. String indexer
    2. OHE encoder
    3. Vector assembler
    4. ML estimator (SVM)
- Create the model by fitting the pipeline to the training data

In [None]:
# Train/validation split
# Two dataframes for training and validation respectively, with a split size of 70/30 (%)

df_train, df_validation = df_clean.randomSplit([0.7, 0.3], 42)
# Caching data ... just the training part as it is accessed many times by the algorithm
# But, it might not be a good idea if we are using a local computer and large dataset!
# df_train.cache()
print(f'There are {df_train.count()} rows in the training set and {df_validation.count()} rows in the validation set.')

In [None]:
# Save the train/validation sets as parquet files 
# Recall that, because it is a sampling, there is not guarantee of 
# getting the same data split when using the code in a different computer/time. 
# And we may want to reproduce or share the experiments.

df_train.write.mode('overwrite').parquet(
df_validation.write.mode('overwrite').parquet(

In [87]:
# As we already got the data split, delete df_clean to free memory space
del df_clean

In [88]:
# Linear SVC algorithm
# default: featuresCol='features', labelCol='label', predictionCol='prediction'
lsvc = LinearSVC(maxIter=10, regParam=0.1, labelCol='Fraud')

In [None]:
# Set up a ML pipeline configuration, holding the sequence of the four stages previously set:
# 1. string_indexer
# 2. ohe_encoder
# 3. vec_assembler (related to assembling features into vector)
# 4. lsvc (related to ML estimator)

pipeline = Pipeline(


In [None]:
# Save in the pipeline for further use, should it be required
pipeline.save('pipeline-LinearSVM')

In [None]:
# Create the model by fitting the pipeline to the training data
# Notice that the model will be a transformer
#
# Note: in case there are running problems in your computer, set 
# a lower number of rows to be used in model training

# A
# model = pipeline.fit(df_train)
# B
limit_rows = 100000
model = pipeline.fit(df_train.limit(limit_rows))

In [None]:
# Save the model for further use, should it be required.
model.save('model-LinearSVM')

---
# 5.
Evaluate the model 

- Make predictions by applying the verification data to the model transformer
- With the predictions made:
	- Print out the schema of the resulting DataFrame and show the columns:
		 features, rawPrediction, prediction, Fraud
	- Compute the evaluation metric *areaUnderROC* using *BinaryClassificationEvaluator*
    - Compute the confusion matrix
    - Based on the confusion matrix, computed the evaluation matrics:
        *accuracy*, *precision*, *recall*, *specifity* and *F1 score*


In [None]:
# Make predictions by applying the verification data to the transformer
df_predictions = model.

# Check its schema
df_predictions.

root
 |-- User: integer (nullable = true)
 |-- Card: integer (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- Day: integer (nullable = true)
 |-- Time: timestamp (nullable = true)
 |-- Use Chip: string (nullable = true)
 |-- Merchant Name: long (nullable = true)
 |-- Merchant City: string (nullable = true)
 |-- MCC: integer (nullable = true)
 |-- Is Fraud?: string (nullable = true)
 |-- NumericAmount: float (nullable = true)
 |-- Hour: integer (nullable = true)
 |-- Min: integer (nullable = true)
 |-- Fraud: integer (nullable = false)
 |-- Use Chip Index: double (nullable = false)
 |-- Merchant City Index: double (nullable = false)
 |-- Use Chip OHE: vector (nullable = true)
 |-- Merchant City OHE: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [None]:
# Compute the evaluation metrics 
# - areaUnderROC using BinaryClassificationEvaluator
# - accuracy, precision, recall, and f1Measure, using MultilabelClassificationEvaluator

# Using BinaryClassificationEvaluator
# Regardless of using default values or not, it is good practice to
# explicitly specify them, at the least the important ones

# areaUnderROC relates to sensitivity (TP rate) and specificity (FP rate)

# Columns of interest: features, rawPrediction, prediction, Fraud
df_predictions_eval = df_predictions.select('features', 
                    'rawPrediction', 'prediction', 'Fraud')

binary_evaluator = BinaryClassificationEvaluator(
    
area_under_ROC = binary_evaluator.evaluate(

# Print out result
print(f'Metric areaUnderROC = {area_under_ROC}')


Metric areaUnderROC = 0.9095048127423266


1817459

In [None]:
# Counting of the kind of predictions made
df_confusion_matrix = df_predictions_eval.groupBy(
df_confusion_matrix.show()

In [None]:
# Compute the confusion matrix
tp = df_confusion_matrix.filter((F.col('prediction')==1.0) & (F.col('Fraud')==1)).first()
tn = df_confusion_matrix.filter((F.col('prediction')==0.0) & (F.col('Fraud')==0)).first()
fp = df_confusion_matrix.filter((F.col('prediction')==1.0) & (F.col('Fraud')==0)).first()
fn = df_confusion_matrix.filter((F.col('prediction')==0.0) & (F.col('Fraud')==1)).first()

confmat = {'TP': 0.0, 'TN': 0.0, 'FP': 0.0, 'FN': 0.0}
if (tp):
    confmat['TP'] = tp['count'] * 1.0
if (tn):
    confmat['TN'] = tn['count'] * 1.0
if (fp):
    confmat['FP'] = fp['count'] * 1.0
if (fn):
    confmat['FN'] = fn['count'] * 1.0

confmat

In [None]:
# Based on the confusion matrix, computed the evaluation matrics:
#   accuracy, precision, recall, specifity and F1 score

# PS: Check divisons by 0.0
accuracy = 
precision = 
recall = 
specifity = 
f1score = 

print('Evaluation metrics based on the confusion matrix:')
print(f' Accuracy = {accuracy}')
print(f' Precision = {precision}')
print(f' Recall = {recall}')
print(f' Specifity = {specifity}')
print(f' F1 score = {f1score}')
