# INTRO
The goal of this exercise is to build a machine learning model capable of distinguishing fraudulent customers from others.

# DATASET
The dataset consists of both clear-text columns and anonymized columns and represents a monthly snapshot of customers.
Below is the explanation of the columns:

- *data_rif*: end-of-month reference date
- *userid*: customer ID
- *age*: age
- *profession*: profession
- *region*: region of residence
- *account_balance*: end-of-month account balance
- *num_trx_cd*: number of debit card transactions executed during the month
- *num_trx_cc*: number of credit card transactions executed during the month
- *num_trx_cp*: number of prepaid card transactions executed during the month
- *num_mov_conto*: number of current account movements during the month
- *sum_mov_conto_pos*: sum of incoming current account transaction amounts during the month
- *sum_mov_conto_neg*: sum of outgoing current account transaction amounts during the month
- *num_prodotti*: number of products owned by the customer
- *f2*, *f3*, *f4*, *f5*, *f6*, *f7*: anonymized behavioral features
- *TARGET*: target variable indicating whether the customer committed fraud in the following months


# Import

In [None]:
import os 
proxies = {
    'http': 'http://inet1.gtm.corp.sanpaoloimi.com:9090/',
    'https': 'http://inet1.gtm.corp.sanpaoloimi.com:9090/'
}


os.environ["http_proxy"] = proxies["http"]
os.environ["https_proxy"] = proxies["https"]


In [None]:
# ! python -m venv env
# ! pip install -r requirements.txt

In [2]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")


import matplotlib.pyplot as plt
import ppscore as pps
import seaborn as sns


from sklearn.preprocessing import LabelEncoder
from scipy.stats import chi2_contingency

import shap
from shap import maskers
from shap import TreeExplainer

from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer #needed for iterative imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline


from category_encoders.cat_boost import CatBoostEncoder
from sklearn.model_selection import GroupKFold
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

from sklearn.metrics import roc_auc_score, recall_score, matthews_corrcoef
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve, ConfusionMatrixDisplay
import kds

# Data loading


In [5]:
DTYPES = {
    'data_rif': str,
    'userid': np.int64,
    'age':  np.int64,
    'profession': str,
    'region': str,
    'account_balance': np.float64,
    'num_trx_cd': np.float64,
    'num_trx_cc': np.float64,
    'num_trx_cp': np.float64,
    'num_mov_conto': np.int32,
    'sum_mov_conto_pos': np.int64,
    'sum_mov_conto_neg': np.int64,
    'num_prodotti': np.int64,
    'f2': np.int64,
    'f3': np.int64,
    'f4': np.int64,
    'f5': np.int64,
    'f6': np.float64,
    'f7': np.float64,
    'TARGET': np.int64
}

In [6]:
df = pd.read_csv("data/frauds_dataset.csv", sep="~", dtype=DTYPES)

In [7]:
df.head()

Unnamed: 0,data_rif,userid,age,profession,region,account_balance,num_trx_cd,num_trx_cc,num_trx_cp,num_mov_conto,sum_mov_conto_pos,sum_mov_conto_neg,num_prodotti,f2,f3,f4,f5,f6,f7,TARGET
0,2022-07-31,1000510,23,Lavoratore autonomo,TOSCANA,65627.799269,0.0,0.0,0.0,10,3590,-370,2,88,60,8,20,21.141686,0.268369,0
1,2022-07-31,1001511,55,Lavoratore dipendente,BASILICATA,39335.109963,7.0,0.0,0.0,0,0,0,5,97,63,11,82,38.169452,0.672864,1
2,2022-07-31,1001726,23,Lavoratore autonomo,PUGLIA,-37466.828926,148.0,0.0,0.0,2,636,-294,10,90,49,31,71,38.60238,0.126743,0
3,2022-07-31,1002418,43,Studente,VALLE AOSTA,13864.880197,215.0,0.0,0.0,8,1064,-1640,3,99,66,52,57,31.505413,2.081956,1
4,2022-07-31,1002646,26,Studente,LOMBARDIA,-32625.910843,38.0,56.0,6.0,0,0,0,1,115,56,44,28,36.882651,0.210746,0


In [8]:
df.shape

(24987, 20)

In [None]:
df.columns = [col.lower() for col in df.columns]

# EDA

In [None]:
df.target.value_counts(normalize=True)

In [None]:
#percentuale di target 1/0 per mese
df_grouped = df.groupby(['data_rif','target']).agg(count_tr=('userid','count')).reset_index()
tot = df.groupby(['data_rif']).agg(tot=('userid','count')).reset_index()
df_grouped['perc_per_month'] = df_grouped.merge(tot, how='inner', on='data_rif').apply(lambda x: x['count_tr']/x['tot']*100,axis=1)

The imbalance between the two classes is consistent across all months, which is useful for the GroupKFold strategy in cross-validation. The folds can be split by months (see below).


In [None]:
plt.figure(figsize=(10, 6))
ax = sns.barplot(data=df_grouped, x='data_rif', y='perc_per_month', hue='target')

for p in ax.patches:
    ax.annotate(f'{p.get_height():.2f}%', (p.get_x() + p.get_width() / 2., p.get_height()-0.5),
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')


plt.title('%Target 1vs0 per mese di riferimento')
plt.xlabel('Data')
plt.ylabel('%')
plt.xticks(rotation=30)


ax.legend(loc='upper left', bbox_to_anchor=(1, 1))


plt.show()


In [None]:
# guardo qualche esempio di target 1 
df.groupby(['userid']).agg(distinct = ('target','nunique')).reset_index().sort_values(by='distinct', ascending=False)

There is a peculiarity in the data: the same userid at an earlier date has a greater age compared to later dates (e.g., 23 vs. 22 or even 23 vs. 44 years old). However, I assume this is due to the "synthetic" dataset, created to establish the relationship between fraud and age, which is later noticeable in the graphs below (i.e., customers with target 1 tend to be older).


In [None]:
df.loc[df['userid']==1623043]

In [None]:

df.loc[df['userid']==1160419]

From the analysis of the numerical features, we observe that there will be missing values to handle in these three transaction-related features: *num_trx_cd*, *num_trx_cc*, *num_trx_cp*. Apart from this, no anomalies are noticed in the data (e.g., negative values in features that must necessarily be positive, such as age, *num_prodotti*, etc.). However, we can see that *sum_mov_conto_pos* and *sum_mov_conto_neg* are almost mirror images in their numerical characteristics (mean, std, max, and min), which indeed suggests a rather high correlation between these two variables.


In [None]:
df.drop(['userid','target'], axis=1).describe()

From the analysis of the categorical features, we observe that there will be missing values to handle in both *profession* and *region*. For further analysis, such as the distribution of the target across the various categories of these two features, please refer to the graphs below.


In [None]:
df.drop(['data_rif','userid','target'], axis=1).describe(include=['object'])

here the confirm of what said before

In [None]:
df.isna().sum()

# Plots

Removing data rif and userid since i won't use these features in training

The analysis using KDE (normalized with respect to the number of values in each class) and countplot (percentage per values in the group) provides particularly interesting insights:

* The most significant variables seem to be:
    * *age*: the target variable = 1 appears to be associated with higher customer age values.
    * *num_prodotti*: the same type of relationship as above.
    * *fX*: all anonymized "fX" variables have different distributions between the two classes. For example, *f2* has a peak around 100 for target 1, which does not appear for target 0, and *f6* has a peak around 30. The most overlapped one is *f7*, which will be further investigated later with correlation analysis.

* The distributions of *account_balance* are practically overlapping, so by itself, it does not provide much information for classification.
* The number of transactions (*cc/cd/cp*) also has a similar distribution between the two classes; however, it's interesting to note that they are skewed to the right. This suggests that using the median instead of the mean might be preferable for replacing missing values. The distribution does not seem to indicate the presence of outliers that significantly affect the results; it looks more like a physiological distribution of values (for example, looking at the scatter plot for *num_trx_cd*, there are a couple of transactions above 400, but this is not too far from the most densely populated values).
* The two variables, *sum_mov_conto_pos* and *sum_mov_conto_neg*, indeed have mirrored distributions (as could be expected a priori, and also as seen in the previous `describe()`), so it might be possible to remove one before training the model, as one implicitly contains the information of the other.

The numerical range of values is very different, but below, tree-based models will be used, so no preprocessing of the numerical values will be done since these models are robust in this regard.

* Regarding the categorical variables, we see that there do not seem to be significant variations in the 1/0 distribution with respect to *profession* (slightly higher percentage of 1 in the "Unemployed" class, but not very relevant), with slightly larger imbalances between regions. We decide to keep these types of variables after applying encoding, allowing the model to determine whether they are useful.


In [None]:
for col in df.drop(['userid','data_rif'],axis=1).select_dtypes(exclude=['object']):
    if col == 'target':
        continue
    plt.figure(figsize=(8,6))
    sns.kdeplot(x=df[col], hue=df.target, common_norm=False)
    plt.title(f'{col} kde')
    
for col in ['region','profession']:
    plt.figure(figsize=(10,8))
    count_percentage = df.groupby(col)['target'].value_counts(normalize=True).mul(100).rename('percent').reset_index()

    # Crea il countplot in percentuale
    ax = sns.barplot(x=col, y='percent', hue='target', data=count_percentage)
    
    # Aggiungi le percentuali sulle barre
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x() + p.get_width() / 2., height + 0.5, f'{height:.1f}%', ha='center')
    
    plt.xticks(rotation=90)
    plt.title(f'{col} countplot in %')
    plt.ylabel('Percentage')
    plt.show()
    
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df.data_rif, y=df['num_trx_cd'])
plt.title('Scatter Plot del Numero di Transazioni C.Debito')
plt.xlabel('data rif')
plt.ylabel('Numero di Transazioni')
plt.show()

In addition to univariate analysis, we will try to find relationships between variables that may assist the model: pairplot and correlation.

* From the pairplot, the variables *num_prodotti* and *f4* stand out, confirming that higher values are associated with target 1. Additionally, there are relationships between *account_balance* and *f7*, which outline a certain range where frauds can be found (whereas previously *account_balance* alone seemed very overlapping between target 1 and 0).


In [None]:
sns.pairplot(df.drop(['userid','data_rif','sum_mov_conto_neg'],axis=1), hue='target')
plt.show()

Below is an analysis of correlation and PPS (Predictive Power Score) between features and the target:

* From the PPS analysis, the most useful features for prediction are *num_prodotti*, *age*, *f6*, *f2*, *f4*, *f5*, and *f7*, though with less predictive power (as noted from the KDE).

* From the linear correlation analysis, the most useful features for prediction are *num_prodotti*, *age*, *f5*, *f6*, *f7*, *f4*, with less importance given to *f2* and *f3*.

The results are generally consistent, but the different "degrees" of correlation/PPS might be due to the type of relationship between the feature and the target. Correlation identifies linear relationships, while PPS is based on decision trees.

There are also high correlations (as expected) between *sum_mov_conto_pos* and *sum_mov_conto_neg* (negative correlation), and between these and *num_mov_conto*. In this case, it might be worth considering removing one of *sum_mov_conto_pos*/*sum_mov_conto_neg* and adding a feature that outlines the ratio between *num_mov_conto* and *sum_mov_conto_pos*, for example, `value_mov_medio = sum_mov_conto_pos / num_mov_conto`. I will try this later in the feature engineering section.


In [None]:

# devo mettere il target categorico
dfcat = df.drop(['userid','data_rif'],axis=1).copy()
dfcat['target'] =dfcat['target'].astype('category')
plt.figure(figsize=(8,6))
predictors_df = pps.predictors(dfcat, y="target")
sns.barplot(data=predictors_df, x='x', y="ppscore")
plt.xticks(rotation=90);


In [None]:
N=10
top_correlated_features = predictors_df.sort_values(by='ppscore',ascending=False).head(N)['x']
print("Top Correlated Features with Target using ppscore:")
print(top_correlated_features)

In [None]:
#analisi correlazione 

correlation_matrix = dfcat.corrwith(dfcat['target'])
correlation_df = pd.DataFrame({'Correlation': correlation_matrix})

sorteddf = correlation_df.abs().sort_values(by='Correlation', ascending=False)
#print le N più correlate 
N=10
top_correlated_features = sorteddf.head(N)
print("Top Correlated Features with Target:")
print(top_correlated_features)


plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap of Top Features with Target')
plt.show()


For the two numerical variables, we use Cramér's V test. If the value is close to 1, there is a relationship between *profession*/*region* and the target; otherwise, there is no significant relationship. For both columns, we do not find a significant relationship.

In [None]:
dfcopy = dfcat[['profession','region','target']].copy()
label_encoder = LabelEncoder()

dfcopy['profession_encoded'] = label_encoder.fit_transform(dfcopy['profession'])
dfcopy['region_encoded'] = label_encoder.fit_transform(dfcopy['region'])

for col in ['profession','region']:
    contingency_table = pd.crosstab(dfcat[col], dfcat['target'])
    chi2_stat, _, _, _ = chi2_contingency(contingency_table)
    num_rows, num_cols = contingency_table.shape
    cramer_v = np.sqrt(chi2_stat / (dfcat.shape[0] * (min(num_rows, num_cols) - 1)))
    print(f"Cramer's V {col}:", cramer_v)


# Feature Engineering

### Missing

The simplest method here would be to replace missing numerical values with the median, which I would use on larger datasets. However, since there are only 20k records, I will try the IterativeImputer, which uses other columns to impute the missing value.

For categorical variables, you would need to encode the values first and then use IterativeImputer. For simplicity, I will use a SimpleImputer with the most frequent value.


In [None]:
df.isna().sum()

In [None]:



X = df.drop(['data_rif','userid'],axis=1).copy()  

numeric_cols = ['num_trx_cd', 'num_trx_cc', 'num_trx_cp']
categorical_cols = ['profession', 'region']


numeric_imputer = IterativeImputer(estimator=RandomForestRegressor(), max_iter=10, random_state=0)
categorical_imputer = SimpleImputer(strategy='most_frequent')  


preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numeric_imputer, numeric_cols),
        ('categorical', categorical_imputer, categorical_cols)
    ])

model = Pipeline(steps=[('preprocessor', preprocessor)])

X_imputed = model.fit_transform(X)


In [None]:
print('shape pre imputation: {}',df.shape)

In [None]:
df = pd.concat([df.drop(numeric_cols+categorical_cols, axis=1), pd.DataFrame(X_imputed, columns=numeric_cols+categorical_cols)],axis=1)

In [None]:
print('shape after imputation: {}',df.shape)

In [None]:
df.isna().sum()

In [None]:
df[['num_trx_cd','num_trx_cc','num_trx_cp']]=df[['num_trx_cd','num_trx_cc','num_trx_cp']].astype('int')

In [None]:
df[['num_trx_cd','num_trx_cc','num_trx_cp']].describe() #controllando sopra non sono variate molto le statistiche

### Creating New Features

I will create a variable that shows the difference between positive and negative account movements, and another that gives the average value of the transactions made.


In [None]:
df['diff_conto'] = df.apply(lambda x: x['sum_mov_conto_pos']+x['sum_mov_conto_neg'], axis=1)
df['var_per_tr'] = df.apply(lambda x: (x['sum_mov_conto_pos']+x['sum_mov_conto_neg'])/x['num_mov_conto'] if x['num_mov_conto'] > 0 else 0, axis=1)

In [None]:
df.iloc[:,9:].head()

In [None]:
df = df.drop(['sum_mov_conto_pos','sum_mov_conto_neg','num_mov_conto'], axis=1)

In [None]:
df.shape

# Model Training

I will try two tree-based models: a Random Forest to establish a baseline and an XGBoost to see how much improvement can be achieved with a more complex model. 

Target 1 is of primary interest, and it is hypothesized that finding frauds is more important than identifying non-frauds. Therefore, we will emphasize the metric that accounts for false negatives (recall) rather than false positives. We will also look at the AUC score and the Matthews correlation coefficient, which evaluates predictions comprehensively (including false positives, false negatives, true positives, and true negatives). The AUC score can be optimistic for highly imbalanced problems since it calculates the false positive rate -> FP / (FP + TN). If TNs are very high in number, the value tends to zero, even if FP might be high compared to TP.

To create the training set and test set, I will split based on the reference date. This mimics a real-world scenario where predictions are made for the most recent month. I will remove *userid* which will not be given to the model, but keep *data_rif* to use in GroupKFold. I choose this type of cross-validation because it allows me to use one month at a time as the validation set, ensuring the model does not overfit and maintains consistent performance for each prediction month.

Note: If the dataset had more months, say a year, I would prefer a TimeSeries cross-validation, training the model on the first 6 months with the 7th month as the validation set, then from months 1 to 7 with the 8th month as validation, and so on up to month 11, using the 12th month as the test set.


In [None]:
train = df.loc[df.data_rif<df.data_rif.max()].drop('userid',axis=1).copy()
test = df.loc[df.data_rif==df.data_rif.max()].drop('userid',axis=1).copy()

In [None]:
train.shape

In [None]:
test.shape

In [None]:
train.data_rif.value_counts()

In [None]:
test.data_rif.value_counts()

In [None]:
X_train , y_train = train.drop('target', axis=1).copy(), train['target'].copy()
X_test , y_test = test.drop('target',axis=1).copy(), test['target'].copy()

#creo le col per predizione
cols = list(X_train.columns)
cols.remove('data_rif')
cols

compute scale pos weight that might be useful for imbalanced classes

In [None]:
num_positive_examples = np.sum(y_train == 1)
num_negative_examples = np.sum(y_train == 0)

scale_pos_weight = num_negative_examples / num_positive_examples
scale_pos_weight

## Cross validation - RandomForest

handle categorical features with CatBoost encoder in the Cross validation loop to avoid target leakage

In [None]:
ce = CatBoostEncoder(drop_invariant=False,
    return_df=True,
    handle_unknown='UNKNOWN',
    handle_missing='MISSING',
    random_state=42)


In [None]:
proba = np.zeros(len(X_train))
preds = np.zeros(len(X_train))

skf = GroupKFold(n_splits=5)

for i, (idxT, idxV) in enumerate(skf.split(X_train, y_train, groups=X_train['data_rif'])):
    
    month = X_train.iloc[idxV]['data_rif'].iloc[0]
    print('Fold', i, 'validation date', month)
    print('Rows of train =', len(idxT), 'Rows of holdout =', len(idxV))
    
    #gestisco categoriche
    ce.fit(X_train[cols].iloc[idxT], y_train.iloc[idxT])
    cvtrain =  ce.transform(X_train[cols].iloc[idxT])
    cvval = ce.transform(X_train[cols].iloc[idxV])
    


    rf = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=42)
    rf.fit(cvtrain, y_train.iloc[idxT])
    
    
    val_preds_proba = rf.predict_proba(cvval)[:, 1]
    val_preds = rf.predict(cvval)
    
    
    proba[idxV] += val_preds_proba
    preds[idxV] += val_preds
    
    # fold - auc su validation set
    aucscore = roc_auc_score(y_train.iloc[idxV], val_preds_proba)   
    print('validation-auc:', aucscore)

    print('-' * 30)

# metriche complessive
print('#' * 20)
print('RF AUC=', roc_auc_score(y_train, proba))
print('RF RECALL CV=', recall_score(y_train, preds))
print('RF MCC CV=', matthews_corrcoef(y_train, preds))


In [None]:
print("Classification Report CV:")
print(classification_report(y_train, preds))


cm = confusion_matrix(y_train, preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=['0','1'])

disp.plot()
plt.tight_layout()
plt.show()
plt.close()


## Cross validation - XGBoost

In [None]:
proba = np.zeros(len(X_train))
preds = np.zeros(len(X_train))

skf = GroupKFold(n_splits=5)

for i, (idxT, idxV) in enumerate(skf.split(X_train, y_train, groups=X_train['data_rif']) ):
    month = X_train.iloc[idxV]['data_rif'].iloc[0]
    print('Fold',i,'validation date',month)
    print(' rows of train =',len(idxT),'rows of holdout =',len(idxV))
    
    #gestisco categoriche
    ce.fit(X_train[cols].iloc[idxT], y_train.iloc[idxT])
    cvtrain =  ce.transform(X_train[cols].iloc[idxT])
    cvval = ce.transform(X_train[cols].iloc[idxV])
    
    modelxgboost = xgb.XGBClassifier(n_estimators=100, max_depth=6,
                            early_stopping_rounds=10, eval_metric='auc', random_state=42)

    modelxgboost.fit(cvtrain, y_train.iloc[idxT], 
            eval_set=[(cvval, y_train.iloc[idxV])],
            verbose=1000)
    
    proba[idxV] += modelxgboost.predict_proba(cvval)[:,1]
    preds[idxV] += modelxgboost.predict(cvval)

print('#'*20)
print ('XGB NO SCALE POS WEIGHT AUC=',roc_auc_score(y_train, proba))
print ('XGB NO SCALE POS WEIGHT RECALL CV=',recall_score(y_train, preds))
print ('XGB NO SCALE POS WEIGHT MCC CV=',matthews_corrcoef(y_train, preds))

In [None]:
print("Classification Report CV:")
print(classification_report(y_train, preds))


cm = confusion_matrix(y_train, preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=['0','1'])

disp.plot()
plt.tight_layout()
plt.show()
plt.close()



In [None]:
proba = np.zeros(len(X_train))
preds = np.zeros(len(X_train))

skf = GroupKFold(n_splits=5)

for i, (idxT, idxV) in enumerate(skf.split(X_train, y_train, groups=X_train['data_rif']) ):
    month = X_train.iloc[idxV]['data_rif'].iloc[0]
    print('Fold',i,'validation date',month)
    print(' rows of train =',len(idxT),'rows of holdout =',len(idxV))
    
    
    #gestisco categoriche
    ce.fit(X_train[cols].iloc[idxT], y_train.iloc[idxT])
    cvtrain =  ce.transform(X_train[cols].iloc[idxT])
    cvval = ce.transform(X_train[cols].iloc[idxV])
    
    modelxgboost_spw = xgb.XGBClassifier(n_estimators=100, max_depth=6, early_stopping_rounds=10, eval_metric='auc',
                            random_state=42, scale_pos_weight=scale_pos_weight)
    
 
    modelxgboost_spw.fit(cvtrain, y_train.iloc[idxT], 
            eval_set=[(cvval,y_train.iloc[idxV])],
            verbose=1000)
    
    proba[idxV] += modelxgboost_spw.predict_proba(cvval)[:,1]
    preds[idxV] += modelxgboost_spw.predict(cvval)

print('#'*20)
print ('XGB SCALE POS WEIGHT AUC=',roc_auc_score(y_train, proba))
print ('XGB SCALE POS WEIGHT RECALL CV=',recall_score(y_train, preds))
print ('XGB SCALE POS WEIGHT MCC CV=',matthews_corrcoef(y_train, preds))

In [None]:
print("Classification Report CV:")
print(classification_report(y_train, preds))


cm = confusion_matrix(y_train, preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=['0','1'])

disp.plot()
plt.tight_layout()
plt.show()
plt.close()



From the cross-validation, it is clear that XGBoost performs better than Random Forest, so it is worth using. Additionally, the *scale_pos_weight* parameter improves recall performance without significantly worsening the overall performance (MCC from 90 to 88): this model is chosen for the final evaluation.

Note: The values set for *n_estimators* and *max_depth* were manually configured (the same for all three models) and were left as is after seeing that the performance was satisfactory. An additional hyperparameter tuning step could be added to further refine the metrics if necessary.


# Training over all the dataset and evaluation on test set

In [None]:
xtrain = X_train[cols].copy()
xtest = X_test[cols].copy()
ytrain = y_train.copy()
ytest =y_test.copy()



print(len(xtrain),len(ytrain))
print(len(xtest), len(ytest))

#fitto ce su tutto il dataset ora
ce.fit(xtrain, ytrain)
xtrain =  ce.transform(xtrain)
xtest = ce.transform(xtest)


In [None]:
clf = xgb.XGBClassifier(n_estimators=100, max_depth=6, eval_metric='auc',
                        scale_pos_weight=scale_pos_weight,  random_state=42)
clf.fit(xtrain, ytrain)



# Performance evaluation

In [None]:

print(f'TEST SET EVAL')


y_pred = clf.predict(xtest)
y_proba = clf.predict_proba(xtest)[:,1]


print("Classification Report:")
print(classification_report(ytest, y_pred))


cm = confusion_matrix(ytest, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=['0','1'])

disp.plot()
plt.tight_layout()
plt.show()
plt.close()

roc_auc = roc_auc_score(ytest, y_proba)
print(f"ROC AUC: {roc_auc}")

mcc = matthews_corrcoef(ytest, y_pred)
print(f"MCC: {mcc}")


fpr, tpr, thresholds = roc_curve(ytest, y_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()


precision, recall, thresholds_pr = precision_recall_curve(ytest, y_proba)

pr_auc = auc(recall, precision)
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, label=f'PR AUC = {pr_auc:.2f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='upper right')
plt.show()


plt.close()

display(kds.metrics.decile_table(ytest, y_proba))

kds.metrics.plot_cumulative_gain(ytest, y_proba)
plt.show()
plt.close()
kds.metrics.plot_lift(ytest, y_proba)
plt.show()
plt.close()

**Final Considerations**

1. The performance seems to remain strong on the test set, comparable to what was found during cross-validation.
2. The recall on the test set is very high, and at the same time, precision is not significantly affected. This appears to be an acceptable trade-off (the PR curve seems to confirm this). If 38 false negatives are deemed too many, one could increase the *scale_pos_weight* or conduct a study on the cutoff threshold.
3. The lift is very high, as is the cumulative gain. In practice, with the second decile of predictions, we capture almost 100% of the frauds.


In [None]:
from xgboost import plot_importance

In [None]:
plot_importance(clf, importance_type='gain')

As expected, we find the most important variables that were identified during the data exploration phase. The newly created variables do not seem to have made a significant contribution, so it might be worth trying to train a model using only the original variables.

The SHAP plot indicates which features are most useful for the model's predictions and how these predictions are influenced by the input values of these variables. In this case, higher values of *num_prodotti* lead to higher predicted probabilities of fraud, similarly for *age*, *f7*, and *f6* (though the results for some of these variables are a bit more uncertain).


In [None]:
background = maskers.Independent(xtrain, 1000) 
exp = TreeExplainer(clf, background)
sv = exp.shap_values(xtest)
shap.summary_plot(sv, xtest)
