# Introduction
In the following code we will analyze a dataset and use it for classification, using many models we'll select the best one. 

The first step will consist in understanding the main features of the dataset, if it is unbalanced, if there are many null values or if there are missing values to replace. Consequently, we'll adopt a strategy to transform the data in a useful format for the classfier. Based on the characteristics of the data we coul pick different classifiers and at the end evaluating their performance. 

In the last part we'll show the results with some final considerations.

My goal is to obtain a model which is able to return an **high recall** with a good precision, in this case the recall we'll be more important because I want to reduce to the minimum the number of False Negatives. A good metric to evaluate this is $F_2$, which weights the recall twice as much as precision. 

I want to obtain a model which can be used in a real time fraud detection system so I would like a fast execution model (for prediction). Note that in this scenario, for me, the false negatives has an higher cost with respect to false positive because in the first case a fraud will not be detected and this will cost money to the company, reputation and customers probably; whereas in the second case the account will be temporarily suspended and restored in a few hours or one/two days. Of course also this last case is not optimal but for sure more negligible.

---
# ---

## borderline SMOTE
Source: *H. Han, W. Wen-Yuan, M. Bing-Huan, "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning", 2005.*

Borderline SMOTE has been proved (Source) to perform better with respect to SMOTE and RandomOversampling for this reason we will use it in this comparison. This variant of the SMOTE is different from existing over-sampling methods due to the fact that all the minority examples or a random subset of the minority class are not the elements that will be over-sampled, instead the **bordeline elements will be the one oversampled** (the one which truly affect the classification).
The explanation of the method will be given in the report. However, in this case we don't change the default settings, using 'borderline-1' instead of 2 due to the fact that the paper doesn't shows big differences in terms of TP rate and F value among the two.

## Undersampling (Neighborhood cleaning rule)
Source: *Laurikkala, J., "Improving Identification of Difficult Small Classes by Balancing Class Distribution, 2001*

*What is the problem of undersampling?* Since the samples from the majority classes are removed, this method can potentially **ignore useful information from those removed samples.** Therefore, several under-sampling approaches are proposed to selectively remove samples from the majority class so that the information could be largely retained in the training data set. 

Why we pick a certain technique for undersampling? The choices were applying neighbours-based undersampler or tomek-links-based techniques, given the fact that we already study the neighbours technique for classification in class I decided to pick that. Among all the neighbour-based technique the **NeighborhoodCleaningRule** seems one of the best solution. In fact reading the paper (Source) of NCR we can see how it outperforms both RandomUnderSampling and OneSidedSelection, moreover it uses *EditedNearestNeighbours* to clean noisy data. From what the paper says, it suites very well with difficult small classes, as in our case.

Given the fact that the undersampling method does't remove a lot of samples, for a design property, we can classify these data using a cost-sensitive approach, like **Weighted Random Forest**, which works pretty well on unbalanced data and then we can compare the result of using the same technique on the original dataset, to see if the undersampling really remove noisy samples.

Why is NCR designed to remove only few samples? We can say that this method is more oriented to *data cleaning* than data reduction:
- Quality of classification doesn't depend solely on the size of the class, other characteristics as noise should be considered
- It is difficult to maintain the original classification accuracy while the data is being reduced


No SMOTEENN, combination over and undersampling because very computationally demanding and time consuming

In [1]:
#print("Distribution after oversampling: {}".format(Counter(smote_enn_y)))

## Classification methods

https://scikit-learn.org/stable/_static/ml_map.png

For balanced data (ROS, RUS, SMOTE, borderlineSMOTE, SMOTEENN) I avoided K-NN because it was extensively used in sampling and maybe this can led to a sort of overfitting (?), instead I pick **SVM** and **Logistic Regression** due to the fact that they are very efficient and robust even with large datasets, **Decision Trees** due to its high sensitivity so I am curios to see how it will perform. I choose these three methods because they are the most suited for numerical attributes.

#### Unbalanced: Weighted Random Forest 
Source:
- *M. Shahhosseini, Guiping Hu, Improved Weighted Random Forest for Classification Problems}, 2021.*

.....

Following the first reference we'll choose the right **weight** for the minority class optimizing the *area under the ROC curve (AUC).*

In [None]:
# SGD Classifier needs scaling of the features and a StandardScaler needs data with a normal distribution. 
# Given the fact that the data don't follow a gaussian curve (?) I will use the RobustScaler which 
# additionally it is more robust to outliers.

from scipy.stats import shapiro

cols = df.columns
alpha = 0.05

for i in range(0, len(cols)):
    stat, p = shapiro(df[cols[i]])

    if p > alpha:
         print(cols[i], ': gaussian')
    else:
         print(cols[i], ': not gaussian')

## SGD Classifier
SGD Classifier is a linear classifier (SVM, logistic regression, a.o.) optimized by the SGD. I used SGD classifier for efficiency (computation and memory) and also to minimize the loss (??) function of LR and SVM.

statistical test to evaluate models (t-test?)

Note: in the train and test splitting we will use **shuffle** equal to *False* and **no stratification** because we want to pick the first 70% percent of the dataset for training and the rest for testing. This is done to have a coerence with the reality: when a fraud detection system is enabled, the data to classify (test set) come later than the data used to train the model, we'll do the same in this analysis. Later the test set's "Time" feature will be resetted to 0 and all the offsets will be computed from that value.

The previous step is **ESSENTIAL** because it is one of the most pitfalls in sampling. In the resampling setting, there is a common pitfall that corresponds to *resample the entire dataset before splitting it into a train and a test partitions.* Note that it would be equivalent to resample the train and test partitions as well. Such of a processing leads to two issues:

- the model will not be tested on a dataset with class distribution similar to the real use-case. Indeed, by resampling the entire dataset, both the training and testing set will be potentially balanced while the model should be tested on the natural imbalanced dataset to evaluate the potential bias of the model;
- the resampling procedure might use information about samples in the dataset to either generate or select some of the samples. Therefore, we might use information of samples which will be later used as testing samples which is the typical data leakage issue.

# Modifying time feature

In [None]:
def bring_time_to_zero(df):
    # order in increasing from lower to higher
    # if first element is zero, ok
    # otherwise sum to all the samples, in time column, the first time value
    
    df = df.sort_values(by="Time") # ascending order
    val = df.iloc[0,0]
    if val != 0:
        df["Time"] = df["Time"].sub(val)
    
    return df
        
# No stratified or shuffle to avoid mixing the samples
def score_model(model, sampler, X_train, y_train, scaler=None):
    cv = KFold(n_splits=5)
    scores = {'test_accuracy':[],
              'test_recall':[],
              'test_precision':[],
              'test_roc_auc':[],
              'test_ftwo':[]}

    for train_fold_index, val_fold_index in cv.split(X_train, y_train):
        # train_fold_index, val_fold_index are array of indexes
        
        X_train_fold, y_train_fold = X_train.iloc[train_fold_index], y_train[train_fold_index]
        X_val_fold, y_val_fold = X_train.iloc[val_fold_index], y_train[val_fold_index]

        # Reset "Time" to zero
        X_train_fold = bring_time_to_zero(X_train_fold)
        X_val_fold = bring_time_to_zero(X_val_fold)
        
        # Scaling if present
        if scaler != None:
            scaler.fit(X_train_fold)
            X_train_fold = scaler.transform(X_train_fold) 
            X_val_fold = scaler.transform(X_val_fold)

        X_train_fold_resample, y_train_fold_resample = sampler.fit_resample(X_train_fold,
                                                                            y_train_fold)

        model.fit(X_train_fold_resample, y_train_fold_resample)
        y_pred = model.predict(X_val_fold)
        
        # Recall, Precision, Accuracy, F1, Average Precision, AUC ROC ...
        scores['test_accuracy'].append(accuracy_score(y_val_fold, y_pred))
        scores['test_recall'].append(recall_score(y_val_fold, y_pred, zero_division=1))
        scores['test_precision'].append(precision_score(y_val_fold, y_pred, zero_division=1))
        scores['test_roc_auc'].append(roc_auc_score(y_val_fold, y_pred))
        scores['test_ftwo'].append(fbeta_score(y_val_fold, y_pred, beta=2, zero_division=1))
 
    return scores

In [None]:
X_train, X_test, y_train, y_test = t_test_split(df)
X_test = bring_time_to_zero(X_test)
X_test.head()