# Exercise 04

# Fraud Detection

## Introduction

- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. 

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [2]:
import zipfile
with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:
    f = z.open('15_fraud_detection.csv')
    data = pd.io.parsers.read_table(f, index_col=0, sep=',')

In [3]:
data.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [4]:
data.shape, data.Label.sum(), data.Label.mean()

((138721, 16), 797, 0.0057453449730033666)

In [5]:
X = data.drop(['Label'], axis=1)
y = data['Label']

# Exercice 04.1

Estimate a Logistic Regression

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

In [6]:
from  sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
x_train, x_test, y_train, y_test =train_test_split(X,y,random_state=1818)

logreg = LogisticRegression()
Log1=logreg.fit(x_train, y_train)
Log1

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [7]:
#Predecir en la muestra de test
y_pred=Log1.predict(x_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [8]:
from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score
print('Accuracy',accuracy_score(y_test, y_pred))
print('Precision',precision_score(y_test, y_pred))
print('Recall',recall_score(y_test, y_pred))
print('F1 score', f1_score(y_test, y_pred))

Accuracy 0.994463827456
Precision 0.0
Recall 0.0
F1 score 0.0


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


###### Debido al imbalance en la muestra todos los pronosticos son cero.

In [9]:
Log1.predict_proba(x_test)

  np.exp(prob, prob)


array([[  9.99790460e-01,   2.09540436e-04],
       [  9.86170699e-01,   1.38293008e-02],
       [  9.99999692e-01,   3.07549543e-07],
       ..., 
       [  9.99932134e-01,   6.78656660e-05],
       [  9.89614411e-01,   1.03855887e-02],
       [  9.99955910e-01,   4.40904567e-05]])

In [10]:
print('El porcentaje de fraude es',round(y_train.mean(),6))

El porcentaje de fraude es 0.005815


# Exercice 04.2

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose?
How the results change?

**Only apply under-sampling to the training set, evaluate using the whole test set**

In [11]:
#Función de Undersampling
def UnderSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_0_new =  n_samples_1 / target_percentage - n_samples_1
    n_samples_0_new_per = n_samples_0_new / n_samples_0

    filter_ = y == 0

    np.random.seed(seed)
    rand_1 = np.random.binomial(n=1, p=n_samples_0_new_per, size=n_samples)
    
    filter_ = filter_ & rand_1
    filter_ = filter_ | (y == 1)
    filter_ = filter_.astype(bool)
    
    return X[filter_], y[filter_]

In [12]:
#Haremos un rebalanceo hasta 30%
X_u, y_u = UnderSampling(x_train, y_train, target_percentage= 0.3)
print('El porcentaje de fraude rebalanceado es: ',round(y_u.mean(),6))

El porcentaje de fraude rebalanceado es:  0.304786


In [13]:
Log2=logreg.fit(X_u, y_u)
y_pred2=Log2.predict(x_test)
print('Accuracy', round(accuracy_score(y_test, y_pred2),6))
print('Precision',round(precision_score(y_test, y_pred2),6))
print('Recall',   round(recall_score(y_test, y_pred2),6))
print('F1 score', round(f1_score(y_test, y_pred2),6))

Accuracy 0.958998
Precision 0.029096
Recall 0.197917
F1 score 0.050734


# Exercice 04.3

Now using random-over-sampling

In [14]:
#Función de OverSampling
def OverSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_1_new =  -target_percentage * n_samples_0 / (target_percentage- 1)

    np.random.seed(seed)
    filter_ = np.random.choice(X[y == 1].shape[0], int(n_samples_1_new))
    # filter_ is within the positives, change to be of all
    filter_ = np.nonzero(y == 1)[0][filter_]
    
    filter_ = np.concatenate((filter_, np.nonzero(y == 0)[0]), axis=0)
    
    return X.iloc[filter_], y.iloc[filter_]

In [15]:
#Haremos un rebalanceo hasta 30%
X_o, y_o = OverSampling(x_train, y_train, target_percentage = 0.3)
print('El porcentaje de fraude rebalanceado es: ',round(y_o.mean(),6))

El porcentaje de fraude rebalanceado es:  0.299999


In [16]:
Log3=logreg.fit(X_o, y_o)
y_pred3=Log3.predict(x_test)
print('Accuracy', round(accuracy_score(y_test, y_pred3),6))
print('Precision',round(precision_score(y_test, y_pred3),6))
print('Recall',   round(recall_score(y_test, y_pred3),6))
print('F1 score', round(f1_score(y_test, y_pred3),6))

Accuracy 0.964938
Precision 0.031993
Recall 0.182292
F1 score 0.054432


# Exercice 04.4*
Evaluate the results using SMOTE

Which parameters did you choose?

In [17]:
#Función Smote
def  SMOTE_MANUAL (X, y, target_percentage=0.5, k=5, seed=None):
    
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()
    
    #New samples
    n_samples_1_new =  int(-target_percentage * n_samples_0 / (target_percentage- 1) - n_samples_1)
    
    # A matrix to store the synthetic samples
    new = np.zeros((n_samples_1_new, X.shape[1]))
    
    # Create seeds
    np.random.seed(seed)
    seeds = np.random.randint(1, 1000000, 3)
    
    # Select examples to use as base
    np.random.seed(seeds[0])
    sel_ = np.random.choice(y[y==1].shape[0], n_samples_1_new)
    
    # Define random seeds (2 per example)
    np.random.seed(seeds[1])
    nn__ = np.random.choice(k, n_samples_1_new)
    np.random.seed(seeds[2])
    steps = np.random.uniform(size=n_samples_1_new)  

    # For each selected examples create one synthetic case
    for i, sel in enumerate(sel_):
        # Select neighbor
        nn_ = nn__[i]
        step = steps[i]
        # Create new sample
        new[i, :] = X[y==1].iloc[sel] - step * (X[y==1].iloc[sel] - X[y==1].iloc[nn_])
    
    X = np.vstack((X, new))
    y = np.append(y, np.ones(n_samples_1_new))
    
    return X, y

In [18]:
#Usamos un balanceo de 30% con 10 vecinos
X_sm, y_sm = SMOTE_MANUAL(x_train, y_train, target_percentage=0.3, k=10)

In [19]:
Log4=logreg.fit(X_sm, y_sm)
y_pred4=Log4.predict(x_test)
print('Accuracy', round(accuracy_score(y_test, y_pred4),6))
print('Precision',round(precision_score(y_test, y_pred4),6))
print('Recall',   round(recall_score(y_test, y_pred4),6))
print('F1 score', round(f1_score(y_test, y_pred4),6))

Accuracy 0.944062
Precision 0.014444
Recall 0.135417
F1 score 0.026104


# Exercice 04.5

Estimate a Logistic Regression, GaussianNB, K-nearest neighbors and a Decision Tree **Classifiers**

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

Combine the classifiers and comment

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

models = {'lr': LogisticRegression(),
          'dt': DecisionTreeClassifier(),
          'nb': GaussianNB(),
          'nn': KNeighborsClassifier()}

for model in models.keys():
    models[model].fit(x_train, y_train)
    
# Las predicciones de cada modelo
y_pred = pd.DataFrame(index=x_test.index, columns=models.keys())
for model in models.keys():
    y_pred[model] = models[model].predict(x_test)

In [21]:
for model in models.keys():
    print(model,'accuracy',round(accuracy_score(y_test,y_pred[model]),6))
    print(model,'precision',round(precision_score(y_test,y_pred[model]),6))
    print(model,'recall',round(recall_score(y_test,y_pred[model]),6))
    print(model,'f1 score', round(f1_score(y_test,y_pred[model]),6))

lr accuracy 0.994464
lr precision 0.0
lr recall 0.0
lr f1 score 0.0
dt accuracy 0.98861
dt precision 0.111111
dt recall 0.151042
dt f1 score 0.128035
nb accuracy 0.989706
nb precision 0.005988
nb recall 0.005208
nb f1 score 0.005571
nn accuracy 0.994435
nn precision 0.461538
nn recall 0.03125
nn f1 score 0.058537


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


# Exercice 04.6

Using the under-sampled dataset

Evaluate a RandomForestClassifier and compare the results

change n_estimators=100, what happened

In [22]:
from sklearn.ensemble import RandomForestClassifier
Random_forest_model = RandomForestClassifier()
Forest=Random_forest_model.fit(X_u, y_u)
#La prediccion 
y_pred_rf=Forest.predict(x_test)
print('Accuracy', round(accuracy_score(y_test, y_pred_rf),6))
print('Precision',round(precision_score(y_test, y_pred_rf),6))
print('Recall',   round(recall_score(y_test, y_pred_rf),6))
print('F1 score', round(f1_score(y_test, y_pred_rf),6))

Accuracy 0.922349
Precision 0.033917
Recall 0.473958
F1 score 0.063304


In [23]:
Random_forest_model_100 = RandomForestClassifier(n_estimators=100)
Forest_100=Random_forest_model_100.fit(X_u, y_u)

#La prediccion 
y_pred_rf100=Forest_100.predict(x_test)
print('Accuracy', round(accuracy_score(y_test, y_pred_rf100),6))
print('Precision',round(precision_score(y_test, y_pred_rf100),6))
print('Recall',   round(recall_score(y_test, y_pred_rf100),6))
print('F1 score', round(f1_score(y_test, y_pred_rf100),6))

Accuracy 0.919062
Precision 0.035524
Recall 0.520833
F1 score 0.066511


###### Con n_estimators=100 mejoran todos las medidas 