## Logistic Regression

In [3]:
import matplotlib.pyplot as plt
import pandas as pd
import torch
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings("ignore")

We started with the Logistic Regression algorithm for classifying the data. We used the optimizers class provided in A4 file. For the custom built classifier class we modified the code we have used in A4 to perform binary classification. We have modified error and gradient functions to give logarithmic ascent. We have replaced the softmax function with sigmoid activation function for the output layer to fit the model better for the binary classification. The output of the layer i is now being fed to the sigmoid activation function. The sigmoid function is an S-shaped curve that maps any real-valued number to a value between 0 and 1. It is defined as:

sigmoid(x) = 1 / (1 + exp(-x)).

After applying the sigmoid function we get the predicted probabilities of legit or fraudulent transactions. 


We then use argmax function to decide which class the particular sample belongs to. Since the dataset is imbalanced we thought it might be helpful to use K-cross validation to get a better assessment by evaluating the model on multiple train-test splits, considering different combinations of minority and majority class samples. We used code form A3 for function generate_k_fold_cross_validation_sets. We have modified the function run_k_fold_cross_validation from the same assignment to include percentage of correct classification values in the result data frame. We have used different network architectures and compared the results given by each of them. The dataset is quite imbalanced with only 492 fradulent transactions in the total of 284807 samples. So, we decided to use the SMOTE(Synthetic Minority Over-sampling Technique) to introduce synthetic samples of the minority class in hope that it will help train the model better for discriminating against the legit and fraudulent transaction. SMOTE determines K-nearest neighbours of each minority instance and generates synthetic samples for them.

In [14]:
%%writefile optimizers.py
import numpy as np

######################################################################
## class Optimizers()
######################################################################

class Optimizers():

    def __init__(self, all_weights):
        '''all_weights is a vector of all of a neural networks weights concatenated into a one-dimensional vector'''
        
        self.all_weights = all_weights

        # The following initializations are only used by adam.
        # Only initializing m, v, beta1t and beta2t here allows multiple calls to adam to handle training
        # with multiple subsets (batches) of training data.
        self.mt = np.zeros_like(all_weights)
        self.vt = np.zeros_like(all_weights)
        self.beta1 = 0.9
        self.beta2 = 0.999
        self.beta1t = 1
        self.beta2t = 1

        
    def sgd(self, error_f, gradient_f, fargs=[], n_epochs=100, learning_rate=0.001, verbose=True, error_convert_f=None):
        '''
error_f: function that requires X and T as arguments (given in fargs) and returns mean squared error.
gradient_f: function that requires X and T as arguments (in fargs) and returns gradient of mean squared error
            with respect to each weight.
error_convert_f: function that converts the standardized error from error_f to original T units.
        '''

        error_trace = []
        epochs_per_print = n_epochs // 10

        for epoch in range(n_epochs):

            error = error_f(*fargs)
            
            grad = gradient_f(*fargs)
            
            # Update all weights using -= to modify their values in-place.
            self.all_weights -= learning_rate * grad

            if error_convert_f:
                error = error_convert_f(error)
            error_trace.append(error)

            if verbose and ((epoch + 1) % max(1, epochs_per_print) == 0):
                print(f'sgd: Epoch {epoch+1:d} Error={error:.5f}')

        return error_trace

    def adam(self, error_f, gradient_f, fargs=[], n_epochs=100, learning_rate=0.001, verbose=True, error_convert_f=None):
        '''
error_f: function that requires X and T as arguments (given in fargs) and returns mean squared error.
gradient_f: function that requires X and T as arguments (in fargs) and returns gradient of mean squared error
            with respect to each weight.
error_convert_f: function that converts the standardized error from error_f to original T units.
        '''

        alpha = learning_rate  # learning rate called alpha in original paper on adam
        epsilon = 1e-8
        error_trace = []
        epochs_per_print = n_epochs // 10

        for epoch in range(n_epochs):

            error = error_f(*fargs)
            
            grad = gradient_f(*fargs)
            


            self.mt[:] = self.beta1 * self.mt + (1 - self.beta1) * grad
            self.vt[:] = self.beta2 * self.vt + (1 - self.beta2) * grad * grad
            self.beta1t *= self.beta1
            self.beta2t *= self.beta2

            m_hat = self.mt / (1 - self.beta1t)
            v_hat = self.vt / (1 - self.beta2t)

            # Update all weights using -= to modify their values in-place.
            self.all_weights -= alpha * m_hat / (np.sqrt(v_hat) + epsilon)
    
            if error_convert_f:
                error = error_convert_f(error)
            error_trace.append(error)

            if verbose and ((epoch + 1) % max(1, epochs_per_print) == 0):
                print(f'Adam: Epoch {epoch+1:d} Error={error:.5f}')

        return error_trace

if __name__ == '__main__':

    import matplotlib.pyplot as plt
    plt.ion()

    def parabola(wmin):
        return ((w - wmin) ** 2)[0]

    def parabola_gradient(wmin):
        return 2 * (w - wmin)

    w = np.array([0.0])
    optimizer = Optimizers(w)

    wmin = 5
    optimizer.sgd(parabola, parabola_gradient, [wmin],
                  n_epochs=500, learning_rate=0.1)

    print(f'sgd: Minimum of parabola is at {wmin}. Value found is {w}')

    w = np.array([0.0])
    optimizer = Optimizers(w)
    optimizer.adam(parabola, parabola_gradient, [wmin],
                   n_epochs=500, learning_rate=0.1)
    
    print(f'adam: Minimum of parabola is at {wmin}. Value found is {w}')

Overwriting optimizers.py


In [3]:
import numpy as np
import optimizers
import sys  # for sys.float_info.epsilon

######################################################################
## class NeuralNetwork()
######################################################################

class NeuralNetwork():


    def __init__(self, n_inputs, n_hiddens_per_layer, n_outputs, activation_function='tanh'):
        self.n_inputs = n_inputs
        self.n_outputs = n_outputs
        self.activation_function = activation_function

        # Set self.n_hiddens_per_layer to [] if argument is 0, [], or [0]
        if n_hiddens_per_layer == 0 or n_hiddens_per_layer == [] or n_hiddens_per_layer == [0]:
            self.n_hiddens_per_layer = []
        else:
            self.n_hiddens_per_layer = n_hiddens_per_layer

        # Initialize weights, by first building list of all weight matrix shapes.
        n_in = n_inputs
        shapes = []
        for nh in self.n_hiddens_per_layer:
            shapes.append((n_in + 1, nh))
            n_in = nh
        shapes.append((n_in + 1, n_outputs))

        # self.all_weights:  vector of all weights
        # self.Ws: list of weight matrices by layer
        self.all_weights, self.Ws = self.make_weights_and_views(shapes)

        # Define arrays to hold gradient values.
        # One array for each W array with same shape.
        self.all_gradients, self.dE_dWs = self.make_weights_and_views(shapes)

        self.trained = False
        self.total_epochs = 0
        self.error_trace = []
        self.Xmeans = None
        self.Xstds = None
        self.Tmeans = None
        self.Tstds = None


    def make_weights_and_views(self, shapes):
        # vector of all weights built by horizontally stacking flatenned matrices
        # for each layer initialized with uniformly-distributed values.
        all_weights = np.hstack([np.random.uniform(size=shape).flat / np.sqrt(shape[0])
                                 for shape in shapes])
        # Build list of views by reshaping corresponding elements from vector of all weights
        # into correct shape for each layer.
        views = []
        start = 0
        for shape in shapes:
            size =shape[0] * shape[1]
            views.append(all_weights[start:start + size].reshape(shape))
            start += size
        return all_weights, views


    # Return string that shows how the constructor was called
    def __repr__(self):
        return f'{type(self).__name__}({self.n_inputs}, {self.n_hiddens_per_layer}, {self.n_outputs}, \'{self.activation_function}\')'


    # Return string that is more informative to the user about the state of this neural network.
    def __str__(self):
        result = self.__repr__()
        if len(self.error_trace) > 0:
            return self.__repr__() + f' trained for {len(self.error_trace)} epochs, final training error {self.error_trace[-1]:.4f}'


    def train(self, X, T, n_epochs, learning_rate, method='sgd', verbose=True):
       
        # Setup standardization parameters
        if self.Xmeans is None:
            self.Xmeans = X.mean(axis=0)
            self.Xstds = X.std(axis=0)
            self.Xstds[self.Xstds == 0] = 1  # So we don't divide by zero when standardizing
            self.Tmeans = T.mean(axis=0)
            self.Tstds = T.std(axis=0)
            
        # Standardize X and T
        X = (X - self.Xmeans) / self.Xstds
        T = (T - self.Tmeans) / self.Tstds

        # Instantiate Optimizers object by giving it vector of all weights
        optimizer = optimizers.Optimizers(self.all_weights)

        # Define function to convert value from error_f into error in original T units, 
        # but only if the network has a single output. Multiplying by self.Tstds for 
        # multiple outputs does not correctly unstandardize the error.
        if len(self.Tstds) == 1:
            error_convert_f = lambda err: (np.sqrt(err) * self.Tstds)[0] # to scalar
        else:
            error_convert_f = lambda err: np.sqrt(err)[0] # to scalar
            

        if method == 'sgd':

            error_trace = optimizer.sgd(self.error_f, self.gradient_f,
                                        fargs=[X, T], n_epochs=n_epochs,
                                        learning_rate=learning_rate,
                                        verbose=True,
                                        error_convert_f=error_convert_f)

        elif method == 'adam':

            error_trace = optimizer.adam(self.error_f, self.gradient_f,
                                         fargs=[X, T], n_epochs=n_epochs,
                                         learning_rate=learning_rate,
                                         verbose=True,
                                         error_convert_f=error_convert_f)

        else:
            raise Exception("method must be 'sgd' or 'adam'")
        
        self.error_trace = error_trace

        # Return neural network object to allow applying other methods after training.
        #  Example:    Y = nnet.train(X, T, 100, 0.01).use(X)
        return self

    def relu(self, s):
        s[s < 0] = 0
        return s

    def grad_relu(self, s):
        return (s > 0).astype(int)
    
    def forward_pass(self, X):
        '''X assumed already standardized. Output returned as standardized.'''
        self.Ys = [X]
        for W in self.Ws[:-1]:
            if self.activation_function == 'relu':
                self.Ys.append(self.relu(self.Ys[-1] @ W[1:, :] + W[0:1, :]))
            else:
                self.Ys.append(np.tanh(self.Ys[-1] @ W[1:, :] + W[0:1, :]))
        last_W = self.Ws[-1]
        self.Ys.append(self.Ys[-1] @ last_W[1:, :] + last_W[0:1, :])
        return self.Ys

    # Function to be minimized by optimizer method, mean squared error
    def error_f(self, X, T):
        Ys = self.forward_pass(X)
        mean_sq_error = np.mean((T - Ys[-1]) ** 2)
        return mean_sq_error

    # Gradient of function to be minimized for use by optimizer method
    def gradient_f(self, X, T):
        '''Assumes forward_pass just called with layer outputs in self.Ys.'''
        error = T - self.Ys[-1]
        n_samples = X.shape[0]
        n_outputs = T.shape[1]
        delta = - error / (n_samples * n_outputs)
        n_layers = len(self.n_hiddens_per_layer) + 1
        # Step backwards through the layers to back-propagate the error (delta)
        for layeri in range(n_layers - 1, -1, -1):
            # gradient of all but bias weights
            self.dE_dWs[layeri][1:, :] = self.Ys[layeri].T @ delta
            # gradient of just the bias weights
            self.dE_dWs[layeri][0:1, :] = np.sum(delta, 0)
            # Back-propagate this layer's delta to previous layer
            if self.activation_function == 'relu':
                delta = delta @ self.Ws[layeri][1:, :].T * self.grad_relu(self.Ys[layeri])
            else:
                delta = delta @ self.Ws[layeri][1:, :].T * (1 - self.Ys[layeri] ** 2)
        return self.all_gradients

    def use(self, X):
        '''X assumed to not be standardized'''
        # Standardize X
        X = (X - self.Xmeans) / self.Xstds
        Ys = self.forward_pass(X)
        Y = Ys[-1]
        # Unstandardize output Y before returning it
        return Y * self.Tstds + self.Tmeans

In [4]:
class NeuralNetworkClassifier(NeuralNetwork):


    def __init__(self, n_inputs, n_hiddens_per_layer, n_outputs, activation_function='tanh'):
        self.n_inputs = n_inputs
        self.n_outputs = n_outputs
        self.activation_function = activation_function
        
        # Set self.n_hiddens_per_layer to [] if argument is 0, [], or [0]
        if n_hiddens_per_layer == 0 or n_hiddens_per_layer == [] or n_hiddens_per_layer == [0]:
            self.n_hiddens_per_layer = []
        else:
            self.n_hiddens_per_layer = n_hiddens_per_layer

        # Initialize weights, by first building list of all weight matrix shapes.
        n_in = n_inputs
        shapes = []
        for nh in self.n_hiddens_per_layer:
            shapes.append((n_in + 1, nh))
            n_in = nh
        shapes.append((n_in + 1, n_outputs))

        # self.all_weights:  vector of all weights
        # self.Ws: list of weight matrices by layer
        self.all_weights, self.Ws = self.make_weights_and_views(shapes)

        # Define arrays to hold gradient values.
        # One array for each W array with same shape.
        self.all_gradients, self.dE_dWs = self.make_weights_and_views(shapes)

        self.trained = False
        self.total_epochs = 0
        self.error_trace = []
        self.Xmeans = None
        self.Xstds = None
        #self.Tmeans = None
        #self.Tstds = None


    def make_weights_and_views(self, shapes):
        # vector of all weights built by horizontally stacking flatenned matrices
        # for each layer initialized with uniformly-distributed values.
        all_weights = np.hstack([np.random.uniform(size=shape).flat / np.sqrt(shape[0])
                                 for shape in shapes])
        # Build list of views by reshaping corresponding elements from vector of all weights
        # into correct shape for each layer.
        views = []
        start = 0
        for shape in shapes:
            size = shape[0] * shape[1]
            views.append(all_weights[start:start + size].reshape(shape))
            start += size
        return all_weights, views


    # Return string that shows how the constructor was called
    def __repr__(self):
        return f'{type(self).__name__}({self.n_inputs}, {self.n_hiddens_per_layer}, {self.n_outputs}, \'{self.activation_function}\')'


    # Return string that is more informative to the user about the state of this neural network.
    def __str__(self):
        result = self.__repr__()
        if len(self.error_trace) > 0:
            return self.__repr__() + f' trained for {len(self.error_trace)} epochs, final training error {self.error_trace[-1]:.4f}'


    def train(self, X, T, n_epochs, learning_rate, method='sgd', verbose=True):
        
        #n_classes = len(np.unique(T))
        # Convert y to one-hot encoding
        #T_onehot = np.eye(n_classes)
        
        # Setup standardization parameters
        if self.Xmeans is None:
            self.Xmeans = X.mean(axis=0)
            self.Xstds = X.std(axis=0)
            self.Xstds[self.Xstds == 0] = 1  # So we don't divide by zero when standardizing
#             self.Tmeans = T.mean(axis=0)
#             self.Tstds = T.std(axis=0)
            self.T = T
            self.classes = np.unique(T).reshape(-1,1)
            #X = X.reshape(-1,1)
        Xin = X
        # Standardize X
        X = (X - self.Xmeans) / self.Xstds
        #T = (T - self.Tmeans) / self.Tstds
        T_onehot = self.makeIndicatorVars(T)
        # Instantiate Optimizers object by giving it vector of all weights
        optimizer = optimizers.Optimizers(self.all_weights)

        # Define function to convert value from error_f into error in original T units, 
        # but only if the network has a single output. Multiplying by self.Tstds for 
        # multiple outputs does not correctly unstandardize the error.
        #if len(self.Tstds) == 1:
            #error_convert_f = lambda err: (np.sqrt(err) * self.Tstds)[0] # to scalar
        #else:
#         error_convert_f = lambda err: np.mean(((self.use(Xin)[0]) != (T)))
        error_convert_f = lambda err: np.exp(-err)
            

        if method == 'sgd':
            error_trace = optimizer.sgd(self.error_f, self.gradient_f, fargs=[X, T_onehot], 
                                        n_epochs=n_epochs, learning_rate=learning_rate, 
                                        verbose=verbose,
                                        error_convert_f=error_convert_f)
        elif method == 'adam':
            error_trace = optimizer.adam(self.error_f, self.gradient_f, fargs=[X, T_onehot], 
                                         n_epochs=n_epochs, learning_rate=learning_rate, 
                                         verbose=verbose, 
                                         error_convert_f=error_convert_f)

        else:
            raise Exception("method must be 'sgd' or 'adam'")
        
        self.error_trace = error_trace

        # Return neural network object to allow applying other methods after training.
        #  Example:    Y = nnet.train(X, T, 100, 0.01).use(X)
        return self

    def relu(self, s):
        s[s < 0] = 0
        return s

    def grad_relu(self, s):
        return (s > 0).astype(int)
    

        def error_f(self, X, T):
        Ys = self.forward_pass(X)
        log_probs = np.log(self.sigmoid(Ys[-1]))
        cross_entropy = -np.mean(T * log_probs)#/T.shape[1]
        #likelihood = np.exp((cross_entropy)/X.shape[0])
        return cross_entropy
    
    # Gradient of function to be minimized for use by optimizer method
    def gradient_f(self, X, T):
        '''Assumes forward_pass just called with layer outputs in self.Ys.'''
        error = T - self.sigmoid(self.Ys[-1])
        n_samples = X.shape[0]
        n_outputs = T.shape[1]
        delta = - error / (n_samples * n_outputs)
        n_layers = len(self.n_hiddens_per_layer) + 1
        # Step backwards through the layers to back-propagate the error (delta)
        for layeri in range(n_layers - 1, -1, -1):

            self.dE_dWs[layeri][1:, :] = self.Ys[layeri].T @ delta 
            # gradient of just the bias weights
            self.dE_dWs[layeri][0:1, :] = np.sum(delta, 0) 
            # Back-propagate this layer's delta to previous layer
            if self.activation_function == 'relu':
                delta = delta @ self.Ws[layeri][1:, :].T * self.grad_relu(self.Ys[layeri])
            else:
                delta = delta @ self.Ws[layeri][1:, :].T * (1 - self.Ys[layeri] ** 2)
  
        return self.all_gradients

    def use(self, X):
        #X assumed to not be standardized
        
        # Standardize X
        X1 = (X - self.Xmeans) / self.Xstds
        Ys = self.forward_pass(X1)
        Y_probs = self.sigmoid(Ys[-1])#, axis = 1)
        Y_classes = self.classes[np.argmax(Y_probs, axis=1)].reshape(-1, 1)
        return Y_classes, Y_probs#, T2
    

    def sigmoid(self, s) :
        return 1/(1 + np.exp(-s))
                   

    def grad_sigmoid(self, s): 
        return self.sigmoid(s) * (1 - self.sigmoid(s))
    
    def makeIndicatorVars(self, T):
    # Make sure T is two-dimensional. Should be nSamples x 1.
        if T.ndim == 1:
            T = T.reshape((-1, 1))    
        return (T == np.unique(T)).astype(int)
    


### Data Import and Data Preparation
We used the Kaggle Credit Card Fraud Detection Dataset : <a href="https://www.kaggle.com/mlg-ulb/creditcardfraud">Link</a>


In [5]:
# Read Data into a Dataframe
df = pd.read_csv('creditcard.csv')
df_refine = df.copy()

In [6]:
df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [7]:
X = df.iloc[:, :-1].values
T = df.iloc[:, -1:].values

In [8]:
X, T

(array([[ 0.00000000e+00, -1.35980713e+00, -7.27811733e-02, ...,
          1.33558377e-01, -2.10530535e-02,  1.49620000e+02],
        [ 0.00000000e+00,  1.19185711e+00,  2.66150712e-01, ...,
         -8.98309914e-03,  1.47241692e-02,  2.69000000e+00],
        [ 1.00000000e+00, -1.35835406e+00, -1.34016307e+00, ...,
         -5.53527940e-02, -5.97518406e-02,  3.78660000e+02],
        ...,
        [ 1.72788000e+05,  1.91956501e+00, -3.01253846e-01, ...,
          4.45477214e-03, -2.65608286e-02,  6.78800000e+01],
        [ 1.72788000e+05, -2.40440050e-01,  5.30482513e-01, ...,
          1.08820735e-01,  1.04532821e-01,  1.00000000e+01],
        [ 1.72792000e+05, -5.33412522e-01, -1.89733337e-01, ...,
         -2.41530880e-03,  1.36489143e-02,  2.17000000e+02]]),
 array([[0],
        [0],
        [0],
        ...,
        [0],
        [0],
        [0]]))

In [9]:
classes = np.unique(T).reshape(-1,1)
classes

array([[0],
       [1]])

In [22]:
def generate_k_fold_cross_validation_sets(X, T, n_folds, shuffle=True):

    if shuffle:
        # Randomly order X and T
        randorder = np.arange(X.shape[0])
        np.random.shuffle(randorder)
        X = X[randorder, :]
        T = T[randorder, :]

    # Partition X and T into folds
    n_samples = X.shape[0]
    n_per_fold = round(n_samples / n_folds)
    n_last_fold = n_samples - n_per_fold * (n_folds - 1)

    folds = []
    start = 0
    for foldi in range(n_folds-1):
        folds.append( (X[start:start + n_per_fold, :], T[start:start + n_per_fold, :]) )
        start += n_per_fold
    folds.append( (X[start:, :], T[start:, :]) )

    # Yield k(k-1) assignments of Xtrain, Train, Xvalidate, Tvalidate, Xtest, Ttest

    for validation_i in range(n_folds):
        for test_i in range(n_folds):
            if test_i == validation_i:
                continue

            train_i = np.setdiff1d(range(n_folds), [validation_i, test_i])

            Xvalidate, Tvalidate = folds[validation_i]
            Xtest, Ttest = folds[test_i]
            if len(train_i) > 1:
                Xtrain1 = np.vstack([folds[i][0] for i in train_i])
                Ttrain1 = np.vstack([folds[i][1] for i in train_i])
                sm = SMOTE(random_state=42)
                Xtrain, Ttrain2 = sm.fit_resample(Xtrain1, Ttrain1)
            else:
                Xtrain1, Ttrain1 = folds[train_i[0]]
                sm = SMOTE(random_state=42)
                Xtrain, Ttrain2 = sm.fit_resample(Xtrain1, Ttrain1)
            

            yield Xtrain, Ttrain, Xvalidate, Tvalidate, Xtest, Ttest

In [12]:
def run_k_fold_cross_validation(X, T, n_folds, list_of_n_hiddens, 
                                n_epochs, learning_rate, act_func, gpu = False):
   
    nn_arch = []
    n_samples = X.shape[0]
    n_features = X.shape[1]
    n_outputs = T.shape[1]
    Train = []
    Validate = []
    Test = []
    results = []
    if gpu:
        print("Moving data and model to GPU. \nCurrent CUDA device: %s" %(torch.cuda.get_device_name(torch.cuda.current_device)))
    else :
        print("Running on CPU")
    for nh in list_of_n_hiddens: # Layer sizes
        
        
        for Xtrain, Ttrain, Xvalidate, Tvalidate, Xtest, Ttest in generate_k_fold_cross_validation_sets(X, T, n_folds):
                nnet =  NeuralNetworkClassifier(Xtrain.shape[1], nh, len(classes))
#                 if gpu:
#                     nnet = nnet.cuda()
                print('Hidden Layers: ', nh)
                start = time.time()
                #nnet =  NeuralNetworkClassifier(Xtrain.shape[1], nh, len(classes))
                nnet.train(Xtrain, Ttrain, n_epochs, learning_rate, method='adam', verbose=True)
                Time = (time.time() - start) / 60/ 60
       # append the  results of each experiment 
        #nnet.train(Xtrain, Ttrain, n_epochs, learning_rate, method='adam', verbose=True)\
                Y_trclasses, Y_trprobs = nnet.use(Xtrain)

                Train.append(100 * np.mean(Y_trclasses == Ttrain))
        
        
        #nnet.train(Xtrain, Ttrain, n_epochs, learning_rate, method='adam', verbose=True)
                Y_vclasses, Y_vprobs = nnet.use(Xvalidate)

                Validate.append(100 * np.mean(Y_vclasses == Tvalidate))
        
        #nnet.train(Xtrain, Ttrain, n_epochs, learning_rate, method='adam', verbose=True)
                Y_tclasses, Y_tprobs = nnet.use(Xtest)

                Test.append(100 * np.mean(Y_tclasses == Ttest))
               
        
        #Averaging values over all the folds.
        Train1 = sum(Train)/len(Train)
        Validate1 = sum(Validate)/len(Validate)
        Test1 = sum(Test)/len(Test)
        results.append([nh, Train1, Validate1, Test1, Time])
        # load these into a dataframe and give it some column titles
    final_df = pd.DataFrame(results, columns=['Hidden Layers', 'Train', 'Validate', 'Test', 'Time'])
    print(final_df)
    return final_df
    

We have run experiments for three different architectures. One with a single hidden layer with 10 neurons. Another with 2 hidden layers with 50 neurons each. The final one has 3 hidden layers with 30 neurons each. We have run the experiments for 200 epochs for a learning rate of 0.01.

In [23]:
torch.manual_seed(42)
np.random.seed(42)

import time
start = time.time()

results = run_k_fold_cross_validation(X, T, 5,
                                      [[10], [50, 50], [30, 30, 30]],
                                      200, 0.01, 'tanh', True)

elapsed = (time.time() - start) / 60/ 60
print(f'Took {elapsed:.2f} hours')
results

Moving data and model to GPU. 
Current CUDA device: NVIDIA TITAN V
Hidden Layers:  [10]
Adam: Epoch 20 Error=0.89968
Adam: Epoch 40 Error=0.98661
Adam: Epoch 60 Error=0.99489
Adam: Epoch 80 Error=0.99615
Adam: Epoch 100 Error=0.99659
Adam: Epoch 120 Error=0.99683
Adam: Epoch 140 Error=0.99698
Adam: Epoch 160 Error=0.99708
Adam: Epoch 180 Error=0.99716
Adam: Epoch 200 Error=0.99721
Hidden Layers:  [10]
Adam: Epoch 20 Error=0.88899
Adam: Epoch 40 Error=0.98201
Adam: Epoch 60 Error=0.99292
Adam: Epoch 80 Error=0.99504
Adam: Epoch 100 Error=0.99582
Adam: Epoch 120 Error=0.99623
Adam: Epoch 140 Error=0.99650
Adam: Epoch 160 Error=0.99669
Adam: Epoch 180 Error=0.99682
Adam: Epoch 200 Error=0.99692
Hidden Layers:  [10]
Adam: Epoch 20 Error=0.90962
Adam: Epoch 40 Error=0.98517
Adam: Epoch 60 Error=0.99382
Adam: Epoch 80 Error=0.99560
Adam: Epoch 100 Error=0.99647
Adam: Epoch 120 Error=0.99699
Adam: Epoch 140 Error=0.99727
Adam: Epoch 160 Error=0.99745
Adam: Epoch 180 Error=0.99758
Adam: Epoch 

Unnamed: 0,Hidden Layers,Train,Validate,Test,Time
0,[10],99.937209,99.937063,99.937063,0.013166
1,"[50, 50]",99.943558,99.93952,99.93952,0.064144
2,"[30, 30, 30]",99.951751,99.942154,99.942359,0.071395


For the first configuration with a single hidden layer of 10 neurons, the model achieved high accuracy across all three datasets. It obtained an accuracy of 99.94% on the training set, as well as on the validation and test sets. The training time for this model was relatively low at 0.013 seconds.

Moving to the second configuration with two hidden layers, each consisting of 50 neurons, the model demonstrated slightly improved performance. It achieved an accuracy of 99.94% on the training set and 99.94% on the validation and test sets. However, the training time increased to 0.064 seconds, indicating a longer computational effort.

The third configuration, comprising three hidden layers with 30 neurons in each layer, showed the best performance among the analyzed models. This model achieved an accuracy of 99.95% on the training set, 99.94% on the validation set, and 99.94% on the test set. The training time increased further to 0.071 seconds.

If we consider the best model to be the one with the highest perecntage correct for the validation set. The third model is the best fit model.

Initially, we were skeptical of the high percent correct values considering the imbalanced dataset, but the values for train, validation and test models appear to be around the same range showing a good non-overfitting model.

In summary, the results suggest that increasing the number of hidden layers and neurons generally improves the model's accuracy in classifying credit card fraud. However, it is important to consider the trade-off between accuracy and training time. While more complex configurations achieve higher accuracy, they also require longer training times. Therefore, it is crucial to strike a balance based on the specific requirements of the application.



In [25]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report

In [26]:
nnet =  NeuralNetworkClassifier(X_train.shape[1], [30,30,30], len(classes))
nnet.train(X_train, T_train, 200, 0.01, method='adam', verbose=True)
Y_tclasses, Y_tprobs = nnet.use(X_test)

Adam: Epoch 20 Error=0.99720
Adam: Epoch 40 Error=0.99815
Adam: Epoch 60 Error=0.99838
Adam: Epoch 80 Error=0.99848
Adam: Epoch 100 Error=0.99855
Adam: Epoch 120 Error=0.99861
Adam: Epoch 140 Error=0.99866
Adam: Epoch 160 Error=0.99872
Adam: Epoch 180 Error=0.99877
Adam: Epoch 200 Error=0.99883


In [27]:
print('AUC-ROC')
print(roc_auc_score(Y_tclasses, T_test))
      
print('F1-Score')
print(f1_score(Y_tclasses, T_test))
    
print('Accuracy')
print(accuracy_score(Y_tclasses, T_test))

AUC-ROC
0.9467938697024044
F1-Score
0.8428571428571429
Accuracy
0.9994850368081645


AUC-ROC (Area Under the Receiver Operating Characteristic Curve) is a metric that measures the ability of the model to distinguish between positive and negative classes. The AUC-ROC ranges from 0 to 1, with a higher value indicating a better performance of the model. From the above table, the AUC-ROC value is 0.9468 which indicates that the model has good predictive power and is able to effectively distinguish between positive and negative classes.

F1-Score is a metric that combines precision and recall, two key metrics in classification. It ranges from 0 to 1, with a higher value indicating a better performance of the model. Precision measures the proportion of true positives among all positive predictions, while recall measures the proportion of true positives among all actual positive cases. A high precision score indicates that the model is making fewer false-positive errors (i.e., the model is more selective in identifying fraudulent transactions), while a high recall score indicates that the model is making fewer false-negative errors (i.e., the model is more sensitive in detecting fraudulent transactions). In the case of credit card fraud detection, we want to maximize both precision and recall to minimize the number of false positives and false negatives. The F1-score, which is the harmonic mean of precision and recall, provides a single score that balances the trade-off between precision and recall. From above results, the F1-Score value of 0.8429 indicates that the model has a good balance between precision and recall. 

Accuracy is a metric that measures the overall correctness of the model predictions. It ranges from 0 to 1, with a higher value indicating a better performance of the model. The Accuracy value of 0.9995 that is obtained above indicates that the model has a high degree of accuracy in predicting both positive and negative classes.

Overall, the combination of these metrics suggests that the model has a high degree of predictive power and is able to effectively distinguish between positive and negative classes. However, it is important to note that these metrics should be used in conjunction with other evaluation techniques and domain knowledge to assess the overall performance of the model.

Therefore, it is important to evaluate the model's performance using multiple metrics and not just rely on the accuracy score alone. Let's analyze confusion matrix to get a better idea of how much of a good fit the model is.

In [19]:
def confusion_matrix(Y_classes, T):
    class_names = np.unique(T)
    table = []
    for true_class in class_names:
        row = []
        for Y_class in class_names:
            row.append(100 * np.mean(Y_classes[T == true_class] == Y_class))
        table.append(row)
    conf_matrix = pd.DataFrame(table, index=class_names, columns=class_names)
    return conf_matrix

In [31]:
perc_correct = 100 * np.mean(Y_tclasses == T_test)
print(f'Test accuracy in percent correct: {perc_correct:.2f}')
cm = confusion_matrix(Y_tclasses, T_test)
cm.style.background_gradient(cmap='Blues').format("{:.1f}%")

Test accuracy in percent correct: 99.95


Unnamed: 0,0,1
0,100.0%,0.0%
1,20.3%,79.7%


The above confusion matrix shows that out of all non-fraudulent transactions, the model correctly predicted 100% of them as non-fraudulent (true negatives). However, out of all fraudulent transactions, the model only correctly predicted 79.7% of them as fraudulent (true positives). This means that the model incorrectly predicted 20.3% of fraudulent transactions as non-fraudulent (false negatives). Overall, it indicates that the model is not as efficient for fraudulent transactions as it is for non-fraudulent transactions.

## Sklearn Model

In [4]:
df1 = pd.read_csv('creditcard.csv')

In [10]:
X = df1.drop(labels='Class', axis=1) # Features
T = df1.loc[:,'Class']               # Target Variable

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

Xtrain, Xtest, Ttrain, Ttest = train_test_split(X, T, test_size=0.2, random_state=42)
classifier = LogisticRegression()
classifier.fit(Xtrain, Ttrain)
Y_pred = classifier.predict(Xtest)
print("Classification Report:")
print(classification_report(Ttest, Y_pred))
auc_roc = roc_auc_score(Ttest, Y_pred)
print("AUC-ROC: ", auc_roc)

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.62      0.56      0.59        98

    accuracy                           1.00     56962
   macro avg       0.81      0.78      0.79     56962
weighted avg       1.00      1.00      1.00     56962

AUC-ROC:  0.7803132859784319


In [12]:
Y_pred1 = Y_pred.reshape(-1,1)
Y_pred1

array([[1],
       [0],
       [0],
       ...,
       [0],
       [0],
       [0]])

In [23]:
import numpy as np
perc_correct = 100 * np.mean(Y_pred == Ttest)
print(f'Test accuracy in percent correct: {perc_correct:.2f}')
cm = confusion_matrix(Y_pred1, Ttest)
cm.style.background_gradient(cmap = 'Blues').format("{:.1f} %")

Test accuracy in percent correct: 99.86


Unnamed: 0,0,1
0,99.9 %,0.1 %
1,43.9 %,56.1 %


From the confusion matrix we can say that the model accurately identified 99.9% of the non-fraudulent transactions (True Negatives, TN). This high percentage indicates that the model has a strong ability to correctly classify non-fraudulent transactions.

There were a small number of non-fraudulent transactions (0.1%) that the model incorrectly predicted as fraudulent (False Positives, FP). Although this percentage is low, it still represents a small fraction of misclassified transactions.

On the other hand, the model struggled to accurately identify fraudulent transactions (class 1). It incorrectly classified 43.9% of the fraudulent transactions as non-fraudulent (False Negatives, FN). This high percentage suggests that a significant number of fraudulent transactions were either missed or misclassified.

The model did manage to correctly predict 56.1% of the fraudulent transactions (True Positives, TP), indicating that it captured a portion of the fraudulent activity.

In summary, the logistic regression model demonstrates good performance in accurately predicting non-fraudulent transactions. However, it has limitations in accurately identifying fraudulent transactions, with a relatively high rate of false negatives. This suggests that further improvements or adjustments may be necessary to enhance the model's ability to detect fraudulent activity effectively.