<h1 align="center">Semi Supervised Classification using Neural Network AutoEncoders</h1>

## Introduction

This Semi-Supervised Learning uses unlabelled data to train a model for one class by learning the best possible representation using a neural network. Given the other class is different, it will automatically separate out the other class.

 
## 1. Dataset Preparation

First, we will load all the required libraries and load the dataset using pandas dataframe. 
 



In [2]:
from keras.layers import Input, Dense
from keras.models import Model, Sequential
from keras import regularizers
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,accuracy_score
from sklearn import preprocessing
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
np.random.seed(203)



Using TensorFlow backend.


In [38]:
data = pd.read_csv('../input/creditcardfraud/creditcard.csv')
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [39]:
data['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

Autoencoders are a special type of neural network architechures in which the output is same as the input. 
Autoencoders are trained in an unsupervised manner in order to learn the extremely low level representation of the input data.
These low level features are then defomred back to project the acutal data.


In [40]:
non_fraud = data[data['Class'] == 0].sample(1000)
fraud = data[data['Class'] == 1]
df = non_fraud.append(fraud).sample(frac=1).reset_index(drop=True)
X = df.drop(['Class'],axis=1).values
y = df['Class'].values

In [41]:
# input layer
input_layer = Input(shape = (X.shape[1],))

# encoding part
encoded = Dense(100, activation='tanh', activity_regularizer = regularizers.l1(10e-5))(input_layer)
encoded = Dense(50, activation='relu')(encoded)

# decoding part
decoded = Dense(50, activation = 'tanh')(encoded)
decoded = Dense(100, activation = 'tanh')(decoded)

# output layer
output_layer = Dense(X.shape[1], activation='relu')(decoded)

In [42]:
autoencoder = Model(input_layer,output_layer)
autoencoder.compile(optimizer='adadelta',loss='mse')

In [43]:
x = data.drop(['Class'],axis=1).values
y = data['Class'].values

x_scale = preprocessing.MinMaxScaler().fit_transform(x)
x_norm, x_fraud = x_scale[y==0], x_scale[y==1] 

The beauty of this approach is that we do not need too many samples for learning the good representations. We will use only 2000 rows of non fraud to train the autoencoder. Additionally, we do not need to run this model for a larger number of epochs.

In [44]:
autoencoder.fit(x_norm[0:2000],x_norm[0:2000],
               batch_size = 256, 
               epochs = 10,
               shuffle = True,
               validation_split = 0.20)

Train on 1600 samples, validate on 400 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7c8894f6dd30>

## Obtain the latent represntations

In [46]:
hidden_representation = Sequential()
hidden_representation.add(autoencoder.layers[0])
hidden_representation.add(autoencoder.layers[1])
hidden_representation.add(autoencoder.layers[2])

In [47]:
norm_hid_rep = hidden_representation.predict(x_norm[:3000])
fraud_hid_rep = hidden_representation.predict(x_fraud)

In [48]:
norm_hid_rep.shape

(3000, 50)

In [51]:
rep_x = np.append(norm_hid_rep,fraud_hid_rep,axis=0)
y_n = np.zeros(norm_hid_rep.shape[0])
y_f = np.ones(fraud_hid_rep.shape[0])
rep_y = np.append(y_n,y_f)

## Simple Linear Classifier

In [54]:
train_x, val_x, train_y, val_y = train_test_split(rep_x,rep_y,test_size=0.25)
lr = LogisticRegression(solver='lbfgs').fit(train_x,train_y)
pred_y = lr.predict(val_x)
print(classification_report(val_y,pred_y))
print(accuracy_score(val_y,pred_y))

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99       755
         1.0       1.00      0.90      0.95       118

   micro avg       0.99      0.99      0.99       873
   macro avg       0.99      0.95      0.97       873
weighted avg       0.99      0.99      0.99       873

0.9862542955326461
