# Autoencoders
Autoencoders are an unsupervised learning technique in which we leverage neural networks for the task of representation learning. Specifically, we'll design a neural network architecture such that we impose a 
bottleneck in the network which forces a compressed knowledge representation of the original input. The main idea is to pass the input and train the neural-net to generate the input itself. In short the input and output is same. This helps in feature extraction the data representation and reducing the number of inputs to train the model. We will get in more detail later in this kernel.

## Load required libraries

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)

from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD

import torch
from torch import nn, optim
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, SubsetRandomSampler
import random

from fastprogress import master_bar, progress_bar
from IPython.display import display

import warnings
warnings.filterwarnings('ignore')

**Add seed for code reproducibility**

In [2]:
SEED = 7

torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True

np.random.seed(SEED)
random.seed(SEED)

## Load the data

In [3]:
df = pd.read_csv('../input/creditcardfraud/creditcard.csv')
print(df.shape)
df.head()

(284807, 31)


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
# Minor preprocessing
df['Time'] = df['Time'] / 3600 % 24

**First let's see how much data we have for each class**

In [5]:
df['Class'].value_counts(normalize=True)

0    0.998273
1    0.001727
Name: Class, dtype: float64

Dataset is highly imbalanced. We have only 0.1727% fraud cases and 99.8273% of non_fraud cases. Generally these datasets are just like that because fraud cases are very rare but still they cost banks a lot so we need some way to predict fraud cases even if we have less data for that class.

## Visualize the data with TSNE and PCA

### TSNE
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. If you want to learn how PCA works behind the scenes check out this [video](https://www.youtube.com/watch?v=NEaUSP4YerM).

**NOTE:** Although extremely useful for visualizing high-dimensional data, t-SNE plots can sometimes be mysterious or misleading. You can read more about it [here](https://distill.pub/2016/misread-tsne/).

### PCA
Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize. PCA is sensitive to outliers, they should be removed. If you want to learn how PCA works behind the scenes check out this [video](https://www.youtube.com/watch?v=FgakZw6K1QQ).

### TruncatedSVD
Truncated Singular Value Decomposition (SVD) is a matrix factorization technique that factors a matrix M into the three matrices U, Σ, and V. This is very similar to PCA, excepting that the factorization for SVD is done on the data matrix, whereas for PCA, the factorization is done on the covariance matrix. Typically, SVD is used under the hood to find the principle components of a matrix.

## Plotly for data visualization

In [6]:
def plot_scatter(X, y, mode='TSNE', fname='file.png'):
    if mode == 'TSNE':
        X_r = TSNE(n_components=2, random_state=SEED).fit_transform(X)
    elif mode == 'PCA':
        X_r = PCA(n_components=2, random_state=SEED).fit_transform(X)
    elif mode == 'TSVD':
        X_r = TruncatedSVD(n_components=2, random_state=SEED).fit_transform(X)
    else:
        print('[ERROR]: Please select a valid mode')
        return
        
    traces = []
    traces.append(go.Scatter(x=X_r[y == 0, 0], y=X_r[y == 0, 1], mode='markers', showlegend=True, name='Non Fraud'))
    traces.append(go.Scatter(x=X_r[y == 1, 0], y=X_r[y == 1, 1], mode='markers', showlegend=True, name='Fraud'))

    layout = dict(title=f'{mode} plot')
    fig = go.Figure(data=traces, layout=layout)
    py.iplot(fig, filename=fname)

For the sake of visualization let's take a sample of non_fraud cases and all the fraud cases.

In [7]:
fraud = df.loc[df['Class'] == 1]
non_fraud = df.loc[df['Class'] == 0].sample(3000)

new_df = pd.concat([fraud, non_fraud]).sample(frac=1.).reset_index(drop=True)
y = new_df.pop('Class')

In [8]:
plot_scatter(new_df, y, mode='TSNE', fname='tsne1.png')

In [9]:
plot_scatter(new_df, y, mode='PCA', fname='pca1.png')

In [10]:
plot_scatter(new_df, y, mode='TSVD', fname='tsvd1.png')

**Create DataLoaders**

In [11]:
def get_dls(data, batch_sz, n_workers, valid_split=0.2):
    d_size = len(data)
    ixs = np.random.permutation(range(d_size))

    split = int(d_size * valid_split)
    train_ixs, valid_ixs = ixs[split:], ixs[:split]

    train_sampler = SubsetRandomSampler(train_ixs)
    valid_sampler = SubsetRandomSampler(valid_ixs)

    # Input and output data should be same
    ds = TensorDataset(torch.from_numpy(data).float(), torch.from_numpy(data).float())

    train_dl = DataLoader(ds, batch_sz, sampler=train_sampler, num_workers=n_workers)
    valid_dl = DataLoader(ds, batch_sz, sampler=valid_sampler, num_workers=n_workers)

    return train_dl, valid_dl

In [12]:
def train(epochs, model, train_dl, valid_dl, optimizer, criterion, device):
    model = model.to(device)

    mb = master_bar(range(epochs))
    mb.write(['epoch', 'train loss', 'valid loss'], table=True)

    for ep in mb:
        model.train()
        train_loss = 0.
        for train_X, train_y in progress_bar(train_dl, parent=mb):
            train_X, train_y = train_X.to(device), train_y.to(device)
            train_out = model(train_X)
            loss = criterion(train_out, train_y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
            mb.child.comment = f'{loss.item():.4f}'

        with torch.no_grad():
            model.eval()
            valid_loss = 0.
            for valid_X, valid_y in progress_bar(valid_dl, parent=mb):
                valid_X, valid_y = valid_X.to(device), valid_y.to(device)
                valid_out = model(valid_X)
                loss = criterion(valid_out, valid_y)
                valid_loss += loss.item()
                mb.child.comment = f'{loss.item():.4f}'

        mb.write([f'{ep+1}', f'{train_loss/len(train_dl):.6f}', f'{valid_loss/len(valid_dl):.6f}'], table=True)

## AutoEncoder Architecture

Autoencoder consists of two sub-models: Encoder and Decoder. Encoder learns to extract features from the input in order to represent the same data in different dimensional space. Decoder learns to decode these features to construct back the input data. Usually, autoencoder is used to extract features which have lower dimension than the input, but the main idea here is to try if autoencoder can extract features and learn to represent a class of data from the other one.

<img src="https://cdn-images-1.medium.com/max/1600/1*44eDEuZBEsmG_TCAKRI3Kw@2x.png" width="500"></img>

In [13]:
class AutoEncoder(nn.Module):
    def __init__(self, f_in):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(f_in, 100),
            nn.Tanh(),
            nn.Dropout(0.2),
            nn.Linear(100, 70),
            nn.Tanh(),
            nn.Dropout(0.2),
            nn.Linear(70, 40)
        )
        self.decoder = nn.Sequential(
            nn.ReLU(inplace=True),
            nn.Linear(40, 40),
            nn.Tanh(),
            nn.Dropout(0.2),
            nn.Linear(40, 70),
            nn.Tanh(),
            nn.Dropout(0.2),
            nn.Linear(70, f_in)
        )

    def forward(self, x):
        return self.decoder(self.encoder(x))

In [14]:
EPOCHS = 10
BATCH_SIZE = 512
N_WORKERS = 0

model = AutoEncoder(30)
criterion = F.mse_loss
optimizer = optim.Adam(model.parameters(), lr=1e-3)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

We will use only a sample of non_fraud cases and let our autoencoder learn how to reconstruct it. This way our encoder will learn insights of data of one class and will help distinguish it from different class.

In [15]:
X = df.drop('Class', axis=1).values
y = df['Class'].values

X = MinMaxScaler().fit_transform(X)
X_nonfraud = X[y == 0]
X_fraud = X[y == 1]
train_dl, valid_dl = get_dls(X_nonfraud[:5000], BATCH_SIZE, N_WORKERS)

Let's train our model. It will take only a few seconds to train on a GPU.

In [16]:
train(EPOCHS, model, train_dl, valid_dl, optimizer, criterion, device)

epoch,train loss,valid loss
1,0.286649,0.19775
2,0.132643,0.04212
3,0.04726,0.008865
4,0.030612,0.006376
5,0.024506,0.003613
6,0.0201,0.002872
7,0.017651,0.002157
8,0.016134,0.002071
9,0.014857,0.002022
10,0.013886,0.001923


We need to encode the data using only the Encoder because that's the part of the model responsible to extract features.

In [17]:
with torch.no_grad():
    model.eval()
    non_fraud_encoded = model.encoder(torch.from_numpy(X_nonfraud).float().to(device)).cpu().numpy()
    fraud_encoded = model.encoder(torch.from_numpy(X_fraud).float().to(device)).cpu().numpy()

nrows = 3000
sample_encoded_X = np.append(non_fraud_encoded[:nrows], fraud_encoded, axis=0)
sample_encoded_y = np.append(np.zeros(nrows), np.ones(len(fraud_encoded)))

In [18]:
plot_scatter(sample_encoded_X, sample_encoded_y, mode='TSNE', fname='tsne2.png')

TSNE was able cluster most of the data seperately.

In [19]:
plot_scatter(sample_encoded_X, sample_encoded_y, mode='PCA', fname='pca2.png')

Plot is not that good maybe because PCA is good but it's linear so it can't interpret complex polynomial relations among independent variables.

In [20]:
plot_scatter(sample_encoded_X, sample_encoded_y, mode='TSVD', fname='tsvd2.png')

In [21]:
def print_metric(model, df, y, scaler=None):
    X_train, X_val, y_train, y_val = train_test_split(df, y, test_size=0.2, shuffle=True, random_state=SEED, stratify=y)
    mets = [accuracy_score, precision_score, recall_score, f1_score]

    if scaler is not None:
        X_train = scaler.fit_transform(X_train)
        X_val = scaler.transform(X_val)

    model.fit(X_train, y_train)
    train_preds = model.predict(X_train)
    train_probs = model.predict_proba(X_train)[:, 1]
    val_preds = model.predict(X_val)
    val_probs = model.predict_proba(X_val)[:, 1]

    train_met = pd.Series({m.__name__: m(y_train, train_preds) for m in mets})
    train_met['roc_auc'] = roc_auc_score(y_train, train_probs)
    val_met = pd.Series({m.__name__: m(y_val, val_preds) for m in mets})
    val_met['roc_auc'] = roc_auc_score(y_val, val_probs)
    met_df = pd.DataFrame()
    met_df['train'] = train_met
    met_df['valid'] = val_met

    display(met_df)

**Let's try with a simple model**

In [22]:
encoded_X = np.append(non_fraud_encoded, fraud_encoded, axis=0)
encoded_y = np.append(np.zeros(len(non_fraud_encoded)), np.ones(len(fraud_encoded)))

In [23]:
clf = LogisticRegression(random_state=SEED)
print('Metric scores for original data:')
print_metric(clf, X, y)
print('Metric score for encoded data:')
print_metric(clf, encoded_X, encoded_y)

Metric scores for original data:


Unnamed: 0,train,valid
accuracy_score,0.999052,0.999034
precision_score,0.864754,0.890909
recall_score,0.535533,0.5
f1_score,0.661442,0.640523
roc_auc,0.976933,0.979828


Metric score for encoded data:


Unnamed: 0,train,valid
accuracy_score,0.998424,0.998402
precision_score,1.0,1.0
recall_score,0.088832,0.071429
f1_score,0.16317,0.133333
roc_auc,0.922028,0.916397


Looks like the model is not optimal. Maybe the final classifier is not a robust one or our autoencoder needs to have some different architecture and required better hyperparameters. But still results are close to the baseline and TSNE, PCA and LDA created a good plot to distinguish between the classes even when our data was highly imbalanced using the encoded features. Also, the dataset was highly imbalanced, so this technique is worth trying with the datasets which are not highly imbalanced.