# Variational AutoEncoder to create Embedding of Merchants

In this notebook, I will use a Variational AutoEncoder (VAE) to create a Merchant Embedding. This information can be used in ML algotithms with higher semantic quality and similarity betweeen Merchants.

* **Introduction**
    * What is Embedding ?
    * How to use Merchants Embedding ?
    * What is Variational autoencoder (VAE)
* **Data Preparation**
    * Load Dataset
    * Data Engineer
* **Training VAE**
* **Visualization of latent space**


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import missingno as msno
import matplotlib.pyplot as plt
import os

%matplotlib inline

## Introduction

### What is Embedding ?

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

So a natural language modelling technique like Word Embedding is used to map words or phrases from a vocabulary to a corresponding vector of real numbers. As well as being amenable to processing by learning algorithms, this vector representation has two important and advantageous properties:

* **Dimensionality Reduction** — it is a more efficient representation
* **Contextual Similarity** — it is a more expressive representation

#### How to use Merchants Embedding ?

![](https://www.fast.ai/images/instacart.png)
https://www.fast.ai/2018/04/29/categorical-embeddings/

We can use the Embedding as input of the model, containing a reduced dimensionality but with much semantic information.  The previous example shows the use of product, store and customer embedding for a consumer products cecommendation model.

This notebook only creates the embeddings of all Merchants for use by ML-Models

### What is Variational autoencoder (VAE)

* https://www.jeremyjordan.me/variational-autoencoders/
* https://blog.keras.io/building-autoencoders-in-keras.html

A variational autoencoder (VAE) provides a probabilistic manner for describing an observation in latent space. Thus, rather than building an encoder which outputs a single value to describe each latent state attribute, we'll formulate our encoder to describe a probability distribution for each latent attribute.

![](https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-18-at-12.24.19-AM.png)

The "Sample from distributions" it's owr Embedding Layer. I will encoder all Merchants and take a Embedding Layer.

**How does a variational autoencoder work?**

First, an encoder network turns the input samples x into two parameters in a latent space, which we will note *z_mean* and *z_log_sigma*. Then, we randomly sample similar points z from the latent normal distribution that is assumed to generate the data, via *z = z_mean + exp(z_log_sigma) * epsilon*, where epsilon is a random normal tensor. Finally, a decoder network maps these latent space points back to the original input data.

The parameters of the model are trained via two loss functions: a reconstruction loss forcing the decoded samples to match the initial inputs (just like in our previous autoencoders), and the KL divergence between the learned latent distribution and the prior distribution, acting as a regularization term. You could actually get rid of this latter term entirely, although it does help in learning well-formed latent spaces and reducing overfitting to the training data.

## Data Preparation

This session transform the variables from the original dataset, corrects missing values and normalizes the data for training. 

In [None]:
df = pd.read_csv('../input/merchants.csv')

print("Size of the dataframe: ", df.shape); display(df.head(5))

In [None]:
df.info()

#### Data Engineer

* Fix a missing values
* One hot encoder for categorical columns
* Normalize MinMax

#### Fix a missing values

In [None]:
# Filter onlu nissing values
null_columns=df.columns[df.isnull().any()]
msno.bar(df[null_columns])

The only features that has missing values is the **avg_sales_lag3**, **avg_sales_lag6**, **avg_sales_lag12** and  **category_2**.

Float columns i will put a average value

In [None]:
for c in ['avg_sales_lag3', 'avg_sales_lag6', 'avg_sales_lag12']:
    df[c] = df[c].fillna(df[c].mean())

and the category column, i will put another category

In [None]:
# add other category 
df['category_2'] = df.category_2.fillna(df.category_2.max()+1)

In [None]:
# replace inf to zero
df = df.replace([np.inf, -np.inf], np.nan).fillna(0)

#### one hot encoder for categorical columns

In [None]:
#merchant_group_id
categorical_columns = ['merchant_category_id','subsector_id',
                       'category_1', 'most_recent_sales_range', 'most_recent_purchases_range',
                       'category_4', 'city_id', 'state_id', 'category_2']

df_enc = pd.get_dummies(df, columns=categorical_columns)
print(df_enc.shape)
df_enc.head()

#### Normalize MinMax

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler    = MinMaxScaler()
df_values = df_enc.drop('merchant_id', axis=1)
df_norm   = scaler.fit_transform(df_values)

## Training Variational autoencoder (VAE)


In [None]:
from keras.layers import Lambda, Input, Dense
from keras.models import Model
from keras.datasets import mnist
from keras.losses import mse, binary_crossentropy
from keras.utils import plot_model
from keras import backend as K
from keras.callbacks import ModelCheckpoint
from keras.layers import Input, Dense, Lambda, Layer, Add, Multiply
from keras.models import Model, Sequential

import argparse
import os

In [None]:
# network parameters
original_dim= df_enc.shape[1]-1
input_shape = (original_dim, )
intermediate_dim = int(original_dim/2)
batch_size = 128
latent_dim = 64
epochs     = 80
epsilon_std = 1.0

#### Build Model

https://tiao.io/post/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/

Keras is awesome. It is a very well-designed library that clearly abides by its guiding principles of modularity and extensibility, enabling us to easily assemble powerful, complex models from primitive building blocks. This has been demonstrated in numerous blog posts and tutorials, in particular, the excellent tutorial on Building Autoencoders in Keras. As the name suggests, that tutorial provides examples of how to implement various kinds of autoencoders in Keras, including the variational autoencoder (VAE)1.

In [None]:
class KLDivergenceLayer(Layer):

    """ Identity transform layer that adds KL divergence
    to the final model loss.
    """

    def __init__(self, *args, **kwargs):
        self.is_placeholder = True
        super(KLDivergenceLayer, self).__init__(*args, **kwargs)

    def call(self, inputs):

        mu, log_var = inputs

        kl_batch = - .5 * K.sum(1 + log_var -
                                K.square(mu) -
                                K.exp(log_var), axis=-1)

        self.add_loss(K.mean(kl_batch), inputs=inputs)

        return inputs

In [None]:
# VAE Architecture
# * original_dim - Original Input Dimension
# * intermediate_dim - Hidden Layer Dimension
# * latent_dim - Latent/Embedding Dimension
def vae_arc(original_dim, intermediate_dim, latent_dim):
    # Decode
    decoder = Sequential([
        Dense(intermediate_dim, input_dim=latent_dim, activation='relu'),
        Dense(original_dim, activation='sigmoid')
    ])

    # Encode
    x = Input(shape=(original_dim,))
    h = Dense(intermediate_dim, activation='relu')(x)

    z_mu = Dense(latent_dim)(h)
    z_log_var = Dense(latent_dim)(h)

    z_mu, z_log_var = KLDivergenceLayer()([z_mu, z_log_var])
    z_sigma = Lambda(lambda t: K.exp(.5*t))(z_log_var)

    eps = Input(tensor=K.random_normal(stddev=epsilon_std,
                                       shape=(K.shape(x)[0], latent_dim)))
    z_eps = Multiply()([z_sigma, eps])
    z = Add()([z_mu, z_eps])

    x_pred = decoder(z)
    
    return x, eps, z_mu, x_pred

Note this is a valid definition of a Keras loss, which is required to compile and optimize a model. It is a symbolic function that returns a scalar for each data-point in y_true and y_pred. In our example, y_pred will be the output of our decoder network, which are the predicted probabilities, and y_true will be the true probabilities.

In [None]:
def nll(y_true, y_pred):
    """ Negative log likelihood (Bernoulli). """

    # keras.losses.binary_crossentropy gives the mean
    # over the last axis. we require the sum
    return K.sum(K.binary_crossentropy(y_true, y_pred), axis=-1)

In [None]:
x, eps, z_mu, x_pred = vae_arc(original_dim, intermediate_dim, latent_dim)
vae            = Model(inputs=[x, eps], outputs=x_pred)
vae.compile(optimizer='adam', loss=nll)

In [None]:
vae.summary()

![](https://tiao.io/post/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/vae_full.svg)

#### Training VAE

Split dataset and train/test

In [None]:
from sklearn.model_selection import train_test_split

# 
X_train, X_test, y_train, y_test = train_test_split(df_norm, df_norm, 
                                                    test_size=0.33, random_state=42)

In [None]:
filepath   ="weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [None]:
# train
hist = vae.fit(X_train, X_train,
        epochs=epochs,
        batch_size=batch_size,
        callbacks=callbacks_list,
        validation_data=(X_test, X_test))

In [None]:
def plt_hist(hist):
    # summarize history for loss
    plt.plot(hist.history['loss'])
    plt.plot(hist.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')

In [None]:
plt_hist(hist)

## Visualization of latent space

Since our latent space is not two-dimensional, we will use PCA to reduce dimensionality, so we can use some interesting visualizations that can be made at this point. One is to look at the neighborhoods of different classes in the latent 2D plane:

In [None]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

def plt_reduce(x, color='merchant_category_id'):
    '''
    Plot Scatter with color
    '''
    plt.figure(figsize=(6, 6))
    plt.scatter(x[:, 0], x[:, 1], c=df[color],
            alpha=.4, s=3**2, cmap='viridis')
    #plt.colorbar()
    plt.show()

In [None]:
# Predict Embedding values
encoder = Model(x, z_mu)
z_df    = encoder.predict(df_norm, batch_size=batch_size)

#### PCA - Principal Component Analysis

In [None]:
# Reduce dimmension
pca      = PCA(n_components=2)
x_reduce = pca.fit_transform(z_df)

In [None]:
# Plot with merchant_category_id color
plt_reduce(x_reduce, 'merchant_category_id')

In [None]:
# Plot with subsector_id color
plt_reduce(x_reduce, 'subsector_id')

In [None]:
# Plot with city_id color
plt_reduce(x_reduce, 'city_id')

### Save Embedding

Join embedding with merchant_id and save pandas

In [None]:
df_embedding = pd.DataFrame(z_df)
df_embedding['merchant_id'] = df.merchant_id
df_embedding.head(5)

In [None]:
df_embedding.to_csv('merchant_id_embedding.csv')

##### continue....