## SKU TO VEC

### Reference
https://arxiv.org/pdf/1103.0398.pdf

http://sebastianruder.com/word-embeddings-1/index.html#continuousbagofwordscbow

## Word embedding models

Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer.

The main difference between such a network that produces word embeddings as a by-product and a method such as word2vec whose explicit goal is the generation of word embeddings is its computational complexity. Generating word embeddings with a very deep architecture is simply too computationally expensive for a large vocabulary. This is the main reason why it took until 2013 for word embeddings to explode onto the NLP stage; computational complexity is a key trade-off for word embedding models

#### 1 Embedding Layer: 
A layer that generates word embeddings by multiplying an index vector with a word embedding matrix;
#### 2 Intermediate Layer(s):
One or more layers that produce an intermediate representation of the input, e.g. a fully-connected layer that applies a non-linearity to the concatenation of word embeddings of nn previous words;
#### 3 Cost Layer:
The final layer that produces a probability distribution over words in VV.

In [12]:
import tensorflow as tf
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## CBOW model

<img src="img/cbow.png">

## The pairwise ranking criterion, which looks like this:

   <img src="img/margin_1_loss.png">
    

## Imports, config variables, and data generators

In [26]:
# Global config variables
batch_size = 10 # 128
num_classes = 10 # number of skus and zuids ~700.000 
state_size = 4 # 32, 64, 128
learning_rate = 0.1 
d_win = 1
n_negative = 7
n_positive = 1

#layer_1 
n_layer_1 = 8

#layer_2
n_layer_2 = 4

In [5]:
def gen_data(size=128):
    pass

def gen_batch(raw_data, batch_size, num_steps):
    pass

def gen_epochs(n, num_steps):
    for i in range(n):
        yield gen_batch(gen_data(), batch_size, num_steps)

### Model

If we treat a continuous sequence of actions of a user in a transaction as a sentences, the sku will play a role as a word.
<img src="img/model_v1.png">

In [24]:
# Placeholders
d_win = tf.placeholder(tf.int32, [batch_size, d_win * 2], name = 'context')
negative_sample = tf.placeholder(tf.int32, [batch_size, n_negative], name = 'negative_sample')
positive_sample = tf.placeholder(tf.int32, [batch_size, n_positive], name = 'positive_sample')

### Helper-functions for creating new variables

In [30]:
def new_weights(shape):
    return tf.Variable(tf.truncated_normal(shape, stddev=0.05))
def new_biases(length):
    return tf.Variable(tf.constant(0.05, shape=[length]))

def new_fc_layer(input,          # The previous layer.
                 num_inputs,     # Num. inputs from prev. layer.
                 num_outputs,    # Num. outputs.
                 use_relu=True): # Use Rectified Linear Unit (ReLU)?

    # Create new weights and biases.
    weights = new_weights(shape=[num_inputs, num_outputs])
    biases = new_biases(length=num_outputs)

    # Calculate the layer as the matrix multiplication of
    # the input and weights, and then add the bias-values.
    layer = tf.matmul(input, weights) + biases

    # Use ReLU?
    if use_relu:
        layer = tf.nn.relu(layer)

    return layer

In [31]:
# Variables

#lookup table 
embeddings = new_weights([num_classes, state_size])

# input vectors
d_win_vec = tf.nn.embedding_lookup(embeddings, d_win)
negative_sample_vec = tf.nn.embedding_lookup(embeddings, negative_sample)
positive_sample_vec = tf.nn.embedding_lookup(embeddings, negative_sample)

## Model 