# Entity Embedding

Perform entity embedding only if there is high cardinality. Or you can simply use one hot encoding or label encoding depending on the feature.

Let's create a pandas dataframe with a color feature with high cardinality.


In [2]:

import pandas as pd
import numpy as np
import random

# Set random seed for reproducibility
np.random.seed(42)

# Define the number of rows and unique categories
num_rows = 50
num_categories = 15

# List of color names
color_names = ['Red', 'Orange', 'Yellow', 'Green', 'Blue', 'Indigo', 'Violet', 'White', 'Black', 'Gray', 'Pink', 'Brown', 'Cyan', 'Magenta', 'Purple']

# Generate random categorical data
categories = random.choices(color_names, k=num_rows)

# Generate random flag values (0 and 1)
flags = np.random.randint(2, size=num_rows)

# Create the DataFrame
df = pd.DataFrame({'Color': categories, 'Target': flags})
df['Color'] = df['Color'].astype('category')

# Display the DataFrame
print(df)


      Color  Target
0      Pink       0
1     Green       1
2     White       0
3   Magenta       0
4    Yellow       0
5       Red       1
6       Red       0
7      Pink       0
8      Gray       0
9       Red       1
10     Blue       0
11     Gray       0
12   Violet       0
13    Green       0
14   Violet       1
15     Pink       0
16    White       1
17     Blue       1
18   Orange       1
19    White       0
20   Purple       1
21   Yellow       0
22     Cyan       1
23    White       1
24  Magenta       1
25    Green       1
26     Gray       1
27   Purple       1
28   Yellow       1
29   Violet       1
30   Violet       0
31    White       0
32   Orange       1
33   Violet       1
34     Gray       1
35    White       0
36   Purple       1
37   Orange       0
38     Cyan       0
39   Purple       0
40    Black       0
41   Violet       0
42    White       1
43   Indigo       1
44     Gray       1
45     Cyan       1
46    Green       1
47     Cyan       0
48    White       1


In [3]:
import pandas as pd
import numpy as np
from keras.models import Model
from keras.layers import Input, Embedding, Dense, Reshape, Concatenate
from keras import backend as K


2023-06-14 22:38:34.038211: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Steps to perform categorical label encoding for the `Color` variable.

* First, we will create an empty list. (if you have your data split already, then create two empty lists, one for the training data, and the other for the test data.)
* Next, create an empty dictionary called `cat_encoder`. This dictionary will be used to save the label encodings for the categorical variable `Color`.
* Then, all the unique values for `Color` are extracted and saved in a variable called `unique_cat`. There are 14 unique Colors in the dataset.
* Then, we loop through each color and assign an integer to the color.
* Finally, we print the `cat_encoder` and we can see that each color is a key and each key has an integer as a value. 

In [6]:
# Input list for the data
input_list = []

# Categorical encoder is in dictionary format
cat_encoder = {}

# Unique values for the categorical variable
unique_cat = np.unique(df['Color'])

# Print out the number of unique values in the categorical variable
print(f'There are {len(unique_cat)} unique colors in the dataset.\n')

# Encode the categorical variable
for i in range(len(unique_cat)):
  cat_encoder[unique_cat[i]] = i

# Take a look at the encoder
cat_encoder

There are 14 unique colors in the dataset.



{'Black': 0,
 'Blue': 1,
 'Cyan': 2,
 'Gray': 3,
 'Green': 4,
 'Indigo': 5,
 'Magenta': 6,
 'Orange': 7,
 'Pink': 8,
 'Purple': 9,
 'Red': 10,
 'Violet': 11,
 'White': 12,
 'Yellow': 13}

In [7]:
# Append the values to the input list
input_list.append(df['Color'].map(cat_encoder).values)

In [41]:
input_list

[[2, 1, 3, 7, 7, ..., 6, 3, 4, 9, 11]
 Length: 50
 Categories (14, int64): [0, 1, 2, 3, ..., 10, 11, 12, 13]]

Now, we will build a model with categorical entity embedding.

First, let's create the embedding layer using the `Embedding` function in keras.

* `input_dim` is the number of unique values for the categorical column. In this example, it is the number of unique colors.
* `output_dim` is the dimension of the embedding output. <br>
How to decide this number? The authors of the [entity embedding paper](https://arxiv.org/pdf/1604.06737.pdf) mentioned that it is a hyperparameter value to tune with the range of 1 to the number of categories minus 1. The authors proposed two general guidelines:
 * If the number of aspects to describe the entities can be estimated, we can use that as the `output_dim`. More complex entities usually need more output dimensions. 
For example, from the German Credit Risk dataset, `Number of dependants` can be described by Age, Income, Applicant Income, Marital Status, so we can set 4 as the number of output dimensions. In our case of `Color`, we will set it as 6, just for code implementation. Since this is an hyperparameter you can test out multiple values and choose the best based on the model performance.
 * If the number of aspects to describe the entities cannot be estimated, then start with the highest possible number of dimensions, which is the number of categories minus 1 for the hyperparameter tuning. 
 * You can also set the number of dimensions as equal to the square root of the number of unique values for the category.
* `name` gives a name for the layer.
* The input dimension of the categorical variable is defined by the `Input` function. `Input()` is used to instantiate a Keras tensor. `shape=(1,)` indicates that the expected input will be a one-dimensional vector.
* `Reshape` changed the output from 3-dimensional to 2-dimensional.

In [8]:
# Number of unique values in the categorical col
n_unique_cat = len(unique_cat)

# Input dimension of the categorical variable
input_cat = Input(shape=(1,))

# Output dimension of the categorical entity embedding
cat_emb_dim = 6

# Embedding layer
emb_cat = Embedding(input_dim=n_unique_cat, output_dim=cat_emb_dim, name="embedding_cat")(input_cat)

# Check the output shape
print(emb_cat)

# Reshape
emb_cat = Reshape(target_shape=(cat_emb_dim, ))(emb_cat)

# Check the output shape
print(emb_cat)

KerasTensor(type_spec=TensorSpec(shape=(None, 1, 6), dtype=tf.float32, name=None), name='embedding_cat/embedding_lookup/Identity_1:0', description="created by layer 'embedding_cat'")
KerasTensor(type_spec=TensorSpec(shape=(None, 6), dtype=tf.float32, name=None), name='reshape/Reshape:0', description="created by layer 'reshape'")


2023-06-14 22:48:28.566667: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [56]:
type(emb_cat)

keras.engine.keras_tensor.KerasTensor

This is a keras symbolic tensor. It helps you build a model framework so that it is ready to accept the input anytime later. To view the embeddings, you need to convert the symbolic tensor to a tensor by first feeding the network with the data.

In [65]:
emb_cat.shape

TensorShape([None, 6])

The categorical variable has a dimension of 6 because we specified the embedding dimension to be 6 in the embedding layer.

We don't need a compilcated model here since there is not much data.

The data is passed in the `Dense` layer of the model. 
* The first `Dense` layer has 3 neurons. `relu` is the activation function.
* The output layer has one neuron, and the activation function is `sigmoid` because the target variable `Target` is a binary variable.

Then the `emb_cat` and `outputs` are grouped into an object using the `Model` function.
* `inputs` takes in the inputs of the model. It can be a `Keras` `Input` object or a list of `Keras` `Input` objects. As the [Keras documentation](https://keras.io/api/models/model/) points out, Only dictionaries, lists, and tuples of input tensors are supported. Nested inputs such as a list of lists or a dictionary of dictionaries are not supported.
* `outputs` takes in the outputs of the model.
* `name` is the name of the model. We gave our model the name of `Entity_embedding_model_keras`.

We can print out the model details using the `summary` function. 

In [10]:
# Dense layer with 3 neurons and relu activation function
model = Dense(3, activation = 'relu')(emb_cat)

# Output is linear
outputs = Dense(1, activation = 'sigmoid')(model)

# Use Model to group layers into an object with training and inference features
nn = Model(inputs=input_cat, outputs=outputs, name ='Entity_embedding_model_keras')

# Print out the model summary
nn.summary()

Model: "Entity_embedding_model_keras"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 embedding_cat (Embedding)   (None, 1, 6)              84        
                                                                 
 reshape (Reshape)           (None, 6)                 0         
                                                                 
 dense_3 (Dense)             (None, 3)                 21        
                                                                 
 dense_4 (Dense)             (None, 1)                 4         
                                                                 
Total params: 109
Trainable params: 109
Non-trainable params: 0
_________________________________________________________________


In [11]:
#!pip install pydot

In [12]:
# Visualize neural network model structure
from keras.utils import plot_model
from IPython.display import Image
import graphviz
import pydot
# Deep learning model
from tensorflow.keras.layers import Input, Dense, Reshape, Concatenate, Embedding
from tensorflow.keras.models import Model, load_model
from keras.callbacks import EarlyStopping

In [13]:
# Print model structure
plot_model(nn, show_shapes=True, show_layer_names=True, to_file='Entity_embedding_model_keras.png')
#Image(retina=True, filename='Entity_embedding_model_keras.png')

You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.


In [89]:
# check input type
type(input_list)

numpy.ndarray

In [80]:
# convert to array
input_list = np.array(input_list)

In [85]:
input_list

array([[ 2,  1,  3,  7,  7, 12,  3,  1,  6, 13,  7,  6,  3, 13,  6, 10,
        10,  4,  0,  6,  5,  8, 11,  2,  1,  5,  5,  9,  3,  5,  4, 12,
         7, 12, 11,  5,  5, 11,  0,  9, 10,  0,  9, 11,  2,  6,  3,  4,
         9, 11]])

In [82]:
# verify
type(input_list)

numpy.ndarray

In [84]:
# check dimension
input_list.shape

(1, 50)

In [20]:
# reshape
input_list = np.reshape(input_list, (50, 1))

In [22]:
# check dimension. This is required for the model
input_list.shape

(50, 1)

In [23]:
# compile model
nn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model
history  =  nn.fit(input_list,
                   df.Target, 
                   epochs=2,  
                   verbose=1)

Epoch 1/2
Epoch 2/2


The weights are extracted from the embedding layer and saved in a dataframe. We can see that the dataframe has 14 rows, each row representing one unique color. There are 7 columns in the dataframe. The first column is the encoded categorical index, the other 6 columns are the embeddings.

In [24]:
# Get weights from the embedding layer
cat_emb_df = pd.DataFrame(nn.get_layer('embedding_cat').get_weights()[0]).reset_index()

# Add prefix to the embedding names
cat_emb_df = cat_emb_df.add_prefix('cat_')

# Take a look at the data
cat_emb_df

Unnamed: 0,cat_index,cat_0,cat_1,cat_2,cat_3,cat_4,cat_5
0,0,0.027591,-0.00713,-0.002562,0.027505,-0.031295,0.047634
1,1,-0.039307,0.001249,0.047699,-0.010845,-0.010499,0.030539
2,2,0.047044,-0.043243,0.026318,0.029448,-0.000237,-0.022817
3,3,0.016779,0.013475,-0.016772,-0.011804,-0.020719,0.003271
4,4,0.004641,-0.003226,-0.025481,0.032866,0.004565,-0.02395
5,5,-0.048152,-0.018079,0.041361,0.045231,0.028738,0.022594
6,6,-0.025324,-0.044873,-0.042644,-0.03486,0.002117,0.036049
7,7,-0.029444,-0.018251,-0.046729,0.01596,0.046416,0.001486
8,8,-0.007883,0.035574,0.024529,0.007265,0.007155,0.0286
9,9,-0.019481,-0.037562,-0.007163,-0.006064,0.031657,-0.03762


In [26]:
# Put the categorical encoder dictionary into a dataframe
cat_encoder_df = pd.DataFrame(cat_encoder.items(), columns=['cat', 'cat_index'])

# Merge data to append the category name
cat_emb_df = pd.merge(cat_encoder_df, cat_emb_df, how = 'inner', on='cat_index')

# Take a look at the data
cat_emb_df

Unnamed: 0,cat,cat_index,cat_0,cat_1,cat_2,cat_3,cat_4,cat_5
0,Black,0,0.027591,-0.00713,-0.002562,0.027505,-0.031295,0.047634
1,Blue,1,-0.039307,0.001249,0.047699,-0.010845,-0.010499,0.030539
2,Cyan,2,0.047044,-0.043243,0.026318,0.029448,-0.000237,-0.022817
3,Gray,3,0.016779,0.013475,-0.016772,-0.011804,-0.020719,0.003271
4,Green,4,0.004641,-0.003226,-0.025481,0.032866,0.004565,-0.02395
5,Indigo,5,-0.048152,-0.018079,0.041361,0.045231,0.028738,0.022594
6,Magenta,6,-0.025324,-0.044873,-0.042644,-0.03486,0.002117,0.036049
7,Orange,7,-0.029444,-0.018251,-0.046729,0.01596,0.046416,0.001486
8,Pink,8,-0.007883,0.035574,0.024529,0.007265,0.007155,0.0286
9,Purple,9,-0.019481,-0.037562,-0.007163,-0.006064,0.031657,-0.03762


You can save these embeddings and use them in other models as input.

References:

https://colab.research.google.com/drive/13U4YRIdEu7SWS1ttiJPSrSayQHNbIaT_?usp=sharing#scrollTo=T9USnggEb1wk

# Recommended Tutorials

- [GrabNGoInfo Machine Learning Tutorials Inventory](https://medium.com/grabngoinfo/grabngoinfo-machine-learning-tutorials-inventory-9b9d78ebdd67)
- [Time Series Forecasting Of Bitcoin Prices Using Phrophet](https://colab.research.google.com/drive/1fy5nFDxdeyaMMsVbFbHWmCEAvdUK2Tdy?usp=sharing)
- [3 Ways for Multiple Time Series Forecasting Using Prophet in Python](https://medium.com/p/3-ways-for-multiple-time-series-forecasting-using-prophet-in-python-7a0709a117f9)
- [Time Series Causal Impact Analysis in Python](https://medium.com/grabngoinfo/time-series-causal-impact-analysis-in-python-63eacb1df5cc)
- [Hyperparameter Tuning For XGBoost](https://colab.research.google.com/drive/18ooFZ4e7cW_zpbvwhBzzhWxCze0Mi6LA#scrollTo=1-FxiavJMirS)
- [Hyperparameter_Tuning_for_Time_Series_Causal_Impact_Analysis_in_Python](https://colab.research.google.com/drive/1HkJ9zm0LY36Wz-wB_bSHq68w8Cef6qJO?usp=sharing)