# Categorical data and Encoding

Categorical data requires special care. Data like language characters ‘a’, ‘b’, ‘c’ etc. are usually represented as integers 0, 1, 2, etc. Do not use integers as input for categorical data. If you would enter those integers as inputs to the model, two problems arise.

1) You bias the model to see relations where there are none. In the language example above, the model would think that ‘a’ is closer to ‘b’ than to ‘o’, although ‘a’ and ‘o’ are both vocals, and the closeness of ‘a’ and ‘b’ does not necessarily say anything about their usage.

2) If you have many categories, you will have input values between 0 and >50. The model will have a hard time separating all those >50 categories without blending over some. Hence, the model loses a lot of information although this is not necessary.

The much better option in the case of categorical data is to use one-hot vectors, or embeddings.

In [7]:
import pandas as pd

blood_type_categories = pd.DataFrame(
    {"blood_type": ["A+", "A-", "B+", "B-", "AB+", "AB-", "O+", "O-"]}
)
blood_type_categories

Unnamed: 0,blood_type
0,A+
1,A-
2,B+
3,B-
4,AB+
5,AB-
6,O+
7,O-


## OneHot Encoding

A one-hot vector represents each category by a vector of 0s, with one index being 1.

### Sklearn OneHotEncoder

In [10]:
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder(sparse_output=False)
onehot_encoding = onehot.fit_transform(blood_type_categories)
onehot_encoding

array([[1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1.]])

In [11]:
onehot_encoding.shape

(8, 8)

### PyTorch OneHotEncoder 

In [63]:
from sklearn.preprocessing import LabelEncoder
import torch.nn.functional as F
import torch
import numpy as np

encoder = LabelEncoder()
labels_blood_type = encoder.fit_transform(blood_type_categories["blood_type"])
labels_blood_type.tolist()

[0, 1, 4, 5, 2, 3, 6, 7]

In [66]:
tensor = torch.tensor(labels_blood_type.tolist())
tensor

tensor([0, 1, 4, 5, 2, 3, 6, 7])

In [68]:
onehot_encoding = F.one_hot(tensor)
onehot_encoding

tensor([[1, 0, 0, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 1]])

In [69]:
onehot_encoding.shape

torch.Size([8, 8])

## Embedding Encoding

It allows for the conversion of categorical data, such as words or items, into vectors of continuous numbers. The beauty of embeddings lies in their ability to capture the underlying semantics and relationships between different categories.

 Properties:

1) **Dense Representation:** While methods like one-hot encoding lead to sparse vectors (mostly zeros with a single one), embeddings result in dense vectors where every dimension can contain any real number.Advantages: Dense vectors are more memory-efficient and can capture more information in fewer dimensions compared to sparse representations.

2) **Semantic Meaning:** One of the primary goals of embeddings is to represent data in such a way that the spatial distances between vectors correlate with semantic similarities.Example: In a well-trained word embedding space, synonyms or related words will be closer to each other. For instance, "king" and "monarch" would have vectors that are near each other.

3) **Dimensionality Reduction:** Embeddings help in reducing the dimensionality of data. Instead of having a dimension for every possible category, the data is represented in a much smaller, fixed-size space. Advantages: This leads to more efficient storage and computation, especially when dealing with a large number of categories.

In [95]:
import torch
import torch.nn as nn

torch.manual_seed(42)

embedding = nn.Embedding(num_embeddings=5, embedding_dim=32)

input_tensor = torch.LongTensor([[0, 4], [2, 3], [0, 1]])

embed_vectors = embedding(input_tensor)

print("Input shape:", input_tensor.shape)
print("Output shape:", embed_vectors.shape)

Input shape: torch.Size([3, 2])
Output shape: torch.Size([3, 2, 32])


The first vector sample is encoded into a embedding vector of 32 dim

In [91]:
input_tensor[0, :]

tensor([0, 4])

In [96]:
embed_vectors[0, :, :]

tensor([[ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047,
         -0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624,
          1.6423, -0.1596, -0.4974,  0.4396, -0.7581,  1.0783,  0.8008,  1.6806,
          1.2791,  1.2964,  0.6105,  1.3347, -0.2316,  0.0418, -0.2516,  0.8599],
        [ 1.9312,  1.0119, -1.4364, -1.1299, -0.1360,  1.6354,  0.6547,  0.5760,
          1.1415,  0.0186, -1.8058,  0.9254, -0.3753,  1.0331, -0.6867,  0.6368,
         -0.9727,  0.9585,  1.6192,  1.4506,  0.2695, -0.2104, -0.7328,  0.1043,
          0.3488,  0.9676, -0.4657,  1.6048, -2.4801, -0.4175, -1.1955,  0.8123]],
       grad_fn=<SliceBackward0>)

In [129]:
from sklearn.preprocessing import LabelEncoder
import torch.nn.functional as F
import torch
import torch.nn as nn

torch.manual_seed(42)

input_tensor = torch.tensor(labels_blood_type.tolist()).reshape(
    (len(labels_blood_type), -1)
)
embedding = nn.Embedding(num_embeddings=len(input_tensor), embedding_dim=32)
embed_vectors = embedding(input_tensor)

print("Input shape:", input_tensor.shape)
print("Output shape:", embed_vectors.shape)

Input shape: torch.Size([8, 1])
Output shape: torch.Size([8, 1, 32])


In [131]:
input_tensor[0, :]

tensor([0])

In [133]:
embed_vectors[0, :, :]

tensor([[ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047,
         -0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624,
          1.6423, -0.1596, -0.4974,  0.4396, -0.7581,  1.0783,  0.8008,  1.6806,
          1.2791,  1.2964,  0.6105,  1.3347, -0.2316,  0.0418, -0.2516,  0.8599]],
       grad_fn=<SliceBackward0>)