<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/chapter-3-text-representation/1_one_hot_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# One-Hot Encoding

In one-hot encoding, each word $w$ in the corpus vocabulary is given a unique integer ID $w_{id}$ that is between 1 and $|V|$, where $V$ is the set of the corpus vocabulary. Each word is then represented by a V-dimensional binary vector of 0s and 1s. This is done via a $|V|$ dimension vector filled with all 0s barring the index, where index = $w_{id}$. At this index, we simply put a 1. The representation for individual words is then combined to form a sentence representation.

**Our toy corpus**

|  |  | 
| --- | --- | 
| D1 | Dog bites man. |
| D2 | Man bites dog. |
| D3 | Dog eats meat. |
| D4 | Man eats food. |

Let’s understand this via our toy corpus. We first map each of the six words to unique IDs: `dog = 1, bites = 2, man = 3, meat = 4 , food = 5, eats = 6.`

Let’s consider the document D1: “dog bites man”. As per the scheme, each word is a six-dimensional vector. 

Dog is represented as `[1 0 0 0 0 0]`, as the word “dog” is mapped to ID 1. Bites is represented as `[0 1 0 0 0 0]`, and so on and so forth. 

**Documents**

|  |  | 
| --- | --- | 
| D1 | Dog bites man. |
| D2 | Man bites dog. |
| D3 | Dog eats meat. |
| D4 | Man eats food. |

Other documents in the corpus can be represented similarly. Finaly, we will get this matrix for **One-Hot Encoding**.:

```
 V = {dog : 1, bites : 2, man : 3, meat : 4 , food : 5, eats : 6}
 D1 = [ 
   [1 0 0 0 0 0],
   [0 1 0 0 0 0],
   [0 0 1 0 0 0]
 ]
 D2 = [ 
   [1 0 0 0 0 0],
   [0 1 0 0 0 0],
   [0 0 1 0 0 0]
 ]
 D3 = [ 
   [1 0 0 0 0 0],
   [0 0 0 0 0 1],
   [0 0 0 1 0 0]
 ]
 D4 = [ 
  [0 0 1 0 0 0],
  [0 0 0 0 0 1],
  [0 0 0 0 1 0]
]
```





In [1]:
documents = [
  "Dog bites man.",
  "Man bites dog.", 
  "Dog eats meat.", 
  "Man eats food."
]

processed_docs = [doc.lower().replace('.', '') for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

In [2]:
# Build the vocabulary
vocab = {}
count = 0
for doc in processed_docs:
  for word in doc.split():
    if word not in vocab:
      count = count + 1
      vocab[word] = count
print(vocab)

{'dog': 1, 'bites': 2, 'man': 3, 'eats': 4, 'meat': 5, 'food': 6}


Let's get one hot representation for any string based on this vocabulary. 

* If the word exists in the vocabulary, its representation is returned. 
* If not, a list of zeroes is returned for that word. 

In [3]:
def get_onehot_vector(somestring):
  onehot_encoded = []
  for word in somestring.split():
    temp = [0] * len(vocab)
    if word in vocab:
      temp[vocab[word] - 1] = 1  # -1 is to take care of the fact indexing in array starts from 0 and not 1
    onehot_encoded.append(temp)
  return onehot_encoded 

In [4]:
print(processed_docs[1])
# one hot representation for a text from our corpus.
get_onehot_vector(processed_docs[1])

man bites dog


[[0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0]]

In [5]:
# one hot representation for a random text, using the above vocabulary
get_onehot_vector('man and dog are good')

[[0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0]]

In [6]:
get_onehot_vector('man and man are good')

[[0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0]]

In [11]:
get_onehot_vector('dog bites man')

[[1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0]]

In [13]:
one_hot_vectors = get_onehot_vector('dog bites man.man eats food')
one_hot_vectors

[[1, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0],
 [0, 0, 0, 0, 0, 1]]

Let's show the one-hot-vetcor in dataframe.

In [14]:
import pandas as pd

In [17]:
pd.DataFrame(one_hot_vectors, columns=vocab, index=['dog', 'bites', 'man', 'eats', 'food'])

Unnamed: 0,dog,bites,man,eats,meat,food
dog,1,0,0,0,0,0
bites,0,1,0,0,0,0
man,0,0,0,0,0,0
eats,0,0,0,1,0,0
food,0,0,0,0,0,1


## One-hot encoding using scikit -learn

In real world projects one mostly uses scikit -learn’s implementation of one hot encoding.

**We encode our corpus as a one-hot numeric array using scikit-learn's OneHotEncoder.**

We will demostrate:
* **One Hot Encoding**: In one-hot encoding, each word $w$ in corpus vocabulary is given a unique integer id $w_{id}$ that is between 1 and $|V|$, where $V$ is the set of corpus vocab. Each word is then represented by a V-dimensional binary vector of 0s and 1s.
* **Label Encoding**: In Label Encoding, each word w in our corpus is converted into a numeric value between 0 and n-1 (where n refers to number of unique words in our corpus).

Link for the official documentation of both can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) respectively.

In [7]:
S1 = 'dog bites man'
S2 = 'man bites dog'
S3 = 'dog eats meat'
S4 = 'man eats food'

In [8]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

data = [S1.split(), S2.split(), S3.split(), S4.split()]
values = data[0] + data[1] + data[2] + data[3]
print('The data:', values)

The data: ['dog', 'bites', 'man', 'man', 'bites', 'dog', 'dog', 'eats', 'meat', 'man', 'eats', 'food']


In [9]:
# Label Encoding
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print('Label Encoded:', integer_encoded)

Label Encoded: [1 0 4 4 0 1 1 2 5 4 2 3]


In [10]:
# One-Hot Encoding
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(data).toarray()
print('Onehot Encoded Matrix:\n', onehot_encoded)

Onehot Encoded Matrix:
 [[1. 0. 1. 0. 0. 0. 1. 0.]
 [0. 1. 1. 0. 1. 0. 0. 0.]
 [1. 0. 0. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 1. 0. 0.]]


In [18]:
pd.DataFrame(onehot_encoded)

Unnamed: 0,0,1,2,3,4,5,6,7
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0


On the positive side, one-hot encoding is intuitive to understand and straightforward to implement. However, it suffers from a few shortcomings:

* he size of a one-hot vector is directly proportional to size of the vocabulary, and most real-world corpora have large vocabularies. This results in a sparse representation where most of the entries in the vectors are zeroes, making it computationally inefficient to store, compute with, and learn from (sparsity leads to overfitting).

* This representation does not give a fixed-length representation for text, i.e., if a text has 10 words, you get a longer representation for it as compared to a text with 5 words. For most learning algorithms, we need the feature vectors to be of the same length.

* It treats words as atomic units and has no notion of (dis)similarity between words. For example, consider three words: run, ran, and apple. Run and ran have similar meanings as opposed to run and apple. But if we take their respective vectors and compute Euclidean distance between them, they’re all equally apart. Thus, semantically, they’re very poor at capturing the meaning of the word in relation to other words.

* Say we train a model using our toy corpus. At runtime, we get a sentence: “man eats fruits.” The training data didn’t include “fruit”  and there’s no way to represent it in our model. This is known as the out of vocabulary (OOV) problem. A one-hot encoding scheme cannot handle this. The only way is to retrain the model: start by expanding the vocabulary, give an ID to the new word, etc.

> These days, one-hot encoding scheme is seldom used.

Some of these shortcomings can be addressed by the bag-of-words approach.