<a href="https://colab.research.google.com/github/sayanghorui/samplecode/blob/master/practice_one_hot_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Word level one hot encoding:

In [0]:
import numpy as np
samples = ['The cat sat on the mat and ate chips, fish, chocolates.', 'The dog ate my homework.']
token_index = {}
for sample in samples:
  for word in sample.split():
    if word not in token_index:
      token_index[word] = len(token_index) + 1
token_index

{'The': 1,
 'and': 7,
 'ate': 8,
 'cat': 2,
 'chips,': 9,
 'chocolates.': 11,
 'dog': 12,
 'fish,': 10,
 'homework.': 14,
 'mat': 6,
 'my': 13,
 'on': 4,
 'sat': 3,
 'the': 5}

In [0]:
max_length = 10
results = np.zeros(shape=(len(samples),max_length,max(token_index.values()) + 1))
results.shape

(2, 10, 15)

In [0]:
for i, sample in enumerate(samples):
  for j, word in list(enumerate(sample.split()))[:max_length]:
    index = token_index.get(word)
    results[i, j, index] = 1

Character level one hot encoding:

In [0]:
import string
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable
token_index = dict(zip(range(1, len(characters) + 1), characters))
max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
for i, sample in enumerate(samples):
  for j, character in enumerate(sample):
    index = token_index.get(character)
    results[i, j, index] = 1

Keras for word-level one-hot encoding:


In [0]:
from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# We create a tokenizer, configured to only take
# into account the top-1000 most common words
tokenizer = Tokenizer(num_words=1000)
# This builds the word index
tokenizer.fit_on_texts(samples)
tokenizer

Using TensorFlow backend.


<keras_preprocessing.text.Tokenizer at 0x7ff7895e1be0>

In [0]:
# This turns strings into lists of integer indices.
sequences = tokenizer.texts_to_sequences(samples)
sequences

[[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]

In [0]:
# You could also directly get the one-hot binary representations.
# Note that other vectorization modes than one-hot encoding are supported!
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

one_hot_results

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [0]:
# This is how you can recover the word index that was computed
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 9 unique tokens.


Word-level one-hot encoding with hashing trick:

In [0]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
dimensionality = 1000
max_length = 10
results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
  for j, word in list(enumerate(sample.split()))[:max_length]:
    index = abs(hash(word)) % dimensionality
    results[i, j, index] = 1
    print(index)

610
958
812
643
622
718
610
238
371
65
879
