### One Hot encording

One-hot encoding is a simple technique used in text analysis to convert words into numerical form so deep learning models can process them. Each unique word in a vocabulary is represented as a vector of zeros with a single one indicating the word’s position. This makes words easy for models to recognize mathematically, but it has important limitations. The vectors are very large and sparse, which increases memory and computation costs. More importantly, one-hot encoding does not capture meaning or relationships between words, so similar words appear completely unrelated. Because of these limitations, it is often replaced by word embeddings in modern deep learning systems.

In [1]:
import numpy as np

def one_hot_encording(sentence):
    words = sentence.lower().split()
    vocabulary = sorted(set(words))
    word_to_index = {word: i for i,
        word in enumerate(vocabulary)}
    one_hot_matrix = np.zeros((
        len(words), len(vocabulary)), dtype=int)
    for i, word in enumerate(words):
        one_hot_matrix[i, word_to_index[word]] =1
    return one_hot_matrix, vocabulary

This function takes a sentence, converts it to lowercase, and splits it into individual words, then builds a vocabulary of unique words and assigns each word an index. It creates a matrix where each row represents a word in the sentence and each column represents a vocabulary term. For each word, the function places a 1 in the column corresponding to that word’s index and 0s elsewhere, producing a one-hot encoded representation. It returns both the resulting matrix and the vocabulary, making the text usable for basic deep learning or text analysis tasks.

In [2]:
sentence = "Should we go to a pizzeria or do you prefer a restaurant?"
one_hot_matrix, vocabulary = one_hot_encording(sentence)
print("Vocabulary:", vocabulary)
print("One_hot_encording_matrix:\n", one_hot_matrix)

Vocabulary: ['a', 'do', 'go', 'or', 'pizzeria', 'prefer', 'restaurant?', 'should', 'to', 'we', 'you']
One_hot_encording_matrix:
 [[0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0]
 [0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0]
 [1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 1 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0]]


In [3]:
sentence = "Should we go to a pizzeria or do you prefer a restaurant?, what about a hotel, or just a normal dining?"
one_hot_matrix, vocabulary = one_hot_encording(sentence)
print("Vocabulary:", vocabulary)
print("One_hot_encording_matrix:\n", one_hot_matrix)

Vocabulary: ['a', 'about', 'dining?', 'do', 'go', 'hotel,', 'just', 'normal', 'or', 'pizzeria', 'prefer', 'restaurant?,', 'should', 'to', 'we', 'what', 'you']
One_hot_encording_matrix:
 [[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


#### Advantages
	1.	Simplicity – It is very easy to implement and understand.
	2.	Deterministic representation – Each word has a unique, unambiguous vector.
	3.	No assumptions – Doesn’t rely on prior knowledge about word meanings.
	4.	Good for small vocabularies – Works efficiently when the dataset is small.
	5.	Compatibility – Can be used as a straightforward input for classical machine learning models or basic neural networks.

In short, one-hot encoding is simple, clear, and deterministic, making it a good starting point for learning text representation despite its scalability limitations.

#### Downside

The main downside of one-hot encoding is that it produces very large, sparse vectors as the vocabulary grows, which increases memory usage and computation cost. It also fails to capture any semantic meaning or relationships between words, so similar words (like good and great) are treated as completely unrelated. Additionally, one-hot encoding cannot handle unseen words well and does not scale efficiently for large text datasets, making it impractical for most modern deep learning applications compared to word embeddings.