# One Hot Encoding

In [1]:
corpus = ["i love nlp", " i teach gen ai ", "i am working with euron"]

In [2]:
unique_words = list(set(" ".join(corpus).split()))

In [3]:
unique_words

['nlp', 'euron', 'gen', 'working', 'teach', 'i', 'ai', 'love', 'with', 'am']

In [4]:
for i in enumerate(unique_words):
    print(i)

(0, 'nlp')
(1, 'euron')
(2, 'gen')
(3, 'working')
(4, 'teach')
(5, 'i')
(6, 'ai')
(7, 'love')
(8, 'with')
(9, 'am')


In [5]:
word_to_index = {word:i for i, word in enumerate(unique_words)}

In [6]:
unique_words

['nlp', 'euron', 'gen', 'working', 'teach', 'i', 'ai', 'love', 'with', 'am']

In [7]:
word_to_index

{'nlp': 0,
 'euron': 1,
 'gen': 2,
 'working': 3,
 'teach': 4,
 'i': 5,
 'ai': 6,
 'love': 7,
 'with': 8,
 'am': 9}

In [8]:
one_hot_vector = []
for sentence in corpus:
    print(sentence)
    sentence_vector = []
    for word in sentence.split():
        vector = [0] * len(unique_words)    # sparse vector -> vector with 0
        print(vector)
        vector[word_to_index[word]] = 1
        #print(word_to_index[word])
        #print(vector[word_to_index[word]])
        sentence_vector.append(vector)
    one_hot_vector.append(sentence_vector)

i love nlp
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 i teach gen ai 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
i am working with euron
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [9]:
one_hot_vector

[[[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 [[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
  [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]],
 [[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]]]

Disadvantates of One Hot Encoding - 
1. Lot of sparse data (lot of zero's)
2. For small sentences also dimentional matrix could be huge, as number of columns would depend on unique set of words
3. No contextual relation between different matrices generated

One Hot Encoding - 
One Hot Encoding is called so because it represents categorical variables as binary vectors. Each category is converted into a vector where one element is "hot" (1) and all others are "cold" (0). This way, each category is uniquely identified without implying any ordinal relationship.

# Bag of Words (BoW) 
Frequency based encoding technique

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
vectorizer = CountVectorizer(stop_words=None, lowercase=False, vocabulary=unique_words)

In [12]:
corpus = ["i love nlp nlp", "love i nlp nlp", " i teach gen ai gen ai gen ai", "i am working with euron"] 
# Note: "i love nlp" is same as "love i nlp" -> doesn't differentiate

In [13]:
X = vectorizer.fit_transform(corpus)

In [14]:
corpus

['i love nlp nlp',
 'love i nlp nlp',
 ' i teach gen ai gen ai gen ai',
 'i am working with euron']

In [15]:
X.toarray()

array([[2, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [2, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 3, 0, 1, 0, 3, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0, 0, 1, 1]])

In [16]:
# print all unique features it has considered
vectorizer.get_feature_names_out()

array(['nlp', 'euron', 'gen', 'working', 'teach', 'i', 'ai', 'love',
       'with', 'am'], dtype=object)

# TF - IDF 
Term Frequency - Inverse Document Frequency

In [17]:
# TF = Number of times word appeared in a document / Total number of words in document 
# IDF = log (Total number of document / Number of document containing word) 

# TF-IDF = TF * IDF -> Represents numeric format of a word

# Repeatative words has less impact than the unique words.. Creates Less Sparse
# if frequency is more , impact is less

In [18]:
#corpus = ["i love nlp nlp", " i teach gen ai gen ai gen ai", "i am working with euron"]
#corpus = ["i love nlp", "i love love nlp", " i teach gen ai gen ai gen ai", "i am working with euron", "unique set"]

corpus = ["pappu nach nahi sakta", "pappu loves dancing", "pappu stops dancing"]
corpus

['pappu nach nahi sakta', 'pappu loves dancing', 'pappu stops dancing']

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
vect_tf_idf = TfidfVectorizer()

In [21]:
X = vect_tf_idf.fit_transform(corpus)

In [22]:
X.toarray()

array([[0.        , 0.        , 0.54645401, 0.54645401, 0.32274454,
        0.54645401, 0.        ],
       [0.54783215, 0.72033345, 0.        , 0.        , 0.42544054,
        0.        , 0.        ],
       [0.54783215, 0.        , 0.        , 0.        , 0.42544054,
        0.        , 0.72033345]])

### Advantage and Disadvatage of TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) has several advantages and disadvantages:

Advantages:

Simplicity: Easy to understand and implement.
Effectiveness: Works well for text classification and information retrieval.
Relevance: Highlights important words in documents by reducing the weight of common terms.

Disadvantages:

Sparsity: Can create high-dimensional sparse vectors, which may lead to inefficiencies.
Context Ignorance: Does not consider word order or context, potentially losing semantic meaning.
Static: Does not adapt to changes in language or context over time.

So, the following embedding techniques are discussed - 
1. One Hot Encoding [Worst Technique]
2. Bag of Words
3. TF-IDF

Problem with all above mentioned techniques is -
- Unable to understand the ordering of the words
- Unable to understand semantic and syntactic (ordering) of words to comprehend its meaning
- Unable to establish relationship or meaning between words and its grammer

So, the next technique to discuss is Word2Vector