## General Word Embeddings

**1. Bag of Words (BoW) and TF-IDF**

**2. Latent Semantic Analysis (LSA)**

**3. Word2Vec developed by Google**

**4. GloVe developed by Standford University**

**5. FastText developed by FaceBook**

**6. Contextual Embeddings**

- ELMo : ELMo (Embeddings from Language Models): Developed by Allen Institute for AI

- BERT (Bidirectional Encoder Representations from Transformers): Developed by HuggingFace now takeover by Google

- GPT (Generative Pre-trained Transformer): Developed by OpenAI

**7. Current Trending Models**

- GPT-2,GPT-3 and GPT series by OpenAI

- BERT Variants: Models like RoBERTa, ALBERT, and DistilBERT have improved upon BERT

- T5 (Text-To-Text Transfer Transformer): Developed by Google, T5 treats all NLP tasks as text-to-text problems, unifying multiple tasks under a single framework.

- CLIP (Contrastive Language-Image Pretraining): Developed by OpenAI, CLIP learns representations that align images and text, enabling models to perform tasks like zero-shot image classification.
    
- Mistral AI: GenAI model

- Google Gemini AI : GenAI model

- Amazon Bedrock : GenAI model
                                                                                    
                                                                                 

1. A Quick Example

- Let’s look at an easy example to understand the concepts previously explained. We could be interested in analyzing the reviews about Game of Thrones:

- **Review 1:** Game of Thrones is an amazing tv series!

- **Review 2:** Game of Thrones is the best tv series!

- **Review 3:** Game of Thrones is so great

In the table, I show all the calculations to obtain the Bag-Of-Words approach:

**Vocabulary**

- Vocabulary means a kind of dictionary

- Every data has its own vocabulary 

- Here also we have vocabulary for above 3 sentences

<img decoding="async" loading="lazy" class="alignnone" src="https://cdn-images-1.medium.com/max/1000/1*cHKkqYIhaYuYwuuhBiSlHw.png" alt="Bag-of-Words with Python&nbsp; example" width="815" height="190" data-image-id="1*cHKkqYIhaYuYwuuhBiSlHw.png" data-width="815" data-height="190">

Each row corresponds to a different review, while the rows are the unique words, contained in the three documents.

**Sentence-1**

- sky is nice

**Sentence-2**

- clouds are nice

**Sentence-3**

- Sky is nice and Clouds are nice


In [9]:
!pip install nltk



In [10]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [7]:
from nltk.corpus import stopwords
stopwords.words('hinglish')

['a',
 'aadi',
 'aaj',
 'aap',
 'aapne',
 'aata',
 'aati',
 'aaya',
 'aaye',
 'ab',
 'abbe',
 'abbey',
 'abe',
 'abhi',
 'able',
 'about',
 'above',
 'accha',
 'according',
 'accordingly',
 'acha',
 'achcha',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 'agar',
 'ain',
 'aint',
 "ain't",
 'aisa',
 'aise',
 'aisi',
 'alag',
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'andar',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'ap',
 'apan',
 'apart',
 'apna',
 'apnaa',
 'apne',
 'apni',
 'appear',
 'are',
 'aren',
 'arent',
 "aren't",
 'around',
 'arre',
 'as',
 'aside',
 'ask',
 'asking',
 'at',
 'aur',
 'avum',
 'aya',
 'aye',
 'baad',
 'baar',
 'bad',
 'bahut',
 'bana',
 'banae',
 'banai',
 'banao',
 'banaya',
 'banaye',
 'banayi',
 'banda',
 'bande',
 'bandi',
 'bane',
 'bani',
 'bas',
 'bata',
 'bat

In [8]:
from nltk.corpus import stopwords
sentences = ['sky is nice', 'clouds are nice', 'Sky is nice and Clouds are nice']

cleaned_sentence = []

for sentence in sentences:
    word = sentence.lower()  
    ##lowering all the letters becaz we dont want it to treat uppercase and lower case words differently
    
    word = word.split()    ##splitting our sentence into words 
    
    ##removing stop words
    word = [i for i in word if i not in set(stopwords.words('english'))]          
    word = " ".join(word)               ##joining our words back to sentences
    cleaned_sentence.append(word)       ##appending our preprocessed sentence into a new list
    
    
## printing our new list
print(cleaned_sentence) 

['sky nice', 'clouds nice', 'sky nice clouds nice']


after data preprocess and apply Stop words

**Sentence-1**

- sky nice

**Sentence-2**

- clouds nice

**Sentence-3**

- Sky nice Clouds nice

**Vocabulary**

   - clouds, nice,sky
   

|vocabulary|Frequency|
|----------------|-----|
|clouds|2|
|nice|4|
|sky|2|


and number of voacbulry becomes number of features


|sentence|feature1(clouds)|featur2(nice)|feature3(sky)|
|--------|-----|---|----|
|sky nice|0|1|1|
|clouds nice|1|1|0|
|Sky nice Clouds nice|1|2|1|

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features = 3)  ##give it a max features as 3
Bagofwords = cv.fit_transform(cleaned_sentence)

Bagofwords.toarray()

array([[0, 1, 1],
       [1, 1, 0],
       [1, 2, 1]], dtype=int64)

In [9]:
# from sklearn.feature_extraction.text import CountVectorizer

# cv = CountVectorizer(max_features = 3)  ##give it a max features as 3
# Bagofwords = cv.fit_transform(cleaned_sentence)
# print(Bagofwords.toarray())

[[0 1 1]
 [1 1 0]
 [1 2 1]]


In [6]:
import pandas as pd
pd.DataFrame(Bagofwords.toarray(),columns=['cloud','nice','sky'])

Unnamed: 0,cloud,nice,sky
0,0,1,1
1,1,1,0
2,1,2,1


In [7]:
cv.vocabulary_
# unique words with index
# cloud is first word : first feature
# nice is scond word: second feature
# skt is third word: Third feature

{'sky': 2, 'nice': 1, 'clouds': 0}

In [6]:
Bagofwords.view()

array([[0, 1, 1],
       [1, 1, 0],
       [1, 2, 1]], dtype=int64)

In [9]:
### All together
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

sentences = ['Game of Thrones is an amazing tv series!', 
             'Game of Thrones is the best tv series!', 
             'Game of Thrones is so great']

cleaned_sentence = []

for sentence in sentences:
    word = sentence.lower()  
    ##lowering all the letters becaz we dont want it to treat uppercase and lower case words differently
    
    word = word.split()    ##splitting our sentence into words 
    
    ##removing stop words
    word = [i for i in word if i not in set(stopwords.words('english'))]          
    word = " ".join(word)               ##joining our words back to sentences
    cleaned_sentence.append(word)       ##appending our preprocessed sentence into a new list
    
    
## printing our new list
print(cleaned_sentence) 

cv = CountVectorizer()  ##give it a max features as 3
Bagofwords = cv.fit_transform(cleaned_sentence).toarray()
print(cv.vocabulary_)
print(Bagofwords)

['game thrones amazing tv series!', 'game thrones best tv series!', 'game thrones great']
{'game': 2, 'thrones': 5, 'amazing': 0, 'tv': 6, 'series': 4, 'best': 1, 'great': 3}
[[1 0 1 0 1 1 1]
 [0 1 1 0 1 1 1]
 [0 0 1 1 0 1 0]]


In [11]:
cv.vocabulary_
# Index mapping

{'game': 2,
 'thrones': 5,
 'amazing': 0,
 'tv': 6,
 'series': 4,
 'best': 1,
 'great': 3}

In [10]:
### All together
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

sentences = ['sky is nice', 'clouds are nice', 'Sky is nice and Clouds are nice']

cleaned_sentence = []

for sentence in sentences:
    word = sentence.lower()  
    ##lowering all the letters becaz we dont want it to treat uppercase and lower case words differently
    
    word = word.split()    ##splitting our sentence into words 
    
    ##removing stop words
    word = [i for i in word if i not in set(stopwords.words('english'))]          
    word = " ".join(word)               ##joining our words back to sentences
    cleaned_sentence.append(word)       ##appending our preprocessed sentence into a new list
    
    
## printing our new list
print(cleaned_sentence) 

cv = CountVectorizer(ngram_range=(1,2))  ##give it a max features as 3
Bagofwords = cv.fit_transform(cleaned_sentence).toarray()

print(cv.vocabulary_)
print(Bagofwords)

# Task for you is:  Identify the output logic

['sky nice', 'clouds nice', 'sky nice clouds nice']
{'sky': 4, 'nice': 2, 'sky nice': 5, 'clouds': 0, 'clouds nice': 1, 'nice clouds': 3}
[[0 0 1 0 1 1]
 [1 1 1 0 0 0]
 [1 1 2 1 1 1]]


In [13]:
sorted(list(cv.vocabulary_.keys()))

['clouds', 'clouds nice', 'nice', 'nice clouds', 'sky', 'sky nice']

In [13]:
cv.vocabulary_

# Vector formation happens based on vocabulary 


{'sky': 4,
 'nice': 2,
 'sky nice': 5,
 'clouds': 0,
 'clouds nice': 1,
 'nice clouds': 3}

In [14]:
################################ With logic##################################
sen=' '.join(cleaned_sentence)
l=list(set(sen.split()))
print("vocabulary:",l)
d={}
l1=[]
for sentence in cleaned_sentence:
    for i in l:
        if i in sentence:
            d[i]=1
        else:
            d[i]=0
    myKeys = list(d.keys())
    myKeys.sort()
    sorted_dict = {i: d[i] for i in myKeys}
    l1.append(sorted_dict)

print(l1)
l2=[i.values() for i in l1]
l2

vocabulary: ['sky', 'clouds', 'nice']
[{'clouds': 0, 'nice': 1, 'sky': 1}, {'clouds': 1, 'nice': 1, 'sky': 0}, {'clouds': 1, 'nice': 1, 'sky': 1}]


[dict_values([0, 1, 1]), dict_values([1, 1, 0]), dict_values([1, 1, 1])]

Some disadvantages of BOWS:
> It won’t provide any semantic information about the words. It only gives how many times has word occurred in a sentence and not its location or correlation with other words in the sentence.

>It gives equal importance to all the words in the sentence. Hence it is most useful for simple processes.

>There are other methods like Tfidf, word2Vec which are more complex and useful than BOWs.

- BOW

- tf-idf

- word2vec

- Glove

In [6]:
# Creating word histogram
import nltk
word2count = {}
for data in sentences:
    words = nltk.word_tokenize(data) # we are split into words
    for word in words:               # we are calling each word
        if word not in word2count.keys(): # if the word not in dictionary, we are 
            word2count[word] = 1
        else:
            word2count[word] += 1
print(word2count)

{'sky': 1, 'is': 2, 'nice': 4, 'clouds': 1, 'are': 2, 'Sky': 1, 'and': 1, 'Clouds': 1}
