## Word Embedding
- Word embedding is a technique used in natural language processing (NLP) to **represent words as dense vectors in a continuous space**. For example, we can easily understand the text "I saw a cat", but our models can not - they need vectors of features. Such vectors, or word embeddings, are representations of words which can be fed into your model.
- It helps capture the semantic, syntactic context or a word/term and helps understand how similar/dissimilar it is to other word.

Advantage of word embedding
- sementic similarity
- conextual understanding
- Dimentionality Reduction

## Word Embedding techniques
- BOW Embedding
- TF-IDF Embedding
- Word2Vec Embedding.


### Bag Of Word(BOW)
- It convert text document into numerical vector representation
- BoW model **ignores the order and structure of the words in the document** and focuses solely on the **occurrence and frequency of words**.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
texts = ['It convert text document into numerical vector representation',
        'BoW model ignores the order and structure of the words in the document and focuses solely on the occurrence and frequency of words']
texts = [i.lower() for i in texts]
cv = CountVectorizer() # initilize CountVectorizer model
cv.fit(texts)
cv_vector = cv.fit_transform(texts)
featrure = cv.get_feature_names_out()
print(featrure) #list of word
pd.DataFrame(cv_vector.toarray(),columns = featrure)

['and' 'bow' 'convert' 'document' 'focuses' 'frequency' 'ignores' 'in'
 'into' 'it' 'model' 'numerical' 'occurrence' 'of' 'on' 'order'
 'representation' 'solely' 'structure' 'text' 'the' 'vector' 'words']


Unnamed: 0,and,bow,convert,document,focuses,frequency,ignores,in,into,it,...,of,on,order,representation,solely,structure,text,the,vector,words
0,0,0,1,1,0,0,0,0,1,1,...,0,0,0,1,0,0,1,0,1,0
1,3,1,0,1,1,1,1,1,0,0,...,2,1,1,0,1,1,0,4,0,2


### TF-IDF ( Term Frequency-Inverse Document Fequency)
- It is used for evulate improtance of a term in a document within a collection or corpus of documents
- **Term Frequency** - It measure how frequent a term appear in a document. TF can help identify important terms within a document by giving higher weight to terms that occur more frequently.<br>
TF($w_i,r_j$) = (Number of time $w_i$ occure in $r_j$) / (Total number of word in $r_j$)

- **Inverse Document Fequency** - It measure improtance of a word across the entire corpus. It penalizes terms that appear in many documents and gives more weight to terms that are relatively rare.<br>
IDF = log((Total number of documents) / (Number of documents containing the term))<br>
TF-IDF = TF * IDF

#### Limitation -
- Semantic Understanding - TF-IDF does not capture sementic meanng of words
- Vocabulary Size and Sparsity
- Word Importance Assumption: TF-IDF assume improtance of a word directly proportional to its frequency. But sometime rear fequent word also more improtant

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = ['It convert text document into numerical vector representation',
        'BoW model ignores the order and structure of the words in the document and focuses solely on the occurrence and frequency of words']
texts = [i.lower() for i in texts]
tfidf = TfidfVectorizer() # initilize CountVectorizer model
tfidf.fit(texts)
tfidf_vector = tfidf.fit_transform(texts)
featrure = tfidf.get_feature_names_out()
print(featrure) #list of word
pd.DataFrame(tfidf_vector.toarray(),columns = featrure)

['and' 'bow' 'convert' 'document' 'focuses' 'frequency' 'ignores' 'in'
 'into' 'it' 'model' 'numerical' 'occurrence' 'of' 'on' 'order'
 'representation' 'solely' 'structure' 'text' 'the' 'vector' 'words']


Unnamed: 0,and,bow,convert,document,focuses,frequency,ignores,in,into,it,...,of,on,order,representation,solely,structure,text,the,vector,words
0,0.0,0.0,0.364996,0.259698,0.0,0.0,0.0,0.0,0.364996,0.364996,...,0.0,0.0,0.0,0.364996,0.0,0.0,0.364996,0.0,0.364996,0.0
1,0.449687,0.149896,0.0,0.106652,0.149896,0.149896,0.149896,0.149896,0.0,0.0,...,0.299792,0.149896,0.149896,0.0,0.149896,0.149896,0.0,0.599583,0.0,0.299792


### Word2Vec
-  As name suggest it convert word to vector. It's aims capture sementic and syntactic relationship between words by mapping them into a continuous vector space.
- Two main archicture, they are- Continuous Bag-of-Words (CBOW) and Skip-gram. Both architectures involve training a neural network on a large corpus of text data and help the network learn how to represent a word. This is **unsupervised machine learning, and labels are needed to train the model**.
<img height = 300 width = 500  src = 'https://miro.medium.com/v2/resize:fit:1100/0*dyZ7Syt3DMbN7nF9'>

**TO bo added**

Skip-gram: works well with small amount of the training data, represents well even rare words or phrases

CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

Reference

- https://aegis4048.github.io/demystifying_neural_network_in_skip_gram_language_modeling#eq-18

In [None]:
import numpy as np
from gensim.models import Word2Vec
texts = ['It convert text document into numerical vector representation',
        'BoW model ignores the order and structure of the words in the document and focuses solely on the occurrence and frequency of words',
        'It is used for evulate improtance of a term in a document within a collection or corpus of documents']

#Preprocess Text Data
sentence = [i.lower().split() for i in texts]
print(sentence)
word2vec_model = Word2Vec(sentence,vector_size=100,window=7, min_count=1 ,workers=3,)
# size: The dimensionality of the word vectors
# window: The maximum distance between the target word and its context words within a sentence
# min_count: The minimum frequency count of words.
print('vectror shape ' ,word2vec_model.wv['text'].shape)
# top 10 similar word of text
print(word2vec_model.wv.most_similar('text'))

# sentence to vector  -> Take  Average of Word2Vec vectors
sentence1 = 'It convert text to document'
vec = np.zeros(100,)
for word in sentence1.lower().split():
    try:
        vec += word2vec_model.wv[word]
    except:
        pass
vec /= len(sentence1.split())
print("shape of sentence vector- ", vec.shape)
vec[:10]

[['it', 'convert', 'text', 'document', 'into', 'numerical', 'vector', 'representation'], ['bow', 'model', 'ignores', 'the', 'order', 'and', 'structure', 'of', 'the', 'words', 'in', 'the', 'document', 'and', 'focuses', 'solely', 'on', 'the', 'occurrence', 'and', 'frequency', 'of', 'words'], ['it', 'is', 'used', 'for', 'evulate', 'improtance', 'of', 'a', 'term', 'in', 'a', 'document', 'within', 'a', 'collection', 'or', 'corpus', 'of', 'documents']]
vectror shape  (100,)
[('occurrence', 0.1822252720594406), ('document', 0.17278888821601868), ('in', 0.16700223088264465), ('is', 0.15654872357845306), ('model', 0.1330449879169464), ('collection', 0.11322769522666931), ('it', 0.11128198355436325), ('convert', 0.10946954041719437), ('ignores', 0.09724971652030945), ('within', 0.0907229632139206)]
shape of sentence vector-  (100,)


array([ 0.00502373,  0.00104713,  0.00228018,  0.00050057, -0.00067506,
       -0.00108926, -0.00096536,  0.00203213, -0.00446166, -0.00252412])

## Pre-trained Word Embedding

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

#Tokenize the sentences
tokenizer = Tokenizer()

#preparing vocabulary
tokenizer.fit_on_texts(list(x_tr))

#converting text into integer sequences
x_tr_seq  = tokenizer.texts_to_sequences(x_tr)
x_val_seq = tokenizer.texts_to_sequences(x_val)

#padding to prepare sequences of same length
x_tr_seq  = pad_sequences(x_tr_seq, maxlen=100)
x_val_seq = pad_sequences(x_val_seq, maxlen=100)


In [None]:
#deep learning library
from keras.models import *
from keras.layers import *
from keras.callbacks import *

model=Sequential()

#embedding layer
model.add(Embedding(size_of_vocabulary,300,input_length=100,trainable=True))

#lstm layer
model.add(LSTM(128,return_sequences=True,dropout=0.2))

#Global Maxpooling
model.add(GlobalMaxPooling1D())

#Dense Layer
model.add(Dense(64,activation='relu'))
model.add(Dense(1,activation='sigmoid'))

#Add loss function, metrics, optimizer
model.compile(optimizer='adam', loss='binary_crossentropy',metrics=["acc"])

#Adding callbacks
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1,patience=3)
mc=ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', save_best_only=True,verbose=1)

#Print summary of model
print(model.summary())


In [None]:
Word2Vec: Developed by Google, Word2Vec provides word embeddings that capture semantic relationships and contextual information. Two popular architectures of Word2Vec are Continuous Bag-of-Words (CBOW) and Skip-gram. The pre-trained Word2Vec embeddings are available for various languages.

GloVe (Global Vectors for Word Representation): GloVe is a word embedding model that combines global statistics of word co-occurrences with local context windows. GloVe embeddings capture semantic and syntactic relationships between words. Pre-trained GloVe embeddings are available in different dimensions, trained on various corpora.

FastText: FastText, developed by Facebook AI Research, extends Word2Vec by considering word substructures (character n-grams) in addition to whole words. FastText embeddings are effective for handling out-of-vocabulary words and capturing morphological information. Pre-trained FastText embeddings are available in different languages and dimensions.

ELMo (Embeddings from Language Models): ELMo embeddings are based on deep contextualized word representations. These embeddings capture word meanings in context, allowing models to leverage contextual information effectively. ELMo embeddings are trained on large-scale language modeling tasks.

BERT (Bidirectional Encoder Representations from Transformers): BERT is a state-of-the-art transformer-based model that captures deep contextualized word representations. BERT embeddings consider the bidirectional context of words, leading to rich contextual information. Pre-trained BERT models are available in different sizes and variations, such as BERT-base and BERT-large.

ULMFiT (Universal Language Model Fine-tuning): ULMFiT is a transfer learning approach that uses pre-training on a large corpus and fine-tuning on task-specific data. ULMFiT embeddings capture syntactic and semantic information and have been successful in various NLP tasks.

Transformer-based Models (e.g., GPT, GPT-2, T5): Transformer-based models, such as OpenAI's GPT (Generative Pre-trained Transformer) and GPT-2, as well as Google's T5 (Text-to-Text Transfer Transformer), have also been used to obtain high-quality pre-trained word embeddings. These models capture deep contextualized representations and have achieved state-of-the-art performance in many NLP tasks.

