In [1]:
import os

folder_path = "./emb_data"
filenames = os.listdir(folder_path)
documents = []

for file in filenames:
  try:
    with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
        documents.append(f.read())
  except:
    print(file)

article_100.txt


In [2]:
len(documents)
doc_str = ' '.join(documents)

In [3]:
!pip install transformers




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import gensim.downloader as api
from gensim.models import Word2Vec, KeyedVectors
from transformers import BertTokenizer, BertModel, GPT2Tokenizer, GPT2Model, pipeline
import torch

###Bag of Words (BoW)
Bag of Words (BoW) is a simple and widely used method in natural language processing to convert text data into numerical representations. The core idea is to represent a text (such as a sentence or a document) as a collection of its words, disregarding grammar and word order but keeping multiplicity. Each unique word in the corpus vocabulary is mapped to a feature index, and the text is represented as a vector where each element indicates the count of a word in the text.

### Usage Scenarios:

Text Classification: BoW can be used to convert documents into a numerical format that machine learning algorithms can process, such as for spam detection or sentiment analysis.
Information Retrieval: Useful for retrieving documents similar to a query by comparing the frequency of words.

### Link: [documentation link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

### Term Frequency-Inverse Document Frequency (TF-IDF)
Term Frequency-Inverse Document Frequency (TF-IDF) is an enhancement of the BoW model. It not only considers the occurrence of a word in a document but also how important a word is by considering its frequency across multiple documents. The TF-IDF score is the product of two statistics:

-Term Frequency (TF): Measures how frequently a term occurs in a document.

-Inverse Document Frequency (IDF): Measures how important a term is, considering the inverse of the document frequency (i.e., the total number of documents divided by the number of documents containing the term).
The intuition behind TF-IDF is to reduce the weight of common words and increase the weight of rare, informative words.

### Usage Scenarios:

-Text Mining: TF-IDF is widely used in mining textual information to identify the most significant words in a collection of documents.

-Document Similarity: Helps in measuring the similarity between documents by capturing the importance of words more effectively than simple frequency counts.

[documentation link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [5]:
### Traditional Methods

# Bag of Words (BoW)
vectorizer_bow = CountVectorizer()
X_bow = vectorizer_bow.fit_transform([doc_str])
print("Bag of Words Embedding:\n", X_bow.toarray())

# Term Frequency-Inverse Document Frequency (TF-IDF)
vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform([doc_str])
print("TF-IDF Embedding:\n", X_tfidf.toarray())


Bag of Words Embedding:
 [[35  2  1 ...  1  1  1]]
TF-IDF Embedding:
 [[2.92116614e-03 1.66923780e-04 8.34618898e-05 ... 8.34618898e-05
  8.34618898e-05 8.34618898e-05]]


### Word2Vec

Word2Vec is a popular word embedding technique that represents words in continuous vector space. Unlike Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), which are based on word frequency counts, Word2Vec captures semantic meanings and relationships between words by training a neural network model. It learns vector representations of words where similar words have similar vectors.

Word2Vec uses two main architectures:

- Continuous Bag of Words (CBOW): Predicts a target word based on its context (neighboring words).

- Skip-gram: Predicts the context (neighboring words) based on a target word.

The resulting word vectors can capture semantic similarity and relationships, such as "king" being close to "queen" and "Paris" being close to "France" in the vector space.

###Usage Scenarios:

- Natural Language Processing (NLP): Used in various NLP tasks such as text classification, sentiment analysis, named entity recognition, and machine translation.

- Similarity Measurement: To find similar words or phrases in a corpus.
Feature Representation: As input features for downstream machine learning models.

[documentation link](https://https://radimrehurek.com/gensim/models/word2vec.html)

In [6]:
# Word2Vec
w2v_model = api.load('word2vec-google-news-300')
words = doc_str.split()
word_vectors = [w2v_model[word] for word in words if word in w2v_model]
print("Word2Vec Embedding:\n", word_vectors)

Word2Vec Embedding:
 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



### GloVe (Global Vectors for Word Representation)

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Unlike Word2Vec, which uses local context window methods, GloVe builds a co-occurrence matrix from the corpus and factorizes it to obtain word vectors. The resulting word vectors capture semantic similarities and relationships by considering the global word-word co-occurrence statistics from a corpus.

### Usage Scenarios:

- Natural Language Processing (NLP): Used in tasks such as text classification, sentiment analysis, named entity recognition, and machine translation.
- Similarity Measurement: To find similar words or phrases in a corpus.
- Feature Representation: As input features for downstream machine learning models.

[documentation link](https://https://github.com/stanfordnlp/GloVe)

In [7]:
word_vectors

[array([ 1.88476562e-01,  3.89099121e-03,  4.19921875e-02,  1.35742188e-01,
         2.21679688e-01, -7.95898438e-02, -3.06396484e-02, -5.95703125e-02,
        -1.11328125e-01,  2.94921875e-01, -4.76074219e-02,  2.39372253e-04,
         2.22656250e-01,  7.76367188e-02, -2.09960938e-01, -1.08398438e-01,
        -3.16406250e-01,  6.07910156e-02,  3.41796875e-02, -3.61328125e-01,
        -1.25000000e-01,  1.36718750e-01, -3.02734375e-01,  1.54296875e-01,
        -2.00195312e-01, -8.54492188e-02, -2.79296875e-01,  7.32421875e-02,
        -2.09960938e-01,  1.85546875e-02,  1.12792969e-01,  1.01074219e-01,
        -3.37890625e-01, -3.39843750e-01, -2.75878906e-02, -6.22558594e-02,
        -5.10253906e-02,  4.46777344e-02,  4.29687500e-02, -1.21582031e-01,
        -1.26953125e-01,  1.07421875e-01, -2.08007812e-01,  1.00097656e-01,
        -1.15234375e-01, -3.37890625e-01,  2.14843750e-02,  9.52148438e-02,
        -2.02148438e-01,  3.00781250e-01, -3.30078125e-01,  2.02148438e-01,
        -7.6

In [8]:
# GloVe
glove_model = api.load('glove-wiki-gigaword-100')
word_vectors_glove = [glove_model[word] for word in words if word in glove_model]
print("GloVe Embedding:\n", word_vectors_glove)

GloVe Embedding:
 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [9]:
word_vectors_glove

[array([-0.039595,  0.22305 ,  0.74367 ,  0.06807 ,  0.099108,  0.084727,
        -0.43643 ,  0.19426 ,  0.78261 , -0.45995 ,  0.26504 ,  0.5667  ,
         0.14682 ,  0.19253 ,  0.47709 ,  0.52925 ,  0.13503 ,  0.011202,
        -0.027028,  0.51998 , -0.36943 , -0.88036 , -0.4235  , -0.2928  ,
         0.065225, -0.10532 ,  0.011529, -0.41076 , -0.17006 , -0.1343  ,
        -0.093686,  0.53304 , -0.22162 , -0.11646 , -0.21371 , -0.18847 ,
        -0.203   , -0.36454 ,  0.075353, -0.080618, -0.47445 ,  0.11029 ,
         0.20691 ,  0.1286  , -0.65957 ,  0.17112 , -0.26303 ,  0.39033 ,
         0.4433  ,  0.19446 ,  0.51082 , -0.031648,  0.41244 ,  0.9763  ,
         0.13114 , -1.7315  , -0.26666 , -0.22859 , -0.12017 ,  0.3759  ,
         0.35398 ,  0.7666  ,  0.19895 , -0.21865 ,  0.39753 , -0.05602 ,
         0.44564 ,  0.059255,  0.033003,  0.38314 ,  0.25702 ,  0.33001 ,
        -0.20888 ,  0.18914 ,  0.41963 ,  0.15973 ,  0.14376 , -0.60505 ,
        -0.72911 ,  0.49253 , -0.22462

### FastText

FastText is an extension of Word2Vec introduced by Facebook's AI Research (FAIR) lab. It represents words as bags of character n-grams, which allows it to generate vectors for out-of-vocabulary words by summing the vectors of their character n-grams. This makes FastText particularly effective for morphologically rich languages.

### Usage Scenarios:

- Natural Language Processing (NLP): Used in various NLP tasks such as text classification, sentiment analysis, and named entity recognition.
- Handling Out-of-Vocabulary Words: Generates vectors for words not seen during training.

[documentation link](https://https://fasttext.cc/docs/en/supervised-tutorial.html)

In [10]:
# FastText
fasttext_model = api.load('fasttext-wiki-news-subwords-300')
word_vectors_fasttext = [fasttext_model[word] for word in words if word in fasttext_model]
print("FastText Embedding:\n", word_vectors_fasttext)


FastText Embedding:
 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



### BERT (Bidirectional Encoder Representations from Transformers)

BERT is a transformer-based model developed by Google that pre-trains deep bidirectional representations by jointly conditioning on both left and right context in all layers. This enables BERT to understand the context of a word based on its surrounding words. BERT is pre-trained on a large corpus and fine-tuned for specific tasks such as question answering and text classification.

### Usage Scenarios:

- Natural Language Understanding (NLU): Used in tasks like question answering, text classification, and named entity recognition.
- Contextual Word Embeddings: Provides context-sensitive embeddings for words.

[hugging face link](https://https://huggingface.co/docs/transformers/en/model_doc/bert)

In [11]:
# BERT
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = BertModel.from_pretrained('bert-base-uncased')
inputs_bert = tokenizer_bert(doc_str, return_tensors='pt')
outputs_bert = model_bert(**inputs_bert)
print("BERT Embedding:\n", outputs_bert.last_hidden_state)

Token indices sequence length is longer than the specified maximum sequence length for this model (184825 > 512). Running this sequence through the model will result in indexing errors


RuntimeError: The size of tensor a (184825) must match the size of tensor b (512) at non-singleton dimension 1

### GPT (Generative Pre-trained Transformer)

GPT is a transformer-based model developed by OpenAI that uses a unidirectional (left-to-right) context to generate text. GPT is pre-trained on a large corpus of text and can be fine-tuned for various natural language generation tasks such as text completion, summarization, and dialogue generation.

### Usage Scenarios:

- Natural Language Generation (NLG): Used in tasks like text completion, summarization, and chatbot development.
- Contextual Text Generation: Generates coherent and contextually relevant text.

[hugging face link](https://huggingface.co/docs/transformers/en/model_doc/openai-gpt)

In [14]:
# GPT
tokenizer_gpt = GPT2Tokenizer.from_pretrained('gpt2')
model_gpt = GPT2Model.from_pretrained('gpt2')
inputs_gpt = tokenizer_gpt(doc_str, return_tensors='pt')
outputs_gpt = model_gpt(**inputs_gpt)
print("GPT Embedding:\n", outputs_gpt.last_hidden_state)

Token indices sequence length is longer than the specified maximum sequence length for this model (192538 > 1024). Running this sequence through the model will result in indexing errors


IndexError: index out of range in self

### Domain specific embeddings:
- Medical Domain
  - BioWOrd2Vec: [link text](https://github.com/ncbi-nlp/BioWordVec)
  - BioBERT: [link text](https://github.com/dmis-lab/biobert)
  - SciBERT: [link text](https://github.com/allenai/scibert)
  - ClinicalBERT: [link text](https://github.com/kexinhuang12345/clinicalBERT)
- Mathematical DOmain
  - MathBERT: [link text](https://arxiv.org/abs/2106.07340)
- Legal:
  - LegalBERT: [link text](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
  - VoyageLaw: [link text](https://blog.voyageai.com/2024/04/15/domain-specific-embeddings-and-retrieval-legal-edition-voyage-law-2/)
  - FastLaw: [link text](https://github.com/jbesomi/fastlaw)

In [16]:
!pip install einops

Collecting einops
  Using cached einops-0.8.0-py3-none-any.whl.metadata (12 kB)
Downloading einops-0.8.0-py3-none-any.whl (43 kB)
   ---------------------------------------- 0.0/43.2 kB ? eta -:--:--
   --------- ------------------------------ 10.2/43.2 kB ? eta -:--:--
   ------------------ --------------------- 20.5/43.2 kB 162.5 kB/s eta 0:00:01
   ---------------------------------------- 43.2/43.2 kB 263.9 kB/s eta 0:00:00
Installing collected packages: einops
Successfully installed einops-0.8.0



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [18]:
!pip install flash-attn --no-build-isolation

Collecting flash-attn
  Downloading flash_attn-2.6.3.tar.gz (2.6 MB)
     ---------------------------------------- 0.0/2.6 MB ? eta -:--:--
     ---------------------------------------- 0.0/2.6 MB ? eta -:--:--
     ---------------------------------------- 0.0/2.6 MB ? eta -:--:--
     ---------------------------------------- 0.0/2.6 MB 217.9 kB/s eta 0:00:12
     ----- ---------------------------------- 0.3/2.6 MB 2.0 MB/s eta 0:00:02
     -------------- ------------------------- 0.9/2.6 MB 4.2 MB/s eta 0:00:01
     ------------------- -------------------- 1.3/2.6 MB 4.9 MB/s eta 0:00:01
     --------------------------- ------------ 1.8/2.6 MB 5.7 MB/s eta 0:00:01
     ------------------------------------ --- 2.4/2.6 MB 6.6 MB/s eta 0:00:01
     ---------------------------------------- 2.6/2.6 MB 6.7 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: flash-attn
  Building wheel 


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [17]:
from transformers import AutoModel

model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)

embeddings = model.encode(doc_str, task="text-matching")

block.py:   0%|          | 0.00/17.8k [00:00<?, ?B/s]

mlp.py:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- mlp.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


mha.py:   0%|          | 0.00/34.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- mha.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


stochastic_depth.py:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- stochastic_depth.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- block.py
- mlp.py
- mha.py
- stochastic_depth.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


xlm_padding.py:   0%|          | 0.00/10.0k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- xlm_padding.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- rotary.py
- block.py
- xlm_padding.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn i

tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

KeyboardInterrupt: 