In [3]:
# prompt: Import text file from a ZIP and mount google drive

from google.colab import drive
drive.mount('/content/drive')

folder_path = "/content/drive/MyDrive/Natural Language Processing/HW3/emb_data"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [16]:
import os

folder_path = "/content/drive/MyDrive/Natural Language Processing/HW3/emb_data"
filenames = os.listdir(folder_path)
documents = []

for file in filenames:
  try:
    with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
        documents.append(f.read())
  except:
    print(file)

article_100.txt


In [17]:
len(documents)
doc_str = ' '.join(documents)

In [19]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import gensim.downloader as api
from gensim.models import Word2Vec, KeyedVectors
from transformers import BertTokenizer, BertModel, GPT2Tokenizer, GPT2Model, pipeline
import torch

###Bag of Words (BoW)
Bag of Words (BoW) is a simple and widely used method in natural language processing to convert text data into numerical representations. The core idea is to represent a text (such as a sentence or a document) as a collection of its words, disregarding grammar and word order but keeping multiplicity. Each unique word in the corpus vocabulary is mapped to a feature index, and the text is represented as a vector where each element indicates the count of a word in the text.

### Usage Scenarios:

Text Classification: BoW can be used to convert documents into a numerical format that machine learning algorithms can process, such as for spam detection or sentiment analysis.
Information Retrieval: Useful for retrieving documents similar to a query by comparing the frequency of words.

### Link: [documentation link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

### Term Frequency-Inverse Document Frequency (TF-IDF)
Term Frequency-Inverse Document Frequency (TF-IDF) is an enhancement of the BoW model. It not only considers the occurrence of a word in a document but also how important a word is by considering its frequency across multiple documents. The TF-IDF score is the product of two statistics:

-Term Frequency (TF): Measures how frequently a term occurs in a document.

-Inverse Document Frequency (IDF): Measures how important a term is, considering the inverse of the document frequency (i.e., the total number of documents divided by the number of documents containing the term).
The intuition behind TF-IDF is to reduce the weight of common words and increase the weight of rare, informative words.

### Usage Scenarios:

-Text Mining: TF-IDF is widely used in mining textual information to identify the most significant words in a collection of documents.

-Document Similarity: Helps in measuring the similarity between documents by capturing the importance of words more effectively than simple frequency counts.

[documentation link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [20]:
### Traditional Methods

# Bag of Words (BoW)
vectorizer_bow = CountVectorizer()
X_bow = vectorizer_bow.fit_transform([doc_str])
print("Bag of Words Embedding:\n", X_bow.toarray())

# Term Frequency-Inverse Document Frequency (TF-IDF)
vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform([doc_str])
print("TF-IDF Embedding:\n", X_tfidf.toarray())


Bag of Words Embedding:
 [[35  2  1 ...  1  1  1]]
TF-IDF Embedding:
 [[2.92116614e-03 1.66923780e-04 8.34618898e-05 ... 8.34618898e-05
  8.34618898e-05 8.34618898e-05]]


### Word2Vec

Word2Vec is a popular word embedding technique that represents words in continuous vector space. Unlike Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), which are based on word frequency counts, Word2Vec captures semantic meanings and relationships between words by training a neural network model. It learns vector representations of words where similar words have similar vectors.

Word2Vec uses two main architectures:

- Continuous Bag of Words (CBOW): Predicts a target word based on its context (neighboring words).

- Skip-gram: Predicts the context (neighboring words) based on a target word.

The resulting word vectors can capture semantic similarity and relationships, such as "king" being close to "queen" and "Paris" being close to "France" in the vector space.

###Usage Scenarios:

- Natural Language Processing (NLP): Used in various NLP tasks such as text classification, sentiment analysis, named entity recognition, and machine translation.

- Similarity Measurement: To find similar words or phrases in a corpus.
Feature Representation: As input features for downstream machine learning models.

[documentation link](https://https://radimrehurek.com/gensim/models/word2vec.html)

In [None]:
# Word2Vec
w2v_model = api.load('word2vec-google-news-300')
words = doc_str.split()
word_vectors = [w2v_model[word] for word in words if word in w2v_model]
print("Word2Vec Embedding:\n", word_vectors)



### GloVe (Global Vectors for Word Representation)

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Unlike Word2Vec, which uses local context window methods, GloVe builds a co-occurrence matrix from the corpus and factorizes it to obtain word vectors. The resulting word vectors capture semantic similarities and relationships by considering the global word-word co-occurrence statistics from a corpus.

### Usage Scenarios:

- Natural Language Processing (NLP): Used in tasks such as text classification, sentiment analysis, named entity recognition, and machine translation.
- Similarity Measurement: To find similar words or phrases in a corpus.
- Feature Representation: As input features for downstream machine learning models.

[documentation link](https://https://github.com/stanfordnlp/GloVe)

In [None]:
# GloVe
glove_model = api.load('glove-wiki-gigaword-100')
word_vectors_glove = [glove_model[word] for word in words if word in glove_model]
print("GloVe Embedding:\n", word_vectors_glove)


### FastText

FastText is an extension of Word2Vec introduced by Facebook's AI Research (FAIR) lab. It represents words as bags of character n-grams, which allows it to generate vectors for out-of-vocabulary words by summing the vectors of their character n-grams. This makes FastText particularly effective for morphologically rich languages.

### Usage Scenarios:

- Natural Language Processing (NLP): Used in various NLP tasks such as text classification, sentiment analysis, and named entity recognition.
- Handling Out-of-Vocabulary Words: Generates vectors for words not seen during training.

[documentation link](https://https://fasttext.cc/docs/en/supervised-tutorial.html)

In [None]:
# FastText
fasttext_model = api.load('fasttext-wiki-news-subwords-300')
word_vectors_fasttext = [fasttext_model[word] for word in words if word in fasttext_model]
print("FastText Embedding:\n", word_vectors_fasttext)


### BERT (Bidirectional Encoder Representations from Transformers)

BERT is a transformer-based model developed by Google that pre-trains deep bidirectional representations by jointly conditioning on both left and right context in all layers. This enables BERT to understand the context of a word based on its surrounding words. BERT is pre-trained on a large corpus and fine-tuned for specific tasks such as question answering and text classification.

### Usage Scenarios:

- Natural Language Understanding (NLU): Used in tasks like question answering, text classification, and named entity recognition.
- Contextual Word Embeddings: Provides context-sensitive embeddings for words.

[hugging face link](https://https://huggingface.co/docs/transformers/en/model_doc/bert)

In [None]:
# BERT
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = BertModel.from_pretrained('bert-base-uncased')
inputs_bert = tokenizer_bert(example_text, return_tensors='pt')
outputs_bert = model_bert(**inputs_bert)
print("BERT Embedding:\n", outputs_bert.last_hidden_state)

### GPT (Generative Pre-trained Transformer)

GPT is a transformer-based model developed by OpenAI that uses a unidirectional (left-to-right) context to generate text. GPT is pre-trained on a large corpus of text and can be fine-tuned for various natural language generation tasks such as text completion, summarization, and dialogue generation.

### Usage Scenarios:

- Natural Language Generation (NLG): Used in tasks like text completion, summarization, and chatbot development.
- Contextual Text Generation: Generates coherent and contextually relevant text.

[hugging face link](https://huggingface.co/docs/transformers/en/model_doc/openai-gpt)

In [None]:
# GPT
tokenizer_gpt = GPT2Tokenizer.from_pretrained('gpt2')
model_gpt = GPT2Model.from_pretrained('gpt2')
inputs_gpt = tokenizer_gpt(example_text, return_tensors='pt')
outputs_gpt = model_gpt(**inputs_gpt)
print("GPT Embedding:\n", outputs_gpt.last_hidden_state)

### Domain specific embeddings:
- Medical Domain
  - BioWOrd2Vec: [link text](https://github.com/ncbi-nlp/BioWordVec)
  - BioBERT: [link text](https://github.com/dmis-lab/biobert)
  - SciBERT: [link text](https://github.com/allenai/scibert)
  - ClinicalBERT: [link text](https://github.com/kexinhuang12345/clinicalBERT)
- Mathematical DOmain
  - MathBERT: [link text](https://arxiv.org/abs/2106.07340)
- Legal:
  - LegalBERT: [link text](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
  - VoyageLaw: [link text](https://blog.voyageai.com/2024/04/15/domain-specific-embeddings-and-retrieval-legal-edition-voyage-law-2/)
  - FastLaw: [link text](https://github.com/jbesomi/fastlaw)