# Representing Text for NLP

In any field of machine learning, we pass in our input to the model as numbers or vectors. This is a bit problematic when it comes to text and it needs to be converted into a numerical representation. These numerical representations of text are called encodings or embeddings.

There are multiple ways of creating embeddings from any given text. In this notebook, we will explore some of the most common methods: 

1. Bag of Words (BOW)
2. Word2Vec CBOW
3. Word2Vec Skipgram
4. GloVe
5. FastText

## Bag of Words

The bag of words method converts a text into a embedding based on a measure of occurence. The word "bag" here indicates that there is no order or position information for tokens involved in our vectors. For example: "John loves Mary" and "Mary loves John" have the same embedding. 

In [34]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
stopwords_english = stopwords.words('english') 
from nltk.stem import PorterStemmer  
import string 
import numpy as np

In [40]:
# Bag of Words Model
# uploading text
with open ("pg84.txt", "r", encoding="utf-8", errors="ignore") as f:
    text=f.read()
print("Length of text string:", len(text))

# preprocess text
def preprocess_text(text): 
    tokens= word_tokenize(text)
    tokens= [token for token in tokens if (token not in stopwords_english and token not in string.punctuation)]
    tokens= [token.lower() for token in tokens]
    return tokens
tokens=preprocess_text(text)
print("Num tokens:", len(tokens))

# get vocabulary
vocab=list(set(tokens))
print("Vocab size:", len(vocab))

# split into documents/sentences
doc_size=500
docs= [tokens[i:i+doc_size] for i in range (0, len(tokens),doc_size)]
print("Num docs:", len(docs))

# create the BOW vector 
# we use frequency counts i.e how many times a word occurs in a doc
X= np.zeros((len(docs), len(vocab)), dtype=np.int32) #the empty embedding matrix where each doc is a row
vocab_index = {word: i for i, word in enumerate(vocab)}
for i,doc in enumerate(docs): 
    for word in doc:
        j=vocab_index[word]
        X[i,j]+=1 #each row in X is now an embedding for a doc
print("Document-term matrix shape:", X.shape)  # (num_docs, vocab_size)   

# explore top words in doc 1
row=X[0]
top_idx=row.argsort()[::-1][:10]
print("The embedding for doc 1: ", row)
for item in top_idx:
    print(vocab[item], row[item])

Length of text string: 438807
Num tokens: 42316
Vocab size: 7635
Num docs: 85
Document-term matrix shape: (85, 7635)
The embedding for doc 1:  [0 0 0 ... 0 0 0]
chapter 24
i 16
may 9
letter 6
ebook 6
frankenstein 4
modern 4
prometheus 4
the 4
these 4


## Word2Vec

The Word2Vec method was first proposed in [Mikolov et al (2013)](https://openreview.net/forum?id=idpCdOWtqXd60) for models to learn word representations from a large corpus. The most common software implementation is available via `models.word2vec` from [Gensim](https://radimrehurek.com/gensim/models/word2vec.html). There are two methods in this: 

1. Continuous Bag of Words (CBOW): Given the context terms within a preselected context window, the model learns to predict the target term.
2. Skipgram: Given the target word, the model tries to predict context words within the context window. All context words are intended to have higher probability compared to other words. 

### Word2Vec CBOW

### Word2Vec Skipgram

## GloVe

The GloVe or Global Vectors method was proposed in [Pennington et al (2014)](https://aclanthology.org/D14-1162/) and  aims to improve upon word representations by also including global context while learning the representations. 

## FastText

The FasText method was first proposed in [Bojanowski et al (2017)](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00051/43387/Enriching-Word-Vectors-with-Subword-Information). Unlike using words as the unit of representation, it decomposes each word into n-grams to learn representations of these different character combinations. 

One of the major advantages of this method was the ability to represent out of vocabulary (OOV) tokens by some mathematical function of the subword ngrams for the token. 