# Vector Space Models
Representation text units (characters, phonemes, words, phrases, sentences, paragraphs, and documents) with vector of numbers.

## Basic Vectorization Approaches

One-Hot Encoding, cons:
1. The size of one-hot vector is directly proportional to size of the vocabulary, and most real-world corpora have large vocabularies. This results in a sparse representation.
2. This representation does not give a fixed-length representation for text, i.e., if a text has 10 words, you get longer representation for it as compared to a text with 5 words.
3. It treats words as atomic units and has no notion of (dis)similarity between words. Semantically, very poor at capturing the meaning of the word in relation to other words.
4. Cannot handle ouf of vocabulary (OOV)

In [17]:
# One-Hot Encoding
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
print(f'Processed docs: {processed_docs}')

# build vocabulary
vocab = {}
count = 0
for doc in processed_docs:
    for word in doc.split():
        if word not in vocab:
            count += 1
            vocab[word] = count
print(f'Vocabulary: {vocab}')

# onehot vector
def get_onehot_vector(somestring):
    onehot_encoded = []
    for word in somestring.split():
        temp = [0]*len(vocab)
        if word in vocab:
            temp[vocab[word]-1] = 1  # use -1 because index array starts from 0 not 1
        onehot_encoded.append(temp)
    return onehot_encoded

print(f'Docs 1 preprocessed: {processed_docs[0]}')
print(f'Docs 1 one hot: {get_onehot_vector(processed_docs[0])}')

print(f'One hot random text: {get_onehot_vector("man and dog are good")}')

Processed docs: ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']
Vocabulary: {'dog': 1, 'bites': 2, 'man': 3, 'eats': 4, 'meat': 5, 'food': 6}
Docs 1 preprocessed: dog bites man
Docs 1 one hot: [[1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0]]
One hot random text: [[0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]


Bag of Words, 

Advantages:
1. Simple to understand and implement
2. Captures the semantic similarity of documents. Because documents having the same words will have their vector representations closer to each other in Euclidean space as compared to documents with completely different words.
3. A fixed-length encoding for any sentence of arbitrary length

Disadvantages:
1. The size of the vector increases with the size of the vocabulary. Sparsity problem. One way to control it is by limiting the vocabulary.
2. It does not capture the similarity between different words that mean the same thing.
3. It does not have any way to handle out of vocabulary words.
4. Word order information is lost

In [18]:
# bag of words
from sklearn.feature_extraction.text import CountVectorizer

#look at the documents list
print("Our corpus: ", processed_docs)

count_vect = CountVectorizer()
#Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)

#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

#see the BOW rep for first 2 documents
print("BoW representation for 'dog bites man': ", bow_rep[0].toarray())
print("BoW representation for 'man bites dog: ",bow_rep[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect.transform(["dog and dog are friends"])
print("Bow representation for 'dog and dog are friends':", temp.toarray())

# Researchers have shown that such a representation without considering frequency is useful for sentiment analysis
# BoW with binary vectors
count_vect = CountVectorizer(binary=True)
bow_rep_bin = count_vect.fit_transform(processed_docs)
temp = count_vect.transform(["dog and dog are friends"])
print("\nBoW with binary vectors:")
print("Bow representation for 'dog and dog are friends':", temp.toarray())

Our corpus:  ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']
Our vocabulary:  {'dog': 1, 'bites': 0, 'man': 4, 'eats': 2, 'meat': 5, 'food': 3}
BoW representation for 'dog bites man':  [[1 1 0 0 1 0]]
BoW representation for 'man bites dog:  [[1 1 0 0 1 0]]
Bow representation for 'dog and dog are friends': [[0 2 0 0 0 0]]

BoW with binary vectors:
Bow representation for 'dog and dog are friends': [[0 1 0 0 0 0]]


Bag of N-Grams

Prons and cons:
1. It captures some context and word-order information in the form of n-grams
2. The resulting vector space is able to capture some semantic similarity.
3. As n increases, dimensionality (and therefore sparsity) only increases rapidly.
4. It still provides no way to address the OOV problem

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

#Ngram vectorization example with count vectorizer and uni, bi, trigrams
count_vect = CountVectorizer(ngram_range=(1,3))

#Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)

#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

#see the BOW rep for first 2 documents
print("BoW representation for 'dog bites man': ", bow_rep[0].toarray())
print("BoW representation for 'man bites dog: ",bow_rep[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect.transform(["dog and dog are friends"])

print("Bow representation for 'dog and dog are friends':", temp.toarray())

Our vocabulary:  {'dog': 3, 'bites': 0, 'man': 12, 'dog bites': 4, 'bites man': 2, 'dog bites man': 5, 'man bites': 13, 'bites dog': 1, 'man bites dog': 14, 'eats': 8, 'meat': 17, 'dog eats': 6, 'eats meat': 10, 'dog eats meat': 7, 'food': 11, 'man eats': 15, 'eats food': 9, 'man eats food': 16}
BoW representation for 'dog bites man':  [[1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0]]
BoW representation for 'man bites dog:  [[1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0]]
Bow representation for 'dog and dog are friends': [[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


TF-IDF

It aims to quantify the importance of a given word relative to other words in the document and in the corpus.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
bow_rep_tfidf = tfidf.fit_transform(processed_docs)

#IDF for all words in the vocabulary
print("IDF for all words in the vocabulary",tfidf.idf_)
print("-"*10)
#All words in the vocabulary.
print("All words in the vocabulary",tfidf.get_feature_names())
print("-"*10)

#TFIDF representation for all documents in our corpus 
print("TFIDF representation for all documents in our corpus\n",bow_rep_tfidf.toarray()) 
print("-"*10)

temp = tfidf.transform(["dog and man are friends"])
print("Tfidf representation for 'dog and man are friends':\n", temp.toarray())

IDF for all words in the vocabulary [1.51082562 1.22314355 1.51082562 1.91629073 1.22314355 1.91629073]
----------
All words in the vocabulary ['bites', 'dog', 'eats', 'food', 'man', 'meat']
----------
TFIDF representation for all documents in our corpus
 [[0.65782931 0.53256952 0.         0.         0.53256952 0.        ]
 [0.65782931 0.53256952 0.         0.         0.53256952 0.        ]
 [0.         0.44809973 0.55349232 0.         0.         0.70203482]
 [0.         0.         0.55349232 0.70203482 0.44809973 0.        ]]
----------
Tfidf representation for 'dog and man are friends':
 [[0.         0.70710678 0.         0.         0.70710678 0.        ]]


Three fundamentals drawback from basic vectorization approaches:
1. Discrete representations, it is hampers their ability to capture relationships between words.
2. The feature vectors are sparse and high-dimensional representations. The high-dimensionality representations makes them computationally inefficient.
3. Cannot handle OOV words.

## Distributed Representations

Some key terms:
- Distributional similarity, the meaning of the word can be understood from the context (connotation)
- Distributional hypothesis, this hypothesizes that words that occur in similar context have similar meanings.
- Distributional representation, representation schemes that are obtained based on distribution of words from the context in which the words appear. (one-hot, bag of words, bag of n-grams, TF-IDF)
- Distributed representation, is based on the distributional hypothesis.
- Embedding, is a mapping between vector space from distributional representation to vector space from distributed representation.
- Vector semantics, NLP methods that aim to learn the word representations based on distributional properties of words in a large corpus.

In [21]:
# word embedding
# RUNNING IN GOOGLE COLAB

# Word Embedding: https://colab.research.google.com/drive/1YPvwkUNPk3N7VXMsTi1VNoyEl13HXyMa?usp=sharing
# Training embedding gensim: https://colab.research.google.com/drive/11SL71Xf72CnFLNShbuuMiPY-LgIg4c3j?usp=sharing

# import warnings #This module ignores the various types of warnings generated
# warnings.filterwarnings("ignore") 

# import os #This module provides a way of using operating system dependent functionality

# import psutil #This module helps in retrieving information on running processes and system resource utilization
# process = psutil.Process(os.getpid())
# from psutil import virtual_memory
# mem = virtual_memory()

# import time #This module is used to calculate the time

In [22]:
# from gensim.models import Word2Vec, KeyedVectors
# pretrainedpath = 'temp/GoogleNews-vectors-negative300.bin.gz'

# #Load W2V model. This will take some time, but it is a one time effort! 
# pre = process.memory_info().rss
# print("Memory used in GB before Loading the Model: %0.2f"%float(pre/(10**9))) #Check memory usage before loading the model
# print('-'*10)

# start_time = time.time() #Start the timer
# ttl = mem.total #Toal memory available

# w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True) #load the model
# print("%0.2f seconds taken to load"%float(time.time() - start_time)) #Calculate the total time elapsed since starting the timer
# print('-'*10)

# print('Finished loading Word2Vec')
# print('-'*10)

# post = process.memory_info().rss
# print("Memory used in GB after Loading the Model: {:.2f}".format(float(post/(10**9)))) #Calculate the memory used after loading the model
# print('-'*10)

# print("Percentage increase in memory usage: {:.2f}% ".format(float((post/pre)*100))) #Percentage increase in memory after loading the model
# print('-'*10)

# print("Numver of words in vocablulary: ",len(w2v_model.vocab)) #Number of words in the vocabulary.

In [23]:
# spacy
import spacy
nlp = spacy.load("en_core_web_sm")

In [24]:
print("Document After Pre-Processing:",processed_docs)

# Iterate over each document and initiate an nlp instance.
for doc in processed_docs:
    doc_nlp = nlp(doc) #creating a spacy "Doc" object which is a container for accessing linguistic annotations. 
    
    print("-"*30)
    print("Average Vector of '{}'\n".format(doc),doc_nlp.vector)#this gives the average vector of each document
    for token in doc_nlp:
        print()
        print(token.text,token.vector)#this gives the text of each word in the doc and their respective vectors.

Document After Pre-Processing: ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']
------------------------------
Average Vector of 'dog bites man'
 [ 0.587522    1.1498089  -1.7489859  -0.2872682  -1.0648674   0.3264183
  0.15925772  0.86956954 -0.03941305 -0.05719852 -1.1595355  -0.03754699
  0.09788382 -0.50910264 -0.09573716 -1.393181   -0.16476774 -0.9535491
  0.89954644  2.1720686   1.0505711  -0.6770111   0.2926553  -1.2229663
 -1.3725324   2.2056735  -0.81114006 -0.8166096   1.5868374  -1.1001147
 -0.35184097  0.12588209  1.1244168  -0.84187156  0.6205936   1.1501113
  3.149638   -0.54442066  0.3360592   0.34230545  0.2527882   2.4464004
 -0.9520517  -1.8514029  -0.41820276 -0.8024625  -0.2905644   0.24335873
  0.67568254 -0.22497876  2.7399645  -0.29414567  0.1477855  -2.1211226
 -0.19914706  1.8373746   0.88662976 -0.61229116 -1.031511   -1.0240673
  1.3489146  -2.9600992  -0.8370202  -2.589103    1.5814676   1.0113854
  0.06165723 -1.9719986  -1.7209841   0.

Ways to handle OOV problem for word embeddings:
1. Create vectors that are initialized randomly, where each component between -0.25 to +0.25.
2. Subword, morphological properties (prefixes, suffixes, word endings, etc), or by using character representations. 

## Distributed Representations Beyond Words and Characters

Word2vec learned representations for words, and we aggregated them to form text representations. fastText learned representations for character n-grams, which were aggregated to form word representations and then text representations.

Both approaches do not take the context of words into account. For example, the sentences "dog bites man" and "man bites dog", both receive the same representation.

Another approaches, Doc2vec, allows us to directly learn the representations for texts of arbitrary lengths (phrases, sentences, paragraphs, and documents) by taking the context of words in the text into account.

Doc2vec learns a "paragraph vector" that learns a representation for the full text.

In [25]:
import warnings
warnings.filterwarnings('ignore')
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from pprint import pprint
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Yasir Abdur
[nltk_data]     Rohman\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [26]:
data = ["dog bites man",
        "man bites dog",
        "dog eats meat",
        "man eats food"]

tagged_data = [TaggedDocument(words=word_tokenize(word.lower()), tags=[str(i)]) for i, word in enumerate(data)]
tagged_data

[TaggedDocument(words=['dog', 'bites', 'man'], tags=['0']),
 TaggedDocument(words=['man', 'bites', 'dog'], tags=['1']),
 TaggedDocument(words=['dog', 'eats', 'meat'], tags=['2']),
 TaggedDocument(words=['man', 'eats', 'food'], tags=['3'])]

In [27]:
#dbow
model_dbow = Doc2Vec(tagged_data,vector_size=20, min_count=1, epochs=2,dm=0)
print(model_dbow.infer_vector(['man','eats','food']))#feature vector of man eats food

[ 0.00381694 -0.01083953  0.02366928 -0.01721474 -0.01125923 -0.02391598
  0.00957225 -0.01906247  0.01022941 -0.02121767 -0.02251468 -0.01903932
 -0.0064447   0.02034412  0.02025429 -0.02116948  0.02440305  0.01085662
 -0.01993765 -0.01649801]


In [28]:
model_dbow.wv.most_similar("man",topn=5)#top 5 most simlar words.

[('meat', 0.32583820819854736),
 ('eats', 0.28601741790771484),
 ('food', 0.14761802554130554),
 ('bites', -0.20909522473812103),
 ('dog', -0.24991711974143982)]

In [29]:
model_dbow.wv.n_similarity(["eats"],["man"])

0.2860174

In [30]:
#dm
model_dm = Doc2Vec(tagged_data, min_count=1, vector_size=20, epochs=2,dm=1)

print("Inference Vector of man eats food\n ",model_dm.infer_vector(['man','eats','food']))

print("Most similar words to man in our corpus\n",model_dm.wv.most_similar("man",topn=5))
print("Similarity between man and dog: ",model_dm.wv.n_similarity(["dog"],["man"]))

Inference Vector of man eats food
  [ 0.00381702 -0.01083956  0.02366922 -0.01721474 -0.01125922 -0.02391591
  0.00957238 -0.01906254  0.01022946 -0.02121754 -0.02251468 -0.01903929
 -0.00644476  0.02034397  0.02025425 -0.02116957  0.0244031   0.0108567
 -0.01993772 -0.01649793]
Most similar words to man in our corpus
 [('meat', 0.32583820819854736), ('eats', 0.28601741790771484), ('food', 0.14761802554130554), ('bites', -0.20909522473812103), ('dog', -0.24991711974143982)]
Similarity between man and dog:  -0.24991713


In [31]:
# OOV
# model_dm.wv.n_similarity(['covid'],['man'])

## Universal Text Representations

Words can mean different things in different context. For example
- “I went to a **bank** to withdraw money” and
- “I sat by the river **bank** and pondered about text representations”

Contextual word representations, which addresses this issue. It uses "language modeling" which is the task of predictiong the next likely word in a sequence of words.
- Transformers, BERT, ELMo, etc

Important aspects to keep in mind while using them in our project:
1. All text representations are inherently biases based on what they saw in training data.
    - Example: An embedding model trained heavily on technology data is likely to identify Apple as being closer to Microsoft/Facebook that to an orange.
2. Pre-trained embeddings are generally large-sized files (several GBs), which may pose problems in certain deployment scenarios.
3. Modeling language for a real-world application is more that capturing the information via word and sentence embeddings.
    - Example: the task of sarcasm detection requires nuances that are not yet captured well by embedding techniques.
4. A practitioner needs to exercise caution and consider practical issues such as return on investment from the effort, business needs, and infrastructural constraints before trying to use them in production-grade applications

# Visualizing Embeddings

In [32]:
# Google Colab
# Visualizing Embeddings using TSNE:
# https://colab.research.google.com/drive/1HJ60cZe2DZdorHVHcMOHDWSqLZQuRURD?usp=sharing

# Visualizing Embeddings using Tensorboard
# https://colab.research.google.com/drive/1s2GsIztRNuSMoBHAaN1vf9ChjGCwvdZK?usp=sharing

# Handcrafted Feature Representation

In many cases, we do have some domain-specific knowledge about the given NLP problem, which we would like to incorporate into the model we're building.

Clearly, custom feature engineering is much more difficult to formulate compared to other feature engineering schemes we’ve seen so far. It’s for this reason that vectorization approaches are more accessible to get started with, especially when we don’t have enough understanding of the domain