# Getting Started : Text Representation
<img src="./assets/banner_notebook_1.jpg">


The NLP domain wasn't always this buzzing with __attention__ and hype that we see today. 
The recent progress in this field is built on top of years of amazing work and research. Before we leap onto the current state of things, let us have a quick walk through of how we arrived here. The current NLP systems are standing tall and promising on the shoulders of very solid work from past decades


## Import Required Libraries

<a target="_blank" href="https://colab.research.google.com/github/raghavbali/llm_workshop/blob/main/module_01/02_getting_started.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
import torch
import torchtext
import os
import collections
import pandas as pd
import numpy as np
import re
import torchtext 
torchtext.disable_torchtext_deprecation_warning()

In [None]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')

### Get Text
__The Gutenberg Project__ is an amazing project aimed at providing free access to some of the world's most amazing classical works. This makes it a wonderful source of textual data for NLP practitionars to use and improve their understanding of textual data. Ofcourse you can improve your litrary skills too 

For this module and workshop in general we will make use of materials available from the project. We begin by downloading the book __"The Adventures of Sherlock Holmes by Arthur Conan Doyle"__


<img src="./assets/img_2_notebook_1.jpg">

In [None]:
!wget -O sherlock_homes.txt http://www.gutenberg.org/files/1661/1661-0.txt

### Load Data

In [None]:
filename = "sherlock_homes.txt"
file_text = open(filename, 'r', encoding='utf-8').read()

# lower case text to reduce dimensionality
file_text = file_text#TODO: Lowercase the file text

# We remove first 1450 characters to remove
# details related to project gutenberg
raw_text = file_text [1450:]

### Text Representation

Feature Engineering is often known as the secret sauce to creating superior and better performing machine learning models. Just one excellent feature could be your ticket to winning a Kaggle challenge! The importance of feature engineering is even more important for unstructured, textual data because we need to convert free flowing text into some numeric representations which can then be understood by machine learning algorithms.

Since text is mostly available in unstructured form yet very high in dimensionality (how??? :sweat: ), the ability to represent text in the most appropriate way is one of the key ingredients to work in this domain.


Let us understand the current dataset at hand by checking the obvious aspects of a textual dataset

In [None]:
# unique list of characters and total characters in the file
char_vocab = sorted(set(raw_text))


# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(char_vocab)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

### Tokenize and Vectorize
To leverage different algorithms we convert text into numbers that can be represented as tensors.

The first step is to convert text to tokens - tokenization. If we use word-level representation, each word would be represented by its own token. We will use build-in tokenizer from torchtext module

In [None]:
import torchtext; torchtext.disable_torchtext_deprecation_warning()

In [None]:
# Deprecation notice!
from torchtext.data import get_tokenizer
from torchtext.vocab import Vocab

In [None]:
tokenizer = get_tokenizer('basic_english')

In [None]:
tokens = tokenizer(raw_text[:50])
print(f'\Token list:\n{tokens}')

Now, to convert text to numbers, we will need to build a vocabulary of all tokens.

In [None]:
# word level vocab
word_counter = collections.Counter()
for line in raw_text.split('\n'):
    word_counter.update(tokenizer(line))
word_vocab = Vocab(word_counter)

In [None]:
# sample lookup at word-level
#TODO: Print a few tokens with their indices

In [None]:
# character level vocab
char2idx = {u:i for i, u in enumerate(char_vocab)}
idx2char = np.array(char_vocab)

text_as_int = np.array([char2idx[c] for c in raw_text])

In [None]:
# char level mapping
print('{')
for char,_ in zip(char2idx, range(10)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

### Text as Vector

``torchtext`` ``vocab.stoi`` dictionary allows us to convert from a string representation into numbers (``stoi`` -> "from string to integers).

To convert the text back from a numeric representation into text, we can use the ``vocab.itos`` dictionary to perform reverse lookup:

In [None]:
word_vocab_size = len(word_vocab)
print(f"Word Vocab size= {word_vocab_size}")


def encode(x):
    return [word_vocab[s] for s in tokenizer(x)]

vec = encode(raw_text[:100])
print(vec)

### Bag Of Words Representation

Bag of Words (BoW) representation is a traditional vector representation of text for NLP tasks. Each word/character is linked to a vector index, vector element contains the number of occurrences of a word/character in a given document.


In [None]:
def to_bow(text,bow_vocab_size=word_vocab_size):
    res = torch.zeros(bow_vocab_size,dtype=torch.float32)
    for i in encode(text):
        if i<bow_vocab_size:
            res[i] += 1
    return res

In [None]:
sample_text = "this is a sample text to showcase text representation"
print(f"sample text:\n{raw_text[100:150]}")
#TODO: Print a BoW vector of the sample text segment chosen above

### TF-IDF

TF-IDF stands for term frequency–inverse document frequency. It is a form of bag of words representation, where instead of a **binary value** indicating the appearance of a word in a document, a floating-point value is used, which is related to the frequency of word occurrence in the corpus.

The formula to calculate TF-IDF is:

$w_{ij}=tf_{ij}* \log(\frac{N}{df_i})$

Where:

- $i$ is the word
- $j$ is the document
- $w_{ij}$ is the weight or the importance of the word in the document
- $tf_{ij}$ is the number of occurrences of the word i in the document j, i.e. the BoW value we have seen before
- $N$ is the number of documents in the collection
- $df_i$ is the number of documents containing the word i in the whole collection.


TF-IDF value $w_{ij}$ increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contains the word, which helps to adjust for the fact that some words appear more frequently than others. For example, if the word appears in every document in the collection, $df_i=N$, and $w_{ij}=0$, and those terms would be completely disregarded.

Let's compute document frequency for each word to start with.
We can represent it as tensor of size vocab_size. We will limit the number of documents to N=1000 to speed up processing. For each input sentence, we compute the set of words (represented by their numbers), and increase the corresponding counter:

In [None]:
raw_text_lines = raw_text.split('\n')
raw_text_lines = [line for line in raw_text_lines if line not in [' ','']]

In [None]:
raw_text_lines[3]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(raw_text[:2000].split('\n'))

In [None]:
vectorizer.transform([raw_text_lines[3]]).todense()

### Word Embeddings
A word embedding is a learned dense representation of text. In this approach we represent words and documents as dense vectors that have distinct lexical properties. This can be considered as one of the key breakthroughs in the fielf of NLP.

Let us briefly:

- Understand the Word2Vec models called Skipgram and CBOW

### Word2Vec

This model was created by [Mikolov et. al at Google in 2013](https://arxiv.org/abs/1301.3781). It is a predictive deep learning model designed to compute and generate high quality, distributed and continuous dense vector representations of words, which capture contextual and semantic similarity. Essentially these are unsupervised models which can be trained on massive textual corpora, create a vocabulary of possible words and generate dense word embeddings for each word in the vector space representing that vocabulary.

There are two different model architectures which can be leveraged by Word2Vec to create these word embedding representations. These include,

- The Continuous Bag of Words (CBOW) Model
- The Skip-gram Model

## Continuous Bag of Words (CBOW) Model
The CBOW model architecture tries to predict the current __`target word`__ (the center word) based on the __`source context words`__ (surrounding words).

Considering a simple sentence, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like __([quick, fox], brown)__, __([the, brown], quick)__, __([the, dog], lazy)__ and so on.

Thus the model tries to predict the target_word based on the context_window words.

<img src="./assets/cbow_arch_notebook_1.png">

### Skip-gram Model
The Skip-gram model architecture usually tries to achieve the reverse of what the CBOW model does. It tries to predict the __`source context words`__ (surrounding words) given a __`target word`__ (the center word).

Considering our simple sentence from earlier, “the quick brown fox jumps over the lazy dog”. If we used the CBOW model, we get pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like __([quick, fox], brown)__, __([the, brown], quick)__, __([the, dog], lazy)__ and so on.

Now considering that the skip-gram model’s aim is to predict the context from the target word, the model typically inverts the contexts and targets, and tries to predict each context word from its target word. Hence the task becomes to predict the context [quick, fox] given target word ‘brown’ or [the, brown] given target word ‘quick’ and so on.

Thus the model tries to predict the context_window words based on the target_word.

<img src="./assets/skipgram_arch_notebook_1.png">

In [None]:
corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
          'I love green eggs, ham, sausages and bacon!',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'
]

In [None]:
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = #TODO: Join back the list of tokens as a string. 
    return doc

normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(corpus)
norm_corpus

## Gensim Framework

The ``gensim`` framework, created by Radim Řehůřek consists of a robust, efficient and scalable implementation of the __Word2Vec__ model. We will leverage the same on our sample toy corpus. In our workflow, we will tokenize our normalized corpus and then focus on the following four parameters in the Word2Vec model to build it.

- vector_size: The word embedding dimensionality
- window: The context window size
- min_count: The minimum word count
- sample: The downsample setting for frequent words
- sg: Training model, 1 for skip-gram otherwise CBOW

We will build a simple Word2Vec model on the corpus and visualize the embeddings.

In [None]:
from gensim.models import word2vec

In [None]:
tokenized_corpus = [tokenizer(line) for line in norm_corpus]

# Set values for various parameters
feature_size = 15    # Word vector dimensionality
window_context = 5   # Context window size
min_word_count = 1   # Minimum word count
sample = 1e-3        # Downsample setting for frequent words
sg = 1               # skip-gram model

w2v_model = word2vec.Word2Vec(tokenized_corpus,
                              vector_size=feature_size,
                              window=window_context,
                              min_count = min_word_count,
                              sg=sg,
                              sample=sample,
                              epochs=5000)
w2v_model

In [None]:
w2v_model.wv['sky']

In [None]:
#TODO: Print the vector for the word India

In [None]:
import scienceplots
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
plt.style.use(['science','ieee','no-latex'])

%matplotlib inline

In [None]:
# visualize embeddings
words = w2v_model.wv.index_to_key
wvs = w2v_model.wv[words]

tsne = TSNE(n_components=2, random_state=42, n_iter=5000, perplexity=5)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(wvs)
labels = words

plt.figure(figsize=(12, 6))
plt.scatter(T[:, 0], T[:, 1],)
for label, x, y in zip(labels, T[:, 0], T[:, 1]):
    plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')

In [None]:
w2v_model.wv.most_similar('dog', topn=10)

## Similar and Improved Works 
- [GloVe](https://nlp.stanford.edu/pubs/glove.pdf)
- [FastText](https://arxiv.org/pdf/1607.04606.pdf)
- [Sent2Vec](https://arxiv.org/abs/1405.4053)
- X2Vec

In [None]:
with open("norm_corpus.txt","w") as f:
    for line in norm_corpus:
        f.write(line+'\n')

In [None]:
import fasttext
fasttext_model = fasttext.train_unsupervised('norm_corpus.txt', model='skipgram',epoch=500000,minCount=1,loss='ns')

In [None]:
fasttext_model.get_word_vector('sky')

In [None]:
# TODO: Get Vector for India

In [None]:
# TODO: Identify nearest neighbors for the word breakfast

### Limitations
One key limitation of traditional pretrained embedding representations such as Word2Vec is the problem of word sense and removing ambiguity by making them clear. While pretrained embeddings can capture some of the meaning of words in context, every possible meaning of a word is encoded into the same embedding. This can cause problems in downstream models, since many words such as the word 'play' have different meanings depending on the context they are used in.

For example, the word 'play' in these two different sentences have quite different meaning:

- I went to a **play** at the theatre.
- John wants to **play** with his friends.
The pretrained embeddings above represent both meanings of the word 'play' in the same embedding. To overcome this limitation, we need to build embeddings based on the language model, which is trained on a large corpus of text, and knows how words can be put together in different contexts.

---

## Thought Exercise

- We discussed about representing text into tokens and why it is important
- We discussed about different tokenization methods (character wise, word wise and more...)
- But does it make any difference for the tokenizer (or even the model) in terms of any meaning of those token?

Probably No. Check this experiment tweeted by [Andrej Karpathy](https://x.com/karpathy/status/1816637781659254908/photo/1)
<img src="./assets/karpathy_emoji_tokenizer.jpeg">
