# Notebook 1: Text representations and exploration

## Load Data

In [None]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
print(newsgroups["DESCR"][:394])

docs, labels, target_names = newsgroups["data"], newsgroups["target"], newsgroups["target_names"]

In [None]:
target_names

In [None]:
docs[0]

In [None]:
target_names[labels[0]]

## 1. The bag of words model (BOW)

Bag of words model helps convert the text into numerical representation (numerical feature vectors) such that the same can be used to train models using machine learning algorithms. Here are the key steps of fitting a bag-of-words model:

- Create a vocabulary indices of words or tokens from the entire set of documents. The vocabulary indices can be created in alphabetical order. 
- Construct the numerical feature vector for each document that represents how frequent each word appears in different documents. The feature vector representing each will be sparse in nature as the words in each document will represent only a small subset of words out of all words (bag-of-words) present in entire set of documents.

Further reading:
- https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

In [None]:
import numpy as np

example_docs = np.array(['Mirabai has won a silver medal in weight lifting in Tokyo olympics 2021',
                         'Sindhu has won a bronze medal in badminton in Tokyo olympics',
                         'Indian hockey team is in top four team in Tokyo olympics'])

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=10000)

encoded_example_docs = vectorizer.fit_transform(example_docs)

In [None]:
encoded_example_docs[0].toarray()

In [None]:
encoded_example_docs[0].toarray().sum()

In [None]:
vectorizer.vocabulary_

### 1.1 Tfidf
In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms. This was originally a term weighting scheme developed for information retrieval (as a ranking function for search engines results) that has also found good use in document classification and clustering.

#### Definition:
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

$$\text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}$$

$$\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1$$

- $n$ is the total number of documents in the document set
- term frequency (tf): the number of times a term occurs in a given document
- document frequency of a term $t$ (df(t)):  the number of documents in the document set that contain term $t$.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=10000)

encoded_example_docs = tfidf_vectorizer.fit_transform(example_docs)

In [None]:
encoded_example_docs[0].toarray()

### 1.2 n-grams
The simple BOW-model loses all ordering information of words. To preserve some of the local ordering information we can extract 2-grams (or higher n) of words in addition to the 1-grams (individual words).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_bigram = CountVectorizer(max_features=10000, ngram_range=(1,2))

encoded_example_docs = vectorizer_bigram.fit_transform(example_docs)

In [None]:
encoded_example_docs[0].toarray()

In [None]:
vectorizer_bigram.vocabulary_

## 2. Language model embeddings
Huggingface/sentence bert

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

#Sentences are encoded by calling model.encode()
embeddings = model.encode(example_docs.tolist())

#Print the embeddings
for sentence, embedding in zip(example_docs, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

In [None]:
## THIS IS SLOW ON CPU

#Sentences are encoded by calling model.encode()
# lm_embeddings = model.encode(docs)

## 3. Compare word/document vectors and find similars

We can compare word vectors with a similarity. An established measure is the cosine similarity (or the scalar product) of embedding vectors.

In [None]:
# Example
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(encoded_example_docs[0], encoded_example_docs[1])
similarity[0][0]

### 3.1 Lets search in the original documents

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vectorizer = TfidfVectorizer(max_features=10000)
encoded_docs = tfidf_vectorizer.fit_transform(docs)

In [None]:
query_text = "Why is my Windows PC so slow?"
query_embedding = tfidf_vectorizer.transform([query_text])[0]

scores = []
for emb in encoded_docs:
    score = cosine_similarity(query_embedding, emb)[0][0]
    scores.append(score)

In [None]:
# get the highest scoring document
docs[np.argmax(scores)]

## 4. Visual exploration

Further reading:
- UMAP: https://umap-learn.readthedocs.io/en/latest/index.html

In [None]:
import pandas as pd
from sklearn.decomposition import PCA, TruncatedSVD

import umap
import umap.plot

# Some plotting libraries
import matplotlib.pyplot as plt
%matplotlib notebook
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE
output_notebook(resources=INLINE)

In [None]:
pca = TruncatedSVD(n_components=50)
pca_embedding = pca.fit_transform(encoded_docs)

In [None]:
embedding = umap.UMAP(n_components=2, metric='cosine').fit(pca_embedding)

In [None]:
document_df = pd.DataFrame({
    "id": list(range(len(docs))),
    "label": [target_names[label] for label in labels]
})

In [None]:
f = umap.plot.interactive(embedding, labels=document_df.label, hover_data=document_df, point_size=7)

In [None]:
show(f)