# Document representation using NLP

![alt text](library-of-babel.jpg)


In [None]:
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_md")
import re
import seaborn as sns
sns.set()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from IPython.display import IFrame
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_distances
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AffinityPropagation
from scipy.spatial import distance
from IPython.display import IFrame

import nltk
nltk.download('punkt_tab')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk import word_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
import string
punct = list(string.punctuation)

# 1. Distance metrics in document representation

If we can represent documents as vectors, this means we can measure the similarity (or distance) between them using geometrical tools. This is because each vector allows us to think of a document as a point in space, where each number in the vector is a dimension. We have encountered this idea already in our exploration of VAD, where the valence-arousal-dominance score of a word or phrase allows us to plot that word or phrase in 3D space. Most of the time we deal in dimensions greater than three in NLP so we can't intuitively visualise the dimensionality of the space. Nevertheless, distance metrics from 2D and 3D space easily generalise. We typically use two distance metrics in NLP:

### 1. Euclidean distance

The Euclidean distance between two vectors is the length of the shortest line that joins their endpoints. For vectors $A$ and $B$ 2D space, it looks like this:
 
<img src="euclidean.png" alt="alt text" width="300" height="200"/>

For vectors $A(x_1,x_2,…,x_n)$ and $B(y_1,y_2,…,y_n)$ in $n$-dimensional space, the formula for Euclidean distance, $d$, is:

$$d = \sqrt{\sum_{i=1}^{n} (y_i - x_i)^2}$$

Note that Euclidean distance is always positive: there is no such thing as negative distance. As there <i>is</i> such a thing as opposite meaning, this can mean that it is not an ideal metric for measuring document meaning.

## 2. Cosine similarity:

Cosine similarity is different from Euclidean distance in that it works with the <i>angles</i> between vectors. The idea is that two vectors that 'point' in the same direction are likely to be similar, and thus that any two documents that are representated by these vectors are also likely to be similar.

<img src="cosine_2.png" alt="alt text" width="500" height="400"/>

For vectors $A(x_1,x_2,…,x_n)$ and $B(y_1,y_2,…,y_n)$ in $n$-dimensional space, the formula for cosine similarity is given by:

$$
\text{sim}(A,B) = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2} \times \sqrt{\sum_{i=1}^n B_i^2}}
$$

Where:
- $\sum_{i=1}^n A_i B_i$ is the dot product of $A$ and $B$,
- $\sqrt{\sum_{i=1}^n A_i^2}$ is the magnitude of vector $A$,
- $\sqrt{\sum_{i=1}^n B_i^2}$ is the magnitude of vector $B$.

Unlike Euclidean distance, cosine similarity can take any value between $-1$ and $1$. Specifically:

* If two documents have a score of $1$ the have the same meaning
* If two documents have a score of $0$ they have unrelated meanings
* If two documents have a score of $-1$ they have opposite meanings


### Two distance examples

In [None]:
d1 = "punky hairstyle"
d2 = "tangent to a point"

d1 = nlp(d1).vector
d2 = nlp(d2).vector

euc = distance.euclidean(d1, d2)
cos = 1 - distance.cosine(d1, d2)

print(f"The Euclidean distance between the documents is {euc}.")
print(f"The cosine similarity between the documents is {cos}.")

In [None]:
sns.heatmap(cos_df)

# 2. Topic modelling with Topic-frequency, Inverse-document-frequency (TF-IDF)

While it's relatively easy to group words together, matters are less simple for documents. Nevertheless, we are usually far more interested in grouping documents together than we are words, especially when they all belong to a corpus. So an early problem that emerged in NLP centred on how we might do this. 

One solution proposed by Karen Spärck Jones in 1972 was TF-IDF scoring. This method is based in two ideas:

* We should 'punish' words that occur frequently in all the documents in our corpus
* We should 'reward' words that occur frequently in small numbers of documents in our corpus

The intuition behind these ideas is fairly simple. For a topic to be coherent, it will generally consist of a small number of concepts. As these concepts will expressed as words, we should expect these topic-related words to concentrate in the documents that belong to this topic. However, some words (like prepositions and common verbs) will appear in all documents. So if we can can create a word-document matrix that assigns a score to each word in each document, we can gain a numerical representation––a vector––of how each document in a corpus differs from every other one.

Once we have these vectors, we can then perform various operations (like clustering) upon them to discover how they might be grouped together. But how do we calculate them?

The TF-IDF score for a word is the product of two quantities: the term frequency, and the inverse document frequency.

$$TF\text{-}IDF (t, d, D) = tf(t,d)\times idf(t, D)$$

Where $t$ is a term, $d$ is a document, and $D$ is the corpus. Let's create a toy corpus to illustrate this.

$D$ = {{$d_1$: Atomic Burger makes a tasty burger}, {$d_2$: An atomic clock is accurate}, {$d_3$: Atomic weapons are destructive}}

First, let's look at how we might calculate $tf(t,d)$. This is the relative frequency of each term in each document. That is, it's the number of times a term $t_i$ occurs in a document $d$ divided by the total number of terms in the document:

$$tf(t,d) = \frac{f_{t_i,d}}{\sum\limits_{t\in d}f_{t,d}}$$

For example, in document $d_1$ 'burger' has a $tf(t,d)$ score of 0.333 because there are six words and it occurs twice. Every other word has a score of 0.16 because it only occurs once. Words that occur often are therefore 'rewarded' in this part of the calculation.

Now, let's look at $idf(t,D)$. This adjusts the $tf(t,d)$ score by capturing how often a word occurs across the entire corpus of documents. It is the logarithm of the number of documents in a corpus divided by the number of douments that contain the term in the corpus. 

$$idf(t,D) = \log{\frac{N}{|\{d\in D: t \in d\}|}}$$

What's happening here? If a term occurs in all documents, the formula outputs a value of 0, because the total number of documents divided by the number of dcouments containing the term is $\log(1)= 0$. If the term occurs in a smaller number of documents, the formula gives a larger number, with the largest number being given when the term only occurs in one document. For example, 'atomic' occurs in all documents, so the $idf(t,D)$ value for 'atomic' is 0. However, 'burger' occurs in only one document, so the value is $\log{\frac{3}{1}} = 1.09$. In this way, the $idf(t,D)$ 'punishes' words that are common and therefore not topic specific. 

The result is that by multiplying $tf(t,d)$ and $idf(t,D)$, we are able to capture the role played by a word in determining the topic of a document relative to a corpus. The $TF\text{-}IDF (t, d, D)$ representation of a document is a vector of the values taken by all the words in that document relative to all the words in the corpus (with a value of 0 being taken when a word does not appear in the document).

In [None]:
vectorizer = TfidfVectorizer(input = 'content', strip_accents = 'ascii', stop_words = 'english')

D = ['Atomic Burger makes a tasty burger', 'An atomic clock is accurate', 'Atomic weapons are destructive']

v = vectorizer.fit_transform(D)
v = v.todense().tolist()

d = pd.DataFrame(
    v,columns=vectorizer.get_feature_names_out())
d.index = ['d1', 'd2', 'd3']

In [None]:
cos = [[] for i in range(len(d))]

for i in range(len(d)):
    for j in range(len(d)):
        cos[i].append(distance.cosine(d.iloc[i], d.iloc[j]))
        
cos_df = pd.DataFrame(cos, columns = d.index, index = d.index)

euc = [[] for i in range(len(d))]

for i in range(len(d)):
    for j in range(len(d)):
        euc[i].append(distance.euclidean(d.iloc[i], d.iloc[j]))
        
euc_df = pd.DataFrame(euc, columns = d.index, index = d.index)

# 3. A real-world example of TF-IDF

We're going to look at a Twitter dataset that contains tweets about eating disorders. The idea is to create a Tf-IDF representation of each tweet and see of we can find any patterns in the data.

In [None]:
# Load some data

data_ed = pd.read_excel('ED_twitter_data.xlsx', index_col = 0)
data_ed = data_ed.drop_duplicates(subset = ['tweet']).reset_index(drop = True)
data_ed.rename(columns={'tweet':'text'}, inplace=True)
data_ed['source'] = 'Eating disorder'


data_random = pd.read_csv('twitter_gender_processed.csv')
data_random['source'] = 'Random'

data_ed = data_ed[['text', 'source']]
data_random = data_random[['text', 'source']]

data_random = data_random.sample(n = len(data_ed))

data = pd.concat([data_ed, data_random]).reset_index(drop = True)

                                 




In [None]:
data

### Creating a tokenizer 

First, let's construct a tokenizer that turns our data into lemmas and passes them to the TF-IDF vectorizer we want to use. We're doing this for two reasons:

* The `scikit-learn` tokenizer that comes with the TF-IDF library isn't very good; we've already seen that stopwords and redundant variation introduce a lot of noise into the analysis.
* We'd like <i>in general</i> to be able to tweak our code to reflect what interests us, so it's important to know how to do that.

In [None]:
# A good tokenizer:

def good_tokens(text):
    words = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(i).lower() for i in words]
    lemmas = [i for i in lemmas if i not in stops and i not in punct]
    return lemmas

### Break the text into a list of documents (our corpus) and create the TF-IDF vectorizer

In [None]:
texts = [str(i) for i in data['text']]

vectorizer = TfidfVectorizer(tokenizer=good_tokens)

### Now create a TF-IDF representation of our corpus

In [None]:
vectors = vectorizer.fit_transform(texts)
vectors = vectors.todense().tolist()

df = pd.DataFrame(vectors,columns=vectorizer.get_feature_names_out())

In [None]:
len(df)

### How can we visualise this high dimensional data? 

There are a few ways in which relations between data points in high dimensions can be made visible. A very common one is known as <i>principal components analysis</i>. This works by finding common patterns of variation between dimensions and collapsing them into each other. The problem with this is that squashing a high dimensional space into a low dimensional space means lots of detail is lost. However, in can be a useful guide for identifying patterns. All calculations should be performed on the original TF-IDF vectors, not the PCA reduction.

In [None]:
pca_1 = PCA(n_components = 3)
comps_1 = pca_1.fit_transform(df)
pc_df_1 = pd.DataFrame(data = comps_1, columns = ['PC'+str(i) for i in range(1, comps_1.shape[1]+1)])
pc_df_1['tweet text'] = data['text']


fig = px.scatter_3d(pc_df_1, x='PC1', y='PC2', z='PC3', hover_data = ['tweet text'])

fig.update_traces(marker=dict(size = 5, line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))

fig.show()

### How can we cluster our data using a distance metric?

There are several options when it comes to dividing our data into clusters. Usually, we need to know how many clusters we want in advance. Kmeans clustering works by taking $n$ clusters and finding the grouping the data that minimises the variance withing each cluster, wkere $n$ is set by the user.

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0, n_init="auto").fit(df)

In [None]:
pc_df_1['clusters'] = [str(i) for i in kmeans.labels_]
pc_df_1['source'] = data['source']


In [None]:
fig = px.scatter_3d(pc_df_1, x='PC1', y='PC2', z='PC3', color = 'clusters', hover_data = ['source', 'tweet text'])

fig.update_traces(marker=dict(size = 5, line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))

fig.show()