In [36]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import itertools
import spacy
nlp = spacy.load('en_core_web_sm')
from keybert import KeyBERT
keywords_model = KeyBERT(model=nlp)

import warnings
warnings.filterwarnings('ignore')

with open('abstract.txt', 'r') as f:
    abstract = f.read()
with open('introduction.txt','r') as f:
    introduction = f.read()


## Similarity of docs
### In this notebook I will try to compute similarity for two docs:

- "Attention is all you need" paper from https://arxiv.org/pdf/1706.03762.pdf (Abstract)
- https://medium.com/@adityathiruvengadam/transformer-architecture-attention-is-all-you-need-aeccd9f50d09 (Introduction)

 According to our intuition, we can assume that both docs should be quite similar, as describe the same topic, but<br> let's see what will be the result of our similarity measure due to other styles of writing, different length of docs, etc.
 <br>
 <br>
 As first step I will do simple normalization of docs using spacy methods.

In [8]:
def normalize(text: str) -> str:
    doc = nlp(text)
    norm_text = []
    for token in doc:
        #deleting punctuations, stopwords, spaces
        if not token.is_punct and not token.is_stop and not token.is_space:
            norm_text.append(token.lemma_.lower())
    return ' '.join(norm_text)

In [31]:
normalized_abstract = normalize(abstract)
normalized_introduction = normalize(introduction)

### First method - computing cosine simmilarity with tf-idf vectors
#### Pros:
- quite easy to understand
- fast
- statistical method - no need to see bigger corpus of docs to compare
- when working on small datasets vectors are quite short, so fast to compute

#### Cons:
- word similarities are overlooked 
- much longer vectors when dataset is bigger and bigger, may much slow down computing

In [21]:
def tfidf_cosine_similarity(*args: str) -> None:
    """Compute cosine similarity between each text"""
    # Create tf-idf matrix
    tfidf_vectorizer = TfidfVectorizer(stop_words=nlp.Defaults.stop_words)
    tfidf_texts = tfidf_vectorizer.fit_transform(args)
    # Compute cosine similarity for each pair of texts
    for pair in list(itertools.combinations(range(len(args)), 2)):
        print(f'text {pair[0]+1} and text {pair[1]+1} has cosine similarity {np.round(cosine_similarity(tfidf_texts[pair[0]], tfidf_texts[pair[1]])[0][0],2)}')


In [32]:
tfidf_cosine_similarity(normalized_abstract, normalized_introduction)

text 1 and text 2 has cosine similarity 0.11


### Second method - computing cosine simmilarity with spacy embeddings vectors
#### Pros:
- embeddings can maintain interrelationships of words
- we can compare short docs with quite good output

#### Cons:
- rather hard to understand - vectors are created by neural network usually with Transformer architecture (BERT, roBERTa, etc)
- output is as good as data used to train embeddings - when working on very specific data we may need to train our own transformer to create embeddings - this is really hard to do on personal computers, nearly impossible

Although we can find a lot OpenSource Transformers on HuggingFace, GitHub, etc.
Here I will use "big english spacy", but there is always space to improve method by testing other embeddings

In [23]:
def spacy_cosine_similarity(*args: str) -> None:
    """Compute cosine similarity between each text"""
    # Create spacy vectors
    spacy_texts = [nlp(text) for text in args]
    # Compute cosine similarity for each pair of docs
    for pair in list(itertools.combinations(range(len(args)), 2)):
        print(f'text {pair[0]+1} and text {pair[1]+1} has cosine similarity {np.round(spacy_texts[pair[0]].similarity(spacy_texts[pair[1]]),2)}')

In [33]:
spacy_cosine_similarity(normalized_abstract, normalized_introduction)

text 1 and text 2 has cosine similarity 0.91


### Third method - comparing keywords
#### Pros:
- easy (especialy when using vectorizers like tf-idf)

#### Cons:
- can work only as faciliation for further human validation
- if we want to use tf-idf there has to be good corpus to create good keywords for document

Due to above cons, I use here KeyBert method to create keywords

In [14]:
def create_keywords(doc: str) -> list:
    """Create sorted list of 10 keywords"""
    keywords = keywords_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=nlp.Defaults.stop_words, top_n=20)
    return [keyword[0] for keyword in keywords]
    
def compare_keywords(*args: str) -> None:
    # Create keywords
    keywords = [create_keywords(text) for text in args]
    # Compute cosine similarity for each pair of docs
    for pair in list(itertools.combinations(range(len(args)), 2)):
        print(f'text {pair[0]+1} and text {pair[1]+1} has common keywords: {set(keywords[pair[0]]).intersection(set(keywords[pair[1]]))}')

In [34]:
compare_keywords(normalized_abstract, normalized_introduction)

text 1 and text 2 has common keywords: {'base', 'architecture', 'state'}


### Summary
 <br><br>
I wanted to show how big improvement are transformers for whole NLP - comparing results of cosine similarity with other vectors (tf-idf vs embeddings) is quite good for that: 
<br><br>
In simple statistical method we get 0.11, so the documents are not similar - this may be cause by lenght of text and other styles of writing of papers and articles. 
<br>
When comparing it to results of cosine similarity computed with embeddings vector we can see completely different results - 0.91 score. The reason of that is that making embeddings vectors by transformers maintains dependence in distance. For example words like "queen" and "king" will be closer than words like "bike" and "grenade". 
<br><br><br>
This notebook is rather simple presentation of mehtods we can use for NLP, there is plenty space to improve all above methods mainly by parameters tuning. One quite interesting way is using ngrams (phrases) rather than only words.<br><br>
Although in my opinion this is the biggest challenge for unsupervised approaches to NLP - finding good validation method and tuning our models/scripts to this. There is never one the best way to compare above methods. This should rather comes from creating application requirements and way it will be used.