<a href="https://colab.research.google.com/github/ziababar/demos/blob/master/derivative_security/document_similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background
---

The objective of this notebook program is to find how similar two documents are. This can be used to determine the derivative documents from a source document.







# References
---
The code in this notebook has been adapted from [Compare documents similarity using Python | NLP](https://dev.to/coderasha/compare-documents-similarity-using-python-nlp-4odp)



# Libraries
---

The following libraries are used in this program.

*   **Natural language toolkit:** NLTK contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.
*   **Gensim:** provides packages for processing texts, working with word vector models.
*   **Numpy:** provides packages for numeric processing.
*   **Requests:** provides packages for processing making HTTP requests.


In [0]:
# import the required libraries
import gensim
import nltk
import numpy
import requests


In [0]:
from nltk.tokenize import word_tokenize # Word Tokenizer
from nltk.tokenize import sent_tokenize # Sentence Tokenizer

nltk.download('punkt')

# Data Sources
---

Data sources were generated using https://talktotransformer.com/

In [0]:
target_url = 'https://raw.githubusercontent.com/ziababar/demos/master/derivative_security/data/doc1.txt'
response = requests.get(target_url)
source_doc = response.text
print(source_doc)

In [0]:
target_url = 'https://raw.githubusercontent.com/ziababar/demos/master/derivative_security/data/doc2.txt'
response = requests.get(target_url)
duplicate_doc = response.text
print(duplicate_doc)

In [0]:
target_url = 'https://raw.githubusercontent.com/ziababar/demos/master/derivative_security/data/doc3.txt'
response = requests.get(target_url)
partial_doc = response.text
print(partial_doc)

In [0]:
print("Number of documents: ",len(source_doc))
print("Number of documents: ",len(duplicate_doc))
print("Number of documents: ",len(partial_doc))


# Document Parsing
---
We need to parse the document and extract all the words from the document. This is done through a two step process.
1. Open the document and get all the sentences through the sent_tokenize() function.
2. For each sentence, get all the words in that sentence using the word_tokenize() function.

In [0]:
# Empty array that contains all the sentences
sent_array = []

sent_tokens = sent_tokenize(source_doc)
for line in sent_tokens:
    sent_array.append(line)

print("Number of sentences: ", len(sent_array))
print(sent_array)


In [0]:
word_array = [[w.lower() for w in word_tokenize(text)] 
            for text in sent_array]
print(word_array)

Gensim requires the words (aka tokens) be converted to unique ids before it can process them.

Create a Dictionary object that maps each word to a unique id. Let's convert our sentences to a [list of words] and pass it to the corpora.Dictionary() object. A dictionary maps every word to a number.

In [0]:
dictionary = gensim.corpora.Dictionary(word_array)
print(dictionary.token2id)

### Step 1 - Bag of Words

Create a Corpus. A ‘corpus’ is typically a ‘collection of documents as a bag of words’.

The corpus is an object that contains the word id and its frequency in each document.


In [0]:
# Create a corpus and pass the tokenized list of words to the Dictionary.doc2bow()
# Here bow stands for bag-of-words
corpus_source = [dictionary.doc2bow(word) for word in word_array]

"the" appears two times in second sentence. The has the ID of 12 and its frequency is 2.

In [0]:
print(corpus_source)

### Step 2 - TF-IDF

Term Frequency – Inverse Document Frequency(TF-IDF) is also a bag-of-words model but unlike the regular corpus, TFIDF down weights tokens (words) that appears frequently across documents.

TF-IDF is calculated by multiplying a local component (TF) with a global component (IDF) and optionally normalizing the result to unit length.

Term frequency is how often the word shows up in the document and inverse document frequency scales the value by how rare the word is in the corpus. In simple terms, words that occur more frequently across the documents get smaller weights.

In [0]:
tfidf_source = gensim.models.TfidfModel(corpus_source)


For example, the word ‘the’ occurs in two documents so it weighted down. The word ‘this’ and 'is' appearing in all three documents so removed altogether.

In [0]:
for doc in tfidf_source[corpus_source]:
    print([[dictionary[id], numpy.around(freq, decimals=2)] for id, freq in doc])

### Step 3 - Parse other documents too
Perform the same processing for the other two documents as well

In [0]:
sent_array = []
sent_tokens = sent_tokenize(duplicate_doc)
for line in sent_tokens:
    sent_array.append(line)

word_array = [[w.lower() for w in word_tokenize(text)] 
            for text in sent_array]

dictionary = gensim.corpora.Dictionary(word_array)
corpus_duplicate = [dictionary.doc2bow(word) for word in word_array]


In [0]:
sent_array = []
sent_tokens = sent_tokenize(partial_doc)
for line in sent_tokens:
    sent_array.append(line)

word_array = [[w.lower() for w in word_tokenize(text)] 
            for text in sent_array]

dictionary = gensim.corpora.Dictionary(word_array)
corpus_partial = [dictionary.doc2bow(word) for word in word_array]

In [0]:
print(corpus_source)
print(corpus_duplicate)
print(corpus_partial)

# Determining Document Similarity
---
Now, we are going to create similarity object using cosine similarity. Cosine similarity is a standard measure in Vector Space Modeling to determine the similarity of two vectors.

The main class is Similarity, which builds an index for a given set of documents.

In [0]:
# Build the index
sims = gensim.similarities.MatrixSimilarity(tfidf_source[corpus_source])

To determine similarity between two documents, we perform two steps. First we get a query document based on the document that needs to be compared, and this is then used to get the similarity index.

In [0]:
# obtain a similarity query against the source corpus
query_duplicate = tfidf_source[corpus_duplicate]
query_partial = tfidf_source[corpus_partial]

In [0]:
# Print the similarity index for each of the documents
print(numpy.around(sims[query_duplicate], decimals=2))


In [0]:
print(numpy.around(sims[query_partial], decimals=2))