<a href="https://colab.research.google.com/github/ziababar/demos/blob/master/derivative_security/document_similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background

The objective of this notebook program is to find how similar two documents are. This can be used to determine the derivative documents from a source document.







# Libraries

The following libraries are used in this program.

**Scikit-learn**

**Natural language toolkit:** NLTK contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.

**Gensim:** provides packages for processing texts, working with word vector models.

In [0]:
# import the required libraries
import gensim
import nltk
import sklearn

import requests


In [0]:
from nltk.tokenize import word_tokenize # Word Tokenizer
from nltk.tokenize import sent_tokenize # Sentence Tokenizer

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
#print(dir(gensim))

# Data Sources

In [0]:
target_url = 'https://raw.githubusercontent.com/ziababar/demos/master/derivative_security/data/doc1.txt'
response = requests.get(target_url)
doc1 = response.text
print(doc1)


The Pentagon must stop pretending the state of cyberwarfare is something new, and that the damage it wreaks is different. If we have not absorbed the lessons of the past, then they will be tested on us again.

Anonymous "Anonymous" is the hacker group famous for its attacks on the Church of Scientology, international terrorist groups, and other governments and corporations. As the name implies, the collective can appear anywhere at any time, and offers its services as a collective attack on Internet services.

In lieu of completely replacing or replacing the process of wargaming the point I want to bring up is that wargames will continue to exist and wargames can be designed and tested. In the case of Dark Age of Camelot I had some limitations, one of which was the lack of people online to play with. I was a solo design, purely with a notebook. And as you know most games I make end up being played by others. And this means that when my friends started playing I was there, and now I can

In [0]:
target_url = 'https://raw.githubusercontent.com/ziababar/demos/master/derivative_security/data/doc2.txt'
response = requests.get(target_url)
doc2 = response.text

In [0]:
target_url = 'https://raw.githubusercontent.com/ziababar/demos/master/derivative_security/data/doc3.txt'
response = requests.get(target_url)
doc3 = response.text

In [0]:
raw_documents = ["Someone I know recently combined Maple Syrup & buttered Popcorn thinking it would taste like caramel popcorn. It didn’t and they don’t recommend anyone else do it either.",
                 "Sometimes it is better to just walk away from things and go back to them later when you’re in a better frame of mind.",
                 "Italy is my favorite country; in fact, I plan to spend two weeks there next year.",
                 "He turned in the research paper on Friday; otherwise, he would have not passed the class.",
                 "Keep up with evolving privacy and security regulations across all industries with monitoring and detailed reporting from encryption status to failed authentication and everything in between. "]


In [0]:
print("Number of documents:",len(raw_documents))
print("Number of documents:",len(doc1))
print("Number of documents:",len(doc2))
print("Number of documents:",len(doc3))


Number of documents: 5
Number of documents: 1098
Number of documents: 1098
Number of documents: 514


# Document Parsing

**Open document 1**

**Open document 2**

**Open document 3**


need to review them too

We use the method word_tokenize() to split a sentence into words.

We need to count average words per sentence, so for accomplishing such a task, we use sentence tokenization as well as words to calculate the ratio.

Program will open file and read it's content. Then it will add tokenized sentences into the array for word tokenization.

In [0]:
file_docs = []

tokens = sent_tokenize(doc1)
for line in tokens:
    file_docs.append(line)

print("Number of documents:",len(file_docs))

Number of documents: 10


Once we added tokenized sentences in array, it is time to tokenize words for each sentence.

In [0]:
gen_docs = [[w.lower() for w in word_tokenize(text)] for text in file_docs]

print(gen_docs)

In order to work on text documents, Gensim requires the words (aka tokens) be converted to unique ids. So, Gensim lets you create a Dictionary object that maps each word to a unique id. Let's convert our sentences to a [list of words] and pass it to the corpora.Dictionary() object

In [0]:
dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary.token2id)

A dictionary maps every word to a number. Gensim lets you read the text and update the dictionary, one line at a time, without loading the entire text file into system memory.

### Method 1 - Bag of Words

The next important object you need to familiarize with in order to work in gensim is the Corpus (a Bag of Words). It is a basically object that contains the word id and its frequency in each document (just lists the number of times each word occurs in the sentence).

Note that, a ‘token’ typically means a ‘word’. A ‘document’ can typically refer to a ‘sentence’ or ‘paragraph’ and a ‘corpus’ is typically a ‘collection of documents as a bag of words’.


Now, create a bag of words corpus and pass the tokenized list of words to the Dictionary.doc2bow()

In [0]:
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]

In [0]:
print(corpus)

As you see we used "the" two times in second sentence and if you look word with id=12 (the) you will see that its frequency is 2 (appears 2 times in sentence)

### Method 2 - TF-IDF

Term Frequency – Inverse Document Frequency(TF-IDF) is also a bag-of-words model but unlike the regular corpus, TFIDF down weights tokens (words) that appears frequently across documents.

Tf-Idf is calculated by multiplying a local component (TF) with a global component (IDF) and optionally normalizing the result to unit length. Term frequency is how often the word shows up in the document and inverse document frequency scales the value by how rare the word is in the corpus. In simple terms, words that occur more frequently across the documents get smaller weights.

In [0]:
tf_idf = gensim.models.TfidfModel(corpus)
for doc in tfidf[corpus]:
    print([[dictionary[id], np.around(freq, decimals=2)] for id, freq in doc])

The word ‘the’ occurs in two documents so it weighted down. The word ‘this’ and 'is' appearing in all three documents so removed altogether.

# Determining Document Similarity

Now, we are going to create similarity object. The main class is Similarity, which builds an index for a given set of documents.The Similarity class splits the index into several smaller sub-indexes, which are disk-based. Let's just create similarity object then you will understand how we can use it for comparing.

In [0]:
# building the index
 sims = gensim.similarities.Similarity('workdir/',tf_idf[corpus],
                                        num_features=len(dictionary))

**Create Query Document**

Once the index is built, we are going to calculate how similar is this query document to each document in the index. So, create second .txt file which will include query documents or sentences and tokenize them as we did before.

In [0]:
file2_docs = []

with open ('demofile2.txt') as f:
    tokens = sent_tokenize(f.read())
    for line in tokens:
        file2_docs.append(line)

print("Number of documents:",len(file2_docs))  
for line in file2_docs:
    query_doc = [w.lower() for w in word_tokenize(line)]
    query_doc_bow = dictionary.doc2bow(query_doc) #update an existing dictionary and
create bag of words

Once the index is built, we are going to calculate how similar is this query document to each document in the index. 



**Document similarities to query**

At this stage, you will see similarities between the query and all index documents. To obtain similarities of our query document against the indexed documents:

In [0]:
# perform a similarity query against the corpus
query_doc_tf_idf = tf_idf[query_doc_bow]
# print(document_number, document_similarity)
print('Comparing Result:', sims[query_doc_tf_idf]) 

Cosine measure returns similarities in the range (the greater, the more similar).

**Computing Cosine Similarity using scikit-learn**

Generally a cosine similarity between two documents is used as a similarity measure of documents.

https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]        
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")    >>> tfidf = vect.fit_transform(corpus)                        >>> pairwise_similarity = tfidf * tfidf.T 

**DOT PRODUCT** method