# Background
---

The objective of this notebook program is to find how similar two documents are. This can be used to determine the derivative documents from a source document.







# Libraries
---

The following libraries are used in this program.

*   **Natural language toolkit:** NLTK contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.
*   **Gensim:** provides packages for processing texts, working with word vector models.
*   **Numpy:** provides packages for numeric processing.
*   **Requests:** provides packages for processing making HTTP requests.
*   **OpenAI:** provides packages for making calls to the OpenAI API library.

Some libraries are not installed by default, hence manually installing them.

In [None]:
!pip install gensim
!pip install nltk
!pip install openai

In [None]:
# import the required libraries
import gensim
import nltk
import numpy
import requests
import openai

In [None]:
from nltk.tokenize import word_tokenize # Word Tokenizer
from nltk.tokenize import sent_tokenize # Sentence Tokenizer

nltk.download('punkt')

# Data Sources
---

We'll be generating text documents from Jupyter Notebook using ChatGPT.

This would require access to the OpenAI API. First login to the OpenAI portal and create a API key for accessing the various provided services.


In [None]:
# The following creates a new OpenAPI instance using the provided API key.
openai.api_key = "api_key"


In order to generate a text document, a prompt needs to be defined that would be used to 'seed' the response.

Using this prompt, call the OpenAI create function to generate the text, and save the output to a file.

In this example, we are using the Davinci engine, which is the most advanced model available from OpenAI. The max_tokens parameter controls the length of the generated text.

In [None]:
prompt = "What are the benefits of Artificial Intelligence"

Using this prompt, call the OpenAI create function to generate the text, and save the output to a file.

In this example, we are using the Davinci engine, which is the most advanced model available from OpenAI. The max_tokens parameter controls the length of the generated text.

In [None]:
response = openai.Completion.create(
    engine="davinci", prompt=prompt, max_tokens=1000
)

original_text = response.choices[0].text.strip()

Save the output to a file by running the following code:

In [None]:
with open("data/text/original_text.txt", "w") as f:
    f.write(original_text)

To generate variations of a text document using ChatGPT, you can use a process called "text rewriting".

In this example, we're using the Davinci engine again, and we're asking the model to rewrite the sentence that includes the original text. The generated text will be saved in the variation_text variable.

In [None]:
prompt = f"Rewrite this sentence: {original_text}"

response = openai.Completion.create(
    engine="davinci", prompt=prompt, max_tokens=900
)

variation_text1 = response.choices[0].text.strip()

with open("data/text/variation_text1.txt", "w") as f:
    f.write(variation_text1)

Repeat the above two more times to obtain a total of three variation texts

In [None]:
prompt = f"Rewrite this sentence: {original_text}"

response = openai.Completion.create(
    engine="davinci", prompt=prompt, max_tokens=900
)

variation_text2 = response.choices[0].text.strip()

with open("data/text/variation_text2.txt", "w") as f:
    f.write(variation_text2)

To confirm that indeed that OpenAI generated text, print out the lengths of each of the generated text strings.

In [None]:
print("Size of original text: ", len(original_text))
print("Size of variation1 text: ", len(variation_text1))
print("Size of variation2 text: ", len(variation_text2))

# Document Parsing
---
We need to parse the document and extract all the words from the document. This is done through a two step process.
1. Open the document and get all the sentences through the sent_tokenize() function.
2. For each sentence, get all the words in that sentence using the word_tokenize() function.

In [None]:
# Empty array that contains all the sentences
sent_array = []

sent_tokens = sent_tokenize(original_text)
for line in sent_tokens:
    sent_array.append(line)

print("Number of sentences: ", len(sent_array))
print(sent_array)


In [None]:
word_array = [[w.lower() for w in word_tokenize(text)] 
            for text in sent_array]
print(word_array)

Gensim requires the words (aka tokens) be converted to unique ids before it can process them.

Create a Dictionary object that maps each word to a unique id. Let's convert our sentences to a [list of words] and pass it to the corpora.Dictionary() object. A dictionary maps every word to a number.

In [None]:
dictionary = gensim.corpora.Dictionary(word_array)
print(dictionary.token2id)

### Step 1 - Bag of Words

Create a Corpus. A ‘corpus’ is typically a ‘collection of documents as a bag of words’.

The corpus is an object that contains the word id and its frequency in each document.


In [None]:
# Create a corpus and pass the tokenized list of words to the Dictionary.doc2bow()
# Here bow stands for bag-of-words
corpus_source = [dictionary.doc2bow(word) for word in word_array]

In [None]:
print(corpus_source)

### Step 2 - TF-IDF

Term Frequency – Inverse Document Frequency(TF-IDF) is also a bag-of-words model but unlike the regular corpus, TFIDF down weights tokens (words) that appears frequently across documents.

TF-IDF is calculated by multiplying a local component (TF) with a global component (IDF) and optionally normalizing the result to unit length.

Term frequency is how often the word shows up in the document and inverse document frequency scales the value by how rare the word is in the corpus. In simple terms, words that occur more frequently across the documents get smaller weights.

In [None]:
tfidf_source = gensim.models.TfidfModel(corpus_source)

For example, the word ‘the’ occurs in multiple documents so it weighted down. The word ‘this’ and 'is' appearing in all three documents so removed altogether.

In [None]:
for doc in tfidf_source[corpus_source]:
    print([[dictionary[id], numpy.around(freq, decimals=2)] for id, freq in doc])

### Step 3 - Parse other documents too
Perform the same processing for the other two documents as well

In [None]:
sent_array = []
sent_tokens = sent_tokenize(variation_text1)
for line in sent_tokens:
    sent_array.append(line)

word_array = [[w.lower() for w in word_tokenize(text)] 
            for text in sent_array]

dictionary = gensim.corpora.Dictionary(word_array)
corpus_variation1 = [dictionary.doc2bow(word) for word in word_array]

In [None]:
sent_array = []
sent_tokens = sent_tokenize(variation_text2)
for line in sent_tokens:
    sent_array.append(line)

word_array = [[w.lower() for w in word_tokenize(text)] 
            for text in sent_array]

dictionary = gensim.corpora.Dictionary(word_array)
corpus_variation2 = [dictionary.doc2bow(word) for word in word_array]

In [None]:
print(corpus_source)
print(corpus_variation1)
print(corpus_variation2)

# Determining Document Similarity
---
Now, we are going to create similarity object using cosine similarity. Cosine similarity is a standard measure in Vector Space Modeling to determine the similarity of two vectors.

The main class is Similarity, which builds an index for a given set of documents.

In [None]:
# Build the index
sims = gensim.similarities.MatrixSimilarity(tfidf_source[corpus_source])

To determine similarity between two documents, we perform two steps. First we get a query document based on the document that needs to be compared, and this is then used to get the similarity index.

In [None]:
# obtain a similarity query against the source corpus
query_variation1 = tfidf_source[corpus_variation1]
query_variation2 = tfidf_source[corpus_variation2]

Print the similarity index for each of the documents

In [None]:
print(numpy.around(sims[query_variation1], decimals=2))

In [None]:
print(numpy.around(sims[query_variation2], decimals=2))

Based on the results of the similarity index, it can be determined which variation is closed to the original text.