<a href="https://colab.research.google.com/github/vanessaaleung/text-mining-tools/blob/main/doc-similarity/Document_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Similarity

Find the path similarity between two documents

### Functions:
* **`convert_tag:`** converts the tag given by `nltk.pos_tag` to a tag used by `wordnet.synsets`.
* **`document_path_similarity:`** computes the symmetrical path similarity between two documents by finding the synsets in each document using `doc_to_synsets`, then computing similarities using `similarity_score`.
* **`doc_to_synsets:`** returns a list of synsets in document. This function first tokenize and part of speech tag the document using `nltk.word_tokenize` and `nltk.pos_tag`. Then it should find each tokens corresponding synset using `wn.synsets(token, wordnet_tag)`. The first synset match should be used. If there is no match, that token is skipped.
* **`similarity_score:`** returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found.Missing values are ignored.

In [None]:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

In [8]:
def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None

In [9]:
def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are nvqjp friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
    """

    tokens = nltk.word_tokenize(doc)
    tags = nltk.pos_tag(tokens)
    converted_tags = [(t[0], convert_tag(t[1])) for t in tags]
    synsets = [wn.synsets(combination[0], combination[1])[0] for combination in converted_tags if wn.synsets(combination[0], combination[1])]
    
    return synsets

In [10]:
def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the
    number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.73333333333333339
    """
    largest_scores = []
    for i in s1:
        similarity_scores = [i.path_similarity(j) for j in s2 if i.path_similarity(j)]
        if similarity_scores:
            largest_scores.append(max(similarity_scores))
    
    return np.float64(sum(largest_scores)/len(largest_scores))

In [11]:
def document_path_similarity(doc1, doc2):
    """Finds the symmetrical similarity between doc1 and doc2"""

    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)

    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2

`Quality` is an indicator variable which indicates if the two documents `D1` and `D2` are paraphrases of one another (1 for paraphrase, 0 for not paraphrase).

In [3]:
paraphrases = pd.read_csv('https://raw.githubusercontent.com/vanessaaleung/text-mining-tools/main/doc-similarity/paraphrases.csv')
paraphrases.head()

Unnamed: 0,Quality,D1,D2
0,1,"Ms Stewart, the chief executive, was not expec...","Ms Stewart, 61, its chief executive officer an..."
1,1,After more than two years' detention under the...,After more than two years in detention by the ...
2,1,"""It still remains to be seen whether the reven...","""It remains to be seen whether the revenue rec..."
3,0,"And it's going to be a wild ride,"" said Allan ...","Now the rest is just mechanical,"" said Allan H..."
4,1,The cards are issued by Mexico's consulates to...,The card is issued by Mexico's consulates to i...


### Model Accuracy

If the score is greater than 0.75, label is paraphrase (1), else label is not paraphrase (0).

In [6]:
def accuracy():
    from sklearn.metrics import accuracy_score

    paraphrases['similarity_score'] = paraphrases.apply(lambda row: document_path_similarity(row.D1, row.D2), axis=1)
    paraphrases['label'] = np.where(paraphrases['similarity_score'] > 0.75, 1, 0)
    
    return accuracy_score(paraphrases['Quality'], paraphrases['label'])

In [7]:
accuracy()

0.8