# Document Similarity & Topic Modelling

This script is based on an assignment given as part of a Coursera course on using Python to analyse unstructured language.

## Part 1 - Document Similarity

For the first part, we create the functions `doc_to_synsets` and `similarity_score` which will be used by `document_path_similarity` to find the path similarity between two documents.

The following functions are provided:
* **`convert_tag:`** converts the tag given by `nltk.pos_tag` to a tag used by `wordnet.synsets`. Used in `doc_to_synsets`.
* **`document_path_similarity:`** computes the symmetrical path similarity between two documents by finding the synsets in each document using `doc_to_synsets`, then computing similarities using `similarity_score`.

Created functions:
* **`doc_to_synsets:`** returns a list of synsets in document. This function first tokenizes and part-of-speech tags the document using `nltk.word_tokenize` and `nltk.pos_tag`. Then it finds each token's corresponding synset using `wn.synsets(token, wordnet_tag)`. The first synset match is used. If there is no match, that token is skipped.
* **`similarity_score:`** returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, finds the synset in s2 with the largest similarity value. Sums all of the largest similarity values together and normalizes this value by dividing it by the number of largest similarity values found. (N.B. be careful with data types, which should be floats. Missing values should be ignored.)

*Do not modify the functions `convert_tag` and `document_path_similarity`.*

In [25]:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

# Uncomment the next 3 lines if not already installed:
# nltk.download('wordnet')
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')


def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None


def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are nvqjp friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
    """
    
    tokens = nltk.word_tokenize(doc)
    tag_list = nltk.pos_tag(tokens)
    pos_tags = [tag for (word,tag) in tag_list]
    wn_tags = [convert_tag(tag) for tag in pos_tags]
    output = [wn.synsets(tok,tag)[0] for (tok,tag) in zip(tokens,wn_tags) if wn.synsets(tok,tag)]
    
    return output


def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the
    number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.73333333333333339
    """
    
    
    max_scores = []
    for syn in s1:
        sim_scores = [syn.path_similarity(syn_cmp) for syn_cmp in s2 if syn.path_similarity(syn_cmp)]
        
        if sim_scores:
            max_scores.append(max(sim_scores))

    norm_score = sum(max_scores)/len(max_scores)
    
    return norm_score


def document_path_similarity(doc1, doc2):
    """Finds the symmetrical similarity between doc1 and doc2"""

    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)

    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2

<br>
_____

`paraphrases` is a DataFrame which contains the following columns: `Quality`, `D1`, and `D2`.

`Quality` is an indicator variable which indicates if the two documents `D1` and `D2` are paraphrases of one another (1 for paraphrase, 0 for not paraphrase).

In [26]:
# Use this dataframe for investigation into most_similar_docs and label_accuracy
paraphrases = pd.read_csv('paraphrases.csv')
paraphrases.head()

Unnamed: 0,Quality,D1,D2
0,1,"Ms Stewart, the chief executive, was not expec...","Ms Stewart, 61, its chief executive officer an..."
1,1,After more than two years' detention under the...,After more than two years in detention by the ...
2,1,"""It still remains to be seen whether the reven...","""It remains to be seen whether the revenue rec..."
3,0,"And it's going to be a wild ride,"" said Allan ...","Now the rest is just mechanical,"" said Allan H..."
4,1,The cards are issued by Mexico's consulates to...,The card is issued by Mexico's consulates to i...


___

### most_similar_docs

Using `document_path_similarity`, finds the pair of documents in paraphrases which has the maximum similarity score.

*This function returns a tuple `(D1, D2, similarity_score)`*

In [29]:
def most_similar_docs():
    
    # preallocate variables to return
    max_score = 0
    output = ('','',0)
    
    for row in paraphrases.iterrows():
        
        # find the similarity of these two phrases
        sim_score = document_path_similarity(row[1].D1,row[1].D2)
        
        if sim_score > max_score:
            
            # update the returned tuple if the current similarity greater than previous max
            max_score = sim_score
            output = (row[1].D1,row[1].D2,'Similarity score: {0:.1f}%'.format(100*sim_score))
    
    return output

most_similar_docs()

('"Indeed, Iran should be put on notice that efforts to try to remake Iraq in their image will be aggressively put down," he said.',
 '"Iran should be on notice that attempts to remake Iraq in Iran\'s image will be aggressively put down," he said.\n',
 'Similarity score: 97.5%')

### label_accuracy

Provides labels for the twenty pairs of documents by computing the similarity for each pair using `document_path_similarity`. The classifier rule is that if the score is greater than 0.75, label is paraphrase (1), else label is not paraphrase (0). Reports accuracy of the classifier using scikit-learn's accuracy_score.

*This function returns a float.*

In [31]:
def label_accuracy():
    from sklearn.metrics import accuracy_score
    
    # create dataframe to populate with scores and labels
    lb_df = pd.DataFrame(index=paraphrases.index,columns=['Score','Label'])
    
    for i in range(len(paraphrases)):
        
            # get the similarity score of this row
            sim_score = document_path_similarity(paraphrases.D1.iloc[i],paraphrases.D2.iloc[i])
            lb_df.Score.iloc[i] = sim_score
            
            # set paraphrase=1 if score is greater than 75%
            if sim_score > 0.75:
                lb_df.Label.iloc[i] = 1
            else:
                lb_df.Label.iloc[i] = 0
    
    lb_df.Label = lb_df.Label.astype('int64')
    
    print(lb_df)
    
    # return the accuracy of document_path_similarity (i.e. did it correctly classify paraphrases?)
    return accuracy_score(paraphrases.Quality,lb_df.Label)


print('\nAccuracy score = {0:.1f}%'.format(100*label_accuracy()))

       Score  Label
0   0.671682      0
1   0.900198      1
2   0.856672      1
3   0.786176      1
4   0.929902      1
5    0.84318      1
6   0.802245      1
7   0.670758      0
8   0.633661      0
9   0.761542      1
10  0.560643      0
11  0.679575      0
12   0.61527      0
13  0.975309      1
14  0.737422      0
15   0.84372      1
16  0.501099      0
17  0.558911      0
18  0.604793      0
19  0.804329      1

Accuracy score = 80.0%


## Part 2 - Topic Modelling

For the second part, we use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in `newsgroup_data`. First, we use the gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save this to the variable `ldamodel`. Then, we extract 10 topics using `corpus` and `id_map`, and with `passes=25` and `random_state=34`.

In [32]:
import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer

# Load the list of documents
with open('newsgroups', 'rb') as f:
    newsgroup_data = pickle.load(f)

# Use CountVectorizor to find three letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
# Fit and transform
X = vect.fit_transform(newsgroup_data)

# Convert sparse matrix to gensim corpus
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())


In [33]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`

ldamodel = gensim.models.ldamodel.LdaModel(corpus,num_topics=10,id2word=id_map,passes=25,random_state=34)

### lda_topics

Using `ldamodel`, we find a list of the 10 topics in the document, and the most significant 10 words in each topic. This is structured as a list of 10 tuples where each tuple takes on the form:

`(9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.014*"sci"')`

for example.

In [22]:
def lda_topics():
    return ldamodel.print_topics(num_topics=10,num_words=10)

lda_topics()

[(0,
  '0.056*"edu" + 0.043*"com" + 0.033*"thanks" + 0.022*"mail" + 0.021*"know" + 0.020*"does" + 0.014*"info" + 0.012*"monitor" + 0.010*"looking" + 0.010*"don"'),
 (1,
  '0.024*"ground" + 0.018*"current" + 0.018*"just" + 0.013*"want" + 0.013*"use" + 0.011*"using" + 0.011*"used" + 0.010*"power" + 0.010*"speed" + 0.010*"output"'),
 (2,
  '0.061*"drive" + 0.042*"disk" + 0.033*"scsi" + 0.030*"drives" + 0.028*"hard" + 0.028*"controller" + 0.027*"card" + 0.020*"rom" + 0.018*"floppy" + 0.017*"bus"'),
 (3,
  '0.023*"time" + 0.015*"atheism" + 0.014*"list" + 0.013*"left" + 0.012*"alt" + 0.012*"faq" + 0.012*"probably" + 0.011*"know" + 0.011*"send" + 0.010*"months"'),
 (4,
  '0.025*"car" + 0.016*"just" + 0.014*"don" + 0.014*"bike" + 0.012*"good" + 0.011*"new" + 0.011*"think" + 0.010*"year" + 0.010*"cars" + 0.010*"time"'),
 (5,
  '0.030*"game" + 0.027*"team" + 0.023*"year" + 0.017*"games" + 0.016*"play" + 0.012*"season" + 0.012*"players" + 0.012*"win" + 0.011*"hockey" + 0.011*"good"'),
 (6,
  '0.0

### topic_distribution

For the new document `new_doc`, we find the topic distribution. We use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus.

*This function returns a list of tuples, where each tuple is `(#topic, probability)`, with topic number corresponding to the topic number found in the previous code section.*

In [35]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]

In [36]:
def topic_distribution():
    # Fit and transform
    X = vect.transform(new_doc)

    # Convert sparse matrix to gensim corpus.
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
    
    return list(ldamodel[corpus])[0]

topic_distribution()

[(0, 0.020003108),
 (1, 0.020003324),
 (2, 0.020001281),
 (3, 0.49674782),
 (4, 0.020004038),
 (5, 0.020004129),
 (6, 0.020002972),
 (7, 0.020002645),
 (8, 0.020003129),
 (9, 0.3432276)]