# Assignment 4 - Document Similarity & Topic Modelling

## Part 1 - Document Similarity

For the first part of this assignment, you will complete the functions `doc_to_synsets` and `similarity_score` which will be used by `document_path_similarity` to find the path similarity between two documents.

The following functions are provided:
* **`convert_tag:`** converts the tag given by `nltk.pos_tag` to a tag used by `wordnet.synsets`. You will need to use this function in `doc_to_synsets`.
* **`document_path_similarity:`** computes the symmetrical path similarity between two documents by finding the synsets in each document using `doc_to_synsets`, then computing similarities using `similarity_score`.

You will need to finish writing the following functions:
* **`doc_to_synsets:`** returns a list of synsets in document. This function should first tokenize and part of speech tag the document using `nltk.word_tokenize` and `nltk.pos_tag`. Then it should find each tokens corresponding synset using `wn.synsets(token, wordnet_tag)`. The first synset match should be used. If there is no match, that token is skipped.
* **`similarity_score:`** returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Be careful with data types, which should be floats. Missing values should be ignored.

Once doc_to_synsets and similarity_score have been completed, submit to the autograder which will run a test to check that these functions are running correctly.

*Do not modify the functions `convert_tag` and `document_path_similarity`.*

In [1]:
%%capture
import numpy as np
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnet as wn
import pandas as pd
nltk.data.path.append("assets/")

def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None

In [2]:
def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
    """

    # YOUR CODE HERE
    
    # Tokenize and POS tag the document
    tokens = nltk.word_tokenize(doc)
    pos_tags = nltk.pos_tag(tokens)
    
    # Empty list for storaging valid synsets
    synsets = []
    
    # For each token, get the corresponding WordNet synset
    for word, pos in pos_tags:
        wordnet_tag = convert_tag(pos)  # Convert POS tag to WordNet format
        
        # Print for debugging
        #print(f"Word: {word}, POS: {pos}, WordNet POS: {wordnet_tag}")
        
        # If there's a valid WordNet tag, look for synsets
        if wordnet_tag:  
            synset = wn.synsets(word, pos = wordnet_tag)  # Get synsets for the word
            if synset:  # If there's a matching synset, take the first one
                synsets.append(synset[0])
                
        else:
            # Try without any POS tag (None) if convert_tag returned None
            synset = wn.synsets(word)  # Get synsets without a specific POS tag
            if synset:  # If there's a matching synset, take the first one
                synsets.append(synset[0])

    return synsets
    
def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the
    number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.7333333333333333
    """
    max_sim = []
    
    # Loop through each synset in s1
    for synsets_1 in s1:
        sim = []  # List for storing similarities, including None as 0
        
        # Print synset for debugging purposes
        #print(f"Comparing synset: {synsets_1.name()}")
        
        # Compute the path similarity for each synset in s2
        for synsets_2 in s2:
            similarity = synsets_1.path_similarity(synsets_2)
            
            if similarity is None:  # If similarity is None, treat as 0.0
                similarity = 0.0
            
            sim.append(similarity)
            
            # Print similarity for debugging
            #print(f"  Similarity with {synsets_2.name()}: {similarity}")
        
        # If valid similarities are found, append the maximum similarity
        max_sim.append(max(sim))
        #print(f"  Max similarity for {synsets_1.name()}: {max(sim)}")
    
    # Return the mean of max similarities (including 0 for None), or 0.0 if no valid similarities were found
    return np.mean(max_sim) if max_sim else 0.0

In [3]:
# Example Usage for doc_to_synsets:
doc_to_synsets('Ms Stewart, 61, its chief executive officer and chairwoman, did not attend.\n') #same as the output given in discussion forum course mentor comments

[Synset('multiple_sclerosis.n.01'),
 Synset('stewart.n.01'),
 Synset('sixty-one.s.01'),
 Synset('information_technology.n.01'),
 Synset('chief.s.01'),
 Synset('executive.a.01'),
 Synset('military_officer.n.01'),
 Synset('president.n.04'),
 Synset('make.v.01'),
 Synset('not.r.01'),
 Synset('attend.v.01')]

In [4]:
# Example Usage for similarity:
synsets1 = doc_to_synsets('Ms Stewart, the chief executive, was not expected to attend.')
synsets2 = doc_to_synsets('Ms Stewart, 61, its chief executive officer and chairwoman, did not attend.\n')

print("Synsets for doc1:", [synset.name() for synset in synsets1])
print("Synsets for doc2:", [synset.name() for synset in synsets2])

similarity = similarity_score(synsets1, synsets2)
print(f"Similarity score: {similarity}")

Synsets for doc1: ['multiple_sclerosis.n.01', 'stewart.n.01', 'chief.s.01', 'executive.n.01', 'be.v.01', 'not.r.01', 'expect.v.01', 'attend.v.01']
Synsets for doc2: ['multiple_sclerosis.n.01', 'stewart.n.01', 'sixty-one.s.01', 'information_technology.n.01', 'chief.s.01', 'executive.a.01', 'military_officer.n.01', 'president.n.04', 'make.v.01', 'not.r.01', 'attend.v.01']
Similarity score: 0.7125


In [5]:
def document_path_similarity(doc1, doc2):
    """Finds the symmetrical similarity between doc1 and doc2"""

    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)

    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2

In [6]:
document_path_similarity('I like cat','I like dog') # Same value 0.73333 returned in the questions example given in the red comment

0.7333333333333334

`paraphrases` is a DataFrame which contains the following columns: `Quality`, `D1`, and `D2`.

`Quality` is an indicator variable which indicates if the two documents `D1` and `D2` are paraphrases of one another (1 for paraphrase, 0 for not paraphrase).

In [7]:
# Use this dataframe for questions most_similar_docs and label_accuracy
paraphrases = pd.read_csv('assets/paraphrases.csv')
paraphrases.head()

Unnamed: 0,Quality,D1,D2
0,1,"Ms Stewart, the chief executive, was not expec...","Ms Stewart, 61, its chief executive officer an..."
1,1,After more than two years' detention under the...,After more than two years in detention by the ...
2,1,"""It still remains to be seen whether the reven...","""It remains to be seen whether the revenue rec..."
3,0,"And it's going to be a wild ride,"" said Allan ...","Now the rest is just mechanical,"" said Allan H..."
4,1,The cards are issued by Mexico's consulates to...,The card is issued by Mexico's consulates to i...


___

### most_similar_docs

Using `document_path_similarity`, find the pair of documents in paraphrases which has the maximum similarity score.

*This function should return a tuple `(D1, D2, similarity_score)`*

In [8]:
def most_similar_docs():
    
    # YOUR CODE HERE
    
    # Calculate similarity scores for each pair of documents
    similarity_score = [document_path_similarity(x, y) for x, y in zip(paraphrases['D1'], paraphrases['D2'])]
    
    # Find the index of the maximum similarity score
    max_index = np.argmax(similarity_score)
    
    # Retrieve the corresponding pair of documents and the maximum similarity score
    D1 = paraphrases.loc[max_index, 'D1']
    D2 = paraphrases.loc[max_index, 'D2']
    max_score = similarity_score[max_index]
    
    return (D1, D2, max_score)

most_similar_docs()

('"Indeed, Iran should be put on notice that efforts to try to remake Iraq in their image will be aggressively put down," he said.',
 '"Iran should be on notice that attempts to remake Iraq in Iran\'s image will be aggressively put down," he said.\n',
 0.9590643274853801)

### label_accuracy

Provide labels for the twenty pairs of documents by computing the similarity for each pair using `document_path_similarity`. Let the classifier rule be that if the score is greater than 0.75, label is paraphrase (1), else label is paraphrase (0). Report accuracy of the classifier using scikit-learn's accuracy_score.

*This function should return a float.*

In [9]:
def label_accuracy():
    from sklearn.metrics import accuracy_score
    
    # YOUR CODE HERE

    # Calculate the similarity scores for each pair of documents
    paraphrases['similarity_score'] = [document_path_similarity(x, y) for x, y in zip(paraphrases['D1'], paraphrases['D2'])]
    
    # Apply classifier rule: If similarity score > 0.75, label as 1 (paraphrase), else 0 (not paraphrase)
    paraphrases['predicted_label'] = np.where(paraphrases['similarity_score'] > 0.75, 1, 0)
    
    # Compute and return accuracy using sklearn's accuracy_score
    return accuracy_score(paraphrases['Quality'], paraphrases['predicted_label'])

label_accuracy()

0.7

## Part 2 - Topic Modelling

For the second part of this assignment, you will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in `newsgroup_data`. You will first need to finish the code in the cell below by using gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save to the variable `ldamodel`. Extract 10 topics using `corpus` and `id_map`, and with `passes=25` and `random_state=34`.

--> A pickled file refers to a file that contains data serialized using Python's pickle module. Serialization, or "pickling," is the process of converting a Python object (e.g., a list, dictionary, or custom object) into a byte stream, so that it can be saved to a file, transmitted over a network, or stored for later use. Pickling allows you to save complex data structures (like machine learning models, data frames, lists, etc.) to a file so you can load them later without needing to recompute or reload the data.

--> The 'rb' in the context of opening a file in Python stands for "read binary" mode.

--> Definition: This parameter removes common English stop words from the text. Stop words are common words such as "the", "is", "in", "at", "and", "for", etc., that typically don’t carry meaningful information for text analysis or modeling.

--> This regular expression defines the pattern for what constitutes a "token" (word) in the text. The pattern (?u)\\b\\w\\w\\w+\\b specifies that a token must consist of three or more word characters. This part of the regular expression tells Python to use Unicode character classes for matching.

--> vect.vocabulary_ is a dictionary mapping words to indices. id_map reverses this mapping, creating a dictionary that maps indices to words. This reverse mapping is useful when you want to interpret the indices from a document-term matrix and convert them back to the original words.

In [10]:
import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer

# Load the list of documents
with open('assets/newsgroups', 'rb') as f:
    newsgroup_data = pickle.load(f)

# Use CountVectorizor to find three letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
# Fit and transform
X = vect.fit_transform(newsgroup_data)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())


In [12]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`


# YOUR CODE HERE
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, passes=25, random_state=34, id2word=id_map)


### lda_topics

Using `ldamodel`, find a list of the 10 topics and the most significant 10 words in each topic. This should be structured as a list of 10 tuples where each tuple takes on the form:

`(9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.013*"information"')`

for example.

*This function should return a list of tuples.*

In [13]:
def lda_topics():
    
    # YOUR CODE HERE
    return list(ldamodel.print_topics(num_topics=10, num_words=10))
    raise NotImplementedError()
    
lda_topics()

[(0,
  '0.056*"edu" + 0.043*"com" + 0.033*"thanks" + 0.022*"mail" + 0.021*"know" + 0.020*"does" + 0.014*"info" + 0.012*"monitor" + 0.010*"looking" + 0.010*"don"'),
 (1,
  '0.024*"ground" + 0.018*"current" + 0.018*"just" + 0.013*"want" + 0.013*"use" + 0.011*"using" + 0.011*"used" + 0.010*"power" + 0.010*"speed" + 0.010*"output"'),
 (2,
  '0.061*"drive" + 0.042*"disk" + 0.033*"scsi" + 0.030*"drives" + 0.028*"hard" + 0.028*"controller" + 0.027*"card" + 0.020*"rom" + 0.018*"floppy" + 0.017*"bus"'),
 (3,
  '0.023*"time" + 0.015*"atheism" + 0.014*"list" + 0.013*"left" + 0.012*"alt" + 0.012*"faq" + 0.012*"probably" + 0.011*"know" + 0.011*"send" + 0.010*"months"'),
 (4,
  '0.025*"car" + 0.016*"just" + 0.014*"don" + 0.014*"bike" + 0.012*"good" + 0.011*"new" + 0.011*"think" + 0.010*"year" + 0.010*"cars" + 0.010*"time"'),
 (5,
  '0.030*"game" + 0.027*"team" + 0.023*"year" + 0.017*"games" + 0.016*"play" + 0.012*"season" + 0.012*"players" + 0.012*"win" + 0.011*"hockey" + 0.011*"good"'),
 (6,
  '0.0

In [14]:
# Example usage, extract topics and print top words
topics_list = []
for topic in lda_topics():
    topic_index, words_str = topic
    words = words_str.split(' + ')  # Split the words from their weights
    words = [word.split('*')[1].replace('"', '') for word in words]  # Extract just the words
    topics_list.append((topic_index, words))
# topics_list is in this format e.g. [(0, ['edu','com','thanks','mail','know','does','info','monitor','looking','don']),(1,['ground','current','just','want','use','using','used','power','speed','output']),
    
# Display the topics
for topic in topics_list:
    print(f"Topic {topic[0]}: {', '.join(topic[1])}") 

Topic 0: edu, com, thanks, mail, know, does, info, monitor, looking, don
Topic 1: ground, current, just, want, use, using, used, power, speed, output
Topic 2: drive, disk, scsi, drives, hard, controller, card, rom, floppy, bus
Topic 3: time, atheism, list, left, alt, faq, probably, know, send, months
Topic 4: car, just, don, bike, good, new, think, year, cars, time
Topic 5: game, team, year, games, play, season, players, win, hockey, good
Topic 6: information, help, medical, new, use, 000, research, university, number, program
Topic 7: don, people, think, just, say, know, does, good, god, way
Topic 8: use, apple, power, time, data, software, pin, memory, simms, port
Topic 9: space, nasa, science, edu, data, shuttle, launch, available, center, sci


### topic_distribution

For the new document `new_doc`, find the topic distribution. Remember to use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus.

*This function should return a list of tuples, where each tuple is `(#topic, probability)`*

In [15]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]

In [16]:
def topic_distribution():
    
    # YOUR CODE HERE
    sparse_doc = vect.transform(new_doc)
    gen_corpus = gensim.matutils.Sparse2Corpus(sparse_doc, documents_columns=False)
    
    # Get the topic distribution for the new document
    topic_distribution = ldamodel.get_document_topics(gen_corpus)

    
    # Get the first element because the result is a list with only one document (since we only have one document)
    topic_prob = topic_distribution[0]

    return topic_prob
    
topic_distribution()

[(0, 0.020003108),
 (1, 0.020003324),
 (2, 0.020001281),
 (3, 0.49674824),
 (4, 0.020004038),
 (5, 0.020004129),
 (6, 0.020002972),
 (7, 0.020002645),
 (8, 0.020003129),
 (9, 0.34322715)]

In [17]:
# Example of iterating through the topic distribution
for topic_id, prob in topic_distribution():
    print(f"Topic {topic_id}: {prob:.4f}")

Topic 0: 0.0200
Topic 1: 0.0200
Topic 2: 0.0200
Topic 3: 0.4968
Topic 4: 0.0200
Topic 5: 0.0200
Topic 6: 0.0200
Topic 7: 0.0200
Topic 8: 0.0200
Topic 9: 0.3432


### topic_names

From the list of the following given topics, assign topic names to the topics you found. If none of these names best matches the topics you found, create a new 1-3 word "title" for the topic.

Topics: Health, Science, Automobiles, Politics, Government, Travel, Computers & IT, Sports, Business, Society & Lifestyle, Religion, Education.

*This function should return a list of 10 strings.*

In [None]:
def topic_names():
    
    # YOUR CODE HERE
    return ['Education','Science','Computers & IT','Religion','Automobiles','Sports','Science','Religion','Computers & IT','Science']

topic_names()
