
# Tagging - Dictionary learning

Our first attempt, creating tags by naive clustering, was not successful.

We need to use a strategy that:

- can cluster an arbitrary number of word vectors
- allows *multiple* tags per sentence fragment

We look into using a decomposition method, which reduces all vectors to a set of common factors that allow it to be rebuilt (as linear combinations of these factors) while losing as little information as possible.

In particular, we can use dictionary learning, which has a similar use case anyway. With our extracted common factors (which are vectors that 'make up' our words), we can round it to the nearest word vector. This word vector/word could be considered to be a key component in reconstituting (and as thus, a key contextual component).


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy import cluster
from sklearn.decomposition import DictionaryLearning

import spacy # if this gives an error, downgrade to python 3.12.3
nlp = spacy.load("en_core_web_lg")

from tqdm import tqdm
tqdm.pandas()

In [None]:
# Extract word vectors from a list of noun chunks
# Returns a dictionary mapping words to their index in the matrix, and the matrix itself
def extract_vectors(noun_chunks):
    word_vectors = {}
    # set as upper bounds (assuming no duplicates)
    word_vector_matrix = np.zeros(shape=(len(noun_chunks), 300))

    #TODO: rewrite to be pythonic
    i = 0
    for word in noun_chunks:
        if word.text in word_vectors:
            continue
        word_vector_matrix[i,:] = word.vector
        word_vectors[word.text] = i
        i = i + 1

    word_vector_matrix = word_vector_matrix[:i,:]
    
    return word_vectors, word_vector_matrix


# Load sample words
sample_words = pd.read_json("res/example_word_list.json")[0].to_list()
sample_words = [nlp(x).noun_chunks for x in sample_words]
sample_words = [x for xs in sample_words for x in xs]

# Extract word vectors from words
wv_mapping, wv_matrix = extract_vectors(sample_words)
reverse_wv_mapping = {v: k for k, v in wv_mapping.items()}

In [158]:
# Attempt latent dirichlet allocation
num_components = 20
# num_components = None

model = DictionaryLearning(n_components=num_components, alpha=10, random_state=42, transform_algorithm='threshold')
model = model.fit(wv_matrix)

key_words_rounded = []
key_words_rounded_vectors = np.zeros(shape=(num_components, 300))
for i in range(num_components):
    key_word_vector = model.components_[i]
    key_word_vector = nlp.vocab.vectors.most_similar(key_word_vector.reshape(1,-1), n=5)[0][0]
    key_words = [nlp.vocab.strings[x] for x in key_word_vector]

    # Key word is rounded to the nearest word in the vocabulary
    key_words_rounded.append(key_words[0])
    key_words_rounded_vectors[i,:] = key_word_vector[0]

    # Print out the key words for the topics, as well as close alternatives
    print(f"Topic {i}: {key_words[0]} (alternates: {key_words[1:]})")





Topic 0: Red (alternates: ['Boac', 'Blue', '-Blue', 'BlazBlue'])
Topic 1: Halberd (alternates: ['halberd', 'Halbe', 'Halberstram', 'Halberstam'])
Topic 2: night (alternates: ['nightime', "G'night", '/night', 'night-'])
Topic 3: Storm (alternates: ['giant', 'PhpStorm', 'WildStorm', 'Giant'])
Topic 4: bolt (alternates: ['bolts', 'Skybolt', 'Trebolt', 'unbolt'])
Topic 5: Fire (alternates: ['-Fire', 'FireRed', 'FireFly', 'Firey'])
Topic 6: Gold (alternates: ['Silver', 'Atragon', 'SoulSilver', 'Moondragon'])
Topic 7: EnOcean (alternates: ['Ocean', 'Oceanus', 'Oceans', 'Piscean'])
Topic 8: sword (alternates: ['swords', 'swordtail', 'Elsword', 'dagger'])
Topic 9: axe (alternates: ['ax', 'scythe', 'Giant', 'sledgehammer'])
Topic 10: arch (alternates: ['arches', 'arching', 'arched', 'archways'])
Topic 11: Cloud (alternates: ['giant', 'pCloud', 'cloud', 'ownCloud'])
Topic 12: Sword (alternates: ['Swords', 'Swordplay', 'Swordsman', 'Longsword'])
Topic 13: Moon (alternates: ['McMoon', 'giant', 'Mo

In [159]:
match_threshold = 100

# With our model, assign each word to a linear combination of topics
word_topics = model.transform(wv_matrix)
y = np.argmax(word_topics, axis=1)

threshold = match_threshold/num_components
all_words = []

# Iterate through each topic and get the words that are most associated with it, above a certain threshold
for k in range(num_components):
    words = []
    for i in range(len(sample_words)):
        if word_topics[i,k] > threshold:
            words.append(sample_words[i].text)
            all_words.append(sample_words[i].text)
    
    print(f"Topic '%s': %s" % (key_words_rounded[k], words))

remaining_words = [x for x in sample_words if x.text not in all_words]
print(f"\nFinished. Remaining words were: {remaining_words}")

Topic 'Red': ['Red dragon', 'Blue dragon', 'Gold dragon', 'Silver dragon', 'Shadow dragon', 'Dark dragon', 'Black ogre', 'Red ogre', 'Blue ogre', 'Moon giant', 'Sword']
Topic 'Halberd': ['Sword']
Topic 'night': ['night']
Topic 'Storm': ['Red dragon', 'Blue dragon', 'Shadow dragon', 'Dark dragon', 'Red ogre', 'Storm giant', 'Cloud giant', 'Moon giant', 'Sword', 'Giant axe']
Topic 'bolt': ['lightning', 'Lightning bolt']
Topic 'Fire': ['Red dragon', 'Red ogre', 'Water yai', 'Fire yai', 'Sword']
Topic 'Gold': ['Red dragon', 'Blue dragon', 'Gold dragon', 'Silver dragon', 'Magma dragon', 'Shadow dragon', 'Dark dragon', 'Red ogre', 'Blue ogre', 'Moon giant', 'Flaming sword', 'Sword', 'Giant axe']
Topic 'EnOcean': ['Blue dragon', 'Dark dragon', 'Moon giant', 'Ocean devil']
Topic 'sword': ['Red dragon', 'Blue dragon', 'Gold dragon', 'Silver dragon', 'Magma dragon', 'Shadow dragon', 'Dark dragon', 'Flaming sword', 'Sword', 'night', 'Rusting sword', 'Molten sword', 'lightning', 'Giant axe', 'Flam

In [184]:
threshold = 0.45

# With our model, assign each word to a linear combination of topics
nlp_key_words = [nlp(x) for x in key_words_rounded]
nlp_sample_words = [nlp(x.text) for x in sample_words]

# Iterate through each topic and get the words that are most associated with it, above a certain threshold
for k in range(num_components):
    words = []
    for i in range(len(sample_words)):
        nlp_word = nlp_sample_words[i]
        nlp_key_word = nlp_key_words[k]
        similarity = nlp_word.similarity(nlp_key_word)
        if similarity > threshold:
            words.append(sample_words[i].text)
            all_words.append(sample_words[i].text)
    
    print(f"Topic '%s': %s" % (key_words_rounded[k], words))

remaining_words = [x for x in sample_words if x.text not in all_words]
print(f"\nFinished. Remaining words were: {remaining_words}")

Topic 'Red': ['Red dragon', 'Blue dragon', 'Black ogre', 'Red ogre', 'Blue ogre']
Topic 'Halberd': ['Sword', 'Molten sword', 'Halberd']
Topic 'night': ['night']
Topic 'Storm': ['Storm giant', 'Fireball']
Topic 'bolt': ['Lightning bolt']
Topic 'Fire': ['Water yai', 'Fire yai', 'Fireball']
Topic 'Gold': ['Gold dragon', 'Silver dragon']
Topic 'EnOcean': ['Ocean devil']
Topic 'sword': ['Silver dragon', 'Magma dragon', 'Shadow dragon', 'Dark dragon', 'Flaming sword', 'Sword', 'Rusting sword', 'Molten sword', 'Halberd', 'Giant axe', 'Infernal devil']
Topic 'axe': ['Flaming sword', 'Rusting sword', 'Molten sword', 'Giant axe']
Topic 'arch': ['Flaming arch']
Topic 'Cloud': ['Cloud giant']
Topic 'Sword': ['Blue dragon', 'Gold dragon', 'Silver dragon', 'Magma dragon', 'Shadow dragon', 'Dark dragon', 'Flaming sword', 'Sword', 'Rusting sword', 'Molten sword', 'Halberd', 'Giant axe']
Topic 'Moon': ['Blue dragon', 'Shadow dragon', 'Dark dragon', 'Blue ogre', 'Moon giant']
Topic 'Dark': ['Red dragon'

  similarity = nlp_word.similarity(nlp_key_word)


This is a vastly improved result (now we can assign a snippet to multiple groups).

Issues:

- Unrelated words are included such asin some iterations, or 

- Antonyms are included: 'blue ogre' might be included under 'Black'. This is because of the color commonality.

- Unexpected commonalities are found. For example, for the pair *'Ocean devil'* and *'Infernal devil'* we get `EnOcean` instead of `devil`, We also get unrelated words such as  "*teavee*" (from Mike Teavee of Charlie and the Chocolate Factory) or *Parectopa*.
    - This is likely because decomposition methods favour re-aggregation of the whole. While *teavee* seems useless, the vector probably allows reconsitution of some minor parts of the other words rather than be another useful 'subcluster' of another cluster. (TODO: Rephrase).
    - Not only does this result in odd commonalities, but we can also be sure it is *missing* a lot of useful commonalities.
- We find some unexpected sortings:
    - For example: *lightning bolt* is not in `Storm`, but *fireball* is.

Continuation:

- What if we used our fantastical context for simlarity? For example, `sword` and `dragon` are rather different in our context, but because they are both used in fantasy, they are similar. We can try this in several ways:
    - We can try *removing* the word vector for 'fantasy' as a part of preprocessing for all words.
    - We can restructure 'similarity' to be 'similarity as a factor of its simlarity to fantasy'. So `similarity(sword and dragon) / similarity (fantasy)` (or perhaps `log(exp(x) + exp(y)` of same))