# Tag Similarity

The goal of this code:
1. Take the output from our model (text file with 1) image file names and 2) tokenized caption)
2. Take a user phrase
3. Convert user phrase to tokens (using same tokenization algorithm as our model) 
4. Use a syntactic similarity method (we don't need something semantic like WordNet because our input is only tags) to find if 2 sets of tags are similar
5. Compile a (shortened) file with image file names and tokenized captions that are sufficiently similar to the user phrase

### Building pseudo model output

Since our model is not yet ready, I'm just going to create a sample set of model output by applying our tokenization algorithm to the original 8k captions. This will yield a set of 'tokenized' captions.

In [17]:
# This part just imports and cleans the captions
import os, sys, string

captions = "captions.txt"
with open(os.path.join(sys.path[0], captions), "r") as f:
    image_data = f.readlines()

# separate image names and captions
im_names = [line.split(",")[0].strip() for line in image_data]
im_captions = [line.split(",")[1].strip() for line in image_data]

# remove all punctuation from the strings
im_captions = [caption.translate(str.maketrans('', '', string.punctuation)) for caption in im_captions]

In [20]:
# function that creates tokens
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def extract_tokens(captions):
    # Extract specific tokens from each caption
    extracted_tokens = []
    for caption in captions:
        # Tokenize the caption
        tokens = word_tokenize(caption)

        # Tag each token with its part of speech
        tagged_tokens = pos_tag(tokens)

        # Extract specific tokens based on their part of speech
        nouns = [token[0] for token in tagged_tokens if token[1].startswith("N")]
        verbs = [token[0] for token in tagged_tokens if token[1].startswith("V")]
        adjectives = [token[0] for token in tagged_tokens if token[1].startswith("J")]

        # Combine extracted tokens into meaningful phrases
        noun_phrases = []
        current_phrase = []
        for token in tagged_tokens:
            if token[1].startswith("N"):
                current_phrase.append(token[0])
            elif current_phrase:
                # Combine consecutive nouns into noun phrases
                noun_phrases.append(" ".join(current_phrase))
                current_phrase = []
        if current_phrase:
            noun_phrases.append(" ".join(current_phrase))

        verb_phrases = verbs

        adjective_phrases = []
        current_phrase = []
        for token in tagged_tokens:
            if token[1].startswith("J"):
                current_phrase.append(token[0])
            elif current_phrase:
                # Combine consecutive adjectives into adjective phrases
                adjective_phrases.append(" ".join(current_phrase))
                current_phrase = []
        if current_phrase:
            adjective_phrases.append(" ".join(current_phrase))

        # Combine all extracted phrases into a single list of tokens
        extracted_tokens.append(list(set(noun_phrases + verbs + adjective_phrases)))
        
    return extracted_tokens

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Rubin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Rubin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [29]:
# this part extracts and saves the tokens
# extract tokens
extracted_tokens = extract_tokens(im_captions)

# Save extracted tokens to a new file 
# *** WE ARE NOT SAVING WITH COMMA DELIMITERS ***
with open('extracted_tokens.txt', 'w') as f:
    for tokens in extracted_tokens:
        f.write(' '.join(tokens) + '\n')

### User input

We now consume input from the user and tokenize it using the same methodology as for our captions. 

For now, this will ONLY be a phrase. In the future, we will want to be able to consume an image from the user and generate tags.

In [28]:
user_phrase = "A man with a helmet climbing a rock"
user_tokens = extract_tokens([user_phrase])

### Phrase Similarity Models

Now, we want to test different text similarity models:
1. Sorted Fuzzy ratio
2. Common words ratio
3. Cosine similarity with SBERT embeddings 
4. Jaccard distance with SBERT embeddings


In [33]:
# import tokens already extracted from the dataset
with open(os.path.join(sys.path[0], 'extracted_tokens.txt'), "r") as f:
    ext_tokens = f.readlines()

ext_tokens = [line.strip() for line in ext_tokens]

##### SBERT Embedding