# Tag Similarity

The goal of this code:
1. Take the output from our model (text file with 1) image file names and 2) tokenized caption)
2. Take a user phrase
3. Convert user phrase to tokens (using same tokenization algorithm as our model) 
4. Use a syntactic similarity method (we don't need something semantic like WordNet because our input is only tags) to find if 2 sets of tags are similar
5. Compile a (shortened) file with image file names and tokenized captions that are sufficiently similar to the user phrase

### Building pseudo model output

Since our model is not yet ready, I'm just going to create a sample set of model output by applying our tokenization algorithm to the original 8k captions. This will yield a set of 'tokenized' captions.

In [17]:
# This part just imports and cleans the captions
import os, sys, string

captions = "captions.txt"
with open(os.path.join(sys.path[0], captions), "r") as f:
    image_data = f.readlines()

# separate image names and captions
im_names = [line.split(",")[0].strip() for line in image_data]
im_captions = [line.split(",")[1].strip() for line in image_data]

# remove all punctuation from the strings
im_captions = [caption.translate(str.maketrans('', '', string.punctuation)) for caption in im_captions]

In [20]:
# function that creates tokens
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def extract_tokens(captions):
    # Extract specific tokens from each caption
    extracted_tokens = []
    for caption in captions:
        # Tokenize the caption
        tokens = word_tokenize(caption)

        # Tag each token with its part of speech
        tagged_tokens = pos_tag(tokens)

        # Extract specific tokens based on their part of speech
        nouns = [token[0] for token in tagged_tokens if token[1].startswith("N")]
        verbs = [token[0] for token in tagged_tokens if token[1].startswith("V")]
        adjectives = [token[0] for token in tagged_tokens if token[1].startswith("J")]

        # Combine extracted tokens into meaningful phrases
        noun_phrases = []
        current_phrase = []
        for token in tagged_tokens:
            if token[1].startswith("N"):
                current_phrase.append(token[0])
            elif current_phrase:
                # Combine consecutive nouns into noun phrases
                noun_phrases.append(" ".join(current_phrase))
                current_phrase = []
        if current_phrase:
            noun_phrases.append(" ".join(current_phrase))

        verb_phrases = verbs

        adjective_phrases = []
        current_phrase = []
        for token in tagged_tokens:
            if token[1].startswith("J"):
                current_phrase.append(token[0])
            elif current_phrase:
                # Combine consecutive adjectives into adjective phrases
                adjective_phrases.append(" ".join(current_phrase))
                current_phrase = []
        if current_phrase:
            adjective_phrases.append(" ".join(current_phrase))

        # Combine all extracted phrases into a single list of tokens
        extracted_tokens.append(list(set(noun_phrases + verbs + adjective_phrases)))
        
    return extracted_tokens

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Rubin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Rubin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [29]:
# this part extracts and saves the tokens
# extract tokens
extracted_tokens = extract_tokens(im_captions)

# Save extracted tokens to a new file 
# *** WE ARE NOT SAVING WITH COMMA DELIMITERS ***
with open('extracted_tokens.txt', 'w') as f:
    for tokens in extracted_tokens:
        f.write(' '.join(tokens) + '\n')

### User input

We now consume input from the user and tokenize it using the same methodology as for our captions. 

For now, this will ONLY be a phrase. In the future, we will want to be able to consume an image from the user and generate tags.

In [93]:
user_phrase = "A girl in a building going to a wooden horse"
user_tokens = extract_tokens([user_phrase])
user_tokens = ' '.join(user_tokens[0])
user_tokens

'going horse wooden girl building'

### Phrase Similarity Models

Now, we want to test different text similarity models:
1. Sorted Fuzzy ratio
2. Cosine distance with USE embeddings 
3. Euclidean distance with USE embeddings 


In [120]:
import pandas as pd
# import tokens already extracted from the dataset
with open(os.path.join(sys.path[0], 'extracted_tokens.txt'), "r") as f:
    ext_tokens = f.readlines()

ext_tokens = [line.strip() for line in ext_tokens]

# start up a df to hold all the similarity measures
sim_DF = pd.DataFrame(list(zip(im_names, ext_tokens)), columns = ["Image file name", "Tokenized caption"])

##### Sorted fuzzy ratio

In [121]:
from fuzzywuzzy import fuzz

fuzzy_ratios = [fuzz.token_sort_ratio(user_tokens, phrase)/100 for phrase in ext_tokens]

sim_DF['Fuzzy ratio'] = fuzzy_ratios
sim_DF.sort_values("Fuzzy ratio")

Unnamed: 0,Image file name,Tokenized caption,Fuzzy ratio
9306,2428275562_4bde2bc5ea.jpg,,0.00
33361,3640443200_b8066f37f6.jpg,,0.00
18354,300577375_26cc2773a1.jpg,front SUV stands policeman,0.03
6202,2216695423_1362cb25f3.jpg,play stick Dogs,0.04
24783,3298175192_bbef524ddc.jpg,plays camera puppy,0.04
...,...,...,...
31924,3583903436_028b06c489.jpg,old wooden building young boy swings,0.65
24717,3295391572_cbfde03a10.jpg,building blocks playing wooden table child,0.65
3,1000268201_693b08cb0e.jpg,wooden girl climbing playhouse little,0.67
36461,439569646_c917f1bc78.jpg,hiding wooden structure girl is building painted,0.68


##### USE Embedding

In [99]:
import tensorflow_hub as hub
model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4") 

Defining the embed function to create the embeddings and then calculating the three distances

In [111]:
from scipy.spatial.distance import cosine, euclidean, cityblock
def embed(text_in_lst):
    return model(text_in_lst)

In [115]:
# get the tokenized captions
token_captions = sim_DF["Tokenized caption"].values.tolist()

# embeddings for user captions
user_embed = embed([user_tokens])

USE_cos = []
USE_euc = []

# go through all the tokens in the db
for caption in token_captions:
    # get the embedding for the caption
    caption_embed = embed([caption])
    
    # calculate all the distance metrics
    cos = cosine(user_embed[0], caption_embed[0])
    euc = euclidean(user_embed[0], caption_embed[0])

    USE_cos.append(1.0 - cos)
    USE_euc.append(1.0 / (1.0 + euc))

Now, we add the different metrics to the database


In [123]:
sim_DF['USE Cosine'] = USE_cos
sim_DF['USE Euclidean'] = USE_euc

Unnamed: 0,Image file name,Tokenized caption,Fuzzy ratio,USE Cosine,USE Euclidean
0,image,caption,0.10,-0.077059,0.405237
1,1000268201_693b08cb0e.jpg,entry way set is stairs child climbing pink dress,0.30,0.242695,0.448291
2,1000268201_693b08cb0e.jpg,building girl going wooden,0.90,0.727266,0.575189
3,1000268201_693b08cb0e.jpg,wooden girl climbing playhouse little,0.67,0.580199,0.521840
4,1000268201_693b08cb0e.jpg,girl climbing stairs playhouse little,0.49,0.394112,0.476008
...,...,...,...,...,...
40451,997722733_0cb5439472.jpg,pink shirt man climbs rock face,0.19,0.184837,0.439205
40452,997722733_0cb5439472.jpg,rock high air man is climbing,0.30,0.085720,0.425126
40453,997722733_0cb5439472.jpg,person red assist handles rock face shirt cove...,0.16,0.069666,0.423001
40454,997722733_0cb5439472.jpg,shirt red rock climber,0.19,0.097913,0.426768


### Filter images based on similarity metrics

In [144]:
threshold = 0.60
fuzzy_threshold = sim_DF[sim_DF['Fuzzy ratio'] >= threshold].sort_values('Fuzzy ratio', ascending=False)
fuzzy_threshold

Unnamed: 0_level_0,Tokenized caption,Fuzzy ratio,USE Cosine,USE Euclidean
Image file name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1000268201_693b08cb0e.jpg,building girl going wooden,0.9,0.727266,0.575189
439569646_c917f1bc78.jpg,hiding wooden structure girl is building painted,0.68,0.486473,0.496663
1000268201_693b08cb0e.jpg,wooden girl climbing playhouse little,0.67,0.580199,0.52184
3583903436_028b06c489.jpg,old wooden building young boy swings,0.65,0.513661,0.503463
3427301653_4ff0d6fd93.jpg,sits wearing side holding iPod girl hat building,0.65,0.399874,0.477199
3295391572_cbfde03a10.jpg,building blocks playing wooden table child,0.65,0.496183,0.499049
3241726740_6d256d61ec.jpg,white women steps building smoking,0.64,0.211571,0.443315
1000268201_693b08cb0e.jpg,going wooden girl little cabin pink dress,0.63,0.540415,0.510534
477254932_56b48d775d.jpg,pool indoor girl is diving,0.62,0.238015,0.447529
566446626_9793890f95.jpg,background hanging girl upsidedown house,0.61,0.308498,0.459556


### Save the filtered images

In [151]:
fuzzy_threshold.to_csv(path_or_buf = 'filteredDF.txt', sep = '\t', columns = ["Tokenized caption"])