# 3. Improve Hashtags

HARRISON dataset provides us with images along with their hashtags. But those provided hashtags are not enough to train a model with. For example, a picture of the beach has only one hashtag `sea`, so the model learns the image is associated with `sea` but not `ocean`, `beach`, or any other similar tags.

In this section, we improve groud truth values (i.e. true hashtags) by adding more related hashtags. To get similar hashtags, we use [Gensim](https://radimrehurek.com/gensim/index.html) and a [pre-trained model](https://code.google.com/archive/p/word2vec/).

## Load a Model

In [25]:
from gensim.models import KeyedVectors

# Load a model
model = KeyedVectors.load_word2vec_format('../model/GoogleNews-vectors-negative300.bin', binary=True)

In [5]:
# Get similar word to "sea"
model.most_similar("sea")

[('ocean', 0.7643541097640991),
 ('seas', 0.6712585687637329),
 ('oceans', 0.6193016767501831),
 ('waters', 0.5993286371231079),
 ('seawaters', 0.5960040092468262),
 ('Creamsicle_orange', 0.58356773853302),
 ('coastal_waters', 0.5737729072570801),
 ('wooden_crate_dumped', 0.5657002329826355),
 ('oceanic', 0.5628669261932373),
 ('prairies_deserts', 0.5556836724281311)]

## Hashtag Similarities
First, we analize similarities of existing hashtags.

In [26]:
import numpy as np

# Read hashtags
with open("../model/hashtags.txt") as f:
    hashtags = np.array(f.read().split('\n'))
num_hashtags = len(hashtags)


In [27]:
# Create a matrix
sim_mat = np.zeros((num_hashtags, num_hashtags))

for i in range(num_hashtags):
    for j in range(i):
        try:
            sim_mat[i,j] = model.similarity(hashtags[i],hashtags[j])
        except:
            pass

In [8]:
# Create a dictionary whose:
#   key:    hashtag
#   value:  similar hashtags
sim_tags = {}
for i in range(num_hashtags):
    # Get similar tags whose cos similarity is > 0
    sim_ls = list(hashtags[np.where(sim_mat[i,:]>0.3)[0]])
    sim_tags[hashtags[i]] = sim_ls

print(f"Tags that are similar to sea: {sim_tags['sea']}")

Tags that are similar to sea: ['air', 'beach', 'boat', 'earth', 'fish', 'ice', 'island', 'lake', 'moon', 'ocean', 'river', 'sand']


## Improve `tag_list.txt` Using the Dictionary

In [9]:
temp = 0
for key in sim_tags:
    temp += len(sim_tags[key])
avg_increase = temp/num_hashtags
print(f" Average # of hashtags added: {avg_increase}")

 Average # of hashtags added: 12.569138276553106


In [19]:
import pickle
import numpy as np

# If you restarted kernel, uncomment lines below
# with open("../model/1113114209_sim_tags.pickle", "rb") as f:
#     sim_tags = pickle.load(f)

with open("../HARRISON/tag_list.txt") as f:
    # Read tag_list
    tag_list = f.read().split('\n')

tag_list = np.array([x.split(' ')[:-1] for x in tag_list[:-1]])
num_images = len(tag_list)

In [20]:
def expand_tags(old: list) -> list:
    ret = old.copy()
    for tag in old:
        ret += sim_tags[tag]
    return ret

In [21]:
new_tag_list = [' '.join(expand_tags(x)) for x in tag_list]

## Save Objects

In [23]:
import pickle
import sys
sys.path.append("../src")
from utils import destination

Save this dictionary to disk
with open(destination("../model", "sim_tags.pickle"), "wb") as f:
    pickle.dump(sim_tags, f)

# Save new tag_list.txt
with open(destination("../HARRISON", "tag_list.txt"), "w") as f:
    f.writelines("%s\n" % tags for tags in new_tag_list)