# Seeds exploration

The idea is to expand the set of words of interest; in fact, not all the suggested words are present in the corpus, and some of them have really low frequencies. This is a problem when trying to guide to topic modelling around the concepts of interests, because these words appear in few documents. Trying to expand the set should help us in discovering better topics (and in general in doing a better analysis).

In [20]:
narcotics = ['cannabis', 'cocaine', 'methamphetamine', 'drugs', 'drug', 'marijuana', 
             'ecstasy', 'lsd', 'ketamine', 'heroin', 'fentanyl', 'overdose']

weapons = ['gun', 'knife', 'weapon', 'firearm', 'rifle', 'carabine', 'shotgun', 'handgun', 
           'revolver', 'musket', 'pistol', 'derringer', 'assault', 'rifle', 'sword', 'blunt']

investigation = ['gang', 'mafia', 'serial',  'killer', 'rape', 'theft', 'recidivism', 
                 'arrest', 'robbery', 'cybercrime', 'cyber', 'crime']

# carabine not found, serial is too generic (########_##XX, 0.67 - Serial, 0.65 - 9mm_Norinco_pistol, 0.47 - 
# oscilloscopes_protocol_analyzers, 0.46 - Gbit_s_optical, 0.46 - printf, 0.43 ...)
not_in_word2vec = {'carabine', 'serial'}

interesting_set = set(narcotics + weapons + investigation) - not_in_word2vec

We keep singular and plural versions of words because the pre-processing phase can produce both versions, depending on the context (we use Spacy)

## GoogleNews word embeddings

Using word embeddings, it should be possible to find the words that are used in the same context as our words of interest, so that we can expand the seed set with different words that are used in the same context.

In [1]:
from gensim import models

w = models.KeyedVectors.load_word2vec_format(
    '../data/models/GoogleNews-vectors-negative300.bin', binary=True)

In [6]:
w.most_similar(positive=['cocaine'])

[('heroin', 0.8294118046760559),
 ('crack_cocaine', 0.8008098006248474),
 ('methamphetamine', 0.7232441306114197),
 ('narcotics', 0.707099974155426),
 ('methamphetamines', 0.7007291316986084),
 ('Cocaine', 0.6972401142120361),
 ('crystal_methamphetamine', 0.6937119960784912),
 ('crystal_methamphetamines', 0.6857212781906128),
 ('illicit_drugs', 0.6745167970657349),
 ('marijuana', 0.6655946373939514)]

In [26]:
for word in interesting_set:
    similar_words = w.most_similar(positive=[word], topn=60)
    print(f"*** {word} ***:\n {' - '.join(map(lambda x: f'{x[0]}, {round(x[1], 2)}', similar_words))}")

*** sword ***:
 swords, 0.77 - broadsword, 0.66 - sandal_flick, 0.63 - katana, 0.62 - scimitar, 0.6 - broadswords, 0.6 - samurai_sword, 0.59 - rapier, 0.58 - sorcery_fantasy, 0.57 - Samurai_sword, 0.56 - sandals_epics, 0.56 - knife, 0.56 - rapiers, 0.56 - plowshare, 0.55 - knives, 0.55 - brandishing_sword, 0.55 - light_saber, 0.55 - scabbards, 0.55 - swordplay, 0.54 - lightsaber, 0.54 - sword_wielding, 0.54 - Sword, 0.54 - kukri, 0.54 - Damocles_hangs, 0.53 - sorcery_epic, 0.53 - swords_spears, 0.53 - bladed_weapon, 0.53 - katana_sword, 0.53 - nunchaku, 0.53 - scabbard, 0.53 - ceremonial_swords, 0.53 - spear, 0.53 - spears, 0.53 - daggers_swords, 0.52 - flaming_sword, 0.52 - spears_swords, 0.52 - lances, 0.52 - daggers, 0.52 - buckler, 0.52 - pen_mightier, 0.52 - unsheathing, 0.52 - cutlass, 0.52 - Gravity_Hammer, 0.52 - dagger, 0.52 - Mjolnir, 0.52 - scepter, 0.52 - Katana_sword, 0.52 - Caddoc, 0.51 - maces, 0.51 - ornamental_sword, 0.51 - unsheathe, 0.51 - sabers, 0.51 - swords_dagge