# Seeds exploration

The idea is to expand the set of words of interest; in fact, not all the suggested words are present in the corpus, and some of them have really low frequencies. This is a problem when trying to guide to topic modelling around the concepts of interests, because these words appear in few documents. Trying to expand the set should help us in discovering better topics (and in general in doing a better analysis).

In [18]:
import warnings
warnings.filterwarnings('ignore')

import sys
sys.path.append("..")

from pprint import pprint

from src.dataset import Dataset

In [25]:
from collections import defaultdict
def check_presence(narcotics, weapons, investigation, filtered_tokens):
    seen = set()
    not_found = set(weapons + investigation + narcotics)
    freq = defaultdict(lambda:0)
    for doc in filtered_tokens:
        for w in set(doc): 
            if w in not_found:
                seen.add(w)
                freq[w] += 1
    
    not_found -= seen
                
    pprint(sorted([(v,k) for k,v in freq.items()]))
    print("Not found")
    pprint(not_found)

In [None]:
dataset = Dataset()
tokens = dataset.load_dataset(year=None, 
                              tokens=True, 
                              courts={"Illinois Supreme Court"})

We keep singular and plural versions of words because the pre-processing phase can produce both versions, depending on the context (we use Spacy)

In [28]:
narcotics = ['cannabis', 'cocaine', 'methamphetamine', 'drugs', 'drug', 'marijuana', 
             'ecstasy', 'lsd', 'ketamine', 'heroin', 'fentanyl', 'overdose']

weapons = ['gun', 'knife', 'weapon', 'firearm', 'rifle', 'carbine', 'shotgun', 'handgun', 
           'revolver', 'musket', 'pistol', 'derringer', 'assault', 'rifle', 'sword', 'blunt']

investigation = ['gang', 'mafia', 'serial',  'killer', 'rape', 'theft', 'recidivism', 
                 'arrest', 'robbery', 'cybercrime', 'cyber', 'crime']

As we can see, many words don't appear so frequently in the collection, and some words don't even appear; if we want to try to drive the Topic Modelling process towards topics of interest, we need to expand these sets with more frequent words.

In [29]:
check_presence(narcotics, weapons, investigation, tokens)

[(1, 'cyber'),
 (1, 'ketamine'),
 (2, 'musket'),
 (5, 'mafia'),
 (8, 'derringer'),
 (14, 'drugs'),
 (18, 'methamphetamine'),
 (20, 'carbine'),
 (20, 'overdose'),
 (29, 'lsd'),
 (32, 'recidivism'),
 (100, 'sword'),
 (138, 'killer'),
 (142, 'cannabis'),
 (158, 'serial'),
 (203, 'blunt'),
 (204, 'handgun'),
 (205, 'rifle'),
 (265, 'shotgun'),
 (288, 'marijuana'),
 (290, 'heroin'),
 (320, 'cocaine'),
 (359, 'firearm'),
 (392, 'gang'),
 (721, 'pistol'),
 (773, 'knife'),
 (923, 'revolver'),
 (975, 'theft'),
 (1043, 'rape'),
 (1475, 'weapon'),
 (1611, 'drug'),
 (1941, 'gun'),
 (2053, 'assault'),
 (2279, 'robbery'),
 (5971, 'arrest'),
 (6662, 'crime')]
Not found
{'ecstasy', 'cybercrime', 'fentanyl'}


In [40]:
not_found = {'ecstasy', 'cybercrime', 'fentanyl'}

interesting_set = set(narcotics + weapons + investigation)

## GoogleNews word embeddings

Using word embeddings, it should be possible to find the words that are used in the same context as our words of interest, so that we can expand the seed set with different words that are used in the same context.

In [33]:
from gensim import models

w = models.KeyedVectors.load_word2vec_format(
    '../data/models/GoogleNews-vectors-negative300.bin', binary=True)

In [36]:
w.most_similar(positive=['cocaine'], topn=5)

[('heroin', 0.8294118046760559),
 ('crack_cocaine', 0.8008098006248474),
 ('methamphetamine', 0.7232441306114197),
 ('narcotics', 0.707099974155426),
 ('methamphetamines', 0.7007291316986084)]

In [42]:
# serial is too generic (########_##XX, 0.67 - Serial, 0.65 - 9mm_Norinco_pistol, 0.47 - 
# oscilloscopes_protocol_analyzers, 0.46 - Gbit_s_optical, 0.46 - printf, 0.43 ...)
not_word2vec = {'serial'}

interesting_set -= not_in_word2vec

In [43]:
for word in interesting_set:
    similar_words = w.most_similar(positive=[word], topn=100)
    print(f"*** {word} ***:\n {' - '.join(map(lambda x: f'{x[0]}, {round(x[1], 2)}', similar_words))}")

*** drugs ***:
 drug, 0.85 - prescription_drugs, 0.69 - medications, 0.67 - illicit_drugs, 0.67 - Drugs, 0.67 - medicines, 0.66 - narcotics, 0.65 - pills, 0.65 - heroin, 0.64 - painkillers, 0.63 - Drug, 0.63 - narcotic, 0.63 - OxyContin_prescription_painkiller, 0.62 - cocaine, 0.62 - anti_depressant_Seroxat, 0.61 - prescription_meds, 0.61 - medication, 0.61 - meth_amphetamine, 0.61 - illicit_substances, 0.61 - painkiller_Oxycontin, 0.6 - prescription_medications, 0.6 - painkiller_Oxycodone, 0.6 - methodone, 0.6 - Suboxone, 0.6 - methadone, 0.6 - oxycotin, 0.6 - psychostimulant_drugs, 0.6 - Oxycontin, 0.59 - prescription_medication, 0.59 - polydrug, 0.58 - painkiller, 0.58 - suboxone, 0.58 - heroin_crack_cocaine, 0.58 - crystal_methamphetamines, 0.58 - Oxycotin, 0.58 - psychotropic_medicines, 0.58 - prescription_medicines, 0.58 - amphetamines, 0.58 - stimulants, 0.58 - controlled_substances, 0.58 - antidepressants_antipsychotics, 0.57 - horse_tranquiliser, 0.57 - prescription_painkiller

In [53]:
new_narcotics, new_weapons, new_investigation = set(), set(), set()
for word in interesting_set:
    similar_words = w.most_similar(positive=[word], topn=20)
    for similar_word in similar_words:
        new_words.update([w.lower() for w in similar_word[0].split("_")])
print(new_words)

{'.##-##', 'burglary', 'sureños', 'derringer', 'fiend', 'incarceration', 'juana', 'cybercriminal', 'arrested', 'violence', 'patch', 'sleeping', '.##', 'forcible', 'magnum', 'reconviction', 'wesson', 'mafiosi', 'medication', 'assault', 'katana', 'breech', 'narcotics', 'colt', 'honest', 'pfizer', 'possessing', 'vandalism', 'stinging', 'folding', 'jailed', 'mmj', 'jailing', 'reducing', 'shotgun', 'offenders', 'hangun', 'amphetamines', 'musket', 'remington', 'kidnapper', 'arrrest', 'syabu', 'mdma', 'automatic', 'subutex', 'ecstacy', 'medications', 'gun', 'overdosing', 'broadsword', 'knife', 'frank', 'mp5', 'sureno', 'firing', 'hydrochloride', 'saber', 'generic', 'assualt', 'xr', 'molestation', 'rifles', 'cyberthreats', 'stabbing', 'acid', 'methampetamine', 'theft', 'break', 'purse', 'meds', 'skunk', 'animal', 'ndrangheta', 'ketamine', 'boxcutter', 'raping', 'reputed', 'attempted', 'assassin', 'gangsters', 'guns', 'ripper', 'larceny', 'acerbic', 'overdosed', 'identity', 'plowshare', 'kitche