## [Chapter 13] Natural Language Autocomplete

In this lab, we're going to use transformers to get autocomplete suggestions for a given term.

In [1]:
import sys
sys.path.append('..')
from aips import *

In [2]:
import pandas
import pickle
import json
import tqdm
pandas.set_option('display.max_rows', 1000)

## Load and clean the Outdoors dataset

In [3]:
from ltr.download import download, extract_tgz
import tarfile

dataset = ['https://github.com/ai-powered-search/outdoors/raw/master/outdoors.tgz']
download(dataset, dest='data/')
extract_tgz('data/outdoors.tgz') # -> Holds 'outdoors.csv', a big CSV file of the stackexchange outdoors dataset

data/outdoors.tgz already exists


In [4]:
from densevectors.outdoors import *
#Transform the the outdoors.csv file into solr documents
outdoors_dataframe = cleanDataset('data/posts.csv')
print(len(outdoors_dataframe))

19585


## Make a vocabulary of all the concepts in a corpus

In [5]:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_lg')

### Getting noun phrases and verb phrases for our nearest-neighbor vocabulary

We need to get the important language for things that people usually search by!  We need to put ourselves in the 

This step is VERY important, and experimentation and iteration arrived at the following strategy. to get a good quality baseline of candidates for our vovabulary.

In [6]:
def normalize(span):
    #normalizes a noun or verb phrase
    return ' '.join([tok.lemma_.lower() for tok in span])

def yieldTuple(df,column,total=100):
    #yields a spacy nlp.pipe compliant tuple of the column text values and its dataframe row as the context
    for idx,row in df.iterrows():
        if idx<total:
            yield (row[column],idx)

def getConcepts(df,total=None):
    #Get all the matching phrases in the content
    phrases = []
    sources = []
    matcher = Matcher(nlp.vocab)
    nountags = ['NN','NNP','NNS','NOUN'] #Nouns
    verbtags = ['VB','VBD','VBG','VBN','VBP','VBZ','VERB'] #Verbs
    matcher.add("noun_phrases", [[{"TAG":{"IN": nountags}, "IS_ALPHA": True,"OP":"+"}]])
    matcher.add("verb_phrases", [[{"TAG":{"IN": verbtags}, "IS_ALPHA": True,"OP":"+", "LEMMA":{"NOT_IN":["be"]}}]])
    if not total:
        total = len(df)
    for doc,idx in tqdm.tqdm(nlp.pipe(yieldTuple(df,"body",total=total), batch_size=40, n_threads=4, as_tuples=True),total=total):
        text = doc.text
        matches = matcher(doc)
        for matchid,start,end in matches:
            span = doc[start:end]
            phrases.append(normalize(span))
            sources.append(span.text)
            
    concepts = {}
    for np in phrases:
        if np not in concepts:
            concepts[np] = 0
        concepts[np] += 1
        
    labels = {}
    for i in range(len(phrases)):
        if sources[i] not in labels:
            labels[phrases[i]] = sources[i]
    
    sorted_concepts = {k: v for k, v in sorted(concepts.items(), key=lambda item: item[1], reverse=True)}
    
    return sorted_concepts,labels

In [7]:
SAVE = False

if SAVE:
    #total = len(outdoors_dataframe)
    concepts,labels = getConcepts(outdoors_dataframe)
    with open('data/outdoors_concepts.pickle','wb') as fd:
        pickle.dump(concepts,fd)
    with open('data/outdoors_labels.pickle','wb') as fd:
        pickle.dump(labels,fd)

else:
    with open('data/outdoors_concepts.pickle','rb') as fd:
        concepts = pickle.load(fd)
    with open('data/outdoors_labels.pickle','rb') as fd:
        labels = pickle.load(fd)

print(list(concepts.items())[0:10])
print(list(labels.items())[0:10])

[('have', 32782), ('do', 26869), ('use', 16793), ('get', 13412), ('go', 9899), ('water', 9537), ('make', 9476), ('need', 7814), ('time', 7187), ('take', 6550)]
[('time', 'Time'), ('walk', 'walked'), ('backpacking', 'Backpacking'), ('have', 'had'), ('moleskin', 'Moleskin'), ('boot', 'Boot'), ('start', 'starts'), ('start give', 'start giving'), ('give', 'Given'), ('blister', 'blisters')]


### Examining the vocabulary

What are the concepts with the highest frequency?

In [8]:
topcons = {k:v for (k,v) in concepts.items() if v>5 }
print(len(concepts.keys()))
print(len(topcons.keys()))
print(json.dumps(topcons,indent=2))

124366
12375
{
  "have": 32782,
  "do": 26869,
  "use": 16793,
  "get": 13412,
  "go": 9899,
  "water": 9537,
  "make": 9476,
  "need": 7814,
  "time": 7187,
  "take": 6550,
  "find": 6359,
  "see": 5591,
  "rope": 5540,
  "know": 5522,
  "day": 5318,
  "way": 5239,
  "want": 5087,
  "people": 5083,
  "keep": 4789,
  "look": 4784,
  "area": 4548,
  "work": 4491,
  "thing": 4451,
  "try": 4179,
  "tent": 4095,
  "bag": 4054,
  "lot": 3934,
  "think": 3728,
  "trail": 3725,
  "say": 3669,
  "foot": 3606,
  "climb": 3559,
  "point": 3550,
  "place": 3539,
  "question": 3424,
  "help": 3206,
  "come": 3186,
  "put": 3115,
  "hike": 3042,
  "weight": 3034,
  "fall": 3027,
  "start": 3009,
  "leave": 2977,
  "answer": 2949,
  "give": 2945,
  "something": 2926,
  "food": 2907,
  "year": 2864,
  "carry": 2823,
  "pack": 2755,
  "end": 2752,
  "one": 2741,
  "bear": 2738,
  "fire": 2730,
  "body": 2722,
  "case": 2716,
  "hand": 2648,
  "walk": 2600,
  "mean": 2573,
  "snow": 2555,
  "climbing"

## Transformer time!

In [9]:
from sentence_transformers import SentenceTransformer, util as STutil
stsb = SentenceTransformer('roberta-base-nli-stsb-mean-tokens')

### Getting embeddings.

We're going to effectively perform a complex normalization that will provide the most precisly related concepts.  But instead of algorithmic normalization (like stemming and stop word removal), we are normalizing to a dense vector space of 768 feature dimensions.  Also remember, we're only normalizing noun phrases and verb phrases.  This is kinda like stopword removal, but that's OK, because we want to suggest similar concepts as concisely as possible.  We also have a much better representation of the meaning of the term and its contexts, so in many ways, the surrounding "stop word" terms are implied.

Also note, this is similar in purpose to what Solr's suggester would do, but much more effective and less noisy.  Solr uses term statistics to reduce the possible number of items that can be suggested.  For example, if an inverted index contains 100,000 distinct terms, we may only effectively present a subset of 10,000 items over time.  Each suggestion is sorted by similarity and applicability to the user's concept.

In [11]:
SAVE = False

#Note!  This is a hyperparameter.
#We are ignoring terms that occur less than this numner in the entire corpus.
#Lowering this number may lower precision
#Raising this number may lower recall
minimum_frequency = 6
phrases = [k for (k,v) in concepts.items() if v>=6]

if SAVE:
    embeddings = stsb.encode(phrases, convert_to_tensor=True)
    with open('data/outdoors_embeddings.pickle','wb') as fd:
        pickle.dump(embeddings,fd)
else:
    with open('data/outdoors_embeddings.pickle','rb') as fd:
        embeddings = pickle.load(fd)

print(len(phrases))
print(len(embeddings))
print(len(embeddings[0]))

12375
12375
768


In [12]:
similarities = STutil.pytorch_cos_sim(embeddings[0:505], embeddings[0:505])

In [13]:
#Find the pairs with the highest cosine similarity scores
import pandas

a_phrases = []
b_phrases = []
scores = []

for a in range(len(similarities)-1):
    for b in range(a+1, len(similarities)):
        a_phrases.append(phrases[a])
        b_phrases.append(phrases[b])
        scores.append(float(similarities[a][b]))

comparisons = pandas.DataFrame({"phrase a":a_phrases,"phrase b":b_phrases,"score":scores})

In [14]:
with pandas.option_context('display.max_rows',None,'display.max_columns',None):
    print(comparisons.sort_values(by=["score"], ascending=False)[0:1000])

           phrase a     phrase b     score
29635         sleep     sleeping  0.934763
92551       protect   protection  0.928151
28541      climbing      climber  0.923570
109287   everything     everyone  0.895128
40536          camp      camping  0.878894
18796          hike        hiker  0.865984
20266         start      morning  0.835614
15187         climb     climbing  0.833662
21787     something      someone  0.821081
18503          hike       hiking  0.815187
42886        hiking        hiker  0.814694
8517         people       person  0.784663
15259         climb      climber  0.782961
2047             go        leave  0.770643
9012           keep         stay  0.768612
64675      backpack  backpacking  0.754588
5319           find       search  0.748263
7414            day      morning  0.744142
72013          life         live  0.739865
20128         start       create  0.739785
20263         start         open  0.735628
47654     situation          sit  0.731386
32127      

### Quickly matching vectors at query time

Now that we can get and compare concept embeddings, we need to be able to search these embeddings efficiently.

In [15]:
import nmslib

In [16]:
# initialize a new index, using a HNSW index on Cosine Similarity
index = nmslib.init(method='hnsw', space='cosinesimil')
index.addDataPointBatch(embeddings)
index.createIndex(print_progress=True)

In [22]:
# query for the nearest neighbours of the first datapoint
ids, distances = index.knnQuery(embeddings[25], k=10)
matches = [labels[phrases[idx]].lower() for idx in ids]
# get all nearest neighbours for all the datapoint
# using a pool of 4 threads to compute
#neighbours = index.knnQueryBatch(data, k=10, num_threads=4)

In [23]:
print(matches)

['bag', 'bag ratings', 'bag covers', 'bag liner', 'garbage bags', 'wag bags', 'bag cooking', 'airbags', 'paper bag', 'tea bags']


In [40]:
from IPython.core.display import display,HTML
def print_labels(prefix,matches,labels):
    display(HTML('<h4>Results for: <em>'+prefix+'</em></h4>'))
    for l,d in matches[1:]:
        print(l + '\t' + str(d))

In [41]:
def paraphrase(query,model,index,vocab,k=20):
    matches = []
    embeddings = model.encode([query], convert_to_tensor=True)
    ids, distances = index.knnQuery(embeddings[0], k=k)
    for i in range(len(ids)):
        text = vocab[ids[i]]
        dist = 1.0-distances[i]
        matches.append((text,dist))
    if not len(matches):
        matches.append((vocab[ids[1]],1.0-distances[1]))
    return matches

In [42]:
def autocomplete(prefix):
    matches = paraphrase(prefix,stsb,index,phrases)
    print_labels(prefix,matches,labels)

In [43]:
autocomplete('forest hike')

mountain hiking	0.7706536054611206
desert hiking	0.7556552290916443
distance hiking trail	0.7531936764717102
mountain hike	0.7473267912864685
distance hiking	0.7289175391197205
go hike	0.7274163961410522
winter hike	0.7262912392616272
trail hiking	0.7201042175292969
winter hiking	0.7133822441101074
hiking trail	0.7077224254608154
day hiking	0.702231764793396
distance hike	0.6980176568031311
have hike	0.6854838132858276
country trail	0.6848740577697754
training hike	0.6723088026046753
day hike	0.6639741659164429
us forest service	0.6488363742828369
camp trek	0.6486558318138123
hiking path	0.6394617557525635


In [44]:
autocomplete('campfire')

camp fire	0.9566243886947632
campfire impact	0.9282976388931274
camping fuel	0.8655523061752319
camping stove	0.8239544630050659
camp stove	0.7969684600830078
cooking fire	0.7753304839134216
campground	0.7744450569152832
fireplace	0.7649710774421692
camping area	0.7596511244773865
have camp	0.7553194761276245
fire	0.7407616376876831
camping	0.7358313798904419
camp site	0.7296763062477112
camping place	0.7286021709442139
camping site	0.7240110635757446
camping gear	0.7222967743873596
camping spot	0.721274733543396
camping supply	0.7174286842346191
camping tent	0.716825544834137


In [45]:
autocomplete('abandoned cabin')

evacuation	0.6357452869415283
disintegrate	0.5935385227203369
get rid	0.5880102515220642
castaway	0.5780271887779236
disclose	0.5759261846542358
displace	0.571616530418396
disappear	0.568480372428894
removal	0.5595774054527283
withdrawal	0.5588412880897522
have leave	0.5517101883888245
forgo	0.5505939722061157
evacuate	0.5493436455726624
have disappear	0.5319399237632751
withdraw	0.5309345722198486
disregard	0.528568685054779
disperse	0.5247459411621094
conserve	0.5163378119468689
go camp	0.5152520537376404
have remove	0.5133577585220337


In [49]:
autocomplete('seek')

searching	0.8038869500160217
have search	0.7866581082344055
search	0.7846546769142151
find	0.7812220454216003
try search	0.7604191303253174
help find	0.7396721839904785
finding	0.6903938055038452
hunt	0.6544297933578491
have find	0.63739413022995
go hunt	0.6148389577865601
search term	0.5935027599334717
route finding	0.5739497542381287
discover	0.5723604559898376
do find	0.5667706727981567
treasure hunt	0.5396451950073242
search engine	0.5318857431411743
have discover	0.5303488373756409
search party	0.5229660272598267
go look	0.5206515789031982


In [34]:
terms = [k for (k,v) in topcons.items()]
originals = []
candidates = []
scores = []
for term in tqdm.tqdm(terms[0:1000]):
    labels = paraphrase(term,stsb,index,phrases,k=25)
    originals += [term]*len(labels)
    candidates += [l[0] for l in labels]
    scores += [l[1] for l in labels]

100%|██████████| 1000/1000 [00:53<00:00, 18.57it/s]


In [77]:
pairs = pandas.DataFrame({'term':originals,'candidate':candidates,'score':scores})

In [78]:
pairs

Unnamed: 0,term,candidate,score
0,have,have,1.000000
1,have,have have,0.932389
2,have,use have,0.770543
3,have,say have,0.756002
4,have,find have,0.750577
...,...,...,...
22002,goal,negate,0.517945
22003,goal,expansion,0.514603
22004,goal,fiberglas,0.509877
22005,goal,gf,0.501710


In [81]:
pairs.to_csv('pairs.csv')

In [250]:
outdoors_posts = getData('../../../../temp/outdoors/posts.csv')

In [274]:
for text,row in outdoors:
    tags = []
    if row["post_type_id"]==1:
        if row["tags"] and isinstance(row["tags"],str):
            tags = [t for t in re.compile("[\<\>]").split(html.unescape(row["tags"])) if len(t)]
        print(tags)
        print(row)

['health', 'first-aid', 'blisters']
id                                                                          1
post_type_id                                                                1
accepted_answer_id                                                         12
parent_id                                                                 NaN
creation_date                                         2012-01-24T19:55:57.057
deletion_date                                                             NaN
score                                                                      31
view_count                                                               7383
body                        &lt;p&gt;A few times I've been out walking or ...
owner_user_id                                                               9
owner_display_name                                                        NaN
last_editor_user_id                                                     12892
last_editor_display_name    

Name: 1274, dtype: object
['fire']
id                                                                       1353
post_type_id                                                                1
accepted_answer_id                                                       1357
parent_id                                                                 NaN
creation_date                                         2012-04-17T01:51:55.920
deletion_date                                                             NaN
score                                                                      11
view_count                                                               2391
body                        &lt;p&gt;If I build a campfire, but need to mo...
owner_user_id                                                             432
owner_display_name                                                        NaN
last_editor_user_id                                                       NaN
last_editor_display_name     

Name: 2216, dtype: object
['skiing']
id                                                                       3341
post_type_id                                                                1
accepted_answer_id                                                       3351
parent_id                                                                 NaN
creation_date                                         2012-12-11T04:26:18.400
deletion_date                                                             NaN
score                                                                      17
view_count                                                              23126
body                        &lt;p&gt;Occasionally I've seen skiers free-he...
owner_user_id                                                            1926
owner_display_name                                                        NaN
last_editor_user_id                                                       NaN
last_editor_display_name   

Name: 3288, dtype: object
['hiking', 'shoes', 'footwear', 'walking']
id                                                                       4488
post_type_id                                                                1
accepted_answer_id                                                       4490
parent_id                                                                 NaN
creation_date                                         2013-08-19T09:47:06.310
deletion_date                                                             NaN
score                                                                       8
view_count                                                               1225
body                        &lt;p&gt;The title may itself sound a little w...
owner_user_id                                                            2303
owner_display_name                                                        NaN
last_editor_user_id                                                      

Name: 4291, dtype: object
['boats', 'gps', 'theft']
id                                                                       5628
post_type_id                                                                1
accepted_answer_id                                                        NaN
parent_id                                                                 NaN
creation_date                                         2014-04-28T19:27:50.887
deletion_date                                                             NaN
score                                                                       6
view_count                                                                190
body                        &lt;p&gt;I'm planning to buy this &lt;a href=&...
owner_user_id                                                            1863
owner_display_name                                                        NaN
last_editor_user_id                                                      8794
last_editor_

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




['rock-climbing', 'climbing', 'trad-climbing']
id                                                                      20162
post_type_id                                                                1
accepted_answer_id                                                      20163
parent_id                                                                 NaN
creation_date                                         2018-08-09T08:15:09.300
deletion_date                                                             NaN
score                                                                      12
view_count                                                                933
body                        &lt;p&gt;I occasionally place slings on rock s...
owner_user_id                                                             NaN
owner_display_name                                                   user2766
last_editor_user_id                                                     11623
last_editor_disp

['safety', 'animals', 'coyotes', 'pack-animals']
id                                                                      21199
post_type_id                                                                1
accepted_answer_id                                                        NaN
parent_id                                                                 NaN
creation_date                                         2018-12-05T06:26:42.650
deletion_date                                                             NaN
score                                                                       5
view_count                                                                372
body                        &lt;p&gt;I live in a city in Ontario, Canada. ...
owner_user_id                                                           17003
owner_display_name                                                        NaN
last_editor_user_id                                                      2157
last_editor_dis

Name: 17879, dtype: object
['united-states', 'animals']
id                                                                      22129
post_type_id                                                                1
accepted_answer_id                                                        NaN
parent_id                                                                 NaN
creation_date                                         2019-05-15T14:29:50.537
deletion_date                                                             NaN
score                                                                       2
view_count                                                                 73
body                        &lt;p&gt;Each year pronghorn shed the outer co...
owner_user_id                                                            8794
owner_display_name                                                        NaN
last_editor_user_id                                                      8794
last_edi

Name: 18774, dtype: object
['rescue']
id                                                                      24156
post_type_id                                                                1
accepted_answer_id                                                      24157
parent_id                                                                 NaN
creation_date                                         2019-09-19T20:33:11.910
deletion_date                                                             NaN
score                                                                       1
view_count                                                                 83
body                        &lt;p&gt;Garmin inReach devices and its subscr...
owner_user_id                                                            3857
owner_display_name                                                        NaN
last_editor_user_id                                                       NaN
last_editor_display_name  