## [ Chapter 13 - Semantic Search with Dense Vectors ] 
## Setting up the outdoors dataset

We're going to use the Outdoors dataset for this chapter, and for a very important reason: the vocabulary and contexts in the outdoor question and answer __domain__ already have good coverage in the Transformer models we'll be using.

This is because the datasets that were used to train the model include sources that are likely to have similar subject matter.  Wikipedia was used to train bert-base-uncased (https://huggingface.co/bert-base-uncased#training-data) and, surprise! wikipedia has a section specifically on outdoors content: https://en.wikipedia.org/wiki/Outdoor

This is important, because if the words and their contexts haven't been seen before, the model will be less accurate.

Also, who doesn't enjoy playing around with a new dataset?! Data is search nerd candy.

In [4]:
import sys
import pickle
import json
sys.path.append('..')
from aips import *
from IPython.display import display,HTML
engine = get_engine()

## Listing 13.1

### Creating our Collection and Indexing the documents

In [5]:
#outdoors
![ ! -d 'outdoors' ] && git clone --depth=1 https://github.com/ai-powered-search/outdoors.git
! cd outdoors && git pull
! cd outdoors && cat outdoors.tgz.part* > outdoors.tgz
! cd outdoors && mkdir -p '../../data/outdoors/' && tar -xvf outdoors.tgz -C '../../data/outdoors/'

Cloning into 'outdoors'...
remote: Enumerating objects: 25, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 25 (delta 0), reused 22 (delta 0), pack-reused 0[K
Receiving objects: 100% (25/25), 491.39 MiB | 11.20 MiB/s, done.
Already up to date.
README.md
concepts.pickle
._guesses.csv
guesses.csv
._guesses_all.json
guesses_all.json
outdoors_concepts.pickle
outdoors_embeddings.pickle
._outdoors_golden_answers.csv
outdoors_golden_answers.csv
._outdoors_golden_answers.xlsx
outdoors_golden_answers.xlsx
._outdoors_golden_answers_20210130.csv
outdoors_golden_answers_20210130.csv
outdoors_labels.pickle
outdoors_question_answering_contexts.json
outdoors_questionanswering_test_set.json
outdoors_questionanswering_train_set.json
._posts.csv
posts.csv
predicates.pickle
pull_aips_dependency.py
._question-answer-seed-contexts.csv
question-answer-seed-contexts.csv
question-answer-squad2-guesses.csv
._roberta-base-squad2-outdoo

In [8]:
from densevectors.outdoors import *
#Transform the the outdoors.csv file into solr documents
outdoors_collection = engine.create_collection("outdoors")
dataframe = clean_dataset("../data/outdoors/posts.csv")
data = transform_dataframe(dataframe)
print(data[0])
outdoors_collection.add_documents(data)

  soup = BeautifulSoup(h, 'lxml')


{'id': '1', 'url': 'https://outdoors.stackexchange.com/questions/1', 'post_type_id': 1, 'accepted_answer_id': 12, 'parent_id': None, 'score': 31, 'view_count': 7383, 'body': "A few times I've been out walking or backpacking and have no moleskin with me. Sure enough, my boots start giving me blisters. How can I prevent or treat hotspots and blisters without using moleskin?", 'title': 'How do I treat hot spots and blisters when I have no moleskin?', 'tags': ['health', 'first aid', 'blisters'], 'answer_count': 4}


In [10]:
def query_collection(query, verbose=False):
    request = {
        "query": query,
        "fields": ["id", "title", "url"],
        "params": {
          "fq":"post_type_id:1",
          "qf": ["title", "body"],
          "defType": "edismax",
          "indent": "true",
          "hl": True
        }
    }
    response = outdoors_collection.search(request)
    highlights = [response["highlighting"][k] for k in response["highlighting"].keys()]
    for r in zip(response["response"]["docs"], highlights):
        print(r)
        display(HTML('<a href="'+r[0]["url"]+'">qid '+r[0]["id"]+'</a>&nbsp;'))
        if "title" in r[0].keys():
            display(HTML("<strong>Question Title:</strong>" + r[0]["title"]))
        if verbose:
            display(HTML("<strong>Body:</strong>" + html.unescape(str(r[1]["body"][0]))))
            display(HTML("<hr>"))

## Listing 13.2

### Querying our collection with a noun phrase

In [11]:
query_collection("climbing knots",verbose=True)

({'id': '21855', 'url': 'https://outdoors.stackexchange.com/questions/21855', 'title': 'What are the four climbing knots used by Jim Bridwell?'}, {'title': ['What are the four <em>climbing</em> <em>knots</em> used by Jim Bridwell?'], 'body': ['Long & Gaines <em>Climbing</em> Anchors says: Jim Bridwell, one of the most experienced climbers in the history of the sport, uses only four <em>knots</em> for any and all <em>climbing</em> situations. ']})


({'id': '18825', 'url': 'https://outdoors.stackexchange.com/questions/18825', 'title': "What's a good resource for learning to tie knots for climbing?"}, {'title': ["What's a good resource for learning to tie <em>knots</em> for <em>climbing</em>?"], 'body': ["If I wanted to learn how to tie certain <em>knots</em>, or learn about new <em>knots</em> and what they're used for, what are some good resources to look up?"]})


({'id': '18814', 'url': 'https://outdoors.stackexchange.com/questions/18814', 'title': 'How to tie a figure eight on a bight?'}, {'body': ['One of the more common <em>knots</em> in <em>climbing</em> is the figure eight on a bight (being tied on a bight means that it is tied without using the ends. ']})


({'id': '9183', 'url': 'https://outdoors.stackexchange.com/questions/9183', 'title': 'Can rock climbers easily transition to canyoning?'}, {'body': ['Rock <em>climbing</em> and canyoning/canyoneering are very different. ']})


({'id': '20223', 'url': 'https://outdoors.stackexchange.com/questions/20223', 'title': 'How to tie a Purcell prusik?'}, {'body': ['One of the more useful <em>knots</em> for <em>climbing</em> is the Purcell prusik (related to but not a regular prussik), which can be used as an adjustable tether or for ascending a rope among other things. ']})


({'id': '22477', 'url': 'https://outdoors.stackexchange.com/questions/22477', 'title': 'Tradeoffs between different stopper knots'}, {'title': ['Tradeoffs between different stopper <em>knots</em>'], 'body': ['In particular, are some stopper <em>knots</em> dangerous to use? I would imagine that a single overhand could be a risky choice (particularly with a short tail).']})


({'id': '22369', 'url': 'https://outdoors.stackexchange.com/questions/22369', 'title': '"Large" eye of tie-in knot: dangerous?'}, {'body': ["While I don't see a problem with a loop that is about 10 cm across, others argue that the loop should be as tight as possible. We're <em>climbing</em> indoors and use figure-of-eight or double bowline as tie-in <em>knots</em>. "]})


({'id': '338', 'url': 'https://outdoors.stackexchange.com/questions/338', 'title': 'How do I tie a Prusik knot?'}, {'body': ["Prusik <em>knots</em> are useful in emergency situations as they can be used for both <em>climbing</em> up a rope (out of a canyon) or down a rope (rappelling) without someone else providing a fireman's belay. "]})


({'id': '11198', 'url': 'https://outdoors.stackexchange.com/questions/11198', 'title': 'Is a barrel knot the same as a double/triple overhand knot?'}, {'body': ['Confused, I looked for other uses of the term barrel knot and found this snippet on the Animated <em>Knots</em> page for a barrel hitch (emphasis mine): "Barrel" Confusion: Unfortunately, the name Barrel has been applied to various other <em>knots</em>. ']})


({'id': '16336', 'url': 'https://outdoors.stackexchange.com/questions/16336', 'title': 'Is a clove hitch to be distrusted in climbing?'}, {'title': ['Is a clove hitch to be distrusted in <em>climbing</em>?'], 'body': ['Reading the description for a clove hitch on animated <em>knots</em> by Grog (the go to bible of <em>knots</em> IMO) I was surprised to read: Caution: The Clove Hitch was, originally, included here with the intention of condemning it. ']})


In [12]:
query_collection("snow camping safety")

[{'title': ['Best approach to <em>camping</em> in <em>snow</em>?'], 'body': ['What is the best way for a trekking group of two to sleep in deep <em>snow</em> (meter or more). What']}, {'title': ['How do people carry <em>camping</em> propane in a car when traveling?'], 'body': ["I'm very new to <em>camping</em>. In fact..I'm trying to prepare for my first <em>camping</em>. One thing that I"]}, {'title': ['Resources for <em>camping</em> in Georgia (country)'], 'body': [' <em>camping</em> in Georgia? I would like to know about the rules and <em>safety</em> aspects. Any resource on these would']}, {'title': ['Any tips to prevent theft while tent <em>camping</em> alone in a caravan park?'], 'body': ['This summer. I will be <em>camping</em> on my own at a coastal caravan park located between 2 towns. In']}, {'title': ['What are some simple tasks to teach knife <em>safety</em>?'], 'body': [' (hiking, <em>camping</em>, backpacking, etc). As part of this interest, he has asked me to teach him h

({'id': '20725', 'url': 'https://outdoors.stackexchange.com/questions/20725', 'title': 'How do people carry camping propane in a car when traveling?'}, {'title': ['How do people carry <em>camping</em> propane in a car when traveling?'], 'body': ["I'm very new to <em>camping</em>. In fact..I'm trying to prepare for my first <em>camping</em>. One thing that I"]})


({'id': '18942', 'url': 'https://outdoors.stackexchange.com/questions/18942', 'title': 'Resources for camping in Georgia (country)'}, {'title': ['Resources for <em>camping</em> in Georgia (country)'], 'body': [' <em>camping</em> in Georgia? I would like to know about the rules and <em>safety</em> aspects. Any resource on these would']})


({'id': '13209', 'url': 'https://outdoors.stackexchange.com/questions/13209', 'title': 'Any tips to prevent theft while tent camping alone in a caravan park?'}, {'title': ['Any tips to prevent theft while tent <em>camping</em> alone in a caravan park?'], 'body': ['This summer. I will be <em>camping</em> on my own at a coastal caravan park located between 2 towns. In']})


({'id': '8700', 'url': 'https://outdoors.stackexchange.com/questions/8700', 'title': 'What are some simple tasks to teach knife safety?'}, {'title': ['What are some simple tasks to teach knife <em>safety</em>?'], 'body': [' (hiking, <em>camping</em>, backpacking, etc). As part of this interest, he has asked me to teach him how to']})


({'id': '3822', 'url': 'https://outdoors.stackexchange.com/questions/3822', 'title': 'End of Winter/Spring Camping'}, {'title': ['End of Winter/Spring <em>Camping</em>'], 'body': ['Planning a <em>camping</em> trip for next weekend. It is looking like it may be the end of winter. We have']})


({'id': '14565', 'url': 'https://outdoors.stackexchange.com/questions/14565', 'title': 'What is the best breathing holes strategy for a quinzee?'}, {'body': ["Since it's the season coming up for many (and <em>snow</em> seems to be in the game this year in eastern NA"]})


({'id': '9986', 'url': 'https://outdoors.stackexchange.com/questions/9986', 'title': 'How to prevent frost / ice damage camping equipment?'}, {'title': ['How to prevent frost / ice damage <em>camping</em> equipment?'], 'body': ["When <em>camping</em> in summer rain overnight or dew in the morning is fine as it'll likely all evaporate"]})


({'id': '20208', 'url': 'https://outdoors.stackexchange.com/questions/20208', 'title': 'What are the primary considerations for a solo traverse of the Cairngorms in winter snow?'}, {'title': ['What are the primary considerations for a solo traverse of the Cairngorms in winter <em>snow</em>?'], 'body': [' Dec 2018 to February 2019 I am planning a solo traverse of the Cairngorms in Scotland, <em>camping</em> wild or']})


({'id': '20746', 'url': 'https://outdoors.stackexchange.com/questions/20746', 'title': 'What are some common mistakes to avoid when trying winter camping for the first time?'}, {'title': ['What are some common mistakes to avoid when trying winter <em>camping</em> for the first time?'], 'body': ['I would like to try winter <em>camping</em>. What are some common mistakes first-timers make when trying to']})


## Listing 13.3

### Querying our collection with a question

In [13]:
query_collection("What is DEET?",verbose=True)

[{'title': ['<em>What</em> <em>is</em> Geocaching?'], 'body': [' <em>is</em> it played?']}, {'title': ['<em>What</em> <em>is</em> bushcrafting?'], 'body': [' <em>is</em> not survival. <em>What</em> actually <em>is</em> bushcraft?']}, {'title': ['<em>What</em> <em>is</em> "catskiing"?'], 'body': [" winter vacation from school. Obviously, that doesn't mean skiing with or by means of a cat. <em>What</em> <em>is</em>"]}, {'title': ['<em>What</em> <em>is</em> a tent skirt and <em>what</em> <em>is</em> its purpose?'], 'body': [' the tent, which makes outer tent reach the ground, so wind <em>is</em> not beating into the inner tent ; it has']}, {'title': ['<em>What</em> <em>is</em> Fast Packing?'], 'body': [' like? <em>What</em> type of milage/hour are people doing, there chafing issues, and <em>is</em> this the type of activity']}, {'title': ['<em>What</em> exactly <em>is</em> a rocker?'], 'body': ['. <em>What</em> exactly <em>is</em> meant by a "Rocker" snowboard and <em>what</em> are the ad

({'id': '20403', 'url': 'https://outdoors.stackexchange.com/questions/20403', 'title': 'What is bushcrafting?'}, {'title': ['<em>What</em> <em>is</em> bushcrafting?'], 'body': [' <em>is</em> not survival. <em>What</em> actually <em>is</em> bushcraft?']})


({'id': '20977', 'url': 'https://outdoors.stackexchange.com/questions/20977', 'title': 'What is "catskiing"?'}, {'title': ['<em>What</em> <em>is</em> "catskiing"?'], 'body': [" winter vacation from school. Obviously, that doesn't mean skiing with or by means of a cat. <em>What</em> <em>is</em>"]})


({'id': '17374', 'url': 'https://outdoors.stackexchange.com/questions/17374', 'title': 'What is a tent skirt and what is its purpose?'}, {'title': ['<em>What</em> <em>is</em> a tent skirt and <em>what</em> <em>is</em> its purpose?'], 'body': [' the tent, which makes outer tent reach the ground, so wind <em>is</em> not beating into the inner tent ; it has']})


({'id': '13006', 'url': 'https://outdoors.stackexchange.com/questions/13006', 'title': 'What is Fast Packing?'}, {'title': ['<em>What</em> <em>is</em> Fast Packing?'], 'body': [' like? <em>What</em> type of milage/hour are people doing, there chafing issues, and <em>is</em> this the type of activity']})


({'id': '15679', 'url': 'https://outdoors.stackexchange.com/questions/15679', 'title': 'What exactly is a rocker?'}, {'title': ['<em>What</em> exactly <em>is</em> a rocker?'], 'body': ['. <em>What</em> exactly <em>is</em> meant by a "Rocker" snowboard and <em>what</em> are the advantages or disadvantages?']})


({'id': '10157', 'url': 'https://outdoors.stackexchange.com/questions/10157', 'title': 'What is a bow shelf?'}, {'title': ['<em>What</em> <em>is</em> a bow shelf?'], 'body': [' can\'t really imagine <em>what</em> this might be. So, <em>what</em> <em>is</em> a "bow shelf"?']})


({'id': '24876', 'url': 'https://outdoors.stackexchange.com/questions/24876', 'title': 'What kind of mushroom is this?'}, {'title': ['<em>What</em> kind of mushroom <em>is</em> this?'], 'body': ["<em>What</em> kind of mushroom <em>is</em> this? There's a lot of similarity to a morel, but the body shape <em>is</em> not"]})


({'id': '913', 'url': 'https://outdoors.stackexchange.com/questions/913', 'title': 'What is a buff?'}, {'title': ['<em>What</em> <em>is</em> a buff?'], 'body': ['<em>What</em> <em>is</em> a "buff", how <em>is</em> it used, and why should I have one? I\'ve seen them advocated in numerous']})


({'id': '6397', 'url': 'https://outdoors.stackexchange.com/questions/6397', 'title': 'What is a bloquers?'}, {'title': ['<em>What</em> <em>is</em> a bloquers?'], 'body': ["Simple question, which has been prompted from this answer . <em>What</em> <em>is</em> a bloquers? I'm guessing it's"]})


Up next: [Introduction to Transformers](2.introduction-to-transformers.ipynb)