## [Chapter 13] Setting up the outdoors dataset

We're going to use the Outdoors dataset for this chapter, and for a very important reason: the vocabulary and contexts in the outdoor question and answer __domain__ already have good coverage in the Transformer models we'll be using.

This is because the datasets that were used to train the model include sources that are likely to have similar subject matter.  Wikipedia was used to train bert-base-uncased (https://huggingface.co/bert-base-uncased#training-data) and, surprise! wikipedia has a section specifically on outdoors content: https://en.wikipedia.org/wiki/Outdoor

This is important, because if the words and their contexts haven't been seen before, the model will be less accurate.

Also, who doesn't enjoy playing around with a new dataset?! Data is search nerd candy.

In [1]:
import sys
import pickle
import json
sys.path.append('..')
from aips import *
from IPython.display import display,HTML

## Listing 13.1

### Creating our Collection and Indexing the documents

In [2]:
outdoors_collection="outdoors"
create_collection(outdoors_collection)

upsert_string_field(outdoors_collection,"url")
upsert_integer_field(outdoors_collection,"post_type_id")
upsert_integer_field(outdoors_collection,"accepted_answer_id")
upsert_integer_field(outdoors_collection,"parent_id")
upsert_integer_field(outdoors_collection,"score")
upsert_integer_field(outdoors_collection,"view_count")
upsert_text_field(outdoors_collection,"body")
upsert_text_field(outdoors_collection,"title")
upsert_keyword_field(outdoors_collection,"tags")
upsert_integer_field(outdoors_collection,"answer_count")
upsert_integer_field(outdoors_collection,"owner_user_id")

Wiping 'outdoors' collection
[('action', 'CREATE'), ('name', 'outdoors'), ('numShards', 1), ('replicationFactor', 1)]
Creating 'outdoors' collection
Status: Success
Adding 'url' field to collection
Status: Success
Adding 'body' field to collection
Status: Success
Adding 'title' field to collection
Status: Success
Adding 'tags' field to collection
Status: Success


In [3]:
#outdoors
![ ! -d 'outdoors' ] && git clone https://github.com/ai-powered-search/outdoors.git
! cd outdoors && git pull
! cd outdoors && cat outdoors.tgz.part* > outdoors.tgz
! cd outdoors && mkdir -p '../../data/outdoors/' && tar -xvf outdoors.tgz -C '../../data/outdoors/'

Cloning into 'outdoors'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (1/1), done.[K
remote: Total 71 (delta 0), reused 0 (delta 0), pack-reused 70[K
Unpacking objects: 100% (71/71), 1005.42 MiB | 2.68 MiB/s, done.
Already up to date.
README.md
concepts.pickle
._guesses.csv
guesses.csv
._guesses_all.json
guesses_all.json
outdoors_concepts.pickle
outdoors_embeddings.pickle
._outdoors_golden_answers.csv
outdoors_golden_answers.csv
._outdoors_golden_answers.xlsx
outdoors_golden_answers.xlsx
._outdoors_golden_answers_20210130.csv
outdoors_golden_answers_20210130.csv
outdoors_labels.pickle
outdoors_question_answering_contexts.json
outdoors_questionanswering_test_set.json
outdoors_questionanswering_train_set.json
._posts.csv
posts.csv
predicates.pickle
pull_aips_dependency.py
._question-answer-seed-contexts.csv
question-answer-seed-contexts.csv
question-answer-squad2-guesses.csv
._roberta-base-squad2-outdoors
roberta-base-squad2-outdoors/
roberta-base-squad2-o

In [4]:
from densevectors.outdoors import *
#Transform the the outdoors.csv file into solr documents
df = cleanDataset('../data/outdoors/posts.csv')
data = transformDataFrame(df)
print(len(data))

19585


In [6]:
print("\nAdding Documents to '" + outdoors_collection + "' collection")
response = requests.post(f"{SOLR_URL}/{outdoors_collection}/update?commit=true", json=data).json()
print(str(response))
print("Status: " "Success" if response["responseHeader"]["status"] == 0 else "Failure" )


Adding Documents to 'outdoors' collection
{'responseHeader': {'rf': 1, 'status': 0, 'QTime': 4856}}
Status: Success


In [7]:
def query_collection(query,verbose=False):
    request = {
        "query": query,
        "fields": ["id", "title", "url"],
        "params": {
          "fq":"post_type_id:1",
          "qf": ["title","body"],
          "defType": "edismax",
          "indent": "true",
          "hl": True
        }
    }
    display(HTML("<br/><strong>Query: </strong><i>" + query + "</i><br/><br/><strong>Ranked Docs:</strong>"))
    response = requests.post(f"{SOLR_URL}/{outdoors_collection}/select", json=request).json()
    results = list(response["response"]["docs"])
    highlights = [response["highlighting"][k] for k in response["highlighting"].keys()]
    res = [j for j in zip(results, highlights)]
    i=1
    for r in res:
        display(HTML('<a href="'+r[0]["url"]+'">qid '+r[0]["id"]+'</a>&nbsp;'))
        if "title" in r[0].keys():
            display(HTML('<strong>Question Title:</strong>' + r[0]["title"]))
        if verbose:
            display(HTML('<strong>Body:</strong>'+html.unescape(str(r[1]["body"][0]))))
            display(HTML('<hr>'))
        i+=1

## Listing 13.2

### Querying our collection with a noun phrase

In [8]:
query_collection("climbing knots",verbose=True)

In [9]:
query_collection("snow camping safety")

## Listing 13.3

### Querying our collection with a question

In [10]:
query_collection("What is DEET?",verbose=True)