## [Chapter 13] Setting up the outdoors dataset

We're going to use the Outdoors dataset for this chapter, and for a very important reason: the vocabulary and contexts in the outdoor question and answer __domain__ already have good coverage in the Transformer models we'll be using.

This is because the datasets that were used to train the model include sources that are likely to have similar subject matter.  Wikipedia was used to train bert-base-uncased (https://huggingface.co/bert-base-uncased#training-data) and, surprise! wikipedia has a section specifically on outdoors content: https://en.wikipedia.org/wiki/Outdoor

This is important, because if the words and their contexts haven't been seen before, the model will be less accurate.

Also, who doesn't enjoy playing around with a new dataset?! Data is search nerd candy.

### Creating our Collection

In [1]:
import sys
import pickle
import json
sys.path.append('..')
from aips import *

In [2]:
outdoors_collection="outdoors"
create_collection(outdoors_collection)

upsert_integer_field(outdoors_collection,"post_type_id")
upsert_integer_field(outdoors_collection,"accepted_answer_id")
upsert_integer_field(outdoors_collection,"parent_id")
upsert_integer_field(outdoors_collection,"score")
upsert_integer_field(outdoors_collection,"view_count")
upsert_text_field(outdoors_collection,"body")
upsert_text_field(outdoors_collection,"title")
upsert_keyword_field(outdoors_collection,"tags")
upsert_integer_field(outdoors_collection,"answer_count")

Wiping 'outdoors' collection
[('action', 'CREATE'), ('name', 'outdoors'), ('numShards', 1), ('replicationFactor', 1)]
Creating outdoors' collection
Status: Success
Adding 'post_type_id' field to collection
Status: Success
Adding 'accepted_answer_id' field to collection
Status: Success
Adding 'parent_id' field to collection
Status: Success
Adding 'score' field to collection
Status: Success
Adding 'view_count' field to collection
Status: Success
Adding 'body' field to collection
Status: Success
Adding 'title' field to collection
Status: Success
Adding 'tags' field to collection
Status: Success
Adding 'answer_count' field to collection
Status: Success


In [3]:
from ltr.download import download, extract_tgz
import tarfile

dataset = ['https://github.com/ai-powered-search/outdoors/raw/master/outdoors.tgz']
download(dataset, dest='data/')
extract_tgz('data/outdoors.tgz') # -> Holds 'outdoors.csv', a big CSV file of the stackexchange outdoors dataset

data/outdoors.tgz already exists


In [4]:
from densevectors.outdoors import *
#Transform the the outdoors.csv file into solr documents
df = cleanDataset('data/posts.csv')
data = transformDataFrame(df)
print(len(data))

19585


In [5]:
print("\nAdding Documents to '" + outdoors_collection + "' collection")
response = requests.post(solr_url + outdoors_collection + "/update?commit=true", json=data).json()
print(str(response))
print("Status: " "Success" if response["responseHeader"]["status"] == 0 else "Failure" )


Adding Documents to 'outdoors' collection
{'responseHeader': {'rf': 1, 'status': 0, 'QTime': 2882}}
Status: Success


In [8]:
def query_collection(query):
    request = {
        "query": query,
        "fields": ["id", "title"],
        "params": {
          "qf": ["title","body"],
          "defType": "edismax",
          "indent": "true",
          "hl": True
        }
    }
    from IPython.core.display import display,HTML
    display(HTML("<br/><strong>Query: </strong><i>" + query + "</i><br/><br/><strong>Ranked Docs:</strong>"))
    response = requests.post(solr_url + outdoors_collection + "/select", json=request).json()
    results = list(response["response"]["docs"])
    highlights = [response["highlighting"][k] for k in response["highlighting"].keys()]
    res = [j for j in zip(results, highlights)]
    i=1
    for r in res:
        if "title" in r[0].keys():
            display(HTML('<strong>' + r[0]["title"] + '</strong>'))
        display(HTML(html.unescape(str(r[1]["body"][0]))))
        display(HTML('<hr>'))
        i+=1
    #output = str(res).replace('\\n', '').replace(", '", ",<br/>'")
    #display(HTML(res))
    #print(str(response))

In [9]:
query_collection("uses for tree branches")

In [10]:
query_collection("pocket knife tied to a string")