## [Chapter 13] Setting up the outdoors dataset

We're going to use the Outdoors dataset for this chapter, and for a very important reason: the vocabulary and contexts in the outdoor question and answer __domain__ already have good coverage in the Transformer models we'll be using.

This is because the datasets that were used to train the model include sources that are likely to have similar subject matter.  Wikipedia was used to train bert-base-uncased (https://huggingface.co/bert-base-uncased#training-data) and, surprise! wikipedia has a section specifically on outdoors content: https://en.wikipedia.org/wiki/Outdoor

This is important, because if the words and their contexts haven't been seen before, the model will be less accurate.

Also, who doesn't enjoy playing around with a new dataset?! Data is search nerd candy.

In [None]:
import sys
import pickle
import json
sys.path.append('..')
from aips import *
from IPython.core.display import display,HTML

In [None]:
def add_outdoors_fields_to_schema(outdoors_collection):
    upsert_string_field(outdoors_collection,"url")
    upsert_integer_field(outdoors_collection,"post_type_id")
    upsert_integer_field(outdoors_collection,"accepted_answer_id")
    upsert_integer_field(outdoors_collection,"parent_id")
    upsert_integer_field(outdoors_collection,"score")
    upsert_integer_field(outdoors_collection,"view_count")
    upsert_text_field(outdoors_collection,"body")
    upsert_text_field(outdoors_collection,"title")
    upsert_keyword_field(outdoors_collection,"tags")
    upsert_integer_field(outdoors_collection,"answer_count")
    upsert_integer_field(outdoors_collection,"owner_user_id")

## Downloading and Processing the Outdoors Dataset

In [None]:
from densevectors.outdoors import *
    
def download_outdoors_dataset():
    from ltr.download import download, extract_tgz
    import tarfile

    dataset = ['https://github.com/ai-powered-search/outdoors/raw/master/outdoors.tgz']
    download(dataset, dest='data/')
    extract_tgz('data/outdoors.tgz') # -> Holds 'outdoors.csv', a big CSV file of the stackexchange outdoors dataset
    
def process_outdoors_dataset():
    #Transform the the outdoors.csv file into solr documents
    df = cleanDataset('data/posts.csv')
    data = transformDataFrame(df)
    return data
    
def index_dataset_to_search_engine(outdoors_collection, data):
    print("\nAdding Documents to '" + outdoors_collection + "' collection")
    response = requests.post(solr_url + outdoors_collection + "/update?commit=true", json=data).json()
    print(str(response))
    print("Status: " "Success" if response["responseHeader"]["status"] == 0 else "Failure" )

In [None]:
#Download the dataset one time up-front
download_outdoors_dataset()

## Listing 13.1

### Creating our Collection and Indexing the documents

In [None]:
outdoors_collection="outdoors"
create_collection(outdoors_collection)
add_outdoors_fields_to_schema(outdoors_collection)
index_dataset_to_search_engine(outdoors_collection, process_outdoors_dataset())

## Listing 13.2

### Sample Question and Answer Documents

In [None]:
def show_data_sample(post_id):
    request = {
        "query": "id:" + str(post_id) + " OR parent_id:" + str(post_id),
        "fields": ["id, post_type_id", "title", "parent_id"," accepted_answer_id", "body"],
        "params": {
          "fq":"post_type_id:*",
          "indent": "true",
          "sort":  "post_type_id asc"
        }
    }
    
    response = requests.post(solr_url + outdoors_collection + "/select", json=request).json()
    results = json.dumps(list(response["response"]["docs"]), indent=4)
    print(results)

show_data_sample(7)

In [None]:
def query_collection(query,verbose=False):
    request = {
        "query": query,
        "fields": ["id", "title", "url"],
        "params": {
          "fq":"post_type_id:1",
          "qf": ["title","body"],
          "defType": "edismax",
          "indent": "true",
          "hl": True
        }
    }
    display(HTML("<br/><strong>Query: </strong><i>" + query + "</i><br/><br/><strong>Ranked Docs:</strong>"))
    response = requests.post(solr_url + outdoors_collection + "/select", json=request).json()
    results = list(response["response"]["docs"])
    highlights = [response["highlighting"][k] for k in response["highlighting"].keys()]
    res = [j for j in zip(results, highlights)]
    i=1
    for r in res:
        display(HTML('<a href="'+r[0]["url"]+'">qid '+r[0]["id"]+'</a>&nbsp;'))
        if "title" in r[0].keys():
            display(HTML('<strong>Question Title:</strong>' + r[0]["title"]))
        if verbose:
            display(HTML('<strong>Body:</strong>'+html.unescape(str(r[1]["body"][0]))))
            display(HTML('<hr>'))
        i+=1

## Listing 13.3

### Querying our collection with a noun phrase

In [None]:
query_collection("climbing knots",verbose=True)

In [None]:
query_collection("snow camping safety")

## Listing 13.4

### Querying our collection with a question

In [None]:
query_collection("What is DEET?",verbose=True)