## [ Chapter 13 - Semantic Search with Dense Vectors ] 
## Setting up the outdoors dataset

We're going to use the Outdoors dataset for this chapter, and for a very important reason: the vocabulary and contexts in the outdoor question and answer __domain__ already have good coverage in the Transformer models we'll be using.

This is because the datasets that were used to train the model include sources that are likely to have similar subject matter.  Wikipedia was used to train bert-base-uncased (https://huggingface.co/bert-base-uncased#training-data) and, surprise! wikipedia has a section specifically on outdoors content: https://en.wikipedia.org/wiki/Outdoor

This is important, because if the words and their contexts haven't been seen before, the model will be less accurate.

Also, who doesn't enjoy playing around with a new dataset?! Data is search nerd candy.

In [1]:
import sys
sys.path.append("../..")
import json
import html
from aips import get_engine, set_engine
from IPython.display import display, HTML

set_engine("opensearch")
engine = get_engine()

## Listing 13.1

### Creating our Collection and Indexing the documents

In [2]:
#outdoors
![ ! -d 'outdoors' ] && git clone --depth=1 https://github.com/ai-powered-search/outdoors.git
! cd outdoors && git pull
! cd outdoors && cat outdoors.tgz.part* > outdoors.tgz
! cd outdoors && mkdir -p '../data/outdoors/' && tar -xvf outdoors.tgz -C '../data/outdoors/'

Cloning into 'outdoors'...
remote: Enumerating objects: 25, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 25 (delta 0), reused 22 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (25/25), 491.39 MiB | 18.47 MiB/s, done.
Already up to date.
README.md
concepts.pickle
._guesses.csv
guesses.csv
._guesses_all.json
guesses_all.json
outdoors_concepts.pickle
outdoors_embeddings.pickle
._outdoors_golden_answers.csv
outdoors_golden_answers.csv
._outdoors_golden_answers.xlsx
outdoors_golden_answers.xlsx
._outdoors_golden_answers_20210130.csv
outdoors_golden_answers_20210130.csv
outdoors_labels.pickle
outdoors_question_answering_contexts.json
outdoors_questionanswering_test_set.json
outdoors_questionanswering_train_set.json
._posts.csv
posts.csv
predicates.pickle
pull_aips_dependency.py
._question-answer-seed-contexts.csv
question-answer-seed-contexts.csv
question-answer-squad2-guesses.csv
._roberta-base-squa

In [4]:
from aips.data_loaders.outdoors import load_dataframe

outdoors_collection = engine.create_collection("outdoors")
outdoors_dataframe = load_dataframe("data/outdoors/posts.csv")
outdoors_collection.write(outdoors_dataframe)
outdoors_dataframe.printSchema()

Wiping "outdoors" collection


ConnectionError: HTTPConnectionPool(host='aips-opensearch', port=9200): Max retries exceeded with url: /outdoors (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f17c8853a90>: Failed to establish a new connection: [Errno 111] Connection refused'))

## Listing 13.2

### Exploring data for a question regarding `climbing knots`

In [5]:
def get_posts_for_question(query):
    request = {"query": query,
               "limit": 1, 
               "query_fields": ["title", "body"],
               "return_fields": ["id"],
               "filters": [("post_type", "question"), ("accepted_answer_id", "*")],
               "log": True}
    id = outdoors_collection.search(**request)["docs"][0]["id"]
    request = {"query": id,
               "query_fields": ["id", "parent_id"],
               "limit": 3,
               "return_fields": ["id", "post_type", "title", "body", "parent_id", "accepted_answer_id"],
               "log": True}
    return outdoors_collection.search(**request)["docs"]

In [6]:
get_posts_for_question("climbing knots")

Search Request:
{
  "query": {
    "query_string": {
      "query": "climbing knots",
      "boost": 0.454545454,
      "fields": [
        "title",
        "body"
      ]
    }
  },
  "size": 1,
  "fields": [
    "id"
  ]
}
Search Request:
{
  "query": {
    "query_string": {
      "query": "eklbRZIBIj6H5vdTILg1",
      "boost": 0.454545454,
      "fields": [
        "id",
        "parent_id"
      ]
    }
  },
  "size": 3,
  "fields": [
    "id",
    "post_type",
    "title",
    "body",
    "parent_id",
    "accepted_answer_id"
  ]
}


ValueError: {'error': {'root_cause': [{'type': 'query_shard_exception', 'reason': 'failed to create query: For input string: "eklbRZIBIj6H5vdTILg1"', 'index': 'outdoors', 'index_uuid': 'hf_aPScSTWiH08WZCIrt_Q'}], 'type': 'search_phase_execution_exception', 'reason': 'all shards failed', 'phase': 'query', 'grouped': True, 'failed_shards': [{'shard': 0, 'index': 'outdoors', 'node': 'WJD6SELYSaeWbVewbV0XTQ', 'reason': {'type': 'query_shard_exception', 'reason': 'failed to create query: For input string: "eklbRZIBIj6H5vdTILg1"', 'index': 'outdoors', 'index_uuid': 'hf_aPScSTWiH08WZCIrt_Q', 'caused_by': {'type': 'number_format_exception', 'reason': 'For input string: "eklbRZIBIj6H5vdTILg1"'}}}]}, 'status': 400}

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 52282)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/conda/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/opt/conda/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/conda/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/local/spark/python/pyspark/accumulators.py", line 281, in handle
    poll(accum_updates)
  File "/usr/local/spark/python/pyspark/accumulators.py", line 253, in poll
    if func():
  File "/usr/local/spark/python/pyspark/accumulators.py", line 257, in accum_updates
    num_updates = read_int(self.rfile)
  File "/usr/local/spark/python/

## Listing 13.3

### Querying our collection with a noun phrase

In [6]:
def display_questions(query, response, verbose=False):
    print(f"Query {query}:\n")
    if verbose:
        highlights = [response["highlighting"][k] for k in response["highlighting"].keys()]
    print("Ranked Questions:")
    for i, q in enumerate(response["docs"]):
        if verbose:
            print(json.dumps(q, indent="  "))
        if "title" in q.keys():
            id = f'<a href="{q["url"]}">{q["id"]}</a>'
            display(HTML(f'<strong>Question {id}: </strong>{q["title"]}'))
        if verbose:
            display(HTML("<strong>Body:</strong>" + html.unescape(str(highlights[i]["body"][0]))))
            display(HTML("<hr>"))

def search_questions(query, verbose=False):
    request = {"query": query,
               "query_fields": ["title", "body"],
               "limit": 5,
               "return_fields": ["id", "url", "post_type", "title",
                                 "body", "accepted_answer_id", "score"],
               "filters": [("post_type", "question")],
               "order_by": [("score", "desc"), ("title", "asc")]}
    response = outdoors_collection.search(**request)
    display_questions(query, response, verbose)

In [7]:
search_questions("climbing knots")

Query climbing knots:

Ranked Questions:


In [8]:
search_questions("snow camping safety")

Query snow camping safety:

Ranked Questions:


## Listing 13.4

### Querying our collection with a question

In [9]:
search_questions("What is DEET?")

Query What is DEET?:

Ranked Questions:


Up next: [Introduction to Transformers](2.introduction-to-transformers.ipynb)