## Load Solr with keyword facets

Using the keywords we just generated and deduped, we now want to add them into the documents in our index, so we can now do the following:

* Show facets along with search results to allow user to drill down into the subset of search results given by the facet.
* Allow better query parsing by using the keywords as a dictionary which can be looked up to isolate longer entities than just words.
* Have a rudimentary notion of "similar documents".
* On the content page, show "similar queries" that might be interesting given the facets set up for the document.

In [1]:
import json
import requests
import os
import sqlite3

In [2]:
DATA_DIR = "../data"

WORDCOUNTS_DB = os.path.join(DATA_DIR, "wordcounts.db")

CURATED_KEYWORDS = os.path.join(DATA_DIR, "raw_keywords.txt")
NEARDUP_MAPPINGS = os.path.join(DATA_DIR, "keyword_neardup_mappings.tsv")
DEDUPE_MAPPINGS = os.path.join(DATA_DIR, "keyword_dedupe_mappings.tsv")

TEXTFILES_DIR = os.path.join(DATA_DIR, "textfiles")
METADATA_FILE = os.path.join(DATA_DIR, "papers_metadata.tsv")

SOLR_URL = "http://localhost:8983/solr/nips1index/"
# SOLR_URL = "http://localhost:8983/solr/nips2index/"

In [3]:
conn = sqlite3.connect(WORDCOUNTS_DB)

### Index with keyword facets

First we create a new core to hold our new index.

    cd <solr_home>
    bin/solr create -c nips1index
    
Then add in our new schema. This has one additional multiValued string field keywords to hold our facets. Additionally we have changed the type of the authors field to "string" to allow faceting on authors as well.

    cd ../scripts
    ./create_schema1.sh


### Lookup tables for keywords

The full list of keywords is available in the `bigrams`, `rake` and `maui` tables. But they need to be filtered using the manually curated list of keywords. Finally these keywords need to be replaced with their canonical form so they look nicer when displayed as facets. 

We will also do a reverse lookup on these canonical form lookup tables when working on query expansion later.

In [4]:
valid_keywords = set()
fcurated = open(CURATED_KEYWORDS, "r")
for line in fcurated:
    valid_keywords.add(line.strip())
fcurated.close()

print("{:d} valid keywords".format(len(valid_keywords)))

2282 valid keywords


In [5]:
raw2canonical = {}
fneardup = open(NEARDUP_MAPPINGS, "r")
for line in fneardup:
    key, value = line.strip().split("\t")
    raw2canonical[key] = value
fneardup.close()

fdedupe = open(DEDUPE_MAPPINGS, "r")
for line in fdedupe:
    key, value, _ = line.strip().split("\t")
    raw2canonical[key] = value
fdedupe.close()

print("{:d} raw to canonical mappings".format(len(raw2canonical)))

455 raw to canonical mappings


In [6]:
def get_keywords_for_doc(conn, doc_id, valid_keywords, raw2canonical):
    keywords = set()
    # collect from bigrams
    cur_bigrams = conn.cursor()
    cur_bigrams.execute("select word_1, word_2 from bigrams where doc_id = ?", [doc_id])
    rows = cur_bigrams.fetchall()
    for row in rows:
        keyword = " ".join([row[0], row[1]]).lower()
        keywords.add(keyword)
    cur_bigrams.close()
    # collect from rake
    cur_rake = conn.cursor()
    cur_rake.execute("select keyword from rake where doc_id = ?", [doc_id])
    rows = cur_rake.fetchall()
    for row in rows:
        keywords.add(row[0].lower())
    cur_rake.close()
    # collect from maui
    cur_maui = conn.cursor()
    cur_maui.execute("select keyword from maui where doc_id = ?", [doc_id])
    rows = cur_maui.fetchall()
    for row in rows:
        keywords.add(row[0].lower())
    cur_maui.close()
    # filter out valid keywords
    filtered_keywords = [keyword for keyword in keywords 
                         if keyword in valid_keywords]
    # map them to their canonical forms if applicable
    canonical_keywords = set()
    for keyword in filtered_keywords:
        if keyword in raw2canonical.keys():
            canonical_keywords.add(raw2canonical[keyword])
        else:
            canonical_keywords.add(keyword)
    # return as list
    return list(canonical_keywords)


doc_id = 1
print(get_keywords_for_doc(conn, doc_id, valid_keywords, raw2canonical))

['science foundation', 'total number', 'random variables', 'optimal values', 'asymptotically optimal', 'national science foundation', 'neural net', 'optimal performance', 'neural network', 'information theory', 'upper bound', 'lower bound', 'maximum number', 'associative memory', 'hamming distance', 'ieee transactions', 'computer science', 'internal representation', 'matrix multiplication', 'national science']


### Document metadata lookup

We already saved the non-text values for each document in the `papers_metadata.tsv` file in the `01-preprocess` notebook, so we will use that here.

In [7]:
id2metadata = {}
fmeta = open(METADATA_FILE, "r")
for line in fmeta:
    line = line.strip()
    if line.startswith("#"):
        continue
    id, year, title, abstract, author_names = line.split("\t")
    authors = author_names.split(":")
    id2metadata[int(id)] = (year, title, abstract, authors)

fmeta.close()
print("{:d} metadata mappings".format(len(id2metadata)))

7238 metadata mappings


### Load papers into Solr

In [8]:
def count_index_rows(solr_url):
    resp = requests.get(solr_url + "select?q=*:*")
    resp_json = json.loads(resp.text)
    return resp_json["response"]["numFound"]


def add_row(solr_url, doc_id, conn, valid_keywords, raw2canonical, 
            id2metadata, should_commit):
    headers = {
        "content-type": "application/json",
        "accept": "application/json"
    }
    if doc_id is None:
        requests.post(solr_url + "update", params={"commit":"true"}, headers=headers)
    else:
        ftext = open(os.path.join(TEXTFILES_DIR, "{:d}.txt".format(doc_id)), "r")
        textfile_lines = []
        for line in ftext:
            textfile_lines.append(line.strip())
        ftext.close()
        text = "\n".join(textfile_lines)
        year, title, abstract, authors = id2metadata[doc_id]
        keywords = get_keywords_for_doc(conn, doc_id, valid_keywords, raw2canonical)
        req_body = json.dumps({
            "add": {
                "doc": {
                    "id": doc_id,
                    "year": year,
                    "title": title,
                    "abstract": abstract,
                    "text": text,
                    "authors": authors,
                    "keywords": keywords
                }
            }
        })
        params = { "commit": "true" if should_commit else "false" }
        requests.post(solr_url + "update", data=req_body, params=params, headers=headers)

In [9]:
num_rows_in_index = count_index_rows(SOLR_URL)
num_added = 0
should_commit = False
if num_rows_in_index == 0:
    for textfile in os.listdir(TEXTFILES_DIR):
        doc_id = int(textfile.split(".")[0])
        if num_added % 1000 == 0:
            print("{:d} records added".format(num_added))
            should_commit = True
        add_row(SOLR_URL, doc_id, conn, valid_keywords, raw2canonical, 
                id2metadata, should_commit)        
        should_commit = False
        num_added += 1
    
    print("{:d} records added, COMPLETE".format(num_added))
    add_row(SOLR_URL, None, conn, valid_keywords, raw2canonical, id2metadata, True)
    num_rows_in_index = count_index_rows(SOLR_URL)

print("{:d} records in index".format(num_rows_in_index))

0 records added
1000 records added
2000 records added
3000 records added
4000 records added
5000 records added
6000 records added
7000 records added
7238 records added, COMPLETE
7238 records in index
