# Preprocess Data

We will work with the text of papers from the Neural Information Processing Systems (NIPS). The data was collected by Ben Hamner in 2015 and shared as a SQLite3 Database file at [https://www.kaggle.com/benhamner/nips-papers] containing 3 tables, papers, authors and paper_authors. There are 7237 academic papers in this dataset, spanning from 1987-2015.

This data has been manually exported from SQLite3 into 3 CSV files papers.csv, authors.csv and paper_authors.csv, using [commands described here](http://www.sqlitetutorial.net/sqlite-export-csv/).

In this notebook, we will process these three CSV files into our first search index, as well as convert into text files that can be processed more easily with downstream tools.

In [1]:
import json
import os
import pandas as pd
import re
import requests

In [2]:
DATA_DIR = "../data"

## Papers

In [3]:
papers_df = pd.read_csv(os.path.join(DATA_DIR, "papers.csv"))
papers_df.head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,2,1987,The Capacity of the Kanerva Associative Memory...,,2-the-capacity-of-the-kanerva-associative-memo...,Abstract Missing,184\n\nTHE CAPACITY OF THE KANERVA ASSOCIATIVE...
2,3,1987,Supervised Learning of Probability Distributio...,,3-supervised-learning-of-probability-distribut...,Abstract Missing,52\n\nSupervised Learning of Probability Distr...
3,4,1987,Constrained Differential Optimization,,4-constrained-differential-optimization.pdf,Abstract Missing,612\n\nConstrained Differential Optimization\n...
4,5,1987,Towards an Organizing Principle for a Layered ...,,5-towards-an-organizing-principle-for-a-layere...,Abstract Missing,485\n\nTOWARDS AN ORGANIZING PRINCIPLE FOR\nA ...


## Authors

We will create an in-memory dictionary of author ID to author name, which we will use in the paper_authors dataframe below.

In [4]:
authors_df = pd.read_csv(os.path.join(DATA_DIR, "authors.csv"))
authors_df.head()

Unnamed: 0,id,name
0,1,Hisashi Suzuki
1,2,Suguru Arimoto
2,3,Philip A. Chou
3,4,John C. Platt
4,5,Alan H. Barr


In [5]:
author_ids = authors_df["id"].values
author_names = authors_df["name"].values
author_id2name = dict(zip(author_ids, author_names))

## Paper Authors

We want to denormalize the authors and attach them to the papers dataframe. To do this, we first group by paper_id, then use our author id to name lookup to create a list of author names for each paper.

In [6]:
paper_authors_df = pd.read_csv(os.path.join(DATA_DIR, "paper_authors.csv"))
paper_authors_df.head()

Unnamed: 0,id,paper_id,author_id
0,1,63,94
1,2,80,124
2,3,80,125
3,4,80,126
4,5,80,127


In [7]:
paper_authors_df["author_names"] = paper_authors_df["author_id"].apply(
    lambda x: [author_id2name[x]])
pa_grouped_df = paper_authors_df.groupby(["paper_id"]).agg({"author_names": "sum"})
pa_grouped_df.head()

Unnamed: 0_level_0,author_names
paper_id,Unnamed: 1_level_1
1,"[Hisashi Suzuki, Suguru Arimoto]"
2,[Philip A. Chou]
3,"[Eric B. Baum, Frank Wilczek]"
4,"[John C. Platt, Alan H. Barr]"
5,[Ralph Linsker]


## Denormalize papers dataframe

We join our paper dataframe with the processed paper_authors dataframe to produce our denormalized papers dataframe.

In [8]:
papers_denorm_df = (papers_df.set_index("id")
                    .join(pa_grouped_df.reset_index().set_index("paper_id"), 
                          rsuffix="_pa", how="inner")
                    .reset_index())
papers_denorm_df.head()

Unnamed: 0,index,year,title,event_type,pdf_name,abstract,paper_text,author_names
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...,"[Hisashi Suzuki, Suguru Arimoto]"
1,2,1987,The Capacity of the Kanerva Associative Memory...,,2-the-capacity-of-the-kanerva-associative-memo...,Abstract Missing,184\n\nTHE CAPACITY OF THE KANERVA ASSOCIATIVE...,[Philip A. Chou]
2,3,1987,Supervised Learning of Probability Distributio...,,3-supervised-learning-of-probability-distribut...,Abstract Missing,52\n\nSupervised Learning of Probability Distr...,"[Eric B. Baum, Frank Wilczek]"
3,4,1987,Constrained Differential Optimization,,4-constrained-differential-optimization.pdf,Abstract Missing,612\n\nConstrained Differential Optimization\n...,"[John C. Platt, Alan H. Barr]"
4,5,1987,Towards an Organizing Principle for a Layered ...,,5-towards-an-organizing-principle-for-a-layere...,Abstract Missing,485\n\nTOWARDS AN ORGANIZING PRINCIPLE FOR\nA ...,[Ralph Linsker]


## Preparing Solr to receive data

In your Solr directory, execute following commands to build the first index to hold the papers (nips0index).

    bin/solr create -c nips0index
    
Then run the following command to set up the index for your schema ([code](https://github.com/sujitpal/content-engineering-tutorial/blob/master/scripts/create_schema_0.sh)).

    scripts/create_schema_0.sh
    
More information:

* [Solr Field Types and Properties](https://lucene.apache.org/solr/guide/7_2/field-type-definitions-and-properties.html)
* [Solr Analyzers, Tokenizers and Tokenfilters](https://lucene.apache.org/solr/guide/7_2/understanding-analyzers-tokenizers-and-filters.html).

## Upload data to Solr

In [9]:
# solr url
solr_url = "http://localhost:8983/solr"
headers = {
    "content-type": "application/json",
    "accept": "application/json"
}
# create subdirectory for text files
textfiles_dir = os.path.join(DATA_DIR, "textfiles")
if not os.path.exists(textfiles_dir):
    os.mkdir(textfiles_dir)
# create metadata file
fmeta = open(os.path.join(DATA_DIR, "papers_metadata.tsv"), "w")
fmeta.write("#id\tyear\ttitle\tabstract\tauthor_names\n")
for paper_id, row in papers_denorm_df.iterrows():
    year = row["year"]
    title = row["title"]
    abstract = row["abstract"]
    # squeeze out newlines from abstract
    abstract = abstract.replace("\n", " ")
    abstract = re.sub("\s+", " ", abstract)
    paper_text = row["paper_text"]
    author_names = row["author_names"]
    # write out metadata
    fmeta.write("{:d}\t{:d}\t{:s}\t{:s}\t{:s}\n"
                .format(paper_id, year, title, abstract, ":".join(author_names)))
    # write out text file
    ftext = open(os.path.join(textfiles_dir, "{:d}.txt".format(paper_id)), "w")
    ftext.write(paper_text)
    ftext.close()
    # post to index
    data = json.dumps({
        "add": {
            "doc": {
                "id": paper_id,
                "year": year,
                "title": title,
                "abstract": abstract,
                "text": paper_text,
                "author_names": [ author_names ]
            }
        }
    })
    if paper_id % 1000 == 0:
        print("Uploaded {:d} records to Solr".format(paper_id))
        params = {"commit": "true"}
    else:
        params = {"commit": "false"}
    requests.post(solr_url + "/nips0index/update", data=data, params=params, headers=headers)
fmeta.close()
requests.post(solr_url + "/nips0index/update", headers=headers, params={"commit": "true"})
print("Uploaded {:d} records to Solr".format(paper_id))

Uploaded 0 records to Solr
Uploaded 1000 records to Solr
Uploaded 2000 records to Solr
Uploaded 3000 records to Solr
Uploaded 4000 records to Solr
Uploaded 5000 records to Solr
Uploaded 6000 records to Solr
Uploaded 7000 records to Solr
Uploaded 7237 records to Solr
