# Elasticsearch Document Conversion

The Elasticsearch Document Conversion from our Search API results appears to be a pain point.
This notebook is meant to serve as root cause analysis of the issue

## Settings Variables

In [1]:
# Maximum number of query items to request from API
QUERY_LIMIT = 20

# Maximum number or requests deferred
MAX_RUN = 20

# This many seconds will expire between requests sent
RATE_LIMIT = 1

# Path to original excel file, converted to CSV
filepath = 'examples/finance_demo.csv'

# Path to converted file to be used for API requests
outpath = 'examples/ipython_demo.json'

# results returned from the API via the networking engine
resultspath = 'results.json'

# Specify a document template for Elasticsearch
esformat = {
            "_index": "ipythonsearch",
            "_type": "website",
            "_id": None,
            "_source": ""
        }

## Reading JSON from Search API Results

In [2]:
import itertools
import string
import json
from elasticsearch import Elasticsearch,helpers

from greencall.utils.loadelastic import read_json, prepare_all_documents
from greencall.csvclean.inputCsv import read_csv

# Connect to Elasticsearch
es = Elasticsearch()

# Read results from the Search API
results = read_json(resultspath)

# Take the length of the results as a sanity check
len(results)

20

## Reviewing Results from the Google Search API

In [14]:
# take a look at the first document

print("key:\n {}\n".format(results.keys()[0]))
print("result:\n {}".format(results[results.keys()[0]]))

key:
 11111437

result:
 {
 "kind": "customsearch#search",
 "url": {
  "type": "application/json",
  "template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
 },
 "queries": {
  "request": [
   {
    "title": "Google Custom Search - Akeem Dent",
    "totalResults": "0",
    "searchTerms": "Akeem Dent",
    "count": 10,
    "language": "lang_en"

** Output from the First Result**

Here we can see that the first result includes 0 total results. It looks like
we have some generic fields that seem to be search meta-data. This initially
looks like it would be 3 elasticsearch documents but it may make more sense
to condense the meta-data into one ES document.

Let's take a look at results that include search results next

In [23]:
# result that contains search hits
import codecs

print("key:\n {}\n".format(results.keys()[1]))
print("value:\n {}".format(codecs.encode(results[results.keys()[1]],'ascii','ignore')))

key:
 11111253

value:
 {
 "kind": "customsearch#search",
 "url": {
  "type": "application/json",
  "template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
 },
 "queries": {
  "nextPage": [
   {
    "title": "Google Custom Search - Jeremy Clark",
    "totalResults": "6020",
    "searchTerms": "Jeremy Clark",
    "count": 10,
    "startIndex": 

tangent: [Natalie Portman tells Harvard seniors to use inexperience as an asset - The Boston Globe](https://www.bostonglobe.com/lifestyle/names/2015/05/27/natalie-portman-tells-harvard-seniors-use-inexperience-asset/GS5LQ3rgc15FEJgLuisYFJ/story.htm)

**Reviewing the result with search hits**

Here we can see some similarities as well. It seems like we get into trouble when trying to
assess the fields within the 'pagemap'. It seems as though the 'snippet' field may be sufficient
for our needs. 

This information seems to be sufficient for creating the next version of the parser. We have 
been able to indentify fields that occur in each result. We have also found fields that occur
in each search hit. 

At this stage, parsing the 'pagemap' seems to be unecessary.

## Building a better document converter

## Results from Elasticsearch document conversion

In [4]:
# prepare all documents for bulk upload into elasticsearch
actions = prepare_all_documents(results, esformat, read_csv(filepath, QUERY_LIMIT))

In [5]:
# it's pretty clear from steping through the documents that
# there are some issues with the parser. The first version
# definitely worked better than nothing but we are going to
# need to make some changes.

actions[47]['_source']

{'account_holder': 'Walter Broughton',
 'account_number': u'11111417',
 u'searchInformation': [u'formattedSearchTime',
  u'formattedTotalResults',
  u'totalResults',
  u'searchTime']}