# Elasticsearch Document Conversion

The Elasticsearch Document Conversion from our Search API results appears to be a pain point.
This notebook is meant to serve as root cause analysis of the issue

## Settings Variables

In [94]:
# Maximum number of query items to request from API
QUERY_LIMIT = 20

# Maximum number or requests deferred
MAX_RUN = 20

# This many seconds will expire between requests sent
RATE_LIMIT = 1

# Path to original excel file, converted to CSV
filepath = 'examples/finance_demo.csv'

# Path to converted file to be used for API requests
outpath = 'examples/ipython_demo.json'

# results returned from the API via the networking engine
resultspath = 'results.json'

# Specify a document template for Elasticsearch
esformat = {
            "_index": "ipythonsearch",
            "_type": "website",
            "_id": None,
            "_source": ""
        }

## Reading JSON from Search API Results

In [95]:
import itertools
import string
import json
from elasticsearch import Elasticsearch,helpers

from greencall.utils.loadelastic import read_json, prepare_all_documents
from greencall.csvclean.inputCsv import read_csv

# Connect to Elasticsearch
es = Elasticsearch()

# Read results from the Search API
results = read_json(resultspath)

# Take the length of the results as a sanity check
len(results)

5

## Reviewing Results from the Google Search API

In [96]:
# take a look at the first document
import codecs

print("key:\n {}\n".format(results.keys()[0]))
print("result:\n {}".format(codecs.encode(results[results.keys()[0]],'ascii','ignore')))

key:
 11111143

result:
 {
 "kind": "customsearch#search",
 "url": {
  "type": "application/json",
  "template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
 },
 "queries": {
  "nextPage": [
   {
    "title": "Google Custom Search - Kelley Mote",
    "totalResults": "426",
    "searchTerms": "Kelley Mote",
    "count": 10,
    "startIndex": 11

** Output from the First Result**

Here we can see that the first result includes 0 total results. It looks like
we have some generic fields that seem to be search meta-data. This initially
looks like it would be 3 elasticsearch documents but it may make more sense
to condense the meta-data into one ES document.

Let's take a look at results that include search results next

In [97]:
# result that contains search hits
import codecs

print("key:\n {}\n".format(results.keys()[1]))
print("value:\n {}".format(codecs.encode(results[results.keys()[1]],'ascii','ignore')))

key:
 11111157

value:
 {
 "kind": "customsearch#search",
 "url": {
  "type": "application/json",
  "template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
 },
 "queries": {
  "nextPage": [
   {
    "title": "Google Custom Search - Al Krueger",
    "totalResults": "2530",
    "searchTerms": "Al Krueger",
    "count": 10,
    "startIndex": 11,


tangent: [Natalie Portman tells Harvard seniors to use inexperience as an asset - The Boston Globe](https://www.bostonglobe.com/lifestyle/names/2015/05/27/natalie-portman-tells-harvard-seniors-use-inexperience-asset/GS5LQ3rgc15FEJgLuisYFJ/story.htm)

**Reviewing the result with search hits**

Here we can see some similarities as well. It seems like we get into trouble when trying to
assess the fields within the 'pagemap'. It seems as though the 'snippet' field may be sufficient
for our needs. 

This information seems to be sufficient for creating the next version of the parser. We have 
been able to indentify fields that occur in each result. We have also found fields that occur
in each search hit. 

At this stage, parsing the 'pagemap' seems to be unecessary.

## Building a better document converter

In [98]:
# meta information
import json
pydict = read_json(resultspath)

valuedict = json.loads(pydict[pydict.keys()[0]])

In [99]:
print("search results: {}".format(valuedict['items'][0]))

search results: {u'kind': u'customsearch#result', u'title': u'Honda Village - Auto Repair - Newton Corner, MA - Reviews ...', u'displayLink': u'www.yelp.com', u'htmlTitle': u'Honda Village - Auto Repair - Newton Corner, MA - Reviews <b>...</b>', u'formattedUrl': u'www.yelp.com/biz/honda-village-newton-corner', u'htmlFormattedUrl': u'www.yelp.com/biz/honda-village-newton-corner', u'pagemap': {u'rating': [{u'ratingvalue': u'5.0'}, {u'ratingvalue': u'5.0'}, {u'ratingvalue': u'5.0'}, {u'ratingvalue': u'5.0'}, {u'ratingvalue': u'1.0'}, {u'ratingvalue': u'5.0'}, {u'ratingvalue': u'5.0'}, {u'ratingvalue': u'4.0'}, {u'ratingvalue': u'5.0'}, {u'ratingvalue': u'1.0'}, {u'ratingvalue': u'2.0'}, {u'ratingvalue': u'5.0'}, {u'ratingvalue': u'4.0'}, {u'ratingvalue': u'1.0'}, {u'ratingvalue': u'5.0'}, {u'ratingvalue': u'4.0'}, {u'ratingvalue': u'1.0'}, {u'ratingvalue': u'4.0'}, {u'ratingvalue': u'5.0'}, {u'ratingvalue': u'1.0'}, {u'ratingvalue': u'2.0'}, {u'ratingvalue': u'5.0'}, {u'ratingvalue': u'5.

In [100]:
print("some metadata: {}".format(valuedict['items'][0].keys()))

some metadata: [u'kind', u'title', u'displayLink', u'htmlTitle', u'formattedUrl', u'htmlFormattedUrl', u'pagemap', u'snippet', u'htmlSnippet', u'link', u'cacheId']


### Metadata for ES document

In [101]:
print('kind: {}'.format(valuedict['kind']))
print('template: {}'.format(valuedict['url']['template'])) #make raw
print('title: {}'.format(valuedict['queries']['request'][0]['title']))
print('totalResults: {}'.format(valuedict['queries']['request'][0]['totalResults']))
print('searchTerms: {}'.format(valuedict['queries']['request'][0]['searchTerms']))
print('count: {}'.format(valuedict['queries']['request'][0]['count']))
print('language: {}'.format(valuedict['queries']['request'][0]['language']))
print('inputEncoding: {}'.format(valuedict['queries']['request'][0]['inputEncoding']))
print('outputEncoding: {}'.format(valuedict['queries']['request'][0]['outputEncoding']))
print('safe: {}'.format(valuedict['queries']['request'][0]['safe']))
print('cx: {}'.format(valuedict['queries']['request'][0]['cx']))
print('filter: {}'.format(valuedict['queries']['request'][0]['filter']))
print('exactTerms: {}'.format(valuedict['queries']['request'][0]['exactTerms']))
print('dateRestrict: {}'.format(valuedict['queries']['request'][0]['dateRestrict']))
print('searchTime: {}'.format(valuedict['searchInformation']['searchTime']))
print('formattedSearchTime: {}'.format(valuedict['searchInformation']['formattedSearchTime']))
print('totalResults: {}'.format(valuedict['searchInformation']['totalResults']))

kind: customsearch#search
template: https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json
title: Google Custom Search - Kelley Mote
totalResults: 426
searchTerms: Kelley Mote
count: 10
language: lang_en
inputEncoding: utf8
outputEncoding: utf8
safe: off
cx: 003891126258438650518:fcb7zxrqavu
filter: 1
exactTerms: asset
dateRestrict: '2012'
searchTime: 0.71

### Resultdata for ES document

In [102]:
print('kind: {}'.format(valuedict['kind']))
print('cx: {}'.format(valuedict['queries']['request'][0]['cx']))
print('title: {}'.format(valuedict['items'][0]['title']))
print('link: {}'.format(valuedict['items'][0]['link'])) #make raw
print('snippet: {}'.format(codecs.encode(valuedict['items'][0]['snippet'],'ascii','ignore')))

kind: customsearch#search
cx: 003891126258438650518:fcb7zxrqavu
title: Honda Village - Auto Repair - Newton Corner, MA - Reviews ...
link: http://www.yelp.com/biz/honda-village-newton-corner
snippet: Find Out More About Our Trade Up Special! .... If there were more Jims in the car 
sales world, I think the public would have a much higher opinion of car...


## Results from Elasticsearch document conversion

In [103]:
# prepare all documents for bulk upload into elasticsearch
#actions = prepare_all_documents(results, esformat, read_csv(filepath, QUERY_LIMIT))

In [104]:
# it's pretty clear from steping through the documents that
# there are some issues with the parser. The first version
# definitely worked better than nothing but we are going to
# need to make some changes.

#actions[47]['_source']