# Using the "greenCall" python package

At this point in time, the greenCall python package requires a series of function calls 
to make our way through the data pipeline. The pipeline consists of the following:

1. Read the csv file formatted as (unique id, query term)
2. Request information from the Search API 
3. Write results to disk in JSON format
4. Bulk upload results to elasticsearch

This notebook provides a concise example of how to work through the data pipeline.

## Settings Variables

In [1]:
# Maximum number of query items to request from API
QUERY_LIMIT = 20

# Maximum number or requests deferred
MAX_RUN = 20

# This many seconds will expire between requests sent
RATE_LIMIT = 1

# Path to original excel file, converted to CSV
filepath = 'examples/finance_demo.csv'

# Path to converted file to be used for API requests
outpath = 'examples/ipython_demo.json'

# results returned from the API via the networking engine
resultspath = 'results.json'

# Specify a document template for Elasticsearch
esformat = {
            "_index": "ipythonsearch",
            "_type": "website",
            "_id": None,
            "_source": ""
        }


##Start Logging

In [2]:
from greencall.utils.utilityBelt import enable_log

# Log everything, always.
enable_log('crawlah')


## Step 1 (Reading the CSV file)

In [3]:
from greencall.csvclean.inputCsv import tojson

# Convert the input file from CSV to JSON
tojson(filepath, outpath, QUERY_LIMIT)

## Step 2 ( Request information from the Search API)

In [4]:
from greencall.csvclean.clientConversion import runConversion
from examples.secret import secret_key

# Use the API client to convert query terms into correct format
# for API requests. Currently hard coded for Google Search API
adict = runConversion(jsonpath=outpath,
                      secretKey= secret_key)

## Step 3 (Write results to disk in JSON format)

In [5]:
from twisted.internet import reactor
from greencall.crawlah import getPages

# Load the network engine which handles API requests (gas & brakes)
gp = getPages(adict, MAX_RUN, RATE_LIMIT)

# Start the networking engine

gp.start()
reactor.run()


## Step 4 (Bulk upload into elasticsearch)

In [2]:
import itertools
import string
import json
from elasticsearch import Elasticsearch,helpers

from greencall.utils.loadelastic import read_json, prepare_all_documents
from greencall.csvclean.inputCsv import read_csv

# Connect to Elasticsearch
es = Elasticsearch()

# Read results from the Search API
results = read_json(resultspath)

# Take the length of the results as a sanity check
len(results)


20

In [4]:
# revising so the data types are correct
for key in results.keys():
    key = int(key) # this is wrong btw; didn't assign back to dict

In [3]:
# total number of items returned is initially limited to 10
# given 20 items, here's the total number of search items availalble

for key in results:
    print json.loads(results[key])['queries']['request'][0]['totalResults']

0
6020
28800
1280
1790
3
4
3
0
4680
62800
0
4780
1950
0
433
2550
0
517
4730


In [4]:
# prepare all documents for bulk upload into elasticsearch
actions = prepare_all_documents(results, esformat, read_csv(filepath, QUERY_LIMIT))


## Issues with search api to elasticsearch document converter

1. Function returns list value even though the last contains a dictionary
2. Too many generic fields from Google Custom Search API are returned as separate documents
3. Parent-child relationships are not mapped
4. Generic document converter is used, may need some google-specific logic to handle conversion

In [5]:
# it looks like the search api to elasticsearch document converter needs the most attention
len(actions)

277

In [56]:
# it's pretty clear from steping through the documents that
# there are some issues with the parser. The first version
# definitely worked better than nothing but we are going to
# need to make some changes.

actions[47]['_source']

{'account_holder': 'Walter Broughton',
 'account_number': u'11111417',
 u'searchInformation': [u'formattedSearchTime',
  u'formattedTotalResults',
  u'totalResults',
  u'searchTime']}

In [6]:
# trying to get a better idea of how the documents were parsed

nums = {}
count = 0
for action in actions:
    if 'items' in action['_source']:
        nums[len(action['_source']['items'])]= True
    else:
        nums['null'] = count
        count += 1
    
print nums

{3: True, 10: True, 'null': 261, 4: True}


In [9]:
# convert list to generator (low level detail)
actions = iter(actions)


In [10]:
# load elasticsearch 

helpers.bulk(es,actions)


BulkIndexError: ('1 document(s) failed to index.', [{u'index': {u'status': 400, u'_type': u'website', u'_id': u'24', u'error': u'MapperParsingException[failed to parse [items.pagemap.metatags.og:updated_time]]; nested: MapperParsingException[failed to parse date field [7/8/2015 10:07:34 AM], tried both date format [dateOptionalTime], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: "7/8/2015 10:07:34 AM" is malformed at "/8/2015 10:07:34 AM"]; ', u'_index': u'ipythonsearch'}}])

In [None]:
# some documents have failed to load because the date format is malformed. 