# Using the "greenCall" python package

At this point in time, the greenCall python package requires a series of function calls 
to make our way through the data pipeline. The pipeline consists of the following:

1. Read the csv file formatted as (unique id, query term)
2. Request information from the Search API 
3. Write results to disk in JSON format
4. Bulk upload results to elasticsearch

This notebook provides a concise example of how to work through the data pipeline.

## Settings Variables

In [1]:
# Maximum number of query items to request from API
QUERY_LIMIT = 20

# Maximum number or requests deferred
MAX_RUN = 20

# This many seconds will expire between requests sent
RATE_LIMIT = 1

# Path to original excel file, converted to CSV
filepath = 'examples/finance_demo.csv'

# Path to converted file to be used for API requests
outpath = 'examples/ipython_demo.json'

# results returned from the API via the networking engine
resultspath = 'results.json'

# Specify a document template for Elasticsearch
esformat = {
            "_index": "ipythonsearch",
            "_type": "website",
            "_id": None,
            "_source": ""
        }


##Start Logging

In [2]:
from greencall.utils.utilityBelt import enable_log

# Log everything, always.
enable_log('crawlah')


## Step 1 (Reading the CSV file)

In [3]:
from greencall.csvclean.inputCsv import tojson

# Convert the input file from CSV to JSON
tojson(filepath, outpath, QUERY_LIMIT)

## Step 2 ( Request information from the Search API)

In [4]:
from greencall.csvclean.clientConversion import runConversion
from examples.secret import secret_key

# Use the API client to convert query terms into correct format
# for API requests. Currently hard coded for Google Search API
adict = runConversion(jsonpath=outpath,
                      secretKey= secret_key)

## Step 3 (Write results to disk in JSON format)

In [5]:
from twisted.internet import reactor
from greencall.crawlah import getPages

# Load the network engine which handles API requests (gas & brakes)
gp = getPages(adict, MAX_RUN, RATE_LIMIT)

# Start the networking engine
gp.start()
reactor.run()


ERROR:root:[Failure instance: Traceback (failure with no frames): <class 'twisted.web.error.Error'>: 503 Service Unavailable
]
ERROR:root:[Failure instance: Traceback (failure with no frames): <class 'twisted.web.error.Error'>: 500 Internal Server Error
]
ERROR:root:[Failure instance: Traceback (failure with no frames): <class 'twisted.web.error.Error'>: 500 Internal Server Error
]
ERROR:root:[Failure instance: Traceback (failure with no frames): <class 'twisted.web.error.Error'>: 500 Internal Server Error
]


## Step 4 (Bulk upload into elasticsearch)

In [6]:
import itertools
import string
import json
from elasticsearch import Elasticsearch,helpers

from greencall.utils.loadelastic import read_json, prepare_all_documents
from greencall.csvclean.inputCsv import read_csv

# Connect to Elasticsearch
es = Elasticsearch()

# Read results from the Search API
results = read_json(resultspath)

# Take the length of the results as a sanity check
len(results)


20

In [7]:
# revising so the data types are correct
for key in results.keys():
    key = int(key)

In [8]:
# prepare all documents for bulk upload into elasticsearch
actions = prepare_all_documents(results, esformat, read_csv(filepath, QUERY_LIMIT))


11
26
41
56
71
85
100
114
125
140
140
151
151
166
166
180
180
194
209
224


In [9]:
# convert list to generator (low level detail)
actions = iter(actions)


In [10]:
# load elasticsearch 

helpers.bulk(es,actions)


BulkIndexError: ('2 document(s) failed to index.', [{u'index': {u'status': 400, u'_type': u'website', u'_id': u'18', u'error': u'MapperParsingException[failed to parse [items.pagemap.metatags.og:updated_time]]; nested: MapperParsingException[failed to parse date field [6/26/2015 10:09:00 AM], tried both date format [dateOptionalTime], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: "6/26/2015 10:09:00 AM" is malformed at "/26/2015 10:09:00 AM"]; ', u'_index': u'ipythonsearch'}}, {u'index': {u'status': 400, u'_type': u'website', u'_id': u'26', u'error': u'MapperParsingException[failed to parse [items.pagemap.review.datepublished]]; nested: MapperParsingException[failed to parse date field [08/08/2013], tried both date format [dateOptionalTime], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: "08/08/2013" is malformed at "/08/2013"]; ', u'_index': u'ipythonsearch'}}])

In [None]:
# some documents have failed to load because the date format is malformed. 