# Using the "greenCall" python package

At this point in time, the greenCall python package requires a series of function calls 
to make our way through the data pipeline. The pipeline consists of the following:

1. Read the csv file formatted as (unique id, query term)
2. Request information from the Search API 
3. Write results to disk in JSON format
4. Bulk upload results to elasticsearch

This notebook provides a concise example of how to work through the data pipeline.

## Step 1 (Reading the CSV file)

In [17]:
from greencall.csvclean.inputCsv import tojson

# Path to original excel file, converted to CSV
filepath = 'examples/finance_demo.csv'

# Path to converted file to be used for API requests
outpath = 'examples/ipython_demo.json'

# Convert the input file from CSV to JSON
tojson(filepath, outpath)

## Step 2 ( Request information from the Search API)

In [18]:
from greencall.csvclean.clientConversion import runConversion
from examples.secret import secret_key

# Use the API client to convert query terms into correct format
# for API requests. Currently hard coded for Google Search API
adict = runConversion(jsonpath=outpath,
                      secretKey= secret_key)

## Step 3 (Write results to disk in JSON format)

In [19]:
from twisted.internet import reactor
from greencall.crawlah import getPages
from greencall.utils.utilityBelt import enable_log

# Log the interactions with the API
enable_log('crawlah')

# Load the network engine which handles API requests (gas & brakes)
gp = getPages(adict)

# Start the networking engine
gp.start()
reactor.run()


## Step 4 (Bulk upload into elasticsearch)

In [20]:
# results returned from the API via the networking engine
resultspath = 'results.json'

# reference the original data for updating elasticsearch documents
accountdict = 'examples/finance_demo.csv'


In [21]:
import itertools
import string
import json
from elasticsearch import Elasticsearch,helpers

from greencall.utils.loadelastic import read_json, prepare_all_documents
from greencall.csvclean.inputCsv import read_csv

# Connect to Elasticsearch
es = Elasticsearch()

# Specify a document template for Elasticsearch
esformat = {
            "_index": "ipythonsearch",
            "_type": "website",
            "_id": None,
            "_source": ""
        }

# Read results from the Search API
results = read_json(resultspath)

# Take the length of the results as a sanity check
len(results)



20

In [22]:
# revising so the data types are correct
for key in results.keys():
    key = int(key)
    print("key: {}, type: {}".format(key, type(key)))


key: 11111437, type: <type 'int'>
key: 11111253, type: <type 'int'>
key: 11111346, type: <type 'int'>
key: 11111417, type: <type 'int'>
key: 11111190, type: <type 'int'>
key: 11111169, type: <type 'int'>
key: 11111294, type: <type 'int'>
key: 11111495, type: <type 'int'>
key: 11111757, type: <type 'int'>
key: 11111709, type: <type 'int'>
key: 11111337, type: <type 'int'>
key: 11111336, type: <type 'int'>
key: 11111135, type: <type 'int'>
key: 11111445, type: <type 'int'>
key: 11111585, type: <type 'int'>
key: 11111143, type: <type 'int'>
key: 11111157, type: <type 'int'>
key: 11111589, type: <type 'int'>
key: 11111146, type: <type 'int'>
key: 11111626, type: <type 'int'>


In [23]:
# prepare all documents for bulk upload into elasticsearch
actions = prepare_all_documents(results, esformat, read_csv(accountdict))


11
26
41
56
71
85
99
113
124
139
154
165
180
195
206
220
235
249
264
279


In [24]:
# convert list to generator (low level detail)
actions = iter(actions)


In [25]:
# load elasticsearch 

helpers.bulk(es,actions)


BulkIndexError: ('2 document(s) failed to index.', [{u'index': {u'status': 400, u'_type': u'website', u'_id': u'24', u'error': u'MapperParsingException[failed to parse [items.pagemap.metatags.og:updated_time]]; nested: MapperParsingException[failed to parse date field [6/26/2015 10:09:00 AM], tried both date format [dateOptionalTime], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: "6/26/2015 10:09:00 AM" is malformed at "/26/2015 10:09:00 AM"]; ', u'_index': u'ipythonsearch'}}, {u'index': {u'status': 400, u'_type': u'website', u'_id': u'26', u'error': u'MapperParsingException[failed to parse [items.pagemap.review.datepublished]]; nested: MapperParsingException[failed to parse date field [08/08/2013], tried both date format [dateOptionalTime], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: "08/08/2013" is malformed at "/08/2013"]; ', u'_index': u'ipythonsearch'}}])

In [1]:
# some documents have failed to load because the date format is malformed. 