## Bulk Upload Test Data

Some sample Twitter data with what looks like Civil Rights-related tweets.

Sources:  
[Python Elastic Search](https://elasticsearch-py.readthedocs.io/en/master/api.html)

In [16]:
import pandas as pd
import os
import json

from elasticsearch import (
    Elasticsearch,
    helpers
)
from datetime import datetime
from io import StringIO
from configparser import ConfigParser

data_dir = os.path.join(os.pardir,'data')
config_file = os.path.join(os.pardir,'config','config.ini')

def get_ini_vals(ini_file, section):
    config = ConfigParser()
    config.read(ini_file)
    return config[section]

es_creds = get_ini_vals(config_file, 'elasticsearch')

In [4]:
# connect to elastic search
es = Elasticsearch(
    [es_creds['host']],
    http_auth=('',''),
    port = es_creds['port'],
    use_ssl=False
)
print(es.info())

{'cluster_name': 'elasticsearch', 'tagline': 'You Know, for Search', 'version': {'build_snapshot': False, 'lucene_version': '6.4.1', 'number': '5.2.1', 'build_hash': 'db0d481', 'build_date': '2017-02-09T22:05:32.386Z'}, 'name': 'pvnM2mO', 'cluster_uuid': 'lf55i5JCSBq8nk_ZrTyTKQ'}


In [15]:
# read sample twitter data
data = os.path.join(data_dir, 'static_data', 'data4.txt')

content = []
with open(data, 'r') as f:
    for line in f:
        content.append(json.loads(line))

content[:5]
df = pd.DataFrame(content)
df.head(3)

Unnamed: 0,keywords,search_date,source,text,url
0,"[donation, civil rights]",02/19/2017,Google,We are the individual donor's first source for...,https://www.charitynavigator.org/index.cfm?bay...
1,"[donation, civil rights]",02/19/2017,Google,Find ratings and read reviews of Civil Rights ...,http://greatnonprofits.org/categories/view/civ...
2,"[donation, civil rights]",02/19/2017,Google,"Nov 15, 2016 ... Traffic to the site was so he...",http://time.com/money/4566160/trump-election-c...


## Helper Functions for ES

It's safer to use a generator function to do bulk uploads in the cases of large files (although here we've already read in all the data). Within the generator function, we have the ability to parse each item/line of the dataset to create an index and document in the format we want. For now, the data is uploaded almost as-is for testing.

In this example, the documents are uploaded to the shared ES cluster under the index 'twitter'.

In [26]:
def tweet_to_es(tweet_collection):
    """generator function for parsing each row of data.
    returns an index and data. Can be potentially used 
    """
    for tweet_dict in tweet_collection:
        # create a timestamp  for elastic search timestamp
        idx = datetime.now().isoformat()

        # Keep dictionary as-is
        yield idx, tweet_dict   
        
def es_bulk_add(es, collection: list):
    """Can read in a raw file byte stream,
    collection just needs to be processed so that it iterates
    over each document.
    """
    bulk = ({
            "_index" : "twitter",
            "_type"  : "tweet",
            "_id"    : idx,
            "_source": tweet_d,
        } for idx, tweet_d in tweet_to_es(collection)
    )
    
    try:
        helpers.bulk(es, bulk)
    except:
        raise
        
es_bulk_add(es, content)   

## Sample Query

Inspired by: https://qbox.io/blog/python-scripts-interact-elasticsearch-examples

Note the use of single and double quotes. Also works without the `index` parameter.

In [25]:
queries = [
    'text: "black lives"',
    'text: "black lives" donation'   # donation not field-specific
]

for query in queries:
    results = es.search(index="twitter", 
                        q=query, 
                        size=3)     # result set limit
    print("query %s results" % query)
    print(results)
    print("\n")

query text: "black lives" results
{'took': 5, 'hits': {'max_score': 11.104426, 'hits': [{'_id': '2017-02-20T21:47:20.955157', '_index': 'twitter', '_source': {'url': 'https://blacklivesmatter.ca/', 'search_date': '02/19/2017', 'source': 'Google', 'keywords': ['donation', 'black lives matter'], 'text': 'BLACK LIVES MATTER – TORONTO. donaterequest a speaker ... Speakers: \nspeakers@blacklivesmatter.ca. Copyright Black Lives Matter – Toronto | 2016\xa0...'}, '_type': 'tweet', '_score': 11.104426}, {'_id': '2017-02-20T21:47:20.956439', '_index': 'twitter', '_source': {'url': 'http://blacklivesmatter.com/', 'search_date': '02/19/2017', 'source': 'Google', 'keywords': ['donation', 'black rights'], 'text': 'Recent Statements. Black Lives Matter Stands In Solidarity with Water Protectors \nat Standing Rock. bgrnd-with-white-down\xa0...'}, '_type': 'tweet', '_score': 7.840791}, {'_id': '2017-02-20T21:47:20.955428', '_index': 'twitter', '_source': {'url': 'https://www.facebook.com/blacklivesmatt