## Intializing Elasitcsearch

We begin by intializing Elasticsearch with the names of our index and type. You should have Elasticsearch installed and already run the "bin/elasticsearch" command. For more information you can visit: https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html.

Elasticsearch uses slightly different terminology from tradational databases. 

This is the following mapping:
Database = Index
Table = Type
Row = Document

Here we intialize an object where corpus will be the name of the index with a single type articles, and a field named sentence.

In [1]:
from snorkel import SnorkelSession
from snorkel.models import Document, Sentence
from elastics import elasticSession ,printResults,deleteIndex

# import os

# # TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# # Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///snorkel-intro'

session = SnorkelSession()
eSearch=elasticSession("corpus","articles","sentence")

### Check Connection

We begin indexing, we can check our connection with Elasticsearch as well as our existing indices. 

In [2]:
eSearch.getIndices()

#To delete an index by name or _all to delete all indices
# deleteIndex("indexName")

Index Information: 
 
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size



## Indexing the Corpus

Now that we have an established connection we can begin indexing. By default each document will contain a field corresponding to the sentence ID number, the sentence and an empty vector of 'o's (this will be useful for generating candidate tags below). Once everything has been indexed we can see its status with the amount of documents it contains as well as the size.

In [3]:
eSearch.elasticIndex(Document)
eSearch.getIndices()

Beginning indexing
2591 items indexed
Index Information: 
 
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   corpus HCOENVAeRoKTrHduXdVChQ   5   1      67820            0       24mb           24mb



### Visualize Index

To better visualize our data we get a mapping of index. We can see here that the each document in articles has 3 different fields: lineNum,candidates and sentence.

In [4]:
eSearch.getIndexMap()

Index Mapping
{
  "corpus": {
    "mappings": {
      "articles": {
        "properties": {
          "lineNum": {
            "type": "integer"
          }, 
          "tagged": {
            "type": "text", 
            "analyzer": "my_stop"
          }, 
          "fillCand": {
            "type": "text", 
            "analyzer": "my_stop"
          }, 
          "sentence": {
            "type": "text", 
            "analyzer": "my_analyzer"
          }
        }
      }
    }
  }
}


### Visualizing a Document

We can get any document by it's line number. Here get the first sentence in our corpus and print the values stored in each field

In [5]:
result = eSearch.getDoc(42743)
print "Sentence"
print result['_source']['sentence']
print "Sentence number"
print result['_source']['lineNum']


Sentence
First time round: Dwain Wallace and Kimberley Bailey were married in August 1984.
Sentence number
42743


## Querying the Corpus

The type of query that we perform is defined by the first arguement. Each query also contains an optional size and distance keyword parameter. Size specifies the number of results we wish to return and distance is the amount of acceptable positions the values in the query can be away from each other. Our distance parameter is known as the slop factor in Elasticsearch, you can read more about it here. Size and distance as defaulted to 5 and 0 respectively.
https://www.elastic.co/guide/en/elasticsearch/guide/current/slop.html

To display the results we can use the printResult function, which takes the search result as its first parameter and the names of the fields you wish to print out.

*Queries are case sensitive


### Arbitrary Search

Once we have all our documents indexed we can perform a simple query. Using the `entire` keyword, we query our sentence field in every document for the words married OR children. Matches that contain both and their entirety will be scored higher.  After performing the query we print the results which are sorted in a decsending order. 


In [6]:
query="married children"
searchResult = eSearch.searchIndex("entire",query)
printResults(searchResult,"sentence")

Number of hits 
2354
Result 1
-------------------
sentence
He is married with        three children.     

Result 2
-------------------
sentence
He was married twice and had five children.      

Result 3
-------------------
sentence
He was later married and had two children.   

Result 4
-------------------
sentence
She got married to my father, Joseph Ewherido, with whom she had eight children, all males.

Result 5
-------------------
sentence
All of the women were married (or divorced!), and some had children.   



### Search contain all values in query

Specifying a distance parameter will force the query to only return results that contain every word in the query. Explicitly stating a distance=0 will return results where the query values appear side by side.

In [7]:
query="white trousers"
searchResult = eSearch.searchIndex("entire",query,distance=0,size=3)
printResults(searchResult,"sentence")

Number of hits 
8
Result 1
-------------------
sentence
Dressed in a smart black shirt and white trousers, the lawyer looked in good spirits despite a difficult fight ahead.     

Result 2
-------------------
sentence
Amal, who was dressed in a smart black shirt and white trousers, looked in good spirits on Monday despite a difficult fight ahead.   

Result 3
-------------------
sentence
Clad in a breton top, white trousers and a matching cap, Bob, 63, looked like he couldn't wait to get underway with the festivities.       



### Positional Search

Using a distance value for the above queries return a result that longer enforces the ordering of our query.  To search for a query where we want care about the ordering we would use the `position` keyword. This time specifying the distance defines the number of acceptable position shifts to have the query side by side, respective of order.

The query string has to be passed in as a parameter as there individual terms. This type of query accepts any number of terms.

In [8]:
searchResult = eSearch.searchIndex("position","Pennsylvania","Colorado",distance =20)
printResults(searchResult,"sentence")

Number of hits 
2
Result 1
-------------------
sentence
It makes stops in North Carolina, Washington, D.C., Pennsylvania, Massachusetts, Minnesota, Illinois, Missouri, New York, Ohio, Florida, Georgia, Texas, California and Colorado.   

Result 2
-------------------
sentence
Dozens of confused pedestrians trying to find detours around the barricades early Friday afternoon stopped on a corner of Arch Street in front of the Pennsylvania Convention Center to ask directions from Peter and Toby Christensen, an LDS couple from Colorado working as volunteers this week for the Catholic Archdiocese of Philadelphia.   



# Creating the Candidate Field

We repeat our definition of the Spouse Candidate subclass from the tutorial.

In [9]:
from snorkel.models import candidate_subclass
Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

### Generate Tags

Using our Spouse candidates we will generate now generate a new field that tags the corresponding position in the sentence. This returns a 2d array of the sentence number and candidate that could not be tagged.

In [10]:
tagCands = eSearch.generateTags(Spouse)
for a in tagCands:
    print a
    break

27644 candidates tagged
[65997, u's']


### Search between Candidates

We can also search in between two candidates which were defined as PERSON in the spousal tutorial. Specifically, we are querying for PERSON married PERSON in that order. To do this we use the `inCand` keyword,followed by the word we want to search for.

*All candidate searches only allow for a singular term

In [11]:
result = eSearch.searchIndex("inCand","married",distance=100)
printResults(result,"sentence","tagged","lineNum")


Number of hits 
226
Result 1
-------------------
sentence
Gifford and Ewart's daughter Victoria married Robert Kennedy's son Michael in 1981.   
tagged
OBJECT o OBJECT o OBJECT o OBJECT OBJECT o OBJECT o o
lineNum
24130

Result 2
-------------------
sentence
Lady Bird Johnson was born as Claudia Alta Taylor and married Lyndon B. Johnson in 1934.
tagged
OBJECT OBJECT OBJECT o o o OBJECT OBJECT OBJECT o o OBJECT OBJECT OBJECT o o
lineNum
64482

Result 3
-------------------
sentence
Parenthood star Erika Christensen has married cyclist Cole Maness.   
tagged
o o OBJECT OBJECT o o o OBJECT OBJECT
lineNum
63311

Result 4
-------------------
sentence
Williams married Van Veen on Saturday afternoon in a private ceremony at a ranch out West, her rep confirmed to the Los Angeles Times.
tagged
OBJECT o OBJECT OBJECT o o o o o o o o o o o o o o o o o o o o
lineNum
36263

Result 5
-------------------
sentence
Singer Wayne Newton married his wife, Kathleen, at Casa de Shenandoah in Las Vegas on Apr

### Search before candidate 

Using the `bCand` keyword we can search for the occurance of a term that appears before any PERSON. 

In [12]:
result = eSearch.searchIndex("bCand","married",distance=100,size=3)
printResults(result,"sentence")

Number of hits 
292
Result 1
-------------------
sentence
Chris Pratt   Nevermind that Jen FINALLY married Justin Theroux, after being engaged since 2012 –

Result 2
-------------------
sentence
Meanwhile, Amy Fisher got out of prison and married Louis Bellera in 2003, separating in 2007.   

Result 3
-------------------
sentence
He gained adoptive rights       'Violent': Kim Davis accused Thomas McIntyre of being violent and calling her a 'whore', married Joe Davis - but then divorced him and married McIntyre in September 2007.



### Search after candidate 

Using the `aCand` keyword we can search for the occurance of a term that appears after any PERSON. 

In [13]:
result = eSearch.searchIndex("aCand","married",distance=100,size=3)
printResults(result,"sentence")

Number of hits 
319
Result 1
-------------------
sentence
Erin and her hubby, British rapper, Example (real name Elliot Gleave) married in Australia in May 2013.

Result 2
-------------------
sentence
All that glitters: Erin and husband Example married in 2013.

Result 3
-------------------
sentence
Singer Wayne Newton married his wife, Kathleen, at Casa de Shenandoah in Las Vegas on April 9, 1994.

