# Elasticsearch With Haystack

In [120]:
import json

with open('data/squad/train.json', 'r') as f:
    squad = json.load(f)

Now we initialize a connection between Haystack and our local Elasticsearch instance

In [121]:
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(host='localhost', username='', password='', index='squad_docs')

In [122]:
document_store

<haystack.document_stores.elasticsearch.ElasticsearchDocumentStore at 0x7f8ba85dc910>

After establishing our connection, now we try querying our Elasticsearch instance through requests library

In [123]:
import requests

We also check our cluster *health* (eg the general status of our Elasticsearch instance)

In [124]:
res = requests.get('http://localhost:9200/_cluster/health')

res.json()

{'cluster_name': 'elasticsearch',
 'status': 'yellow',
 'timed_out': False,
 'number_of_nodes': 1,
 'number_of_data_nodes': 1,
 'active_primary_shards': 4,
 'active_shards': 4,
 'relocating_shards': 0,
 'initializing_shards': 0,
 'unassigned_shards': 3,
 'delayed_unassigned_shards': 0,
 'number_of_pending_tasks': 0,
 'number_of_in_flight_fetch': 0,
 'task_max_waiting_in_queue_millis': 0,
 'active_shards_percent_as_number': 57.14285714285714}

## Adding Data
We populate our elasticsearch instance (which contains empty index squad_docs) with our squad data

In [125]:
squad_docs = []

for sample in squad:
    squad_docs.append({
        'content': sample['context']
    })

Then we add our data to the index like this:

In [126]:
squad_docs

[{'content': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'},
 {'content': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in

In [127]:
document_store.write_documents(squad_docs)

## Retrieve data using TF-IDF

In [128]:
from haystack.nodes import TfidfRetriever

retriever = TfidfRetriever(document_store)

In [129]:
query = "In what country is Normandy located"

retriever.retrieve(query)

[<Document: {'content': "In late 1203, John attempted to relieve Château Gaillard, which although besieged by Philip was guarding the eastern flank of Normandy. John attempted a synchronised operation involving land-based and water-borne forces, considered by most historians today to have been imaginative in conception, but overly complex for forces of the period to have carried out successfully. John's relief operation was blocked by Philip's forces, and John turned back to Brittany in an attempt to draw Philip away from eastern Normandy. John successfully devastated much of Brittany, but did not deflect Philip's main thrust into the east of Normandy. Opinions vary amongst historians as to the military skill shown by John during this campaign, with most recent historians arguing that his performance was passable, although not impressive.[nb 8] John's situation began to deteriorate rapidly. The eastern border region of Normandy had been extensively cultivated by Philip and his predeces

The query may return huge number of duplicates because each context could be tied to several different questions.

In [130]:
res = requests.post('http://localhost:9200/squad_docs/_delete_by_query',
                    json={
                        'query': {
                            'match_all': {}
                        }
                    })

res.json()

{'took': 2367,
 'timed_out': False,
 'total': 19029,
 'deleted': 19029,
 'batches': 20,
 'version_conflicts': 0,
 'noops': 0,
 'retries': {'bulk': 0, 'search': 0},
 'throttled_millis': 0,
 'requests_per_second': -1.0,
 'throttled_until_millis': 0,
 'failures': []}

Our response shows 19029 documents have been deleted from our squad_docs index.

In [131]:
res = requests.get('http://localhost:9200/squad_docs/_count')

res.json()

{'count': 1029,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}

After clearing the index, we remove duplicates from our SQuAD contexts and re-index them.

In [132]:
# create list of contexts (we cannot do this using current dictionary format)
contexts = [sample['context'] for sample in squad]

# convert to set to remove duplicates, then back to list
contexts = list(set(contexts))

# convert back to dictionary format we need
squad_docs = [{'content': sample} for sample in contexts]

Finally we re-index our Elasticsearch as we did before.

In [133]:
document_store.write_documents(squad_docs)

Because we have changed the contents of our index, we initialize our retriever once more.

In [134]:
retriever = TfidfRetriever(document_store)

In [135]:
retriever.retrieve(query)

[<Document: {'content': "In late 1203, John attempted to relieve Château Gaillard, which although besieged by Philip was guarding the eastern flank of Normandy. John attempted a synchronised operation involving land-based and water-borne forces, considered by most historians today to have been imaginative in conception, but overly complex for forces of the period to have carried out successfully. John's relief operation was blocked by Philip's forces, and John turned back to Brittany in an attempt to draw Philip away from eastern Normandy. John successfully devastated much of Brittany, but did not deflect Philip's main thrust into the east of Normandy. Opinions vary amongst historians as to the military skill shown by John during this campaign, with most recent historians arguing that his performance was passable, although not impressive.[nb 8] John's situation began to deteriorate rapidly. The eastern border region of Normandy had been extensively cultivated by Philip and his predeces

Now we get a set of relevant documents without duplicates.

## Retrieve data using BM25

In [136]:
# import BM25 retriever
from haystack.nodes import BM25Retriever

# intialize
retriever = BM25Retriever(document_store)

# and query
retriever.retrieve(query)

[<Document: {'content': 'But Amnesty International found no evidence that UNFPA had supported the coercion. A 2001 study conducted by the pro-life Population Research Institute (PRI) falsely claimed that the UNFPA shared an office with the Chinese family planning officials who were carrying out forced abortions. "We located the family planning offices, and in that family planning office, we located the UNFPA office, and we confirmed from family planning officials there that there is no distinction between what the UNFPA does and what the Chinese Family Planning Office does," said Scott Weinberg, a spokesman for PRI. However, United Nations Members disagreed and approved UNFPA’s new country program me in January 2006. The more than 130 members of the “Group of 77” developing countries in the United Nations expressed support for the UNFPA programmes. In addition, speaking for European democracies -- Norway, Denmark, Sweden, Finland, the Netherlands, France, Belgium, Switzerland and Germa