# Setup

Before we start developing with Elasticsearch, let's make sure we can 

- connect to the cluster
- index some data
- and perform a query on the index

In [3]:
from elasticsearch import Elasticsearch

# Test Connection 
client = Elasticsearch("http://localhost:9200")

assert client.cluster.health()['status'] == 'green', 'Cluster not healthy'

AssertionError: Cluster not healthy

The connection should be established and the cluster health green - which means we're good to go!

In [2]:
# Create an index
response = client.indices.create(index='my-first-index')

assert response['acknowledged'], 'Index could not be created'
assert client.indices.exists(index='my-first-index')

In [3]:
# Add a sample document
client.index(index='my-first-index',id=23,document={
    'message': 'Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ, is an American businessman and former professional basketball player. He played fifteen seasons in the National Basketball Association (NBA), winning six NBA championships with the Chicago Bulls. Jordan is the principal owner and chairman of the Charlotte Hornets of the NBA and of 23XI Racing in the NASCAR Cup Series. His biography on the official NBA website states: "By acclamation, Michael Jordan is the greatest basketball player of all time." He was integral in popularizing the NBA around the world in the 1980s and 1990s, becoming a global cultural icon in the process.'
})
assert client.exists(index='my-first-index',id=23)

Now we've created an index and indexed one document - not bad, but we're just getting started! Let's try to search our data.

In [4]:
client.search(query={
    'match': {
        'message': 'Michael'
    }
})

ObjectApiResponse({'took': 6, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 1, 'relation': 'eq'}, 'max_score': 0.39767313, 'hits': [{'_index': 'my-first-index', '_id': '23', '_score': 0.39767313, '_ignored': ['message.keyword'], '_source': {'message': 'Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ, is an American businessman and former professional basketball player. He played fifteen seasons in the National Basketball Association (NBA), winning six NBA championships with the Chicago Bulls. Jordan is the principal owner and chairman of the Charlotte Hornets of the NBA and of 23XI Racing in the NASCAR Cup Series. His biography on the official NBA website states: "By acclamation, Michael Jordan is the greatest basketball player of all time." He was integral in popularizing the NBA around the world in the 1980s and 1990s, becoming a global cultural icon in the process.'}}]}})

In [5]:
client.search(query={
    'match': {
        'message': 'GOAT'
    }
})

ObjectApiResponse({'took': 3, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 0, 'relation': 'eq'}, 'max_score': None, 'hits': []}})

# Growing the set

We can already explore the search API with this minimal example, however we would always end up with one of two result sets (one result or none).

To make things more interesting, we should probably accumulate a bit more data. Let's use a small Wikipedia API client to fetch some random articles from Wikipedia and ingest them into Elasticsearch.

In [2]:
import wikipedia

pages = wikipedia.random(pages=10)

for _id in pages:
    page = wikipedia.page(_id)
    client.index(
        index='my-second-index',
        document={
            'url': page.url,
            'title': page.title,
            'summary': page.summary,
            'content': page.content,
            'links': page.links,
        }
    )

NameError: name 'client' is not defined

In [7]:
client.count(index='my-second-index')

ObjectApiResponse({'count': 10, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}})

In [12]:
import os
len(os.listdir('./_articles/'))

1000