## Introduction to Databases

### Using elasticsearch

Based on [this](https://medium.com/naukri-engineering/elasticsearch-tutorial-for-beginners-using-python-b9cb48edcedc), [this](https://towardsdatascience.com/getting-started-with-elasticsearch-in-python-c3598e718380) and [this](https://www.elastic.co/guide/en/elasticsearch/reference/current/elasticsearch-intro.html) posts  

[Installing Elastic Search](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html)  

To install the Python wrapper, run:

!sudo pip install -U elasticsearch

In [1]:
!curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_expression","resource.id":"_all"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_expression","resource.id":"_all"},"status":404}

In [22]:
import requests
res = requests.get('http://localhost:9200')
print(res.content)

b'{\n  "name" : "AMYiWlv",\n  "cluster_name" : "elasticsearch",\n  "cluster_uuid" : "ZV9z9yl4SVa861af9CnG7A",\n  "version" : {\n    "number" : "6.8.13",\n    "build_flavor" : "default",\n    "build_type" : "deb",\n    "build_hash" : "be13c69",\n    "build_date" : "2020-10-16T09:09:46.555371Z",\n    "build_snapshot" : false,\n    "lucene_version" : "7.7.3",\n    "minimum_wire_compatibility_version" : "5.6.0",\n    "minimum_index_compatibility_version" : "5.0.0"\n  },\n  "tagline" : "You Know, for Search"\n}\n'


### What is Elasticsearch?

Elasticsearch is the distributed search and analytics engine at the heart of the [Elastic Stack](https://www.elastic.co/pt/elk-stack). 

Elasticsearch provides near real-time search and analytics for all types of data. Whether you have structured or unstructured text, numerical data, or geospatial data, Elasticsearch can efficiently store and index it in a way that supports fast searches. You can go far beyond simple data retrieval and aggregate information to discover trends and patterns in your data. And as your data and query volume grows, the distributed nature of Elasticsearch enables your deployment to grow seamlessly right along with it.

While not every problem is a search problem, Elasticsearch offers speed and flexibility to handle data in a wide variety of use cases:

+ Add a search box to an app or website
+ Store and analyze logs, metrics, and security event data
+ Use machine learning to automatically model the behavior of your data in real time
+ Automate business workflows using Elasticsearch as a storage engine
+ Manage, integrate, and analyze spatial information using Elasticsearch as a geographic information system (GIS)
+ ore and process genetic data using Elasticsearch as a bioinformatics research tool

We’re continually amazed by the novel ways people use search. But whether your use case is similar to one of these, or you’re using Elasticsearch to tackle a new problem, the way you work with your data, documents, and indices in Elasticsearch is the same.

In [2]:
# Import Elasticsearch package 
import json
from elasticsearch import Elasticsearch 

In [3]:
# Connect to the elastic cluster
es = Elasticsearch([{'host':'localhost','port':9200}])
es

<Elasticsearch([{'host': 'localhost', 'port': 9200}])>

Elasticsearch is document oriented, meaning that it stores entire object or documents. It not only stores them, but also indexes the content of each document in order to make them searchable. In Elasticsearch you index, search,sort and filter documents.

Elasticsearch uses JSON as the serialisation format for the documents.

Now let’s start by indexing the employee documents.

The act of storing data in Elasticsearch is called indexing. An Elasticsearch cluster can contain multiple indices, which in turn contain multiple types. These types hold multiple documents, and each document has multiple fields.

In [4]:
e1 = {"first_name":"Renato",
      "last_name":"Souza",
      "age": 30,
      "about": "I love to climb",
      "interests": ['sports','music','literature'],
     }

print(json.dumps(e1, indent=2, sort_keys=True))

{
  "about": "I love to climb",
  "age": 30,
  "first_name": "Renato",
  "interests": [
    "sports",
    "music",
    "literature"
  ],
  "last_name": "Souza"
}


### Inserting a document:

In [5]:
#Now let's store this document in Elasticsearch 

res = es.index(index='emap',
               doc_type='employee',
               id=1,
               document=e1)



In [6]:
# Let's insert some more documents
e2 = {"first_name" :  "Leonardo",
      "last_name" :   "Smith",
      "age" :         32,
      "about" :       "I like to collect rock albums",
      "interests":  ["music"]
     }

e3 = {"first_name" :  "Pedro",
      "last_name" :   "Clark",
      "age" :         35,
      "about":        "I like to build cabinets",
      "interests":  ["forestry"]}

res = es.index(index='emap',
               doc_type='employee',
               id=2,
               document=e2)

print(json.dumps(res, indent=2, sort_keys=True))

res = es.index(index='emap',
               doc_type='employee',
               id=3,
               document=e3)

print(json.dumps(res, indent=2, sort_keys=True))

{
  "_id": "2",
  "_index": "emap",
  "_primary_term": 1,
  "_seq_no": 0,
  "_shards": {
    "failed": 0,
    "successful": 1,
    "total": 2
  },
  "_type": "employee",
  "_version": 1,
  "result": "created"
}
{
  "_id": "3",
  "_index": "emap",
  "_primary_term": 1,
  "_seq_no": 0,
  "_shards": {
    "failed": 0,
    "successful": 1,
    "total": 2
  },
  "_type": "employee",
  "_version": 1,
  "result": "created"
}


In [7]:
print(json.dumps(res, indent=2, sort_keys=True))

{
  "_id": "3",
  "_index": "emap",
  "_primary_term": 1,
  "_seq_no": 0,
  "_shards": {
    "failed": 0,
    "successful": 1,
    "total": 2
  },
  "_type": "employee",
  "_version": 1,
  "result": "created"
}


### Retrieving a Document:

This is easy in Elasticsearch. We simply execute an HTTP GET request and specify the address of the document — the index, type, and ID. Using those three pieces of information, we can return the original JSON document.

In [8]:
res = es.get(index='emap',
             doc_type='employee',
             id=3)

print(json.dumps(res, indent=2, sort_keys=True))

{
  "_id": "3",
  "_index": "emap",
  "_primary_term": 1,
  "_seq_no": 0,
  "_source": {
    "about": "I like to build cabinets",
    "age": 35,
    "first_name": "Pedro",
    "interests": [
      "forestry"
    ],
    "last_name": "Clark"
  },
  "_type": "employee",
  "_version": 1,
  "found": true
}


You will get the actual document in ‘_source’ field

In [9]:
res['_source']

{'first_name': 'Pedro',
 'last_name': 'Clark',
 'age': 35,
 'about': 'I like to build cabinets',
 'interests': ['forestry']}

### Deleting a document:

In [10]:
res = es.delete(index='emap',
                doc_type='employee',
                id=1)

print(res['result'])

deleted


Now let’s validate the absence of the document in Elasticsearch

In [11]:
res = es.search(index='emap',
                query={'match_all':{}})

print('Got %d hits:' %res['hits']['total'])

Got 0 hits:


### Search Lite:

A GET is fairly simple — you get back the document that you ask for. Let’s try something a little more advanced, like a simple search! 

[Ref](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html)

```
res = es.search(index='emap',
                query={<your query comes here>})

print(json.dumps(res['hits']['hits'], indent=2, sort_keys=True))
```

Now let’s search for the user name who has nitin in his first name.

### match operator:

In [12]:
res = es.search(index='emap',
                query={'match':{'first_name':'Pedro'}})

print(json.dumps(res['hits']['hits'], indent=2, sort_keys=True))

[]


### bool operator:

bool takes a dictionary containing at least one of must, should, and must_not, each of which takes a list of matches or other further search operators.

In [13]:
res = es.search(index='emap',
                query={'bool':{'must':[{'match':{'first_name':'Leonardo'}}]}}
               )
print(json.dumps(res['hits']['hits'], indent=2, sort_keys=True))

[]


### Filter operator:

Let’s make the search a little more complicated. We still want to find all employees with a first name of nitin, but we want only employees who are older than 30. Our query will change a little to accommodate a filter, which allows us to execute structured searches efficiently:

In [14]:
res= es.search(index='emap',
               query={'bool':{'must':{'match':{'first_name':'Pedro'}},
                              'filter':{"range":{"age":{"gt":25}}}}
                     })

print(json.dumps(res['hits']['hits'], indent=2, sort_keys=True))

[]


### Full text search

The searches so far have been simple.  
Let’s try more advanced full text search. Before starting this next type of search let me insert one more document.

In [15]:
e4 = {"first_name":"Marcelo",
      "last_name":"Jones",
      "age": 27,
      "about": "I love to play football",
      "interests": ['sports','music'],}

res = es.index(index='emap',
               doc_type='employee',
               id=4,
               document=e4)

print(json.dumps(res, indent=2, sort_keys=True))

{
  "_id": "4",
  "_index": "emap",
  "_primary_term": 1,
  "_seq_no": 1,
  "_shards": {
    "failed": 0,
    "successful": 1,
    "total": 2
  },
  "_type": "employee",
  "_version": 1,
  "result": "created"
}


In [16]:
res = es.search(index='emap',
                doc_type='employee',
                query={'match':{"about":"play cricket"}})

for hit in res['hits']['hits']:
    print(hit['_source']['about']) 
    print(hit['_score'])
    print('**********************')

In [17]:
e5 = {"first_name":"Marcos",
      "last_name":"Jones",
      "age": 25,
      "about": "I love to play volleyball",
      "interests": ['sports','music'],}

res = es.index(index='emap',
               doc_type='employee',
               id=5,
               document=e5)

print(json.dumps(res, indent=2, sort_keys=True))

{
  "_id": "5",
  "_index": "emap",
  "_primary_term": 1,
  "_seq_no": 0,
  "_shards": {
    "failed": 0,
    "successful": 1,
    "total": 2
  },
  "_type": "employee",
  "_version": 1,
  "result": "created"
}


In [18]:
res = es.search(index='emap',
                doc_type='employee',
                query={'match':{"about":"play cricket"}})

for hit in res['hits']['hits']:
    print(hit['_source']['about']) 
    print(hit['_score'])
    print('**********************')

### Phrase Search

Finding individual words in a field is all well and good, but sometimes you want to match exact sequence of words of phrases.

In [19]:
res = es.search(index='emap',
                doc_type='employee',
                query={'match_phrase':{"about":"play cricket"}})

for hit in res['hits']['hits']:
    print(hit['_source']['about']) 
    print(hit['_score'])
    print('**********************')

### Aggregations

Elasticsearch has functionality called aggregations, which allowed you to generate sophisticated analytics over your data. It is similar to Group By in SQL, but much more powerful.  

[Ref1](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html)
[Ref2](https://techoverflow.net/2019/03/17/how-to-fix-elasticsearch-fielddata-is-disabled-on-text-fields-by-default-for-keyword-field/)


In [20]:
res= es.search(index='emap',
               doc_type='employee',
               aggs={"all_interests": {"terms": {"field": "interests.keyword"}}}
              )

print(json.dumps(res, indent=2, sort_keys=True))

{
  "_shards": {
    "failed": 0,
    "skipped": 0,
    "successful": 5,
    "total": 5
  },
  "aggregations": {
    "all_interests": {
      "buckets": [],
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0
    }
  },
  "hits": {
    "hits": [],
    "max_score": null,
    "total": 0
  },
  "timed_out": false,
  "took": 4
}


### Deleting an Index:


In [21]:
es.indices.delete(index='emap', ignore=[400, 404])

{'acknowledged': True}

### A [full example](https://towardsdatascience.com/getting-started-with-elasticsearch-in-python-c3598e718380) on scraping and storing in Elastic Search

In [23]:
import json
import logging
from pprint import pprint
from time import sleep

import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch

In [64]:
def search(es_object, index_name, search):
    res = es_object.search(index=index_name, body=search)
    pprint(res)

In [65]:
def create_index(es_object, index_name):
    created = False
    # index settings
    settings = {
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        },
        "mappings": {
            "salads": {
                "dynamic": "strict",
                "properties": {
                    "title": {
                        "type": "text"
                    },
                    "submitter": {
                        "type": "text"
                    },
                    "description": {
                        "type": "text"
                    },
                    "calories": {
                        "type": "integer"
                    },
                    "ingredients": {
                        "type": "nested",
                        "properties": {
                            "step": {"type": "text"}
                        }
                    },
                }
            }
        }
    }
    try:
        if not es_object.indices.exists(index_name):
            # Ignore 400 means to ignore "Index Already Exist" error.
            es_object.indices.create(index=index_name, ignore=400, document=settings)
            print('Created Index')
        created = True
    except Exception as ex:
        print(str(ex))
    finally:
        return created

In [66]:
def store_record(elastic_object, index_name, record):
    is_stored = True
    try:
        outcome = elastic_object.index(index=index_name, doc_type='salads', body=record)
        print(outcome)
    except Exception as ex:
        print('Error in indexing data')
        print(str(ex))
        is_stored = False
    finally:
        return is_stored

In [67]:
def connect_elasticsearch():
    _es = None
    _es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
    if _es.ping():
        print('Connected')
    else:
        print('Could not connect!')
    return _es

In [68]:
def parse(u):
    title = '-'
    submit_by = '-'
    description = '-'
    calories = 0
    ingredients = []
    rec = {}

    try:
        r = requests.get(u, headers=headers)

        if r.status_code == 200:
            html = r.text
            soup = BeautifulSoup(html, 'lxml')
            title_section = soup.select('.recipe-summary__h1')
            submitter_section = soup.select('.submitter__name')
            description_section = soup.select('.submitter__description')
            ingredients_section = soup.select('.recipe-ingred_txt')
            calories_section = soup.select('.calorie-count')
            if calories_section:
                calories = calories_section[0].text.replace('cals', '').strip()

            if ingredients_section:
                for ingredient in ingredients_section:
                    ingredient_text = ingredient.text.strip()
                    if 'Add all ingredients to list' not in ingredient_text and ingredient_text != '':
                        ingredients.append({'step': ingredient.text.strip()})

            if description_section:
                description = description_section[0].text.strip().replace('"', '')

            if submitter_section:
                submit_by = submitter_section[0].text.strip()

            if title_section:
                title = title_section[0].text

            rec = {'title': title, 
                   'submitter': submit_by, 
                   'description': description, 
                   'calories': calories,
                   'ingredients': ingredients}
            
    except Exception as ex:
        print('Exception while parsing')
        print(str(ex))
    finally:
        return json.dumps(rec)

In [69]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
           'Pragma': 'no-cache'}

In [70]:
logging.basicConfig(level=logging.ERROR)

In [71]:
url = 'https://www.allrecipes.com/recipes/96/salad/'

r = requests.get(url, headers=headers)
if r.status_code == 200:
    html = r.text
    soup = BeautifulSoup(html, 'lxml')
    links = soup.select('.fixed-recipe-card__h3 a')
    print(links)
    if len(links) > 0:
        es = connect_elasticsearch()

    for link in links:
        sleep(2)
        result = parse(link['href'])
        if es is not None:
            if create_index(es, 'recipes'):
                out = store_record(es, 'recipes', result)
                print('Data indexed successfully')

[]


In [72]:
es = connect_elasticsearch()
if es is not None:
    # search_object = {'query': {'match': {'calories': '102'}}}
    # search_object = {'_source': ['title'], 'query': {'match': {'calories': '102'}}}
    search_object = {'_source': ['title'], 'query': {'range': {'calories': {'gte': 20}}}}
    search(es, 'recipes', json.dumps(search_object))

Connected


  res = es_object.search(index=index_name, body=search)


NotFoundError: NotFoundError(404, 'index_not_found_exception', 'no such index', recipes, index_or_alias)