**SEARCH ENGINE IMPLEMENTATION USING ELASTIC SEARCH**

In [None]:
## Importing the requird Elastic search library
import elasticsearch
print(elasticsearch.__version__)

(5, 2, 0)


In [None]:
## Install the elasticsearch to our environment 
pip install elasticsearch==5.2.0

Collecting elasticsearch==5.2.0
  Using cached elasticsearch-5.2.0-py2.py3-none-any.whl (57 kB)
Installing collected packages: elasticsearch
Successfully installed elasticsearch-5.2.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
## Import the library and configure the Elastic-Search server 
## The server is running on the port 9200, below code cell is executed to instantiate an Elasticsearch client object and connecting to server
from elasticsearch import Elasticsearch
import numpy as np
import pandas as pd

config = {'host':'localhost', 'port':9200}
es = Elasticsearch([config])

# test connection to ES server
es.ping()

True

In [None]:
try :
    es.delete_index("youtube_index")
except :
    pass

**Creating our Index , Stemming & Tokenization**

The first step is to create a new index and add our YouTube metadata. A set of mappings are specified that indicates our index schema, datatypes and text field.


As specified in our architecture, the language processing is done using Stemming and Tokenization.


**Stemming** : is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. In our Elastic search code, we have implemented stop analyzer which adds support for removing stop words.

**Tokenization** : A Standard Tokenizer is used which provides grammar based tokenization irrespective of the language.

**Stop Words** : Stop words are set of commonly used words, they are used to eliminate unimportant words, and allows the application to focus on important words instead.




In [None]:
index_config = {
   "settings": {
        "analysis": {
            "analyzer": {
                "stop_stem_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter":[
                        "lowercase",
                        "stop",
                        "snowball"
                    ]
                    
                }
            }
        }
    },
    "mappings": { 
        "properties": {
            "videoDescription": {"type": "text", "analyzer": "stop_stem_analyzer"},
            "videoTitle": {"type": "text", "analyzer": "stop_stem_analyzer"},
            "videoCategoryLabel": {"type": "text", "analyzer": "stop_stem_analyzer"}
            }
        }
    }

In [None]:
##Creating index
index_name = 'youtube_index'
es.indices.create(index=index_name, body=index_config, ignore=400)

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'youtube_index'}

**Reading our csv dataset**

In [None]:
df = pd.read_csv('/Users/shalinijain/Downloads/Clean_dataSet.csv')

In [None]:
df

Unnamed: 0,videotitle,videodescription,videocategorylabel
0,Add 500 to All Numbers in Text String? LET or ...,Download Excel File: http://excelisfun.net/fil...,Education
1,Dynamic Excel Multiplication Table with Condit...,Download Excel File: http://excelisfun.net/fil...,Education
2,Dueling Excel #191: Net Working Hours Spanning...,Download Excel File: http://excelisfun.net/fil...,Education
3,Append Two Number Columns and Sort. Excel Magi...,Download Excel File: http://excelisfun.net/fil...,Education
4,LET Function Advanced Array Formula for Dynami...,Download Excel File: http://excelisfun.net/fil...,Education
...,...,...,...
43647,KDD2016 paper 1202,Title: Sampling of Attributed Networks from Hi...,People & Blogs
43648,KDD2016 paper 1227,Title: Identifying Earmarks in Congressional B...,People & Blogs
43649,KDD2016 paper 1236,Title: Online Feature Selection: A Limited-Mem...,People & Blogs
43650,KDD2016 paper 958,Title: Scalable Betweenness Centrality Maximiz...,People & Blogs


In [None]:
df_iter = df.iterrows()
index, document = next(df_iter)

Iteration is done through all the rows and keys are extracted in corelation with Indexes

In [None]:

use_these_keys = ['videocategorylabel','videotitle','videodescription']
def filterKeys(document):
    return {key: document[key] for key in use_these_keys }

A loop is created which iterates through our list and then adds them to our previously created Elastic search index. Since we have large set of records we need to use bulk Indexing to optimize the process.

In [None]:
from elasticsearch import Elasticsearch
from elasticsearch import helpers
import uuid
es_client = Elasticsearch(http_compress=True)
def doc_generator(df):
    df_iter = df.iterrows()
    for index, document in df_iter:
        yield {
                "_index": 'youtube_index',
                "_type": "_doc",
                "_id" : uuid.uuid4(),
                "_source": filterKeys(document),
            }
helpers.bulk(es_client, doc_generator(df))

(43652, [])

**Searching the Index**


Since we have all the documents loaded to our index, we will use the ElasticSearch query language that supports several types of query types. 

In our search engine, we'll be using the **best fields** query type. The best_fields type is most useful when you are **searching for multiple words** best found in the same field.

In [None]:
# collapse-hide
def search_es(es_obj, index_name, question_text, n_results):
    
    # construct query
    query = {
              "query": {
                    "multi_match" : {
                      "query":      question_text,
                      "type":       "best_fields",
                      "fields":     [ "videocategorylabel","videodescription","videotitle"],
                      "tie_breaker": 0.3
                    }
                  }
            }
    
    res = es_obj.search(index=index_name, body=query, size=n_results)
    
    return res

Now our best_fields query type is created, we will execute a query on our search engine, here question_text is our query

In [None]:
question_text = 'Dynamic'

# execute query
res = search_es(es_obj=es, index_name='youtube_index', question_text=question_text, n_results=20)

Post query execution, all the videos related to the query are retrieved along with the **query execution duration and Relevance score**

In [None]:
print(f'Question: {question_text}')
print(f'Query Duration: {res["took"]} milliseconds')
print('VideoCategoryLabel, Title, Relevance Score:')
[( hit['_source']['videocategorylabel'], hit['_source']['videotitle'],hit['_id'],hit['_score']) for hit in res['hits']['hits']]

Question: Dynamic
Query Duration: 864 milliseconds
VideoCategoryLabel, Title, Relevance Score:


[('Science & Technology',
  'Excel Dynamic Chart #10: OFFSET Function Dynamic Range',
  'e92c378d-eca3-4434-8d31-91a4a2eb7966',
  9.332223),
 ('Education',
  'Dynamic Programming',
  'ef3f34be-a598-483a-9573-ae1a81bea94a',
  9.277651),
 ('Education',
  'Excel 2013 Statistical Analysis #36: Dynamic Binomial Probability Charts (3 Examples)',
  'c97a812e-d4b2-4ef2-9d52-a3eaa7988e0c',
  9.233617),
 ('Science & Technology',
  'Excel Dynamic Chart #9: 4 Week Chart Dynamic Formula & Dynamic Data Validation Formula',
  'f01e4f59-3ec0-4774-a730-fbd47116fc21',
  9.216183),
 ('Howto & Style',
  'Dueling Excel - Dynamic OFFSET or INDEX?: #1384',
  'f8de5ff2-cecc-4c45-8e97-8baa2144ce99',
  9.126232),
 ('Education',
  'Excel Dynamic Arrays: Fully Dynamic Cross Tabulated Reports? Unbelievable! EMT 1520',
  'e2d2e99c-28a1-4964-8fb2-3af3a69fc0a2',
  9.067312),
 ('Science & Technology',
  'Excel Magic Trick 636: Dynamic Frequency Table & Histogram Chart',
  '0b8715f3-2313-4780-b951-06a7226e64a1',
  9.06

With respect to our search engine, we have considered three fields to retrieve video as per user query:


1.   videocategorylabel
2.   videodescription
3.   videotitle,

Out of which video category label has the highest priority. From our above query "dynamic", we have created ground truth labels and assigned ratings based on prioritization of field videocategorylabel 



In [None]:
request_body=  [
        {
            "id": "Dynamic",
            "request": {
                "query": {
                    "multi_match": {
                        "query": "Dynamic",
                        "type": "best_fields",
                        "fields": [
                            "videocategorylabel",
                            "videodescription",
                            "videoTitle"
                        ],
                        "tie_breaker": 0.3
                    }
                }
            },
            "ratings": [
                {
                    "_id": "e92c378d-eca3-4434-8d31-91a4a2eb7966",
                    "rating": 1,
                    "_index": "youtube_index"
                },
                {
                    "_id": "ef3f34be-a598-483a-9573-ae1a81bea94a",
                    "rating": 2,
                    "_index": "youtube_index"
                },
                {
                    "_id": "c97a812e-d4b2-4ef2-9d52-a3eaa7988e0c",
                    "rating": 3,
                    "_index": "youtube_index"
                },
                {
                    "_id": "f01e4f59-3ec0-4774-a730-fbd47116fc21",
                    "rating": 4,
                    "_index": "youtube_index"
                },
                {
                    "_id": "f8de5ff2-cecc-4c45-8e97-8baa2144ce99",
                    "rating": 5,
                    "_index": "youtube_index"
                },
                {
                    "_id": "e2d2e99c-28a1-4964-8fb2-3af3a69fc0a2",
                    "rating": 6,
                    "_index": "youtube_index"
                },
                {
                    "_id": "0b8715f3-2313-4780-b951-06a7226e64a1",
                    "rating": 7,
                    "_index": "youtube_index"
                },
                {
                    "_id": "3c03a007-99c8-4031-8e7b-a4d13ab447c9",
                    "rating": 8,
                    "_index": "youtube_index"
                },
                {
                    "_id": "00b66e2a-36ad-4dc8-ade2-ab891d75fd6b",
                    "rating": 9,
                    "_index": "youtube_index"
                },
                {
                    "_id": "a1faaec5-979f-4082-8d6a-6a0d72fe2cf6",
                    "rating": 10,
                    "_index": "youtube_index"
                },
                {
                    "_id": "37dfa56f-9b23-41db-95a1-c4d98c71567b",
                    "rating": 11,
                    "_index": "youtube_index"
                },
                {
                    "_id": "8a0b2247-2a94-4578-a4d1-d9d3234642ad",
                    "rating": 12,
                    "_index": "youtube_index"
                },
                {
                    "_id": "4dcbcdfe-342d-4698-b63b-e7eca9c39b75",
                    "rating": 13,
                    "_index": "youtube_index"
                },
                {
                    "_id": "ab7aecb1-8dd4-47e1-9ceb-cb8a41e7e0fa",
                    "rating": 14,
                    "_index": "youtube_index"
                },
                {
                    "_id": "12255ff5-4cdb-48f4-9ff6-834d1efab963",
                    "rating": 15,
                    "_index": "youtube_index"
                }

            ]
        }
]

**Evaluation of our search engine**

Ranking Evaluation API evaluates the quality of ranked search results over a set of typical search queries. The evaluation starts with looking at the user queries, and the things that they are searching for.
In order to start our search quality evaluation, we have considered below three mandatory attributes:

A collection of documents used to evaluate our query performance.

A collection of typical search requests

A set of document ratings that represent the documents relevance, which we implemented in previous cell.

Evaluation Metrics
The metric section determines which of the evaluation metrics will be used. For our serach engine evaluation performed based on below metrics

**1) Precision :** This metric measures the proportion of relevant results in the top k search results. It is the fraction of relevant documents in those first k results. A paramater K needs to be set, which represent maximum number of documents retrieved per query.

**2) Recall:** This metric measures the total number of relevant results in the top k search results. It is the fraction of relevant documents in those first k results relative to all possible relevant results. Similar to Precision, K needs to the set which represents maximum number of documents retrieved per query.

**3) Mean Reciprocal Rank:** For every query in the test suite, this metric calculates the reciprocal of the rank of the first relevant document.The reciprocal rank for each query is averaged across all queries in the test suite to give the mean reciprocal rank.

**4) Discounted Cumulative Gain:** DCG takes both the rank and the rating of the search results into account.The assumption is that highly relevant documents are more useful for the user when appearing at the top of the result list.

**5) Expected Reciprocal Rank:** Expected Reciprocal Rank (ERR) is an extension of the classical reciprocal rank for the graded relevance case. The metric models the expectation of the reciprocal of the position at which a user stops reading through the result list. This means that a relevant document in a top ranking position will have a large contribution to the overall score.

In [None]:
import requests
import json

url = "http://localhost:9200/youtube_index/_rank_eval"
headers = {
  'Content-Type': 'application/json'
}

**1)Precision Evaluation Number 1 :** K = 5

In [None]:
payload = json.dumps({
  "requests": request_body,
  "metric": {
    "precision": {
      "k": 5,
      "relevant_rating_threshold": 1,
      "ignore_unlabeled": False
    }
  }
})

response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])


Score is 1.0
{'precision': {'relevant_docs_retrieved': 5, 'docs_retrieved': 5}}


**1)Precision Evaluation Number 2 :** K = 10

In [None]:
payload = json.dumps({
  "requests": request_body,
  "metric": {
    "precision": {
      "k": 10,
      "relevant_rating_threshold": 1,
      "ignore_unlabeled": False
    }
  }
})

response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])

Score is 0.7
{'precision': {'relevant_docs_retrieved': 7, 'docs_retrieved': 10}}


**Precision Evaluation Number 3 :** K = 15




In [None]:
payload = json.dumps({
  "requests": request_body,
  "metric": {
    "precision": {
      "k": 15,
      "relevant_rating_threshold": 1,
      "ignore_unlabeled": False
    }
  }
})

response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])

Score is 0.4666666666666667
{'precision': {'relevant_docs_retrieved': 7, 'docs_retrieved': 15}}


**2) Recall Evaluation Number 1 :** K =5

In [None]:
payload = json.dumps({
  "requests": request_body,
   "metric": {
    "recall": {
      "k": 5,
      "relevant_rating_threshold": 1
    }
  }
})

response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])

Score is 0.3333333333333333
{'recall': {'relevant_docs_retrieved': 5, 'relevant_docs': 15}}


**Recall Evaluation Number 2 :** K = 10

In [None]:
payload = json.dumps({
  "requests": request_body,
   "metric": {
    "recall": {
      "k": 10,
      "relevant_rating_threshold": 1
    }
  }
})

response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])

Score is 0.4666666666666667
{'recall': {'relevant_docs_retrieved': 7, 'relevant_docs': 15}}


**Recall Evaluation Number 3 :** K =15

In [None]:
payload = json.dumps({
  "requests": request_body,
   "metric": {
    "recall": {
      "k": 15,
      "relevant_rating_threshold": 1
    }
  }
})

response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])

Score is 0.4666666666666667
{'recall': {'relevant_docs_retrieved': 7, 'relevant_docs': 15}}


**3) Mean Reciprocal Rank:** k=1

In [None]:
payload = json.dumps({
  "requests": request_body,
   "metric": {
        "mean_reciprocal_rank": {
            "k": 1,
            "relevant_rating_threshold": 1
        }
    }
})
response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])

Score is 1.0
{'mean_reciprocal_rank': {'first_relevant': 1}}


**4) DCG Evaluation:** k=5

In [None]:
payload = json.dumps({
  "requests": request_body,
   "metric": {
        "dcg": {
            "k": 5,
            "normalize": False
        }
    }
})
response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])

Score is 13031.984986658073
{'dcg': {'dcg': 13031.984986658073, 'unrated_docs': 0}}


**DCG Evaluation :** k=10

In [None]:
payload = json.dumps({
  "requests": request_body,
   "metric": {
        "dcg": {
            "k": 10,
            "normalize": False
        }
    }
})
response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])

Score is 15588.55451385429
{'dcg': {'dcg': 15588.55451385429, 'unrated_docs': 3}}


**DCG Evaluation :** k=15

In [None]:
payload = json.dumps({
  "requests": request_body,
   "metric": {
        "dcg": {
            "k": 15,
            "normalize": False
        }
    }
})
response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])

Score is 15588.55451385429
{'dcg': {'dcg': 15588.55451385429, 'unrated_docs': 8}}


**5) Expected Reciprocl rank Evaluation :** k=5




In [None]:
payload = json.dumps({
  "requests": request_body,
   "metric": {
        "expected_reciprocal_rank": {
            "k": 5,
            "maximum_relevance": 3
        }
    }
})
response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])

Score is -275258.26440633135
{'expected_reciprocal_rank': {'unrated_docs': 0}}


**Expected Reciprocl rank Evaluation :** k=10


In [None]:
payload = json.dumps({
  "requests": request_body,
   "metric": {
        "expected_reciprocal_rank": {
            "k": 10,
            "maximum_relevance": 3
        }
    }
})
response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])

Score is -4827277755.6741705
{'expected_reciprocal_rank': {'unrated_docs': 3}}


**Expected Reciprocl rank Evaluation :** k=15


In [None]:
payload = json.dumps({
  "requests": request_body,
   "metric": {
        "expected_reciprocal_rank": {
            "k": 15,
            "maximum_relevance": 3
        }
    }
})
response = requests.request("POST", url, headers=headers, data=payload)

json_data = json.loads(response.text)
print("Score is",json_data['details']['Dynamic']['metric_score'])
print(json_data['details']['Dynamic']['metric_details'])

Score is -4827277755.6741705
{'expected_reciprocal_rank': {'unrated_docs': 8}}


**Conclusion:**

We have implemented elastic search on our Youtube dataset and have also measured the performance of our search engine using various evaluation metrics. Selection of query settings depends on the use case of the application and can be further analyzed using various parameters.