# Assignment 3 : Indexing

## Team-005


###  Indexing DBPedia Data


In [19]:
import pandas as pd
from pandas import DataFrame, Series
import json
from elasticsearch import Elasticsearch
import re

## Subcollections and Pre-processing

We indexed these subcollections in full: `labels_en`, `long_abstracts_en`, `article_categories_en` and `page_links_en`

Since the dbpedia data is distributed in subcollections, it was necessary to perform pre-processing tasks on the data such as looking up (resolving) predicate values. Given the large sizes of the collections, we opted to perform this pre-processing with the help of thrid party tools as described in the report.
Ultimately, the pre-processed data was dumped in a single text file `entities_cleaned.txt` that can be accessed [here](https://drive.google.com/open?id=17hMZECtkvKCypqHB9N1FStwD6Oq9N5D_)

### Settings

We used 1 shard and 0 replicas because the index was run locally on the machine

In [3]:
SETTINGS = {
    "index": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    }
}

Used the default `english` analyzer and made sure to build a positional index with "term_vector" set to `with_positions` 

In [4]:
TEXT_PROPERTIES = {
    "type": "text",
    "term_vector": "with_positions",
    "analyzer": "english"
}

TEXT_INDEX_SETTINGS = {
    "settings": SETTINGS,
    "mappings": {
        "properties": {
            "names": TEXT_PROPERTIES,
            "categories": TEXT_PROPERTIES,
            "similar_entities": TEXT_PROPERTIES,
            "abstract": TEXT_PROPERTIES,
            "catch_all": TEXT_PROPERTIES
        }
    }
}

The data directory with the mentioned `entities_cleaned.txt` file. This file can be downloaded from the link above

In [9]:
DATA_DIR = "data/"

Check if a pre-existing index with the same name exists and delete it, then proceed to creating a new index

In [6]:
INDEX_NAME = "dbpedia_text"

es = Elasticsearch()

if es.indices.exists(INDEX_NAME):
    # Delete the index if it exists
    resp = es.indices.delete(INDEX_NAME)
    print(f"Deleting the existing index: {resp}")

resp = es.indices.create(index=INDEX_NAME, body=TEXT_INDEX_SETTINGS)
print(f"Creating the index: {resp}")

Creating the index: {'acknowledged': True, 'shards_acknowledged': True, 'index': 'dbpedia_tex'}


Since the `entities_cleaned.txt` file is quite huge (4GB), we use pandas to iterate through the text file and process it in chunks

For each chunk, construct a body dictionary with `[names, categories, similar_entities, abstract, catch_all]` as the names of the fields .
We use `es.bulk()` method to bulk index the chunk .
Printing the cumulative entities indexed in every chunk helps to show the indexing progress

In [11]:
# ENTITIES FILE
file_count = 0
chunks = pd.read_csv(DATA_DIR+"entities_cleaned.txt",
                     delimiter=';p;', skiprows=1,
                     names=['subject', 'label', 'page_link_entities',
                            'category_entities',
                            'category_names', 'abstract', 'page_link_names'],
                     engine='python', chunksize=10000)

for df in chunks:
    df['page_link_names'].replace(
        to_replace='[\s_;]+', value=' ', inplace=True, regex=True)
    bulk_data = []
    entity_list = json.loads(df.to_json(orient='records'))

    for entity in entity_list:
        catch_all_fields = [
            str(entity.get('label', " ")),
            str(entity.get('page_link_names', ' ')),
            str(entity.get('category_names', ' ')),
            str(entity.get('abstract', ' '))
        ]
        bulk_data.append(
            {
                "index": {
                    "_index": INDEX_NAME,
                    "_id": entity['subject']
                }
            }
        )
        bulk_data.append(
            {
                'names': entity.get('label', ' '),
                'categories': entity.get('category_names', ' '),
                'similar_entities': entity.get('page_link_names', ' '),
                'abstract': entity.get('abstract', ' '),
                'catch_all': "".join(catch_all_fields)
            }
        )

    es.bulk(index=INDEX_NAME, body=bulk_data, refresh=True)

    file_count = file_count+df.shape[0]
    print(file_count)
    
    break # This can be removed to run the entire index


print("-"*100)
print("Finished indexing all the entities into index:  {}".format(INDEX_NAME))


10000
----------------------------------------------------------------------------------------------------
Finished indexing all the entities into index:  dbpedia_tex


## Generating First Pass Files

Initializing input and output filenames and index

In [14]:
QUERIES_1 = "data/queries.txt"
QUERIES_2 = "data/queries2.txt"
QUERIES_1_FIRST_PASS = "data/first_pass_bm25_one.csv"
QUERIES_2_FIRST_PASS = "data/first_pass_bm25_two.csv"

INDEX_NAME = "dbpedia_text"

es = Elasticsearch()

Creating a function to load a list of dictionary of queries in the format `{'QueryId': val, 'Query': val}`

In [15]:
# Loading the actual queries
def load_queries(queries_file):
    queries_df = pd.read_csv(queries_file, header=None, delimiter=';')
    df = queries_df.replace(to_replace=['^\S+\d+\s'], value='', regex=True)
    queries_df = queries_df.applymap(
        lambda x: re.findall(r'^\S+\d+\s', x)[0].strip())
    queries_df = queries_df.merge(df, right_index=True, left_index=True)
    queries_df.columns = ['QueryId', 'Query']

    return queries_df.to_dict(orient='records')

Searching each query's top 100 entities from index

In [16]:
def search(query, query_id):
    res = es.search(index=INDEX_NAME, q=query, size=100,
                    _source=False, analyzer='english', request_timeout=30)
    hits = res['hits']['hits']
    matched_entities = [(query_id, entity['_id']) for entity in hits]

    return DataFrame.from_records(matched_entities)

Generating the first pass file

In [17]:
def generate_first_pass_file(queries_file, output_file):
    query_top_100_dfs = []

    queries = load_queries(queries_file)
    for q in queries:
        print(f"Searching entities for query: {q['QueryId']}")
        top_100 = search(q['Query'], q['QueryId'])
        query_top_100_dfs.append(top_100)

    query_entity_df = pd.concat(query_top_100_dfs, ignore_index=True)
    query_entity_df.columns = ['QueryId', 'EntityId']

    with open(output_file, 'w', encoding="utf-8", errors='ignore') as f:
        f.write("QueryId,EntityId\n")
        for rec in query_entity_df.to_dict(orient='records'):
            f.write("{},{}\n".format(rec['QueryId'], '"'+rec['EntityId']+'"'))

    print("-"*100)
    print(
        f"Finished generating first pass file for {len(queries)} queries")

Reading from files and generating first pass files

In [20]:
# Generating first pass file for `Queries.txt`
generate_first_pass_file(QUERIES_1, QUERIES_1_FIRST_PASS)

Searching entities for query: INEX_LD-2009022
Searching entities for query: INEX_LD-2009053
Searching entities for query: INEX_LD-2009062
Searching entities for query: INEX_LD-2009074
Searching entities for query: INEX_LD-2009111
Searching entities for query: INEX_LD-2010004
Searching entities for query: INEX_LD-2010019
Searching entities for query: INEX_LD-2010037
Searching entities for query: INEX_LD-2010057
Searching entities for query: INEX_LD-2010100
Searching entities for query: INEX_LD-20120111
Searching entities for query: INEX_LD-20120121
Searching entities for query: INEX_LD-20120131
Searching entities for query: INEX_LD-20120211
Searching entities for query: INEX_LD-20120221
Searching entities for query: INEX_LD-20120231
Searching entities for query: INEX_LD-20120311
Searching entities for query: INEX_LD-20120321
Searching entities for query: INEX_LD-20120331
Searching entities for query: INEX_LD-20120411
Searching entities for query: INEX_LD-20120421
Searching entities for 

Searching entities for query: SemSearch_ES-63
Searching entities for query: SemSearch_ES-66
Searching entities for query: SemSearch_ES-68
Searching entities for query: SemSearch_ES-70
Searching entities for query: SemSearch_ES-72
Searching entities for query: SemSearch_ES-74
Searching entities for query: SemSearch_ES-76
Searching entities for query: SemSearch_ES-78
Searching entities for query: SemSearch_ES-80
Searching entities for query: SemSearch_ES-82
Searching entities for query: SemSearch_ES-84
Searching entities for query: SemSearch_ES-86
Searching entities for query: SemSearch_ES-89
Searching entities for query: SemSearch_ES-90
Searching entities for query: SemSearch_ES-93
Searching entities for query: SemSearch_ES-95
Searching entities for query: SemSearch_ES-97
Searching entities for query: SemSearch_ES-99
Searching entities for query: SemSearch_LS-10
Searching entities for query: SemSearch_LS-12
Searching entities for query: SemSearch_LS-14
Searching entities for query: SemS

In [None]:
# Generating first pass file for `Queries2.txt`
generate_first_pass_file(QUERIES_2, QUERIES_2_FIRST_PASS)