# Exercise 2
Use Elasticsearch to index the News dataset and write a Python function that implements pseudo-relevance feedback in Elasticsearch

In [1]:
pip install jsonify

Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
import re
import json, jsonify
import time
import requests
from requests.auth import HTTPBasicAuth
from elasticsearch import Elasticsearch
import elasticsearch
from elasticsearch.helpers import bulk
import math

print(elasticsearch.__version__)

(8, 12, 0)


In [2]:
USER = 'elastic'
PWD = 'GHafZMbVQ*cgYBz7n7pT'
index_name = 'news'
ES_ENDPOINT = 'https://localhost:9200'

path_to_ca_certificates = 'C:/Users/Usuario/OneDrive/Escritorio/Data in Production/elastic_search/elasticsearch-8.12.0/config/certs/http_ca.crt'

### Load the data

In [3]:
df = pd.read_csv('../data/news/news.csv')
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


### Superficial EDA

Dimension of our data

In [4]:
df.shape

(209527, 6)

In [5]:
# Duplicates?
print(f'There are {df[df.duplicated()].shape[0]} duplicated news articles')
# we filter the duplicates out
df = df[~df.duplicated()]
df.head()

There are 13 duplicated news articles


Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [7]:
# Nan values?
nan_values = df.isna().sum()
nan_values

link                     0
headline                 6
category                 0
short_description    19712
authors              37405
date                     0
dtype: int64

In [8]:
# Fill the missing authors with 'Unknown' and missing short_description with 'Not provided' and headline with 'None'
df['short_description'].fillna('Not provided', inplace=True)
df['authors'].fillna('Unknown', inplace=True)
df['headline'].fillna('None', inplace=True)

### Data Prep for creating index

In [9]:
#transform dataframe into json format
docs = df.to_dict(orient='records')
doc_ids = df.index
print(doc_ids)
print(docs[1703])

Index([     0,      1,      2,      3,      4,      5,      6,      7,      8,
            9,
       ...
       209517, 209518, 209519, 209520, 209521, 209522, 209523, 209524, 209525,
       209526],
      dtype='int64', length=209514)
{'link': 'https://www.huffpost.com/entry/kieran-mac-culkin-snl-hoist-monologue_n_61875f22e4b06de3eb763290', 'headline': "Kieran Culkin Finally Gets The 'SNL' Lift He Hankered For 30 Years Ago", 'category': 'COMEDY', 'short_description': '“My brother’s up there. He’s got his arms up, all victorious. And I’m down there on the ground,\xa0like ... I want uppies," Culkin recalled of being on "SNL" as a boy.', 'authors': 'Mary Papenfuss', 'date': '2021-11-07'}


### Define index settings and mappings

Build the settings of the index, adding the term vector for field 'short_decription' as well as allowing full text queries for fields 'Headline' and 'short_decription'. We initialize the refresh to None so we can later do a bulk update of all our data.

In [10]:
# create an index
create_index_json={
  "mappings" : {
      "properties" : {
        "headline" :  {
            "type": "text",  #for full-text searches
            "fields": {
              "keyword": {    
                  "type": "keyword" #for exact search
          }}},
        "short_description" : {
            "type": "text",  #for full-text searches
            "term_vector": "with_positions_offsets_payloads",
            "store" : True,
            "fields": {
              "keyword": {    
                  "type": "keyword" #for exact search
          }}},
        "link" : {
          "type" : "text"
        },
        "category" : {
          "type" : "text"
        },
        "authors" : {
          "type" : "text"
        },
        "date" : {
          "type" : "date"
        }
      }
  },
  "settings": {
        "number_of_replicas": 1, # only one replica needed
        "refresh_interval": -1, # static dataset and updated rarely so we disable the refreshing
        "index" : {
        "similarity" : {
          "default" : {
            "type" : "BM25", "b": 0.75, "k1": 1.1 # 10% increase of the default value
          }
        }        
    },
        "analysis": {
            "analyzer": {"std_english": {"type": "standard", "stopwords": "_english_" }}
        }
  }}


### Load Elasticsearch wrapper
Using the wrapper provided in the exercise sessions

In [11]:
class Elastic:
    """
    A convenience object to send HTTP requests to Elasticsearch
    """
    def __init__(self, endpoint, username, password, path_to_ca_certificates):
        """
        @param endpoint: the URL of the Elasticsearch instance
        @param username: the Elasticsearch username 
        @param password: the Elasticsearch password
        """
        self.header = {'Content-Type': 'application/json', 'charset':'UTF-8'}
        #self.header={'Content-Type': '--data-binary application/x-ndjson'}
        self.endpoint = endpoint
        self.username = username
        self.password = password
        self.path_to_ca_certificates = path_to_ca_certificates
        self.methods_mapping = {'get': requests.get, 
                                'put':requests.put, 
                                'post':requests.post, 
                                'delete':requests.delete}
        
    def curl(self, method, handle, json=None):
        """
        Sends an HTTP request to the Elasticsearch instance
        @param method: can be 'get', 'put', 'post', 'delete'
        @param handle: the API handle to be appended to the Elasticsearch url
        @param json: the json payload of the HTTP request
        """
        http_method = self.methods_mapping[method.lower()]
        r = http_method(f'{self.endpoint}/{handle}', auth=HTTPBasicAuth(USER, PWD), 
                        headers=self.header, json=json,
                        verify = self.path_to_ca_certificates)
        return r

In [12]:
e = Elastic(ES_ENDPOINT, USER, PWD, path_to_ca_certificates)

In [13]:
# make sure not any index with this name exists
r = e.curl('delete', index_name)
r.json()

{'acknowledged': True}

In [14]:
# create an index
r = e.curl('put', index_name, json=create_index_json)
r.json()

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'news'}

In [15]:
# get the index details and settings
r = e.curl('get', index_name)
r.json()

{'news': {'aliases': {},
  'mappings': {'properties': {'authors': {'type': 'text'},
    'category': {'type': 'text'},
    'date': {'type': 'date'},
    'headline': {'type': 'text', 'fields': {'keyword': {'type': 'keyword'}}},
    'link': {'type': 'text'},
    'short_description': {'type': 'text',
     'store': True,
     'fields': {'keyword': {'type': 'keyword'}},
     'term_vector': 'with_positions_offsets_payloads'}}},
  'settings': {'index': {'routing': {'allocation': {'include': {'_tier_preference': 'data_content'}}},
    'refresh_interval': '-1',
    'number_of_shards': '1',
    'provided_name': 'news',
    'similarity': {'default': {'type': 'BM25', 'b': '0.75', 'k1': '1.1'}},
    'creation_date': '1716711889792',
    'analysis': {'analyzer': {'std_english': {'type': 'standard',
       'stopwords': '_english_'}}},
    'number_of_replicas': '1',
    'uuid': 'W6-vhEzlQU2qr2ag35jGxA',
    'version': {'created': '8500008'}}}}}

Load the data efficiently by doing bulk indexing:

In [16]:
# bulk indexing (via official API)

#connect to the local elasticsearch node and authenticate
es = Elasticsearch([ES_ENDPOINT], ca_certs=path_to_ca_certificates, basic_auth=(USER, PWD))

actions = [
  {
    "_index": index_name,
    "_id": doc_id,
    "_source": doc
  }
  for doc_id, doc in list(zip(doc_ids, docs))
]

# send actions in bulk (the API takes care of chunking them optimally)
bulk(es, actions)

(209514, [])

Allow the data to be loaded and refreshed every 2 seconds

In [17]:
# reset the refresh interval to 2 seconds
r = e.curl('put', f'{index_name}/_settings', {'index' : {'refresh_interval' : '2s'}})
r.json()

{'acknowledged': True}

Check on a random data point to verify if the bulk worked properly:

In [18]:
r = e.curl('get', f'{index_name}/_doc/{doc_ids[88]}')
r.json()

{'_index': 'news',
 '_id': '88',
 '_version': 1,
 '_seq_no': 88,
 '_primary_term': 1,
 'found': True,
 '_source': {'link': 'https://www.huffpost.com/entry/ap-as-vietnam-karaoke-fire_n_6319c41be4b0ed021def2c70',
  'headline': 'At Least 32 Dead In Fire At Karaoke Parlor In South Vietnam',
  'category': 'WORLD NEWS',
  'short_description': 'The fire in Thuan An city began late Tuesday and trapped both workers and customers inside the multi-story venue.',
  'authors': 'Unknown',
  'date': '2022-09-08'}}

### Pseudo Relevance Feedback Function
A pseudo relevance feedback function using the description of Rocchio's algorithm provided in lecture 2 of IR

In [19]:
def pseudo_relevance_feedback(index, query, field_to_match, top_k=10, top_m=10):    
    words = query.lower().split()
    # Initial full text query to get the relevant results 
    queryft ={
          "query": {
            "match": {
                "short_description": {
                    "query": query
                }
            }
          }
        } 
    
    relevant_doc = e.curl('get', f'{index}/_search', queryft)
    relevant_docs = relevant_doc.json()["hits"]["hits"]
    
    # Step 1: Retrieve relevant documents for initial query
    doc_ids = [int(doc["_id"]) for doc in relevant_docs]
    
    # Step 2: Get TF-IDF vectors of all terms in top k documents and sum them
    term_scores = {}
    for doc_id in doc_ids:
        term_vectors = e.curl('get', f'{index}/_termvectors/{doc_id}', json={"fields": ["short_description"], "term_statistics": True})
        term_vector = term_vectors.json()
        doc_count = term_vector["term_vectors"][field_to_match]["field_statistics"]["doc_count"]
        if field_to_match in term_vector["term_vectors"]:
            terms = term_vector["term_vectors"][field_to_match]["terms"]
            for term, term_info in terms.items():
                if term not in term_scores:
                    term_scores[term] = 0
                tf = (1 + math.log(term_info["term_freq"])) # exhaustivity
                term_scores[term] += tf * math.log(doc_count / (term_info["doc_freq"]))  # TF-IDF formula
    
    # Step 3: Get the top-M terms in the resulting vector
    top_m_terms = sorted(term_scores.keys(), key=lambda x: term_scores[x], reverse=True)[:top_m]
    
    # Step 4: Submit a new query that contains the initial terms AND the set of new top m terms
    expanded_query = query
    for term in top_m_terms:
        if term not in words:
            expanded_query += f" {term}"
    
    expand_query = {
          "query": {
            "match": {
                "short_description": {
                    "query": expanded_query
                }
            }
          }
        }
    print('New query used is: ',expanded_query)
    expanded_result = e.curl('get', f'{index}/_search', expand_query)
    expanded_results = expanded_result.json()["hits"]["hits"]
    
    return expanded_results

In [20]:
# query on 'plane crash'
index_name = "news"
initial_query = "plane crash"
field_to_match = "short_description"
expanded_results = pseudo_relevance_feedback(index_name, initial_query, field_to_match)
expanded_results

New query used is:  plane crash 176 barnes killed resonances along germanwings small mountainous


[{'_index': 'news',
  '_id': '3987',
  '_score': 35.312504,
  '_source': {'link': 'https://www.huffpost.com/entry/steve-barnes-death_n_5f788083c5b64b480aae740e',
   'headline': 'Attorney Steve Barnes Of Cellino & Barnes Dies In Plane Crash',
   'category': 'U.S. NEWS',
   'short_description': 'Barnes, whose firm was known for its ads and catchy jingle, died in a small plane crash along with his niece, Elizabeth Barnes.',
   'authors': 'Jim Mustian, AP',
   'date': '2020-10-03'}},
 {'_index': 'news',
  '_id': '5462',
  '_score': 33.91592,
  '_source': {'link': 'https://www.huffpost.com/entry/iran-protests-plane-shot-down_n_5e1b1a07c5b6640ec3d5df2f',
   'headline': 'Iranians Defy Police, Protest Over Ukranian Plane Shootdown',
   'category': 'WORLD NEWS',
   'short_description': 'The plane crash killed all 176 people on board, mostly Iranians and Iranian-Canadians.',
   'authors': 'Joseph Krauss and Jon Gambrell, AP',
   'date': '2020-01-12'}},
 {'_index': 'news',
  '_id': '5412',
  '_sc

In [21]:
# query on 'London'
index_name = "news"
initial_query = "London"
field_to_match = "short_description"
expanded_results = pseudo_relevance_feedback(index_name, initial_query, field_to_match)
expanded_results

New query used is:  London city offer.therefore tourist’s taxi's underground's paris cabs selfridges owing


[{'_index': 'news',
  '_id': '57481',
  '_score': 42.054108,
  '_source': {'link': 'https://www.huffingtonpost.com/entry/5-things-you-must-do-in-london-while-traveling-england_us_57cdc099e4b07addc413ce14',
   'headline': '5 Things You Must Do In London While Traveling England',
   'category': 'TRAVEL',
   'short_description': 'London is a tourist’s paradise owing to the several iconic attractions it has to offer.Therefore,it makes sense why London',
   'authors': 'Rachel M. Moore, ContributorI’m a 20 something runner-girl from the Alabama area, and this...',
   'date': '2016-09-05'}},
 {'_index': 'news',
  '_id': '198223',
  '_score': 31.207134,
  '_source': {'link': 'https://www.huffingtonpost.comhttp://online.wsj.com/article/SB10001424052702304192704577404080468282556.html?mod=priority_pass',
   'headline': 'London Taxi Company Exports Famous Cabs To Azerbaijan',
   'category': 'TRAVEL',
   'short_description': "The former Soviet republic is London Taxi's biggest single customer. It 

Now let's compare the results with normal queries:

In [22]:
def full_text_query(index, query):
    queryft ={
          "query": {
            "match": {
                "short_description": {
                    "query": query
                }
            }
          }
        } 
    
    answer = e.curl('get', f'{index}/_search', queryft)
    answers = answer.json()["hits"]["hits"]
    return answers

In [23]:
full_text_query(index_name, 'plane crash')

[{'_index': 'news',
  '_id': '133956',
  '_score': 15.93325,
  '_source': {'link': 'https://www.huffingtonpost.com/entry/master-traveler-tips_us_5b9dfa8ee4b03a1dcc8fcb0e',
   'headline': '16 Things Master Travelers Do Differently',
   'category': 'TRAVEL',
   'short_description': 'They know what to do during a plane crash. "A number of crash studies focusing on both survivors and staged experiments have',
   'authors': 'Suzy Strutner',
   'date': '2014-04-15'}},
 {'_index': 'news',
  '_id': '5462',
  '_score': 15.778189,
  '_source': {'link': 'https://www.huffpost.com/entry/iran-protests-plane-shot-down_n_5e1b1a07c5b6640ec3d5df2f',
   'headline': 'Iranians Defy Police, Protest Over Ukranian Plane Shootdown',
   'category': 'WORLD NEWS',
   'short_description': 'The plane crash killed all 176 people on board, mostly Iranians and Iranian-Canadians.',
   'authors': 'Joseph Krauss and Jon Gambrell, AP',
   'date': '2020-01-12'}},
 {'_index': 'news',
  '_id': '5412',
  '_score': 14.197142,


In [24]:
full_text_query(index_name, 'London')

[{'_index': 'news',
  '_id': '15674',
  '_score': 8.752179,
  '_source': {'link': 'https://www.huffingtonpost.com/entry/kim-kardashian-wants-more-kids_us_5a68d590e4b0dc592a0eec21',
   'headline': 'Kim Kardashian Reportedly Wants More Kids After Baby Chicago',
   'category': 'ENTERTAINMENT',
   'short_description': 'New York? London? Kansas City?',
   'authors': 'Cole Delbyck',
   'date': '2018-01-24'}},
 {'_index': 'news',
  '_id': '162654',
  '_score': 8.514979,
  '_source': {'link': 'https://www.huffingtonpost.com/entry/great-makeout-spots-in-lo_us_5b9d3ee1e4b03a1dcc85e4cb',
   'headline': 'Great Make-Out Spots in London (VIDEO)',
   'category': 'TRAVEL',
   'short_description': 'London is a very romantic city.',
   'authors': 'Kate Thomas, Contributor\nOn-camera Host and Video Producer, TravelwithKate.com',
   'date': '2013-06-13'}},
 {'_index': 'news',
  '_id': '194344',
  '_score': 8.476414,
  '_source': {'link': 'https://www.huffingtonpost.com/entry/taylor-tomasi-hill-fashion-wee

##### Results comparison for news with 'plane crash'

So when we do an extended query on 'plane 'crash' we see the results obtained are more targeted towards accidents, whereas when we perform a simple full text query we observe that in the results retrieved we also have articles about survival, but also just articles where we only find either plane or crush but not both together. So overall, this search returns a bigger undefined range of results. This shows how doing an extended query using pseudo relevance we are able to refine and narrow the topic.

##### Results comparison for news with 'London'

Again it can be seen how when performing only a simple search, the first results obtained are broader and maybe not to the point (being mentioned in passing but not as the main focus, as the name of a person ...), meanwhile when doing an extended search using pseudo relevance we observe most focalized results on London being the first ones to appear.