# Similarity Modulation

Here we are going to implement another similarity other that the BM25 which is the default in Elastic. We want you to implement a tf-idf similarity and test it with same queries in phase2 so that you can get a sense of how well your Elastic tf-idf works. Follow the instructions and fill where ever it says # TODO.  <br>
You can contact me in case of any problems via Telegram: @mahvash_sp

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install elasticsearch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting elasticsearch
  Downloading elasticsearch-8.3.1-py3-none-any.whl (382 kB)
[K     |████████████████████████████████| 382 kB 24.8 MB/s 
[?25hCollecting elastic-transport<9,>=8
  Downloading elastic_transport-8.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 5.9 MB/s 
[?25hCollecting urllib3<2,>=1.26.2
  Downloading urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
[K     |████████████████████████████████| 138 kB 43.3 MB/s 
Installing collected packages: urllib3, elastic-transport, elasticsearch
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uninstalled urllib3-1.24.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
requests 2.23.0 requires

In [3]:
from elasticsearch import Elasticsearch, helpers
import json
import warnings



In [4]:
# import data in json format
file_name = '/content/drive/MyDrive/IR_data_news_12k.json'

with open(file_name) as f:
    data = json.load(f)

In [5]:
# Filter warnings
warnings.filterwarnings('ignore')

In [6]:
# data keys
data['0'].keys()

dict_keys(['title', 'content', 'tags', 'date', 'url', 'category'])

After starting your Elasticsearch on your pc (localhost:9200 is the default) we have to connect to it via the following piece of code


In [7]:
# Here we try to connect to Elastic
es = Elasticsearch(
    cloud_id="IR:dXMtY2VudHJhbDEuZ2NwLmNsb3VkLmVzLmlvOjQ0MyQ2NTdiMmI1MzE3YTc0OWNjYWQ0YTU1ZDM4ZGUyOGI0NiQzYTU2YWFlYzQ1NzI0YjkxOTZlN2YwOWExNDU3ODk2MQ==",
    basic_auth=("elastic", "TOseKiFNqilMmEBgf1i0Dv92")
    )

## Create tf-idf Index

### Create Index

In [8]:
# Name of index 
sm_index_name = 'tfidf_index'

In [9]:
# Delete index if one does exist
if es.indices.exists(index=sm_index_name):
    es.indices.delete(index=sm_index_name)

# Create index    
es.indices.create(index=sm_index_name)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'tfidf_index'})

### Add documents

In here we used the bulk doc formatter which was introduced in the first subsection of phase 3. <br>
You can find out more in [Here](https://stackoverflow.com/questions/61580963/insert-multiple-documents-in-elasticsearch-bulk-doc-formatter).

In [10]:

from elasticsearch.helpers import bulk

def bulk_sync():
    actions = [
        {
            '_index': sm_index_name,
            '_id':doc_id,
            '_source': doc
        } for doc_id,doc in data.items()
    ]
    bulk(es, actions)
    
    


In [11]:
# run the function to add documents
bulk_sync()

In [12]:
# Check index
es.count(index = sm_index_name)

ObjectApiResponse({'count': 12202, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}})

### Configuring a similarity

In order to configure a new similarity function you have to change the similarity from the settings api of the index. This can be done via the function 'put_settings' in python. What we do is to change the 'default' similarity function in Elastic so that it uses the replaced similarity instead. Type of this similarity is set to 'scripted' because tf-idf is not among the pre-defined similarity functions in Elastic anymore. As this similarity is a scripted type the source code of it must be written **by you** and passed to it.<br>
> In order for the changes to be applied, first we close the index and change the settings and then reopen it<br>

Write the tf-idf code in a string and pass it as a value to the "source" key. <br>
You can find the variables needed in your code in [Here](https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-similarity-context.html).

In [25]:
# TODO : uncomment the code bellow, write the tf-idf code in here
source_code = "double tf = Math.log(1+doc.freq);double idf=Math.log((field.docCount+1.0)/(term.docFreq+1.0));return tf*idf;"

In [26]:
# closing the index
es.indices.close(index=sm_index_name)

# applying the settings
es.indices.put_settings(index=sm_index_name, 
                            settings={
                                "similarity": {
                                      "default": {
                                        "type": "scripted",
                                        "script": {
                                          # TODO : uncomment the code bellow and pass the suitable parameter
                                            "source":source_code 
                                        }
                                      }
                                }
                            }
                       )

# reopening the index
es.indices.open(index=sm_index_name)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True})

### Query

In this section you have to test your index with same queries you tested phase2. The goal here is to observe how different or simillar your tf-idf Elastic implementation works.

In [27]:
# A function that creates appropriate body for our match content type query
def get_query(text):
    body ={
    "query":{  
        "match" : {
            "content" : text

            }
        }
    }
    
    return body

In [31]:
queries = [
    #TODO : add your queries in string format to this list
    "توپ طلای رونالدو"
]

In [32]:
all_res_tfidf = []


for q in queries:
    res_tfidf = es.search(index=sm_index_name, body=get_query(q), explain=True)
    all_res_tfidf.append(dict(res_tfidf))

In [33]:
for res, q in zip(all_res_tfidf, queries):
    print(q)
    for doc in res['hits']['hits']:
        print(doc['_source']['url'])
    print("----------------------------")

توپ طلای رونالدو
https://www.farsnews.ir/news/14001128000368/رویای-همبازی-شدن-رونالدو-با-مسی-امکان-پذیر-نیست
https://www.farsnews.ir/news/14001005000044/زنگ-خطر-برای-مسی-و-رونالدو-به-صدا-درآمد-عکس
https://www.farsnews.ir/news/14001001000174/امباپه-مترجم-مکالمه-مسی-و-لواندوفسکی-شد
https://www.farsnews.ir/news/14001010000262/واکنش-مندز-به-شایعه-جدایی-رونالدو-از-منچستر
https://www.farsnews.ir/news/14001016000102/مسی-در-انتخابات-آمریکا-رونالدو-را-کنار-زد-عکس
https://www.farsnews.ir/news/14001130000713/مجسمه-رونالدو-در-برنابئو
https://www.farsnews.ir/news/14001118000663/انتخاب-گزینه‌ای-جوان-و-درعین-حال-باتجربه-مومنی‌مقدم-چگونه-سرمربی-تیم
https://www.farsnews.ir/news/14001217000846/افشاگری-رسانه-ایتالیایی-از-رونالدو-کریستیانو-در-یوونتوس-هم-بی‌مسئولیت
https://www.farsnews.ir/news/14001005000045/مسی-بالاتر-از-رونالدو-ایستاد-عکس
https://www.farsnews.ir/news/14001116000335/آمار-خیره‌کننده-رونالدو-به-بهانه-تولد-3-سالگی--مرور-زیباترین-گل‌های
----------------------------
