This notebook have almost all the backend code needed for you to run an interface to evaluate the documents more like this is returning are relevant to your solution.

The Azure Services we are using in this case:
* Azure Search
* Azure Storage Account
    * Containers
    * Tables

The code below can be implemented directly in a WebApp or in Azure Functions, it's up to you!

Basic functionality:

This function will run a search in Azure Search, and retrieve 20% of all the documents that we will be used as baseline to test More Like This API.

In the code below I explain it.

First thing: the imports:

In [None]:
import json
import requests
import os
import random
import datetime
import pandas as pd

In the next section we will create the string that will allow us to connect to the Azure Search service and define the values we will be using through the code.

The variables defined in the following section are self-explainable, but some worth mention:

* searchSvc: the search service you created in Azure;
* headers { api-key }: the key for the search service;
* searchFields: the field you want to use in More Like This when it searchs the other documents;
* totalDocuments: the total number of documents in your Storage Account;
* userName: used as PartitionKey in Azure Table.
* rowKey: the RowKey in Azure Table.

In [None]:
searchSvc = os.environ['SEARCH_SVC']
endpoint = 'https://'+ searchSvc + '.search.windows.net/'
apiVersion = '?api-version=2019-05-06-Preview'
headers = {'Content-Type': 'application/json',
           'api-key': os.environ['SEARCH_SUBSCRIPTION_KEY']}
searchFields = 'merged_content'
numMoreLikeThisDocs = '3'

totalDocuments = 50
numDocsToSelect = int(totalDocuments*0.2)
userName = 'Squassina'

rowKey = str(datetime.date.today())

This section will retrieve the possible indexes. By default I am considering there is only one index in the service

In [None]:
url = endpoint + "indexes" + apiVersion + "&$select=name"
response  = requests.get(url, headers=headers)
indexList = response.json()
indexName = indexList['value'][0]['name']

This function if to select the documents I'm using to evaluate the service. 

I will select a random integer (a) to elect the document I will use to test if More Like This is returning relevant documents.

By using Azure Search to retrieve the first document after skipping (a), I'm selecting only a few fields to accelerate the process.

In [None]:
def select_source_documents():
    random.seed()
    randDoc = random.randint(0,totalDocuments)
    url = endpoint + 'indexes/' + indexName + '/docs' + apiVersion + '&$select=Id,metadata_storage_path&$top=1&$skip=' + str(randDoc)
    response  = requests.get(url, headers=headers)
    return(response.json())

This function will retrieve the documents returned by More Like This for me to do the evaluation. Here, two key fields worth noticing:
* moreLikeThis; and
* searchField

According to the [documentation](https://docs.microsoft.com/en-us/azure/search/search-more-like-this): 
> ` moreLikeThis=[key] ` is a query parameter in the Search Documents API that finds documents similar to the document specified by the document key.

In [None]:
def more_like_this_docs(odocId: str):
    url = endpoint + 'indexes/' + indexName + '/docs' + apiVersion + '&$moreLikeThis='+ odocId + '&searchFields=' + searchFields + '&$top=' + numMoreLikeThisDocs
    response  = requests.get(url, headers=headers)
    return(response.json())

In the following 2 sections we start the process to select the documents and the results of More Like This to be tested.

In [None]:
sourceDocList = {}

for i in range(numDocsToSelect):
    sourceDocList[i] = (select_source_documents())

In [None]:
mlt = {}

for i in range(len(sourceDocList)):
    mlt[i] = more_like_this_docs(sourceDocList[i]['value'][0]['Id'])

I send everything to a Pandas DataFrame before storing for better visualization, because I am working with a very small set of documents, but you may skip this section and move to Store data in Azure Storage Account / Table

In [None]:
docsDf = pd.DataFrame(columns = ['RowKey','PartitionKey',
                                  'source_id','source_stg_path',
                                  'MLT_1_id','MLT_1_stg_path','MLT_1_score',
                                  'MLT_2_id','MLT_2_stg_path','MLT_2_score',
                                  'MLT_3_id','MLT_3_stg_path','MLT_3_score',
                                  'EVAL_1', 'EVAL_2', 'EVAL_3','DCG'] )

If you notice well, there are 4 additional fields I created above that I'm not using below: `EVAL_1, EVAL_2, EVAL_3, DCG`. These fields will be used when you are doing the evaluation in the UI: one evaluation for each document returned by MoreLikeThis and the final calculation of the relevance.

In [None]:
docsDf.RowKey = [rowKey + ' - ' + str(x) for x in range(len(sourceDocList))]
docsDf.PartitionKey = [userName for x in range(len(sourceDocList))]
docsDf.source_id = [sourceDocList[i]['value'][0]['Id'] for i in range(len(sourceDocList))]
docsDf.source_stg_path = [sourceDocList[i]['value'][0]['metadata_storage_path'] for i in range(len(sourceDocList))]
docsDf.MLT_1_id = [mlt[i]['value'][0]['Id'] for i in range(len(mlt))]
docsDf.MLT_1_stg_path = [mlt[i]['value'][0]['metadata_storage_path'] for i in range(len(mlt))]
docsDf.MLT_1_score = [mlt[i]['value'][0]['@search.score'] for i in range(len(mlt))]
docsDf.MLT_2_id = [mlt[i]['value'][1]['Id'] for i in range(len(mlt))]
docsDf.MLT_2_stg_path = [mlt[i]['value'][1]['metadata_storage_path'] for i in range(len(mlt))]
docsDf.MLT_2_score = [mlt[i]['value'][1]['@search.score'] for i in range(len(mlt))]
docsDf.MLT_3_id = [mlt[i]['value'][2]['Id'] for i in range(len(mlt))]
docsDf.MLT_3_stg_path = [mlt[i]['value'][2]['metadata_storage_path'] for i in range(len(mlt))]
docsDf.MLT_3_score = [mlt[i]['value'][2]['@search.score'] for i in range(len(mlt))]


In [None]:
docsDf.head()

In [None]:
docsDf.to_csv('./docs_to_be_evaluated.csv',index=False)

# Store data in Azure Storage Account / Table

I don't want to work in memory and loose everything I have done so far, so I store the results of above code in Azure Tables. 

The imports for this section of the code are for Azure Storage. If you need to install it, the command is: 

`pip install azure-storage`

In [None]:
from azure.storage import CloudStorageAccount
from azure.storage.table import TableService, Entity

In [None]:
docsDict = pd.read_csv('./docs_to_be_evaluated.csv').to_dict('records')

Here I'm setting the account by reading the Storage Account Name and Key.

In [None]:
accountName = os.environ['STORAGE_ACCOUNT_NAME']

accountKey = os.environ['STORAGE_ACCOUNT_KEY']

account = CloudStorageAccount(accountName, accountKey)


This is the table service. I am using a table called `MLTEval` to store the data from the first part of this solution. 

In [None]:
tableName = 'MLTEval'
tableService = account.create_table_service()

For educational purposes only, I'm leaving the code to delete and create table here, but it's not needed after the first run as we will save this to compare after the users start using the system that the documents retrieved are still relevant, or it's needed to adjust something.

In [None]:
#table_service.delete_table(tableName)

`create_table` wil return `true` if the table was created or `false`, if the table already exists.

In [None]:
tableService.create_table(tableName)

Here I'm inserting the data I collected in the first section.

In [None]:
for i in range(len(docsDict)):
    tableService.insert_or_replace_entity(tableName, docsDict[i])


# Calculate Discount Cumulative Gain (DCG)

This section will calculate DCG. I didn't wrote it, the code is available in the website as indicated in the comments 

In [None]:
def dcg_at_k(r, k, method=0):
    import numpy as np
    # https://gist.github.com/bwhite/3726239
    """Score is discounted cumulative gain (dcg)
    Relevance is positive real values.  Can use binary
    as the previous methods.
    Example from
    http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
    >>> r = [3, 2, 3, 0, 0, 1, 2, 2, 3, 0]
    >>> dcg_at_k(r, 1)
    3.0
    >>> dcg_at_k(r, 1, method=1)
    3.0
    >>> dcg_at_k(r, 2)
    5.0
    >>> dcg_at_k(r, 2, method=1)
    4.2618595071429155
    >>> dcg_at_k(r, 10)
    9.6051177391888114
    >>> dcg_at_k(r, 11)
    9.6051177391888114
    Args:
        r: Relevance scores (list or numpy) in rank order
            (first element is the first item)
        k: Number of results to consider
        method: If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, ...]
                If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, ...]
    Returns:
        Discounted cumulative gain
    """
    r = np.asfarray(r)[:k]
    if r.size:
        if method == 0:
            return r[0] + np.sum(r[1:] / np.log2(np.arange(2, r.size + 1)))
        elif method == 1:
            return np.sum(r / np.log2(np.arange(2, r.size + 2)))
        else:
            raise ValueError('method must be 0 or 1.')
    return 0.

The function below will take partitionKey and rowKey as arguments.

In [None]:
def retrieve_data_azure_table(partitionKey : str, rowKey : str):
    import os
    from azure.storage import CloudStorageAccount
    from azure.storage.table import TableService, Entity
    import numpy as np

    accountName = os.environ['STORAGE_ACCOUNT_NAME']
    
    accountKey = os.environ['STORAGE_ACCOUNT_KEY']
    
    account = CloudStorageAccount(accountName, accountKey)

    filter = "PartitionKey eq '" + partitionKey + "' and RowKey gt '" + rowKey + "'"

    tableService = account.create_table_service()
    
    return(tableService.query_entities(table_name = tableName, filter = filter))
    

I'm using the userName as partitionKey in this example.

In [None]:
docsDf = retrieve_data_azure_table(partitionKey = userName, rowKey = rowKey)

Running DCG in the data as is will return NaN as we don't have any evaluations

In [None]:
k = 3
DCG = []
for row in docsDf:            
    dcg = dcg_at_k( [row['EVAL_1'], row['EVAL_2'], row['EVAL_3']], k, 1 )
    DCG.append(dcg)
    
dcgDf = pd.DataFrame( data = {'DCG':DCG} )
print(dcgDf)

Let's add some data to EVAL_* fields

In [None]:
for row in docsDf:            
    row['EVAL_1'] = random.randint(0,4)
    row['EVAL_2'] = random.randint(0,4)
    row['EVAL_3'] = random.randint(0,4)

Now we will have data to calculate DCG

In [None]:
k = 3
DCG = []
for row in docsDf:            
    dcg = dcg_at_k( [row['EVAL_1'], row['EVAL_2'], row['EVAL_3']], k, 1 )
    row['DCG'] = str(dcg)
    DCG.append(dcg)
    
dcgDf = pd.DataFrame( data = {'DCG':DCG} )
print(dcgDf)

Let's save everything to Azure Tables:

In [None]:
for item in docsDf.items:
    tableService.insert_or_replace_entity(tableName, item)