# Implementation of a search engine based on sBERT

In this notebook there is a basic implementation of sBERT for searching a database of sentences with queries.

The goal is to increase the amount of labeled data that we have in order to later fine tune a model to be used for sentence classification. First of all we have to find a pool of queries that represent the six labels of the six policy instruments. With these queries we can pull a set of sentences that can be automaticaly labeled with the same label of the query. In this way we can increase the diversity of labeled sentences in each label category. This approach will be complemented with a manual curation step to produce a high quality training data set.

The policy instruments that we want to find and that correspond to the different labels are:
* Direct payment (PES)
* Tax deduction
* Credit/guarantee
* Technical assistance
* Supplies
* Fines

This notebook is intended for the following purposes:
* Try different query strategies to find the optimal retrieval of sentences in each policy instrument category
* Try different transformers
* Be the starting point for further enhancements

## Import modules

This notebook is self contained, it does not depend on any other class of the sBERT folder.

You just have to create an environment where you install the external dependencies. Usually the dependencies that you have to install are:

**For the basic sentence similarity calculation**
*  pandas
*  boto3
*  pytorch
*  sentence_transformers

**If you want to use ngrams to generate queries**
*  nltk
*  plotly
*  wordcloud

**If you want to do evaluation and ploting with pyplot**
*  matplotlib

In [None]:
# If your environment is called nlp then you execute this cell otherwise you change the name of the environment
!conda activate nlp

In [9]:
# General purpose libraries
import numpy as np
import pandas as pd
import boto3
import json
import csv
import time
import copy
from pathlib import Path
import re

In [None]:
# Model libraries
from sentence_transformers import SentenceTransformer
from scipy.spatial import distance

# Libraries for model evaluation
# import matplotlib.pyplot as plt
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score
# from sklearn.metrics import confusion_matrix

# Libraries to be used in the process of definig queries
import nltk # imports the natural language toolkit
import plotly
from wordcloud import WordCloud
from collections import Counter
from nltk.util import ngrams


from json import JSONEncoder

class NumpyArrayEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        return JSONEncoder.default(self, obj)

## Accesing documents in S3

All documents from El Salvador have been preprocessed and their contents saved in a JSON file. In the JSON file there are the sentences of interest.

Use the json file with the key and password to access the S3 bucket if necessary. 
If not, skip this section and use files in a local folder. 

In [2]:
# If you want to keep the credentials in a local folder out of GitHub, you can change the path to adapt it to your needs.
# Please, comment out other users lines and set your own
# path = "C:/Users/jordi/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/credentials/" # Jordi's local path in desktop
path = "C:/Users/user/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/credentials/" # Jordi's local path in laptop
# path = ""
#If you put the credentials file in the same "notebooks" folder then you can use the following path
# path = ""
filename = "Omdena_key_S3.json"
file = path + filename
with open(file, 'r') as dict:
    key_dict = json.load(dict)

In [3]:
for key in key_dict:
    KEY = key
    SECRET = key_dict[key]

In [4]:
s3 = boto3.resource(
    service_name = 's3',
    region_name = 'us-east-2',
    aws_access_key_id = KEY,
    aws_secret_access_key = SECRET
)

### Loading the sentence database

In [5]:
filename = 'JSON/Chile.json'

obj = s3.Object('wri-latin-talent',filename)
serializedObject = obj.get()['Body'].read()
policy_list = json.loads(serializedObject)

### Building a list of potentially relevant sentences

Before going through the dictionary to retrieve sentences, we define a function to reduce de number of sentences in the final "sentences" dictionary. This is just for testing purposes. The reason being that running the sentence embedding function takes time. So for initial testing purposes we can reduce the number of sentences in the testing dataset.

The variable "slim_by" is the reduction factor. If it is set to 1, there will be no reduction and we will be working with the full dataset. It it is set to two, we will take one every two sentences and so one.

<span style="color:red"><strong>REMEMBER</strong></span> that you have to re-run the function "get_sentences_dict" with the "slim_by" variable set to 1 when you want to go for the final shoot.

In [6]:
def slim_dict(counter, slim_factor): # This is to shrink the sentences dict by a user set factor. It will pick only one sentence every "slim_factor"
    if counter % slim_factor == 0:
        return True
    else:
        return False
def sentence_length_filter(sentence_text, minLength, maxLength):
    if len(sentence_text) > minLength:#len(sentence_text) < maxLength and
        return True
    else:
        return False
    
def get_sentences_dict(docs_dict, is_not_incentive_dict, slim_factor, minLength, maxLength):
    count = 0
    result = {}
    for key, value in docs_dict.items():
        for item in value: 
            if item in is_not_incentive_dict:
                continue
            else:
                for sentence in docs_dict[key][item]['sentences']:
                    if sentence_length_filter(docs_dict[key][item]['sentences'][sentence]["text"], minLength, maxLength):
                        count += 1
                        if slim_dict(count, slim_by):
                            result[sentence] = docs_dict[key][item]['sentences'][sentence]
                        else:
                            continue
                    else:
                        continue
    return result

Here you will run the function to get your sentences list in a dictionary of this form:

{"\<sentence id\>" : "\<text of the sentence\>"}.

In [7]:
# is_not_incentive = {"CONSIDERANDO:" : 0,
#                     "POR TANTO" : 0,
#                     "DISPOSICIONES GENERALES" : 0,
#                     "OBJETO" : 0,
#                     "COMPETENCIA, PROCEDIMIENTOS Y RECURSOS." : 0}
is_not_incentive = {"CONSIDERANDO:" : 0,
                    "POR TANTO" : 0,
                    "DISPOSICIONES GENERALES" : 0,
                    "OBJETO" : 0,
                    "COMPETENCIA, PROCEDIMIENTOS Y RECURSOS." : 0,
                   "VISTO" : 0,
                   "HEADING" : 0}

slim_by = 1 # REMEMBER to set this variable to the desired value.
min_length = 50 # Just to avoid short sentences which might be fragments or headings without a lot of value
max_length = 250 # Just to avoid long sentences which might be artifacts or long legal jargon separated by semicolons

sentences = get_sentences_dict(policy_list, is_not_incentive, slim_by, min_length, max_length)


In [None]:
# Just to check if the results look ok
print("In this data set there are {} policies and {} sentences".format(len(policy_list),len(sentences)))
# for sentence in sentences:
#     print(sentences[sentence]['text'])


## Defining Queries

### N-grams approach

In the following lines, we use the excel file with the selected phrases of each country, process them and get N-grams to define basic queries for the SBERT model.

In [None]:
data = pd.read_excel(r'WRI_Policy_Tags (1).xlsx', sheet_name = None)
df = None

if isinstance(data, dict):
    for key, value in data.items():
        if not isinstance(df,pd.DataFrame):
            df = value
        else:
            df = df.append(value)
else:
    df = data
df.head()

In [None]:
tagged_sentences = df["relevant sentences"].apply(lambda x: x.split(";") if isinstance(x,str) else x)
tagged_sentence = []

for elem in tagged_sentences:
    if isinstance(elem,float) or len(elem) == 0:
        continue
    elif isinstance(elem,list):
        for i in elem:
            if len(i.strip()) == 0:
                continue
            else:
                tagged_sentence.append(i.strip())
    else:
        if len(elem.strip()) == 0:
            continue
        else:
            tagged_sentence.append(elem.strip())

tagged_sentence
words_per_sentence = [len(x.split(" ")) for x in tagged_sentence]
plt.hist(words_per_sentence, bins = 50)
plt.title("Histogram of number of words per sentence")

In [None]:
def top_k_ngrams(word_tokens,n,k):
    
    ## Getting them as n-grams
    n_gram_list = list(ngrams(word_tokens, n))

    ### Getting each n-gram as a separate string
    n_gram_strings = [' '.join(each) for each in n_gram_list]
    
    n_gram_counter = Counter(n_gram_strings)
    most_common_k = n_gram_counter.most_common(k)
    print(most_common_k)

noise_words = []
stopwords_corpus = nltk.corpus.stopword
sp_stop_words = stopwords_corpus.words('spanish')
noise_words.extend(sp_stop_words)
print(len(noise_words))

if "no" in noise_words:
    noise_words.remove("no")

tokenized_words = nltk.word_tokenize(''.join(tagged_sentence))
word_freq = Counter(tokenized_words)
# word_freq.most_common(20)
# list(ngrams(tokenized_words, 3))

word_tokens_clean = [re.findall(r"[a-zA-Z]+",each) for each in tokenized_words if each.lower() not in noise_words and len(each.lower()) > 1]
word_tokens_clean = [each[0].lower() for each in word_tokens_clean if len(each)>0]

We define the size of the n-gram that we want to find. The larger it is, the less frequent it will be, unless we substantially increase the number of phrases.

In [None]:
n_grams = 2

top_k_ngrams(word_tokens_clean, n_grams, 20)

### Building queries with Parts-Of-Speech

The following functions take a specific word and find the next or previous words according to the POS tags.

An example is shown below with the text: <br>

text = "Generar empleo y garantizara la población campesina el bienestar y su participación e incorporación en el desarrollo nacional, y fomentará la actividad agropecuaria y forestal para el óptimo uso de la tierra, con obras de infraestructura, insumos, créditos, servicios de capacitación y asistencia técnica" <br>

next_words(text, "empleo", 3) <br>
prev_words(text, "garantizara", 6) <br>

Will return: <br>

>['garantizara', 'población', 'campesina'] <br>
>['Generar', 'empleo']

In [None]:
nlp = es_core_news_md.load()

def ExtractInteresting(sentence, match = ["ADJ","ADV", "NOUN", "NUM", "VERB", "AUX"]):
    words = nltk.word_tokenize(sentence)
#     interesting = [k for k,v in nltk.pos_tag(words) if v in match]
    doc = nlp(sentence)
    interesting = [k.text for k in doc if k.pos_ in match]
    return(interesting)

def next_words(sentence, word, num_words, match = ["ADJ","ADV", "NOUN", "NUM", "VERB", "AUX"]):

    items = list()
    doc = nlp(sentence)
    text = [i.text for i in doc]

    if word not in text: return ""
    
    idx = text.index(word)
    for num in range(num_words):
        
        pos_words = [k.text for k in doc[idx:] if k.pos_ in match]
        if len(pos_words) > 1: 
            items.append(pos_words[1])
            idx = text.index(pos_words[1])
    
    return items
    
def prev_words(sentence, word, num_words, match = ["ADJ","ADV", "NOUN", "NUM", "VERB", "AUX"]):
    
    items = list()
    doc = nlp(sentence)
    text = [i.text for i in doc]

    if word not in text: return ""
    
    idx = text.index(word)
    for num in range(num_words):
        pos_words = [k.text for k in doc[:idx] if k.pos_ in match]
        if len(pos_words) >= 1: 
            items.insert(0, pos_words[-1]) #Add element in order and take the last element since it is the one before the word
            idx = text.index(pos_words[-1])
    
    return items

### Keyword approach

In [10]:
# Regular expression to find incentive policy instruments
keywords = re.compile(r'(asistencia tecnica)|ayuda\s*s*\s*\b|\bbono\s*s*\b\s*|credito\s*s*\b\s*|incentivo\s*s*\b\s*|insumo\s*s*\b\s*|multa\s*s*\b\s*')
# deduccion\s*(es)*\b\s*|devolucion\s*(es)*\b\s*|
# Function to change accented words by non-accented counterparts. It depends on the dictionary "accent_marks_bugs" 
accents_out = re.compile(r'[áéíóúÁÉÍÓÚ]')
accents_dict = {"á":"a","é":"e","í":"i","ó":"o","ú":"u","Á":"A","É":"E","Í":"I","Ó":"O","Ú":"U"}
def remove_accents(string):
    for accent in accents_out.findall(string):
        string = string.replace(accent, accents_dict[accent])
    return string
# Dictionary to merge variants of a word
families = {
    "asistencia tecnica" : "asistencia técnica",
    "ayuda" : "ayuda",
    "ayudas" : "ayuda",
    "bono" : "bono",
    "bonos" : "bono",
    "credito":  "crédito",
    "creditos":  "crédito",
#     "deduccion" : "deducción",
#     "deducciones" : "deducción",
#     "devolucion" : "devolución",
#     "devoluciones" : "devolución",
    "incentivo" : "incentivo",
    "incentivos" : "incentivo",
    "insumo" : "insumo",
    "insumos" : "insumo",
    "multa" : "multa",
    "multas" : "multa"
}

In [11]:
keyword_in_sentences = []
            
for sentence in sentences:
    line = remove_accents(sentences[sentence]['text'])
    hit = keywords.search(line)
    if hit:
        keyword = hit.group(0).rstrip().lstrip()
        keyword_in_sentences.append([families[keyword], sentence, sentences[sentence]['text']])             

In [15]:
### print(len(keyword_in_sentences))
# keyword_in_sentences = sorted(keyword_in_sentences, key = lambda x : x[0])
# df_keyword_in_sentences = pd.DataFrame(keyword_in_sentences)

# path = "../output/"
# filename = "keywords_match_labeling.csv"
# file = path + filename

# df_keyword_in_sentences.to_csv(file)

# print(keyword_in_sentences[0:20])
filtered = [row for row in keyword_in_sentences if row[0] == "asistencia técnica"]
filtered

[['asistencia técnica',
  '02076c3_2',
  'Además, debe permitir la recepción y traslado de representantes de países y organismos internacionales, en materia de asistencia técnica y cooperación financiera internacional silvoagropecuaria, como así también a la Ministra y el Subsecretario, que significan desplazamientos fuera de la jornada de trabajo, lo que hace necesario que el vehículo de esta Seremi pueda circular sin restricción de horarios'],
 ['asistencia técnica',
  '0982088_23',
  'El valor de la asesoría profesional está desagregada en $17,945/ha por concepto de asistencia técnica en terreno y $10,539/ha por concepto de elaboración de estudios técnicos'],
 ['asistencia técnica',
  '0982088_39',
  'El valor de la asesoría profesional está desagregada en $ 14,356/ha por concepto de asistencia técnica en terreno y $ 8,432/ha por concepto de elaboración de estudios técnicos'],
 ['asistencia técnica',
  '0982088_153',
  'El valor de la asesoría profesional está desagregada en $17,945

In [None]:
i = 0
for key, value in families.items():
    if i % 2 == 0:
        print(value, "--", len([row for row in keyword_in_sentences if row[0] == value]))
    i += 1
    

In [None]:
incentives = {}

for incentive in families:
    incentives[families[incentive]] = 0
    
incentives

## Initializing the model

First, we import the sBERT model. Several transformers are available and documentation is here: https://github.com/UKPLab/sentence-transformers <br>

Then we build a simple function that takes four inputs:
1. The model as we have set it in the previous line of code
2. A dictionary that contains the sentences {"\<sentence_ID\>" : {"text" : "The actual sentence", labels : []}
3. A query in the form of a string
4. A similarity treshold. It is a float that we can use to limit the results list to the most relevant.

The output of the function is a list with three columns with the following content:
1. Column 1 contains the id of the sentence
2. Column 2 contains the similarity score
3. Column 3 contains the text of the sentence that has been compared with the query

### Modeling functions

There are currently two multi language models available for sentence similarity

* xlm-r-bert-base-nli-stsb-mean-tokens: Produces similar embeddings as the bert-base-nli-stsb-mean-token model. Trained on parallel data for 50+ languages.
<span style="color:red"><strong>Attention!</strong></span> Model "xlm-r-100langs-bert-base-nli-mean-tokens" which was the name used in the original Omdena-challenge script has changed to this "xlm-r-bert-base-nli-stsb-mean-tokens"

* distiluse-base-multilingual-cased-v2: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. While the original mUSE model only supports 16 languages, this multilingual knowledge distilled version supports 50+ languages

In [None]:
# This function is to create the embeddings for each transformer the embeddings in a json with the following structure:
# INPUT PARAMETERS
# transformers: a list with transformer names
# sentences_dict: a dictionary with the sentences of the database with the form {"<sentence id>" : "<sentence text>"}}
# file: the filepath and filename of the output json
# OUTPUT
# the embeddings of the sentences in a json with the following structure:
# {"<transformer name>" : {"<sentence id>" : <sentence embedding>}}

def create_sentence_embeddings(transformers, sentences_dict, file):
    embeddings = {}
    for transformer_name in transformers:
        model = SentenceTransformer(transformer_name)
        embeddings[transformer_name] = {}
        for sentence in sentences_dict:
            embeddings[transformer_name][sentence] = [model.encode(sentences_dict[sentence]['text'].lower())]
    with open(file, 'w') as fp:
        json.dump(embeddings, fp, cls=NumpyArrayEncoder)
     
   
def highlight(transformer_name, model, sentence_emb, sentences_dict, query, similarity_treshold):
    query_embedding = model.encode(query.lower())
    highlights = []
    for sentence in sentences_dict:
        sentence_embedding = np.asarray(sentence_emb[transformer_name][sentence])[0]
        score = 1 - distance.cosine(sentence_embedding, query_embedding)
        if score > similarity_treshold:
            highlights.append([sentence, score, sentences_dict[sentence]['text']])
    highlights = sorted(highlights, key = lambda x : x[1], reverse = True)
    return highlights


### Create embeddings for sentences in the database

This piece of code it's to be executed only once every time the database is chaged or we want to get the embeddings of a new database. For example, we are going to use it once for El Salvador policies and we don't need to use it again until we add new policies to this database. Instead, whenever we want to run experiments on this database, we will load the json files with the embeddings which are in the "input" folder.

So, the next cell will be kept commented for safety reasons. Un comment it and execute it whenvere you need it.

In [None]:
Ti = time.perf_counter()

transformer_names =['xlm-r-bert-base-nli-stsb-mean-tokens', 'distiluse-base-multilingual-cased-v2']

path = "../input/"
filename = "Embeddings_ElSalvador_201223.json"
file = path + filename

create_sentence_embeddings(transformer_names, sentences, file)

Tf = time.perf_counter()

print(f"The building of a sentence embedding database for El Salvador in the two current models has taken {Tf - Ti:0.4f} seconds")

### Loading the embeddings for database sentences

Loading of the embeddings for a single country

In [None]:
path = "../input/"
filename = "Embeddings_Chile_201223.json"
file = path + filename

with open(file, "r") as f:
    sentence_embeddings_chile = json.load(f)

Loading and merging all the embeddings for all the countries in a single file.

In [None]:
paths = Path("../input/").glob('**/*.json')

i = 0
for file_obj in paths:
    # because path is object not string
    file = str(file_obj)
    if "embedding" in file:
        if i = 0:
            with open(file, "r") as f:
                sentence_embeddings = json.load(f)
        else:
            with open(file, "r") as f:
                sentence_embeddings.update(json.load(f) )           

In [None]:
len(sentence_embeddings)
for key in sentence_embeddings:
    print(len(sentence_embeddings[key]))

## Basic search with single test query

In [None]:
# First load transformers into the model by choosing one model from index
transformer_names =['xlm-r-bert-base-nli-stsb-mean-tokens', 'distiluse-base-multilingual-cased-v2']
model_index = 0
model = SentenceTransformer(transformer_names[model_index])

In [None]:
# Now, perform single query searches by manually writing a query in the corresponding field
Ti = time.perf_counter()

highlighter_query = "La Policia al tener conocimiento de cualquier infraccion"
similarity_limit = 0.00

label_1 = highlight(transformer_names[model_index], model, sentence_embeddings, sentences, highlighter_query, similarity_limit)

Tf = time.perf_counter()

print(f"similarity search for El Salvador sentences done in {Tf - Ti:0.4f} seconds")

In [None]:
print(len(label_1))
label_1[0:10]

### Inspecting the results

In [None]:
print(highlighter_query)
label_1[0:40]

### Further filtering of the results by using the similarity score

In [None]:
similarity_treshold = 0.5
filtered = [row for row in label_1 if row[1] > similarity_treshold]
filtered

### Exporting results

In [None]:
# Create a dataframe
export_query = pd.DataFrame(label_1)
#export file 
export_query = pd.DataFrame(label_1)

## Multiparameter search design

In [None]:
# This piece of code is just to limit the amount of items in the incentives dictionary for testing purposes
# The "incentives" dictionari contains the keywords that represent policy instruments. This is to be used in
# the following cell where we make a search based on (1) the keywords themselves (2) the first sentence found in policy documents
# with each of the keywords.

# dicti = {}
# i= 0
# for key in incentives:
#     if i < 2:
#         dicti[key] = 0
#     i += 1
# incentives = dicti
    

In [18]:
# The function below is to use a set of queries to search a database for similar sentences with different transformers.
# The input parameters are:

# Transformer_names: A list with the names of the transformers to be used. For multilingual similarity search we have two transformers
# Queries: a list of the queries as strings, that we want to use for searching the database

# Similarity_limit: The results are in the form of a similarity coefficient where 1 is a perfect match between the query embedding
# and the sentence in the database (the two vectors overlap). If the similarity coefficient is 0 the two vectors are orthogonal,
# they do not share anything in common. Thus, in order to restribt the number of results that are kept from the experiment we can
# it by setting a similarity threshold.When we have a huge database a good treshold would be 0.3 to 0.5 or even higher.

# Results_limit: instead of or complementary to Similarity_limit, we can limit our list of search results by the first sentences
# in the similarity ranking. We can set the limit to high numbers in an exploration phase and then reduce this number in a 
# "production" phase

# Filename: The results will be exported to the "output/" folder in json formate, we need to give it a name witout extension.

def multiparameter_sentence_similarity_search(transformer_names, queries, similarity_limit, results_limit, filename):
    results = {}
    for transformer in transformer_names:
        model = SentenceTransformer(transformer)
        results[transformer] = {}
        for query in queries:
            Ti = time.perf_counter()
            similarities = highlight(transformer, model, sentence_embeddings, sentences, query, similarity_limit)
            results[transformer][query] = similarities[0:results_limit]
            Tf = time.perf_counter()
            print(f"similarity search for model {transformer} and query {query} it's been done in {Tf - Ti:0.4f} seconds")

    path = "../output/"
    filename = filename + ".json"
    file = path + filename
    with open(file, 'w') as fp:
        json.dump(results, fp, indent=4)
    return results

# For experiments 2 and 3 this function helps debugging misspelling in the calues of the dictionary
def check_dictionary_values(dictionary):
    check_country = {}
    check_incentive = {}
    for key, value in dictionary.items():
        incentive, country = value.split("-")
        check_incentive[incentive] = 0
        check_country[country] = 0
    print(check_incentive)
    print(check_country)

In [None]:
for key1 in results_Exp2:
    print(key1)
    for key2 in results_Exp2[key1]:
        print(queries_dict_exp2[key2])

### Query building

The code to compute sentence similarity will take two imputs:

* The queries that will by input as a list of strings. 
* The embeddings of the sentences in the database. 

At this point all we need to run the experiment is ready but the list of queries. One can write the list manually, or one can make it from other data flows. The next cells are ment to do this.

### <strong><span style="color:red">Experiment 1</span></strong> Queries extracted from the database itself

What we do in this experiment is check the capacity of two models:

* xlm-r-bert-base-nli-stsb-mean-tokens
* distiluse-base-multilingual-cased-v2

to find policy instruments for incentives, based in 9 categories:

asistencia técnica; ayuda; bono; crédito; deducción; devolución; incentivo; insumo and multa

We will compare two approaches:
1. to perform a search with the keyword itself
2. to perform a search with one of the sentences found in El Salvador policies which contain the keyword.

User set parameters:
<strong>Transformer names:</strong> this is a list with the different models to test. There are currently two.

<strong>Similarity limit:</strong> just to filter out the search matches whith low similarity.

<strong>Number of search results:</strong> the search is against all 40.000 sentences in the database, but we don't want to keep all, just the most relevant so we take 1500 as the keyword with most direct matches is "multa" with some 1352 matches.

In [None]:
S_queries = {"asistencia técnica" : "00a55af_79", 
             "ayuda" : "00a55af_61", 
             "bono" : "00a55af_80", 
             "crédito" : "1cd36a0_11", 
             "incentivo" : "51a0d9e_30",
             "insumo" : "731dbf0_11",
             "multa" : "029d411_88"
}
incentive = "asistencia técnica"
[row for row in keyword_in_sentences if row[1] == S_queries[incentive]][0][2]

In [None]:
transformers =['xlm-r-bert-base-nli-stsb-mean-tokens']#, 'distiluse-base-multilingual-cased-v2']
queries = ["Conceder créditos a los productores o propietarios"]
similarity_threshold = 0.2
search_results_limit = 100
name = "test201224"

results_dict = multiparameter_sentence_similarity_search(transformers, queries, similarity_threshold, search_results_limit, name)

In [None]:
i = 0
similarity_list = []
for key1 in results_dict:
    for key2 in results_dict[key1]:
        for item in results_dict[key1][key2]:
            similarity_list.append([i, item[1]])
            if i == 0:
            i += 1

In [None]:
similarity_list[0:5]

### <strong><span style="color:red">Experiment 2</span></strong> Queries from the tagged database

Here we use the databse of tagged sentences to define queries. The database is structured by countries. From a list of model documents the sentences were separated and tagged with a policy instrument label. The labels that were used are:

* Credit
* Direct payment
* Fine
* Guarantee
* Supplies
* Tax deduction
* Technical assistance

Not all countries have tagged sentences for each category so we ended up with 26 queries


In [None]:
queries_dict_exp2 = {
    "Para efectos del otorgamiento de estímulos fiscales, crediticios o financieros por parte del Estado, se considerarán prioritarias las actividades relacionadas con la conservación y restauración de los hábitats, la protección del ambiente y el aprovechamiento sustentable de los recursos naturales." : "Credit-México",
"Obtener créditos blandos para mejorar la sostenibillidad y rentabilidad de las actividades de uso de la Diversidad Biológica. Estos créditos podrían beneficiar a sistemas productivos asociados a la pequeña y mediana producción, actividades de experimentación, investigación, producción y comercialización de la Diversidad Biológica, implementación de tecnologías de producción limpia, programas de reforestación u otros que se estipulen." : "Credit-Perú",
"Se asocia con créditos de enlace INDAP y Banco Estado" : "Credit-Chile", 
"El INAB establecerá un programa de garantía crediticia para la actividad forestal, mediante el cual se respaldarán los créditos que otorgue el sistema bancario para el fomento del sector forestal a los pequeños propietarios referidos en el articulo 83 de la presente ley, usando recursos del Fondo Forestal Privativo u otras fuentes, el reglamento debe regular los procedimientos del programa de garantía crediticia a la actividad forestal del pequeño propietario." : "Credit-Guatemala",
"El Banco Multisectorial de Inversiones establecerá líneas de crédito para que el Sistema Financiero apoye a la pequeña, mediana y microempresa, a fin de que puedan oportunamente adaptarse a las Disposiciones de la presente Ley." : "Credit-El Salvador",
"Dentro de los incentivos económicos se podrá crear un bono que incentive la conservación del recurso forestal por el Fondo Forestal Mexicano de acuerdo a la disponibilidad de recursos, a fin de retribuir a los propietarios o poseedores de terrenos forestales por los bienes y servicios ambientales generados." : "Direct_payment-México",
"Los fondos forestales serviran para el pago por arbol prendido a los dos anos de su instalación en terreno definitivo, siempre que provengan de viveros certificados." : "Direct_payment-Perú",
"El porcentaje de bonificación para pequeños propietarios forestales será del 90% de los costos de la forestación que efectúen en suelos de aptitud preferente ente forestal o en suelos degradados de cualquier clase, incluidas aquellas plantaciones con baja densidad para fines de uso silvopastoral, respecto de las primeras 15 hectáreas y de un 75% respecto de las restantes." : "Direct_payment-Chile",
"El Estado, en un período de 20 años contados a partir de la vigencia de la presente ley, dará incentivos al establecimiento de plantaciones, su mantenimiento y el manejo de bosques naturales, este incentivo se otorgará a los propietarios de tierras con vocación forestal, una sola vez, de acuerdo al plan de manejo y/o reforestación aprobado por el INAB." : "Direct_payment-Guatemala",
"Incentivos en dinero: para cubrir los costos directos e indirectos del establecimiento y manejo de areas con sistema agroforestal de cafe" : "Direct_payment-El Salvador",
"Toda persona física o moral que ocasione directa o indirectamente un daño a los recursos forestales, los ecosistemas y sus componentes, estará obligada a repararlo o compensarlo, de conformidad con lo dispuesto en la Ley Federal de Responsabilidad Ambiental." : "Fine-México",
"Disminuir los riesgos para el inversionista implementando mecanismos de aseguramiento." : "Guarantee-México",
"Fianza: Podrá garantizarse el cumplimiento de repoblación forestal mediante fi anza otorgada a favor del INAB por cualquiera de las afi anzadoras legalmente autorizadas para funcionar en el país, en base al cuadro siguiente" : "Guarantee-Guatemala",
"La/el sujeto de derecho podrá recibir en especie materiales, insumos, equipos, herramientas, para la instalación y operación de viveros comunitarios." : "Supplies-México",
"Ello, a través de la utilización de guías, manuales, protocolos, paquetes tecnológicos, procedimientos, entre otros." : "Supplies-Perú",
"Incentivos en especie: insumos agrícolas, herramientas, asistencia tecnica, estudios de factibilidad y pre factibilidad, elaboracion de planes de manejo, mejoramiento de vías de acceso a las plantaciones, comercializacion y capacitaciones." : "Supplies-El Salvador",
"Otorgar incentivos fiscales a las plantaciones forestales comerciales, incluyendo incentivos dirigidos a promover la industria ligada a las plantaciones comerciales forestales." : "Tax_deduction-México",
"25% de descuento en el pago del derecho de aprovechamiento, si el titular de la concesión reporte anualmente sus resultados de inventario forestal, de acuerdo a los lineamientos aprobados por el SERFOR." : "Tax_deduction-Perú",
"Las bonificaciones percibidas o devengadas se considerarán como ingresos diferidos en el pasivo circulante y no se incluirán para el cálculo de la tasa adicional del artículo 21 de la Ley de la Renta ni constituirán renta para ningún efecto legal hasta el momento en que se efectúe la explotación o venta del bosque que originó la bonificación, oportunidad en la que se amortizará abonándola al costo de explotación a medida y en la proporción en que ésta o la venta del bosque se realicen, aplicándose a las utilidades resultantes el artículo 14°, inciso primero, del presente decreto ley." : "Tax_deduction-Chile",
"Los contratistas que suscriban contratos de exploración y/o explotación y de sistemas estacionarios de transporte de hidrocarburos, quedan exentos de cualquier impuesto sobre los dividendos, participaciones y utilidades que el contratista remese al exterior como pago a sus accionistas, asociados, partícipes o socios, así como las remesas en efectivo y/o en especie y los créditos contables que efectúen a sus casas matríces." : "Tax_deduction-Guatemala",
"Exención de los derechos e impuestos, incluyendo el Impuesto a la Transferencia de Bienes Muebles y a la Prestación de Servicios, en la importación de sus bienes, equipos y accesorios, maquinaria, vehículos, aeronaves o embarcaciones para cabotaje y los materiales de construcción para las edificaciones del proyecto." : "Tax_deduction-El Salvador",
"Formación Permanente Además del acompañamiento técnico, los sujetos de derecho participarán en un proceso permanente de formación a lo largo de todo el año, que les permita enriquecer sus habilidades y capacidades en el ámbito social y productivo." : "Technical_assistance-México",
"Contribuir en la promoción para la gestión de las plantaciones forestales y agroforestales, a través de la capacitación, asesoramiento, asistencia técnica y educación de los usuarios, en coordinación con la ARFFS." : "Technical_assistance-Perú",
"Asesoría prestada al usuario por un operador acreditado, conducente a elaborar, acompañar y apoyar la adecuada ejecución técnica en terreno de aquellas prácticas comprometidas en el Plan de Manejo, sólo podrán postular, a esta asistencia, los pequeños productores agrícolas." : "Technical_assistance-Chile",
"Programas de Capacitación Para la ejecución de programas de capacitación, adiestramiento y otorgamiento de becas para la preparación de personal guatemalteco, así como para el desarrollo de tecnología en actividades directamente relacionadas con las operaciones petroleras objeto del contrato, todo contratista contribuirá con las cantidades de dólares de los Estados Unidos de América que se estipulen en el contrato." : "Technical_assistance-Guatemala",
"Apoyo técnico y en formulación de proyectos y conexión con mercados" : "Technical_assistance-El Salvador"}

queries = []
for query in queries_dict_exp2:
    queries.append(query)
        
# print(queries)

The next cell is just to check the presence of misspelling in the values of the queries dictionary

In [None]:
check_dictionary_values(queries_dict_exp2)


In [None]:
transformers =['xlm-r-bert-base-nli-stsb-mean-tokens']#, 'distiluse-base-multilingual-cased-v2']
similarity_threshold = 0.2
search_results_limit = 100
name = "Exp2_tagged_201228"

results_dict = multiparameter_sentence_similarity_search(transformers, queries, similarity_threshold, search_results_limit, name)

### <strong><span style="color:red">Experiment 3</span></strong> Queries from the tagged database with modification

Here we use the databse of tagged sentences to define queries. The database is structured by countries. From a list of model documents the sentences were separated and tagged with a policy instrument label. The labels that were used are:

* Credit
* Direct payment
* Fine
* Guarantee
* Supplies
* Tax deduction
* Technical assistance

Not all countries have tagged sentences for each category so we ended up with 26 queries

The difference between this experiment and experiment 2 is that here we have reformulated the query sentences by extracting the core incentive meaning from the original sentences, eliminating all the vocabulary not strictly speaking about incentives.

In [None]:
queries_dict_exp3 = {
    "Otorgamiento de estímulos crediticios por parte de el estado" : "Credit-México",
"Estos créditos podrían beneficiar a sistemas productivos asociados a la pequeña y mediana producción" : "Credit-Perú",
"Se asocia con créditos de enlace del Banco del Estado" : "Credit-Chile", 
"Acceso al programa de garantía crediticia para la actividad económica" : "Credit-Guatemala",
"El banco establecerá líneas de crédito para que el sistema financiero apoye la pequeña, mediana y microempresa" : "Credit-El Salvador",
"Dentro de los incentivos económicos se podrá crear un bono para retribuir a los propietarios por los bienes y servicios generados." : "Direct_payment-México",
"Acceso a los fondos forestales para el pago de actividad" : "Direct_payment-Perú",
"Se bonificará el 90% de los costos de repoblación para las primeras 15 hectáreas y de un 75% respecto las restantes" : "Direct_payment-Chile",
"El estado dará un incentivo que se pagará una sola vez a los propietarios forestales" : "Direct_payment-Guatemala",
"Incentivos en dinero para cubrir los costos directos e indirectos del establecimiento y manejo de areas de producción" : "Direct_payment-El Salvador",
"Toda persona física o moral que cause daños estará obligada a repararlo o compensarlo" : "Fine-México",
"Disminuir los riesgos para el inversionista implementando mecanismos de aseguramiento" : "Guarantee-México",
"Podrá garantizarse el cumplimiento de la actividad mediante fianza otorgada a favor del estado por cualquiera de las afianzadoras legalmente autorizadas." : "Guarantee-Guatemala",
"El sujeto de derecho podrá recibir insumos para la instalación y operación de infraestructuras para la actividad económica." : "Supplies-México",
"Se facilitará el soporte técnico a  través de la utilización de guías, manuales, protocolos, paquetes tecnológicos, procedimientos, entre otros." : "Supplies-Perú",
"Se concederán incentivos en especie para fomentar la actividad en forma de insumos" : "Supplies-El Salvador",
"Se otorgarán incentivos fiscales para la actividad primaria y también la actividad de transformación" : "Tax_deduction-México",
"De acuerdo con los lineamientos aprobados se concederá un 25% de descuento en el pago del derecho de aprovechamiento" : "Tax_deduction-Perú",
"Las bonificaciones percibidas o devengadas se considerarán como ingresos diferidos en el pasivo circulante y no se incluirán para el cálculo de la tasa adicional ni constituirán renta para ningún efecto legal hasta el momento en que se efectúe la explotación o venta" : "Tax_deduction-Chile",
"Los contratistas que suscriban contratos de exploración y/o explotación, quedan exentos de cualquier impuesto sobre los dividendos, participaciones y utilidades" : "Tax_deduction-Guatemala",
"Exención de los derechos e impuestos, incluyendo el Impuesto a la Transferencia de Bienes Muebles y a la Prestación de Servicios, en la importación de sus bienes, equipos y accesorios, maquinaria, vehículos, aeronaves o embarcaciones" : "Tax_deduction-El Salvador",
"Se facilitará formación Permanente Además del acompañamiento técnico, los sujetos de derecho participarán en un proceso permanente de formación a lo largo de todo el año, que les permita enriquecer sus habilidades y capacidades " : "Technical_assistance-México",
"Contribuir en la promoción para la gestión, a través de la capacitación, asesoramiento, asistencia técnica y educación de los usuarios" : "Technical_assistance-Perú",
"Asesoría prestada al usuario por un operador acreditado, conducente a elaborar, acompañar y apoyar la adecuada ejecución técnica en terreno de aquellas prácticas comprometidas en el Plan de Manejo" : "Technical_assistance-Chile",
"Para la ejecución de programas de capacitación, adiestramiento y otorgamiento de becas para la preparación de personal , así como para el desarrollo de tecnología en actividades directamente relacionadas con las operaciones objeto del contrato" : "Technical_assistance-Guatemala",
"Apoyo técnico y en formulación de proyectos y conexión con mercados" : "Technical_assistance-El Salvador"}

queries = []
for query in queries_dict_exp3:
    queries.append(query)
        
# print(queries)

The next cell is just to check the presence of misspelling in the values of the queries dictionary

In [None]:
check_dictionary_values(queries_dict_exp3)

In [None]:
transformers =['xlm-r-bert-base-nli-stsb-mean-tokens', 'distiluse-base-multilingual-cased-v2']
similarity_threshold = 0.2
search_results_limit = 1000
name = "Exp3_tagged_201231"

results_dict = multiparameter_sentence_similarity_search(transformers, queries, similarity_threshold, search_results_limit, name)

### <strong><span style="color:red">Experiment 4</span></strong> Queries from the tagged database with modification

This would be the last version before extensive tagging. These data will be used for fine-tuning the model. It will be very similar to experiment3 but we are going to change some parameters:

* We are going to work with both, the chilean database and the El Salvador database
* We are going to retrieve the first 200 results for each query
* We are going to balance the number of queries to have 5 queries for each policy instrument.

There are some policy instruments that are underrepresented in some countries. For example, in the tagged sentences data set there are only two sentences tagged as fine and they are both from Mexico. What we are going to do is to manually find more sentences in official datasets in order to have at least 5 queries in each category.

In [16]:
queries_dict_exp4 = {
    "Otorgamiento de estímulos crediticios por parte de el estado" : "Credit-México",
"Estos créditos podrían beneficiar a sistemas productivos asociados a la pequeña y mediana producción" : "Credit-Perú",
"Se asocia con créditos de enlace del Banco del Estado" : "Credit-Chile", 
"Acceso al programa de garantía crediticia para la actividad económica" : "Credit-Guatemala",
"El banco establecerá líneas de crédito para que el sistema financiero apoye la pequeña, mediana y microempresa" : "Credit-El Salvador",
"Dentro de los incentivos económicos se podrá crear un bono para retribuir a los propietarios por los bienes y servicios generados." : "Direct_payment-México",
"Acceso a los fondos forestales para el pago de actividad" : "Direct_payment-Perú",
"Se bonificará el 90% de los costos de repoblación para las primeras 15 hectáreas y de un 75% respecto las restantes" : "Direct_payment-Chile",
"El estado dará un incentivo que se pagará una sola vez a los propietarios forestales" : "Direct_payment-Guatemala",
"Incentivos en dinero para cubrir los costos directos e indirectos del establecimiento y manejo de areas de producción" : "Direct_payment-El Salvador",
"Toda persona física o moral que cause daños estará obligada a repararlo o compensarlo" : "Fine-México",
"El incumplimiento de cualquiera de los requisitos establecidos en la presente se sanciona con la aplicación de la multa y la ejecución de la Medida Complementaria según lo establecido" : "Fine-Perú",
"Incumplimiento grave de las obligaciones del concesionario, tales como el incumplimiento grave del Plan de Manejo, o el incumplimiento grave de las demás normas y regulaciones dictadas para la respectiva área por la Autoridad" : "Fine-Chile",    
"Quedan prohibidas las actividades que pongan en peligro o dañen las Áreas de Bosques, Áreas Naturales y Zonas de Amortiguamiento principalmente la tala ilegal y quema" : "Fine-Guatemala",    
"Tala indiscriminada de árboles para uso habitacional, industrial, comercial y servicios, en cualquier zona urbana por la omisión de la autorización otorgada $ 2,000.00 por cada árbol"  : "Fine-El Salvador",
"El sujeto de derecho podrá recibir insumos para la instalación y operación de infraestructuras para la actividad económica." : "Supplies-México",
"Se facilitará el soporte técnico a  través de la utilización de guías, manuales, protocolos, paquetes tecnológicos, procedimientos, entre otros." : "Supplies-Perú",
"Se concederán incentivos en especie para fomentar la actividad en forma de insumos" : "Supplies-El Salvador",
"Se otorgarán incentivos fiscales para la actividad primaria y también la actividad de transformación" : "Tax_deduction-México",
"De acuerdo con los lineamientos aprobados se concederá un 25% de descuento en el pago del derecho de aprovechamiento" : "Tax_deduction-Perú",
"Las bonificaciones percibidas o devengadas se considerarán como ingresos diferidos en el pasivo circulante y no se incluirán para el cálculo de la tasa adicional ni constituirán renta para ningún efecto legal hasta el momento en que se efectúe la explotación o venta" : "Tax_deduction-Chile",
"Los contratistas que suscriban contratos de exploración y/o explotación, quedan exentos de cualquier impuesto sobre los dividendos, participaciones y utilidades" : "Tax_deduction-Guatemala",
"Exención de los derechos e impuestos, incluyendo el Impuesto a la Transferencia de Bienes Muebles y a la Prestación de Servicios, en la importación de sus bienes, equipos y accesorios, maquinaria, vehículos, aeronaves o embarcaciones" : "Tax_deduction-El Salvador",
"Se facilitará formación Permanente Además del acompañamiento técnico, los sujetos de derecho participarán en un proceso permanente de formación a lo largo de todo el año, que les permita enriquecer sus habilidades y capacidades " : "Technical_assistance-México",
"Contribuir en la promoción para la gestión, a través de la capacitación, asesoramiento, asistencia técnica y educación de los usuarios" : "Technical_assistance-Perú",
"el beneficiario deberá presentar factura o boleta de honorarios que certifique el pago al operador por los servicios prestados en la confección y presentación del plan de manejo, según lo declarado como costo de asistencia técnica" : "Technical_assistance-Chile",
"Asesoría prestada al usuario por un operador acreditado, conducente a elaborar, acompañar y apoyar la adecuada ejecución técnica en terreno de aquellas prácticas comprometidas en el Plan de Manejo" : "Technical_assistance-Chile",
"Para la ejecución de programas de capacitación, adiestramiento y otorgamiento de becas para la preparación de personal , así como para el desarrollo de tecnología en actividades directamente relacionadas con las operaciones objeto del contrato" : "Technical_assistance-Guatemala",
"Apoyo técnico y en formulación de proyectos y conexión con mercados" : "Technical_assistance-El Salvador"}

queries = []
for query in queries_dict_exp4:
    queries.append(query)

The next cell is just to check the presence of misspelling in the values of the queries dictionary

In [19]:
check_dictionary_values(queries_dict_exp4)

{'Credit': 0, 'Direct_payment': 0, 'Fine': 0, 'Supplies': 0, 'Tax_deduction': 0, 'Technical_assistance': 0}
{'México': 0, 'Perú': 0, 'Chile': 0, 'Guatemala': 0, 'El Salvador': 0}


In [None]:
transformers =['xlm-r-bert-base-nli-stsb-mean-tokens']#, 'distiluse-base-multilingual-cased-v2']
similarity_threshold = 0.2
search_results_limit = 200
name = "Exp4_tagged_200105"

results_dict = multiparameter_sentence_similarity_search(transformers, queries, similarity_threshold, search_results_limit, name)

## Results analysis

This is a temporary section to explore how to analyze the results. It is organized with the same structure as the section <strong>Defining queries</strong> as we are exploring the best search strategies based on different types of queries.

### N-grams approach

### Parts-of-speach approach

### Keyword approach

In [None]:
# Loading the results

# path = "../output/"
# filename = "Experiment_201215_jordi_1500.json"
# file = path + filename

# with open(file, "r") as f:
#     experiment_results = json.load(f)

#### Experiment 1

First we load the results and refactor data structures to better process them.

In [None]:
experiment_results = results

# Building a final dictionari of the results with a extra layer with sentence IDs as keys of the last layer
experiment_results_full_dict = {}
for model in experiment_results:
    experiment_results_full_dict[model] = {}
    i = 0
    for keyword in experiment_results[model]:
        if i % len(experiment_results[model]) == 0:
            key = keyword + "_K"
            experiment_results_full_dict[model][key] = {}
            for result in experiment_results[model][keyword]:
                experiment_results_full_dict[model][key][result[0]] = result[1:len(result)]
        else:
            key = key[0:-2] + "_S"
            experiment_results_full_dict[model][key] = {}
            for result in experiment_results[model][keyword]:
                experiment_results_full_dict[model][key][result[0]] = result[1:len(result)]
        i += 1
            
# Building a dictionary with all sentences found by exact keyword matching. The dictionary is of the form:
# {"<incetive>" : ["<sentence_id_1>", ... "<sentence_id_n>"]}

keyword_hits = {}
for item in keyword_in_sentences:
    if item[0] in keyword_hits:
        keyword_hits[item[0]][item[1]] = []
    else:
        keyword_hits[item[0]] = {}
        keyword_hits[item[0]][item[1]] = []     

In [None]:
transformer_names =['xlm-r-bert-base-nli-stsb-mean-tokens', 'distiluse-base-multilingual-cased-v2']


for incentive, sentence_list in keyword_hits.items():
#     print("\t", incentive.center(25)
    for sentence_id in sentence_list:
        for model_name in transformer_names:
            for key in experiment_results_full_dict[model_name]:
                if incentive in key:
                    if sentence_id in experiment_results_full_dict[model_name][key]:
                        keyword_hits[incentive][sentence_id].append(experiment_results_full_dict[model_name][key][sentence_id][2])
                        keyword_hits[incentive][sentence_id].append(round(experiment_results_full_dict[model_name][key][sentence_id][0], 2))
                    else:
                        keyword_hits[incentive][sentence_id].append(15000)
                        keyword_hits[incentive][sentence_id].append(0.0)
        i += 1
#         for keyword in 
#             print(experiment_results_full_dict[model_name].keys())

In [None]:
keyword_hits

In [None]:
results_csv = []
for key, value in keyword_hits.items():
    for sentence, res in value.items():
        results_csv.append([key, sentence, res[0], res[1], res[2], res[3], res[4], res[5], res[6], res[7]])

In [None]:
column_names = ["keyword", "sentence_ID", "xlm_K-rank", "xlm_K-sim", "xlm_S-rank", "xlm_S-sim", "dist_K-rank", "dist_K-sim", "dist_S-rank", "dist_S-sim"]
df= pd.DataFrame(results_csv, columns = column_names)
        
    

In [None]:
path = "../output/"
filename = "Experiment_201217_jordi_1500.csv"
file = path + filename

df.to_csv(file)
df.head()

### Tagged sentence approach

Below, we define the functions that are going to be used in the post-processing and in the analysis of the experiments.

In [None]:
# To show the contents of the results dict, particularly, the length of the first element and its contents
def show_results(results_dictionary):
    i = 0
    for key1 in results_dictionary:
        for key2 in results_dictionary[key1]:
            if i == 0:
                print(len(results_dictionary[key1][key2]))
                print(results_dictionary[key1][key2])
            i += 1

# Adding the rank to each result
def add_rank(results_dictionary):
    for model in results_dictionary:
        for keyword in results_dictionary[model]:
            i = 1
            for result in results_dictionary[model][keyword]:
                result.insert(1, i)
                i += 1
    return results_dictionary

# For experiments 2 and 3 this function is to save results in separate csv files
def save_results_as_separate_csv(results_dictionary, queries_dictionary, experiment_number, date):
    name = "Exp" + experiment_number
    path = "../output/" + name + "/" + date + "/"
    for model, value in results_dictionary.items():
        name1 = name + "_" + "M" + model.split("-")[0] + "_"
        for exp_title, result in value.items():
            filename = name1 + queries_dictionary[exp_title]
            file = path + filename + ".tsv"
            with open(file, 'w', newline='', encoding='utf-8') as f:
                write = csv.writer(f, delimiter='\t')
                write.writerows(result)
#             print(filename)
    

The results from the analysis are saved as a json file. To further process the information we can upload the file contents into a dictionary.

After loading the results, a rank value is added to the results from the highest similarity score to the lower one.

In [None]:
# Load the json where there are the results that you want to analyze. CHANGE the file name accordingly.
path = "../output/"
filename = "Exp3_tagged_201231.json"
file = path + filename
with open(file, "r") as f:
    results_ = json.load(f)

In [None]:
# Adding the rank in the results dictionary
results = copy.deepcopy(add_rank(results_dict))
# show_results(results_E2)

Now, to simplify the analysis process and to make it available for a broader spectrum of analysts, the results are split into small "tsv" documents that can be easily imported in spreadsheets.

The new files will contain only the results of a single query, this is it will contain all the 100 (or whatever number has been retrieved) sentences from the database which have the highest similarity score with the query. There will be the following columns:

* Sentence ID
* Rank of the sentence in the similarity results
* Similarity score
* Text of the sentence

In [None]:
# Save the results as separete csv files
queries_dict = queries_dict_exp3 # CHANGE the queries dict accordingly!
Experiment_number = "3" # CHANGE the experiment number accordingly!
Date = "201231" # CHANGE the date accordingly!
save_results_as_separate_csv(results, queries_dict, Experiment_number, Date)

In the next cell there is the code to retrieve the results that were saved in the previous cell for further analysis

In [None]:
subfolder = "Exp3/201228/" # CHANGE the subfolder name accordingly!
# subfolder = "Exp3/201231/" # CHANGE the subfolder name accordingly!
paths = Path("../output/" + subfolder).glob('**/*.tsv')
transformers = ["Mxlm", "Mdistiluse"]
policy_instruments = ["Credit", "Direct_payment", "Fine", "Guarantee", "Supplies", "Tax_deduction", "Technical_assistance"]
countries = ["Chile", "El Salvador", "Guatemala", "México", "Perú"]
transformer = "Mxlm"
policy_instrument = "Technical_assistance"

In [None]:
sentences_dict = {}
unique_ids = {}

for path in paths:
    # because path is object not string
    path_in_str = str(path)
    if transformer in path_in_str:
        if policy_instrument in path_in_str:
            for country in countries:
                if country in path_in_str:
                    sentences_dict[country] = {}
                    print(path_in_str)
                    with open(path_in_str, "r", encoding = "utf-8") as f:
                        file = csv.reader(f, delimiter='\t')
                        for row in file:
                            sentences_dict[country][row[0]] = row
                            unique_ids[row[0]] = row

In [None]:
name = "Unique_sentence_IDs_" + policy_instrument + ".tsv"
path = "../output/Exp3/201228/"
file = path + name
with open(file, 'w', newline = '', encoding = 'utf-8') as f:
    write = csv.writer(f, delimiter='\t')
    for key, value in unique_ids.items():
        write.writerow(value)

In [None]:
print(len(sentences_dict))
print(len(sentences_dict[country]))
print(len(unique_ids))

In [None]:
sentences = {}
counts = 0
i = 0
for country in countries:
    i += 1
    j = 0
    for ref_country in countries:
        j += 1
#         if j > i:
        print(ref_country, "---", country)
        for sentence in sentences_dict[country]:
            if sentence in sentences_dict[ref_country]:
                if sentence in sentences:
                    sentences[sentence] = sentences[sentence] + 1
                else:
                    sentences[sentence] = 1
#                     print("hit")

In [None]:
print(counts)
print(len(sentences))
sentences

In [None]:
policy_instrument = "Direct_payment"
path = Path("../output/")
subfolder = Path("Exp3/201228/" )# CHANGE the subfolder name accordingly!
filename = "Unique_Ids_Tagged_" + policy_instrument + ".xlsx"
file = path / subfolder / filename
df = pd.read_excel(file)
tagged = df.values.tolist()
tagged_dict = {}
for item in tagged:
    tagged_dict[item[0]] = [item[4], item[5], item[6]]

for country in countries:
    updated_file = []
    filename = "Exp3_Mxlm_" + policy_instrument + "-" + country + ".tsv"
    file = path / subfolder / filename
    with open(file, "r", encoding = "utf-8") as f:
        file = csv.reader(f, delimiter='\t')
        for row in file:
            updated_file.append([row[0], row[1], row[2], row[3], tagged_dict[row[0]][0], tagged_dict[row[0]][1], tagged_dict[row[0]][2]])
    filename = "Exp3_Mxlm_" + policy_instrument + "-" + country + "_tagged.tsv"
    file = path / subfolder / filename
    with open(file, 'w', newline = '', encoding = 'utf-8') as f:
        write = csv.writer(f, delimiter='\t')
        write.writerows(updated_file)

In [None]:
updated_file

#### Exp2

In [None]:
path = "../output/"
filename = "Exp2_tagged_201228.json"
file = path + filename
with open(file, "r") as f:
    results_Exp2 = json.load(f)

In [None]:
for key1 in results_Exp2:
    print(key1)
    for key2 in results_Exp2[key1]:
        print(queries_dict_exp2[key2])

#### Exp3

In [None]:
path = "../output/"
filename = "Exp3_tagged_201228.json"
file = path + filename
with open(file, "r") as f:
    results_Exp3 = json.load(f)

### Retrieving the documents of selected sentences