# Implementation of a search engine based on sBERT

In this notebook there is a basic implementation of sBERT for searching a database of sentences with queries.

The goal is to increase the amount of labeled data that we have in order to later fine tune a model to be used for sentence classification. First of all we have to find a pool of queries that represent the six labels of the six policy instruments. With these queries we can pull a set of sentences that can be automaticaly labeled with the same label of the query. In this way we can increase the diversity of labeled sentences in each label category. This approach will be complemented with a manual curation step to produce a high quality training data set.

The policy instruments that we want to find and that correspond to the different labels are:
* Direct payment (PES)
* Tax deduction
* Credit/guarantee
* Technical assistance
* Supplies
* Fines

This notebook is intended for the following purposes:
* Try different query strategies to find the optimal retrieval of sentences in each policy instrument category
* Try different transformers
* Be the starting point for further enhancements

## Import modules

This notebook is self contained, it does not depend on any other class of the sBERT folder.

You just have to create an environment where you install the external dependencies. Usually the dependencies that you have to install are:

**For the basic sentence similarity calculation**
*  pandas
*  boto3
*  pytorch
*  sentence_transformers

**If you want to use ngrams to generate queries**
*  nltk
*  plotly
*  wordcloud

**If you want to do evaluation and ploting with pyplot**
*  matplotlib

In [None]:
# If your environment is called nlp then you execute this cell otherwise you change the name of the environment
!conda activate nlp

In [1]:
# General purpose libraries
import numpy as np
import pandas as pd
import boto3
import json
import time

# Model libraries
from sentence_transformers import SentenceTransformer
from scipy.spatial import distance

# Libraries for model evaluation
# import matplotlib.pyplot as plt
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score
# from sklearn.metrics import confusion_matrix

# Libraries to be used in the process of definig queries
import nltk # imports the natural language toolkit
import plotly
from wordcloud import WordCloud
from collections import Counter
from nltk.util import ngrams
import re

from json import JSONEncoder

class NumpyArrayEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        return JSONEncoder.default(self, obj)

## Accesing documents in S3

All documents from El Salvador have been preprocessed and their contents saved in a JSON file. In the JSON file there are the sentences of interest.

Use the json file with the key and password to access the S3 bucket if necessary. 
If not, skip this section and use files in a local folder. 

In [2]:
# If you want to keep the credentials in a local folder out of GitHub, you can change the path to adapt it to your needs.
# Please, comment out other users lines and set your own
path = "C:/Users/jordi/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/credentials/" # Jordi's local path in desktop
# path = "C:/Users/user/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/credentials/" # Jordi's local path in laptop
# path = ""
#If you put the credentials file in the same "notebooks" folder then you can use the following path
# path = ""
filename = "Omdena_key_S3.json"
file = path + filename
with open(file, 'r') as dict:
    key_dict = json.load(dict)

In [3]:
for key in key_dict:
    KEY = key
    SECRET = key_dict[key]

In [4]:
s3 = boto3.resource(
    service_name = 's3',
    region_name = 'us-east-2',
    aws_access_key_id = KEY,
    aws_secret_access_key = SECRET
)

### Loading the sentence database from El Salvador

In [5]:
filename = 'JSON/ElSalvador.json'

obj = s3.Object('wri-latin-talent',filename)
serializedObject = obj.get()['Body'].read()
policy_list = json.loads(serializedObject)

### Building a list of potentially relevant sentences

Before going through the dictionary to retrieve sentences, we define a function to reduce de number of sentences in the final "sentences" dictionary. This is just for testing purposes. The reason being that running the sentence embedding function takes time. So for initial testing purposes we can reduce the number of sentences in the testing dataset.

The variable "slim_by" is the reduction factor. If it is set to 1, there will be no reduction and we will be working with the full dataset. It it is set to two, we will take one every two sentences and so one.

<span style="color:red"><strong>REMEMBER</strong></span> that you have to re-run the function "get_sentences_dict" with the "slim_by" variable set to 1 when you want to go for the final shoot.

In [6]:
def slim_dict(counter, slim_factor): # This is to shrink the sentences dict by a user set factor. It will pick only one sentence every "slim_factor"
    if counter % slim_factor == 0:
        return True
    else:
        return False
def sentence_length_filter(sentence_text, minLength, maxLength):
    if len(sentence_text) > minLength:#len(sentence_text) < maxLength and
        return True
    else:
        return False
    
def get_sentences_dict(docs_dict, is_not_incentive_dict, slim_factor, minLength, maxLength):
    count = 0
    result = {}
    for key, value in docs_dict.items():
        for item in value: 
            if item in is_not_incentive_dict:
                continue
            else:
                for sentence in docs_dict[key][item]['sentences']:
                    if sentence_length_filter(docs_dict[key][item]['sentences'][sentence]["text"], minLength, maxLength):
                        count += 1
                        if slim_dict(count, slim_by):
                            result[sentence] = docs_dict[key][item]['sentences'][sentence]
                        else:
                            continue
                    else:
                        continue
    return result

Here you will run the function to get your sentences list in a dictionary of this form:

{"\<sentence id\>" : "\<text of the sentence\>"}.

In [7]:
is_not_incentive = {"CONSIDERANDO:" : 0,
                    "POR TANTO" : 0,
                    "DISPOSICIONES GENERALES" : 0,
                    "OBJETO" : 0,
                    "COMPETENCIA, PROCEDIMIENTOS Y RECURSOS." : 0}

slim_by = 1 # REMEMBER to set this variable to the desired value.
min_length = 50 # Just to avoid short sentences which might be fragments or headings without a lot of value
max_length = 250 # Just to avoid long sentences which might be artifacts or long legal jargon separated by semicolons

sentences = get_sentences_dict(policy_list, is_not_incentive, slim_by, min_length, max_length)


In [8]:
print("In this data set there are {} policies and {} sentences".format(len(policy_list),len(sentences)))
# for sentence in sentences:
#     print(sentences[sentence]['text'])


In this data set there are 349 policies and 33291 sentences


## Defining Queries

### N-grams approach

In the following lines, we use the excel file with the selected phrases of each country, process them and get N-grams to define basic queries for the SBERT model.

In [None]:
data = pd.read_excel(r'WRI_Policy_Tags (1).xlsx', sheet_name = None)
df = None

if isinstance(data, dict):
    for key, value in data.items():
        if not isinstance(df,pd.DataFrame):
            df = value
        else:
            df = df.append(value)
else:
    df = data
df.head()

In [None]:
tagged_sentences = df["relevant sentences"].apply(lambda x: x.split(";") if isinstance(x,str) else x)
tagged_sentence = []

for elem in tagged_sentences:
    if isinstance(elem,float) or len(elem) == 0:
        continue
    elif isinstance(elem,list):
        for i in elem:
            if len(i.strip()) == 0:
                continue
            else:
                tagged_sentence.append(i.strip())
    else:
        if len(elem.strip()) == 0:
            continue
        else:
            tagged_sentence.append(elem.strip())

tagged_sentence
words_per_sentence = [len(x.split(" ")) for x in tagged_sentence]
plt.hist(words_per_sentence, bins = 50)
plt.title("Histogram of number of words per sentence")

In [None]:
def top_k_ngrams(word_tokens,n,k):
    
    ## Getting them as n-grams
    n_gram_list = list(ngrams(word_tokens, n))

    ### Getting each n-gram as a separate string
    n_gram_strings = [' '.join(each) for each in n_gram_list]
    
    n_gram_counter = Counter(n_gram_strings)
    most_common_k = n_gram_counter.most_common(k)
    print(most_common_k)

noise_words = []
stopwords_corpus = nltk.corpus.stopword
sp_stop_words = stopwords_corpus.words('spanish')
noise_words.extend(sp_stop_words)
print(len(noise_words))

if "no" in noise_words:
    noise_words.remove("no")

tokenized_words = nltk.word_tokenize(''.join(tagged_sentence))
word_freq = Counter(tokenized_words)
# word_freq.most_common(20)
# list(ngrams(tokenized_words, 3))

word_tokens_clean = [re.findall(r"[a-zA-Z]+",each) for each in tokenized_words if each.lower() not in noise_words and len(each.lower()) > 1]
word_tokens_clean = [each[0].lower() for each in word_tokens_clean if len(each)>0]

We define the size of the n-gram that we want to find. The larger it is, the less frequent it will be, unless we substantially increase the number of phrases.

In [None]:
n_grams = 2

top_k_ngrams(word_tokens_clean, n_grams, 20)

### Building queries with Parts-Of-Speech

The following functions take a specific word and find the next or previous words according to the POS tags.

An example is shown below with the text: <br>

text = "Generar empleo y garantizara la población campesina el bienestar y su participación e incorporación en el desarrollo nacional, y fomentará la actividad agropecuaria y forestal para el óptimo uso de la tierra, con obras de infraestructura, insumos, créditos, servicios de capacitación y asistencia técnica" <br>

next_words(text, "empleo", 3) <br>
prev_words(text, "garantizara", 6) <br>

Will return: <br>

>['garantizara', 'población', 'campesina'] <br>
>['Generar', 'empleo']

In [None]:
nlp = es_core_news_md.load()

def ExtractInteresting(sentence, match = ["ADJ","ADV", "NOUN", "NUM", "VERB", "AUX"]):
    words = nltk.word_tokenize(sentence)
#     interesting = [k for k,v in nltk.pos_tag(words) if v in match]
    doc = nlp(sentence)
    interesting = [k.text for k in doc if k.pos_ in match]
    return(interesting)

def next_words(sentence, word, num_words, match = ["ADJ","ADV", "NOUN", "NUM", "VERB", "AUX"]):

    items = list()
    doc = nlp(sentence)
    text = [i.text for i in doc]

    if word not in text: return ""
    
    idx = text.index(word)
    for num in range(num_words):
        
        pos_words = [k.text for k in doc[idx:] if k.pos_ in match]
        if len(pos_words) > 1: 
            items.append(pos_words[1])
            idx = text.index(pos_words[1])
    
    return items
    
def prev_words(sentence, word, num_words, match = ["ADJ","ADV", "NOUN", "NUM", "VERB", "AUX"]):
    
    items = list()
    doc = nlp(sentence)
    text = [i.text for i in doc]

    if word not in text: return ""
    
    idx = text.index(word)
    for num in range(num_words):
        pos_words = [k.text for k in doc[:idx] if k.pos_ in match]
        if len(pos_words) >= 1: 
            items.insert(0, pos_words[-1]) #Add element in order and take the last element since it is the one before the word
            idx = text.index(pos_words[-1])
    
    return items

### Keyword approach

In [9]:
# Regular expression to find incentive policy instruments
keywords = re.compile(r'(asistencia tecnica)|ayuda\s*s*\s*\b|\bbono\s*s*\b\s*|credito\s*s*\b\s*|deduccion\s*(es)*\b\s*|devolucion\s*(es)*\b\s*|incentivo\s*s*\b\s*|insumo\s*s*\b\s*|multa\s*s*\b\s*')

# Function to change accented words by non-accented counterparts. It depends on the dictionary "accent_marks_bugs" 
accents_out = re.compile(r'[áéíóúÁÉÍÓÚ]')
accents_dict = {"á":"a","é":"e","í":"i","ó":"o","ú":"u","Á":"A","É":"E","Í":"I","Ó":"O","Ú":"U"}
def remove_accents(string):
    for accent in accents_out.findall(string):
        string = string.replace(accent, accents_dict[accent])
    return string
# Dictionary to merge variants of a word
families = {
    "asistencia tecnica" : "asistencia técnica",
    "ayuda" : "ayuda",
    "ayudas" : "ayuda",
    "bono" : "bono",
    "bonos" : "bono",
    "credito":  "crédito",
    "creditos":  "crédito",
    "deduccion" : "deducción",
    "deducciones" : "deducción",
    "devolucion" : "devolución",
    "devoluciones" : "devolución",
    "incentivo" : "incentivo",
    "incentivos" : "incentivo",
    "insumo" : "insumo",
    "insumos" : "insumo",
    "multa" : "multa",
    "multas" : "multa"
}

In [27]:
keyword_in_sentences = []
            
for sentence in sentences:
    line = remove_accents(sentences[sentence]['text'])
    hit = keywords.search(line)
    if hit:
        keyword = hit.group(0).rstrip().lstrip()
        keyword_in_sentences.append([families[keyword], sentence, sentences[sentence]['text']])             

In [None]:
print(len(keyword_in_sentences))
keyword_in_sentences = sorted(keyword_in_sentences, key = lambda x : x[0])
# print(keyword_in_sentences[0:20])
filtered = [row for row in keyword_in_sentences if row[0] == "asistencia técnica"]
filtered

In [None]:
i = 0
for key, value in families.items():
    if i % 2 == 0:
        print(value, "--", len([row for row in keyword_in_sentences if row[0] == value]))
    i += 1
    

In [11]:
incentives = {}

for incentive in families:
    incentives[families[incentive]] = 0
    
incentives

{'asistencia técnica': 0,
 'ayuda': 0,
 'bono': 0,
 'crédito': 0,
 'deducción': 0,
 'devolución': 0,
 'incentivo': 0,
 'insumo': 0,
 'multa': 0}

## Using the model

### Initializing the model

First, we import the sBERT model. Several transformers are available and documentation is here: https://github.com/UKPLab/sentence-transformers <br>

Then we build a simple function that takes four inputs:
1. The model as we have set it in the previous line of code
2. A dictionary that contains the sentences {"\<sentence_ID\>" : {"text" : "The actual sentence", labels : []}
3. A query in the form of a string
4. A similarity treshold. It is a float that we can use to limit the results list to the most relevant.

The output of the function is a list with three columns with the following content:
1. Column 1 contains the id of the sentence
2. Column 2 contains the similarity score
3. Column 3 contains the text of the sentence that has been compared with the query

#### Modeling functions

There are currently two multi language models available for sentence similarity

* xlm-r-bert-base-nli-stsb-mean-tokens: Produces similar embeddings as the bert-base-nli-stsb-mean-token model. Trained on parallel data for 50+ languages.
<span style="color:red"><strong>Attention!</strong></span> Model "xlm-r-100langs-bert-base-nli-mean-tokens" which was the name used in the original Omdena-challenge script has changed to this "xlm-r-bert-base-nli-stsb-mean-tokens"

* distiluse-base-multilingual-cased-v2: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. While the original mUSE model only supports 16 languages, this multilingual knowledge distilled version supports 50+ languages

In [12]:
# This function is to create the embeddings for each transformer the embeddings in a json with the following structure:
# INPUT PARAMETERS
# transformers: a list with transformer names
# sentences_dict: a dictionary with the sentences of the database with the form {"<sentence id>" : "<sentence text>"}}
# file: the filepath and filename of the output json
# OUTPUT
# the embeddings of the sentences in a json with the following structure:
# {"<transformer name>" : {"<sentence id>" : <sentence embedding>}}

def create_sentence_embeddings(transformers, sentences_dict, file):
    embeddings = {}
    for transformer_name in transformers:
        model = SentenceTransformer(transformer_name)
        embeddings[transformer_name] = {}
        for sentence in sentences_dict:
            embeddings[transformer_name][sentence] = [model.encode(sentences_dict[sentence]['text'].lower())]
    with open(file, 'w') as fp:
        json.dump(embeddings, fp, cls=NumpyArrayEncoder)
     
   
def highlight(transformer_name, model, sentence_emb, sentences_dict, query, similarity_treshold):
    query_embedding = model.encode(query.lower())
    highlights = []
    for sentence in sentence_emb[transformer_name]:
        sentence_embedding = np.asarray(sentence_emb[transformer_name][sentence])[0]
        score = 1 - distance.cosine(sentence_embedding, query_embedding)
        if score > similarity_treshold:
            highlights.append([sentence, score, sentences_dict[sentence]['text']])
    highlights = sorted(highlights, key = lambda x : x[1], reverse = True)
    return highlights


#### Create embeddings for sentences in the database

This piece of code it's to be executed only once every time the database is chaged or we want to get the embeddings of a new database. For example, we are going to use it once for El Salvador policies and we don't need to use it again until we add new policies to this database. Instead, whenever we want to run experiments on this database, we will load the json files with the embeddings which are in the "input" folder.

So, the next cell will be kept commented for safety reasons. Un comment it and execute it whenvere you need it.

In [None]:
# Ti = time.perf_counter()

# transformer_names =['xlm-r-bert-base-nli-stsb-mean-tokens', 'distiluse-base-multilingual-cased-v2']

# path = "../input/"
# filename = "Embeddings_ElSalvador_201217.json"
# file = path + filename

# create_sentence_embeddings(transformer_names, sentences, file)

# Tf = time.perf_counter()

# print(f"The building of a sentence embedding database for El Salvador in the two current models has taken {Tf - Ti:0.4f} seconds")

#### Loading the embeddings for database sentences

In [14]:
path = "../input/"
filename = "Embeddings_ElSalvador_201217.json"
file = path + filename

with open(file, "r") as f:
    sentence_embeddings = json.load(f)

In [15]:
len(sentence_embeddings)

2

### Running the search

#### Basic search with single test query

In [None]:
# First load transformers into the model by choosing one model from index
transformer_names =['xlm-r-bert-base-nli-stsb-mean-tokens', 'distiluse-base-multilingual-cased-v2']
model_index = 0
model = SentenceTransformer(transformer_names[model_index])

In [None]:
# Now, perform single query searches by manually writing a query in the corresponding field
Ti = time.perf_counter()

highlighter_query = "La Policia al tener conocimiento de cualquier infraccion"
similarity_limit = 0.00

label_1 = highlight(transformer_names[model_index], model, sentence_embeddings, sentences, highlighter_query, similarity_limit)

Tf = time.perf_counter()

print(f"similarity search for El Salvador sentences done in {Tf - Ti:0.4f} seconds")

In [None]:
print(len(label_1))
label_1[0:10]

##### Inspecting the results

In [None]:
print(highlighter_query)
label_1[0:40]

##### Further filtering of the results by using the similarity score

In [None]:
similarity_treshold = 0.5
filtered = [row for row in label_1 if row[1] > similarity_treshold]
filtered

##### Exporting results

In [None]:
# Create a dataframe
export_query = pd.DataFrame(label_1)
#export file 
export_query = pd.DataFrame(label_1)

#### Multiparameter search design

In [None]:
# This piece of code is just to limit the amount of items in the incentives dictionary for testing purposes
# The "incentives" dictionari contains the keywords that represent policy instruments. This is to be used in
# the following cell where we make a search based on (1) the keywords themselves (2) the first sentence found in policy documents
# with each of the keywords.

# dicti = {}
# i= 0
# for key in incentives:
#     if i < 2:
#         dicti[key] = 0
#     i += 1
# incentives = dicti
    

##### Experiment 1

What we do in this experiment is check the capacity of two models:

* xlm-r-bert-base-nli-stsb-mean-tokens
* distiluse-base-multilingual-cased-v2

to find policy instruments for incentives, based in 9 categories:

asistencia técnica; ayuda; bono; crédito; deducción; devolución; incentivo; insumo and multa

We will compare two approaches:
1. to perform a search with the keyword itself
2. to perform a search with one of the sentences found in El Salvador policies which contain the keyword.

User set parameters:
<strong>Transformer names:</strong> this is a list with the different models to test. There are currently two.

<strong>Similarity limit:</strong> just to filter out the search matches whith low similarity.

<strong>Number of search results:</strong> the search is against all 40.000 sentences in the database, but we don't want to keep all, just the most relevant so we take 1500 as the keyword with most direct matches is "multa" with some 1352 matches.


In [16]:
transformer_names =['xlm-r-bert-base-nli-stsb-mean-tokens', 'distiluse-base-multilingual-cased-v2']
similarity_limit = 0.2
number_of_search_results = 1500
results = {}

for transformer in transformer_names:
    model = SentenceTransformer(transformer)
    results[transformer] = {}
    for incentive in incentives:
        queries = [incentive, [row for row in keyword_in_sentences if row[0] == incentive][0][2]]
        Ti = time.perf_counter()
        for query in queries:
            similarities = highlight(transformer, model, sentence_embeddings, sentences, query, similarity_limit)
            results[transformer][query] = similarities[0:number_of_search_results]
            Tf = time.perf_counter()
            print(f"similarity search for model {transformer} and query {query} it's been done in {Tf - Ti:0.4f} seconds")

path = "../output/"
filename = "Experiment_201215_jordi_1500.json"
file = path + filename
with open(file, 'w') as fp:
    json.dump(results, fp, indent=4)

similarity search for model xlm-r-bert-base-nli-stsb-mean-tokens and query asistencia técnica it's been done in 1.9174 seconds
similarity search for model xlm-r-bert-base-nli-stsb-mean-tokens and query - Facilitar asistencia tecnica a los pequeños propietarios de bosques y areas naturales privadas, para que elaboren planes de manejo sostenibles, para ser incorporados en los programas de incentivos nacionales y locales existentes it's been done in 4.1508 seconds
similarity search for model xlm-r-bert-base-nli-stsb-mean-tokens and query ayuda it's been done in 1.8700 seconds
similarity search for model xlm-r-bert-base-nli-stsb-mean-tokens and query Para esta clase de ayuda sera prioritario planificar proyectos que sean gestionados ante los organismos respectivos it's been done in 3.7846 seconds
similarity search for model xlm-r-bert-base-nli-stsb-mean-tokens and query bono it's been done in 1.8773 seconds
similarity search for model xlm-r-bert-base-nli-stsb-mean-tokens and query La Munic

similarity search for model distiluse-base-multilingual-cased-v2 and query Al quedar firme la ejecucion, el infractor hara efectiva la multa en la Tesoreria Municipal, en el plazo de tres dias posteriores a su requerimiento La multa podra conmutarse por arresto del infractor, el cual no podra exceder de cinco dias it's been done in 3.4283 seconds


## Results analysis

This is a temporary section to explore how to analyze the results. It is organized with the same structure as the section <strong>Defining queries</strong> as we are exploring the best search strategies based on different types of queries.

### N-grams approach

### Parts-of-speach approach

### Keyword approach

#### Experiment 1

In [38]:
# Loading the results

path = "../output/"
filename = "Experiment_201215_jordi_1500.json"
file = path + filename

with open(file, "r") as f:
    experiment_results = json.load(f)

# Adding the rank to each result
for model in experiment_results:
    for keyword in experiment_results[model]:
        i = 1
        for result in experiment_results[model][keyword]:
            result.append(i)
            i += 1

# Building a dictionary with all sentences found by exact keyword matching. The dictionary is of the form:
# {"<incetive>" : ["<sentence_id_1>", ... "<sentence_id_n>"]}

keyword_hits = {}
for item in keyword_in_sentences:
    if item[0] in keyword_hits:
        keyword_hits[item[0]].append(item[1])
    else:
        keyword_hits[item[0]] = []
        keyword_hits[item[0]].append(item[1])            


In [30]:
# Building a dictionary with all sentences found by exact keyword matching. The dictionary is of the form:
# {"<incetive>" : ["<sentence_id_1>", ... "<sentence_id_n>"]}

keyword_hits = {}
for item in keyword_in_sentences:
    if item[0] in keyword_hits:
        keyword_hits[item[0]].append(item[1])
    else:
        keyword_hits[item[0]] = []
        keyword_hits[item[0]].append(item[1])
    

In [41]:
keyword_hits

{'incentivo': ['00a55af_32',
  '00a55af_75',
  '00a55af_82',
  '0d090c5_34',
  '15169f7_31',
  '15169f7_71',
  '15169f7_77',
  '2369a4b_74',
  '48bdadf_142',
  '48bdadf_204',
  '51a0d9e_30',
  '5bd487f_10',
  '5bd487f_18',
  '5bd487f_21',
  '5c54e9a_56',
  '66d8c54_33',
  '66d8c54_77',
  '66d8c54_83',
  '70be962_10',
  '70be962_11',
  '70be962_22',
  '70be962_52',
  '7289291_64',
  '753300f_106',
  '753300f_287',
  '76b6e7f_107',
  '7bd959f_213',
  '7c3fe0d_171',
  '7c3fe0d_172',
  '7c3fe0d_175',
  '7c3fe0d_176',
  '7c3fe0d_177',
  '85d7464_28',
  '88ad2ff_52',
  '8fb61f8_178',
  '901a837_73',
  '930897c_256',
  'a9bb3c2_9',
  'cefe040_65',
  'cfe0ee3_92',
  'd491a15_173',
  'd491a15_174',
  'd491a15_177',
  'd491a15_178',
  'd491a15_179',
  'd6c44cc_184',
  'd6c44cc_185',
  'd6c44cc_187',
  'd6c44cc_188',
  'd6c44cc_189',
  'e2bb383_21',
  'e2bb383_110',
  'e2bb383_179',
  'e2bb383_182',
  'e3b7f39_231',
  'e5e0011_84',
  'e97f743_12',
  'eb99685_144',
  'f13fd1f_116',
  'f5d3dd3_62']

In [39]:
for model in experiment_results:
    for keyword in experiment_results[model]:
        print("\n**************************\n",model,"*****", keyword, "\n**************************\n" ) 
        for result in experiment_results[model][keyword]:
            print(result)
            i += 1


**************************
 xlm-r-bert-base-nli-stsb-mean-tokens ***** asistencia técnica 
**************************

['5f9ee58_461', 0.7669254702624007, 'El Dictamen Tecnico de Proyecto comprende, segun sea el caso:']
['b189d15_458', 0.7669254702624007, 'El Dictamen Tecnico de Proyecto comprende, segun sea el caso:']
['0c5f5ad_470', 0.7588855334407649, ') El Dictamen Tecnico de Proyecto comprende, segun sea el caso:']
['f6b2045_465', 0.7588855334407649, ') El Dictamen Tecnico de Proyecto comprende, segun sea el caso:']
['9eee974_36', 0.757205641055339, 'Nivel de apoyo tecnico y administrativo: sera el responsable de facilitar las condiciones que viabilicen la labor de la institucion; y']
['902bedc_40', 0.749563348790238, 'Colaboracion y Asistencia Tecnica Interinstitucional']
['7eab6c1_422', 0.7424573906129105, 'Memoria descriptiva del proyecto y descripcion de las especificaciones tecnicas para ejecucion de las obras']
['9de08a6_34', 0.7417946657549259, 'El equipamiento necesario p

['3416cbf_20', 0.6562968099737365, 'Gestionar fondos para la ejecucion de proyectos que mejoren el medio ambiente y los recursos naturales del municipio']
['88ad2ff_260', 0.6562844561664334, 'Debe difundirse la utilizacion de practicas agricolas sostenibles, compatibles con la proteccion del medio ambiente, a traves de las medidas que se proponen en el Programa de Intervencion del POA, y en coordinacion con los Ministerios de Agricultura y Ganaderia, y el de Medio Ambiente y Recursos Naturales']
['060720b_4', 0.6561709565600657, 'Que los bosques son recursos naturales que constituyen a la conservacion, incremento, y equilibrio de los demas recursos naturales']
['6c5f083_4', 0.6561709565600657, 'Que los bosques son recursos naturales que constituyen a la conservacion, incremento, y equilibrio de los demas recursos naturales']
['71e0ae7_48', 0.6560471535218025, 'e) La ejecucion de obras de forestacion destinadas a la proteccion y conservacion de los caminos vecinales y principales y de a

['2418c2b_302', 0.37237912500930725, '+ — Para la limpieza de los senderos se podra permitir la remocion de piedras, podas de arboles o inclusive la tala de algun arbusto, unicamente con el fin de mejorar el acceso, la comodidad o seguridad del espacio, o de brindar una mejor oportunidad interpretativa o estetica del atractivo']
['8436b4f_59', 0.37235996438545005, '- Se consideran de especial proteccion los recursos hidricos que nutren las fuentes y manantiales existentes en el termino municipal de Santa Cruz Michapa']
['326cdc5_34', 0.37213577874257564, 'j) Autorizar la existencia y funcionamiento de los Consejos Regionales']
['b821b11_232', 0.37212271786496454, 'l) Alterar, ceder o hacer uso indebido de las autorizaciones extendidas por la autoridad competente;']
['50f28b4_26', 0.37210828223123926, 'b) El agua de los manantiales es un patrimonio de la comunidad']
['5295ca3_19', 0.37210828223123926, 'b) El agua de los manantiales es un patrimonio de la comunidad']
['0c5f5ad_65', 0.372

['ae8b0c2_35', 0.5123450690755533, 'Asi se llega al vertice Nor Poniente, que es donde se inicio la descripcion']
['ae8b0c2_45', 0.5123450690755533, 'Asi se llega al vertice Nor Poniente, que es donde se inicio la descripcion']
['ae8b0c2_55', 0.5123450690755533, 'Asi se llega al vertice Nor Poniente, que es donde se inicio la descripcion']
['ae8b0c2_66', 0.5123450690755533, 'Asi se llega al vertice Nor Poniente, que es donde se inicio la descripcion']
['ae8b0c2_75', 0.5123450690755533, 'Asi se llega al vertice Nor Poniente, que es donde se inicio la descripcion']
['af94df5_32', 0.5123450690755533, 'Asi se llega al vertice Nor Poniente, que es donde se inicio la descripcion']
['b39f36e_24', 0.5123450690755533, 'Asi se llega al vertice Nor Poniente, que es donde se inicio la descripcion']
['d763534_21', 0.5123450690755533, 'Asi se llega al vertice Nor Poniente, que es donde se inicio la descripcion']
['ff4e801_29', 0.5123450690755533, 'Asi se llega al vertice Nor Poniente, que es donde s

['50f28b4_49', 0.6689822881504817, '- Los bosques y zonas arboladas son de suma importancia para los mantos acuiferos y el clima local, ello proporciona bienestar de toda la poblacion']
['6dd2466_68', 0.6689822881504817, '- Los bosques y zonas arboladas son de suma importancia para los mantos acuiferos y el clima local, ello proporciona bienestar de toda la poblacion']
['e4e4d6c_3', 0.6689586055944158, 'Que es competencia de los Municipio y obligacion de los Concejos Municipales, incrementar y proteger los recursos naturales, tanto renovables como no renovables; asi como contribuir a la preservacion de dichos recursos y de la salud de sus habitantes; de conformidad con los Art 4, numeral 10 y 31, numeral 6 del Codigo Municipal']
['55fdbb4_51', 0.6687184058636698, '- La Municipalidad promovera y organizara grupos civiles y forestales para la proteccion de estos recursos y velara siempre porque se cultiven solo especies nativas, aprovechando las que generen alimento a la poblacion (fruta

['a0dcb6f_237', 0.2965426937380259, 'Planes Parciales de Rehabilitacion de Centros Historicos']
['a765ce8_230', 0.2965426937380259, 'Planes Parciales de Rehabilitacion de Centros Historicos']
['af82608_224', 0.2965426937380259, 'Planes Parciales de Rehabilitacion de Centros Historicos']
['b016406_230', 0.2965426937380259, 'Planes Parciales de Rehabilitacion de Centros Historicos']
['b189d15_59', 0.2965426937380259, 'Planes Parciales de Rehabilitacion de Centros Historicos']
['db5877b_228', 0.2965426937380259, 'Planes Parciales de Rehabilitacion de Centros Historicos']
['e2a3198_223', 0.2965426937380259, 'Planes Parciales de Rehabilitacion de Centros Historicos']
['f6b2045_67', 0.2965426937380259, 'Planes Parciales de Rehabilitacion de Centros Historicos']
['27d89cb_96', 0.2964926467759037, '- De la zona de proteccion de los recursos hidricos']
['34f1f1d_160', 0.2964926467759037, '- De la zona de proteccion de los recursos hidricos']
['498128e_161', 0.2964926467759037, '- De la zona de 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



['8e81fe2_468', 0.3854961129743013, 'La tasa que deberan pagar las empresas o personas juridicas o naturales por este servicio, sera la establecida en la ordenanza de tasas municipales del municipio']
['5a4270a_54', 0.38541007595565113, '- Por talar arboles sin permiso de la Unidad Ambiental, y si es en lo rural, sin el permiso del area forestal del CENTA, se sancionara con una multa economica, a la persona que sea sorprendida por primera vez, y si reincidiera, se duplicara la multa y se aplicara cinco dias de trabajo forzoso en beneficio de la comunidad']
['b107cc9_81', 0.38540046534202754, 'Ademas de las caracteristicas señaladas en el articulo anterior para las especies permitidas de sembrar o plantar en el espacio publico del municipio, se encuentra la de su atractiva floracion o abundante follaje, cualidades que contribuyen al ornato de la ciudad y al esparcimiento ciudadano, sin perjudicar la via publica y sin poner en peligro la seguridad de los transeuntes o los bienes de sus h

['af82608_87', 0.33596376766669445, '- Para los efectos de la presente ordenanza, la identificacion de las diferentes clases de usos de suelos, estara establecida en el Plano Normativo de Clasificacion de Usos del Suelos del Municipio']
['db5877b_90', 0.33596376766669445, '- Para los efectos de la presente ordenanza, la identificacion de las diferentes clases de usos de suelos, estara establecida en el Plano Normativo de Clasificacion de Usos del Suelos del Municipio']
['e2a3198_85', 0.33596376766669445, '- Para los efectos de la presente ordenanza, la identificacion de las diferentes clases de usos de suelos, estara establecida en el Plano Normativo de Clasificacion de Usos del Suelos del Municipio']
['2ce32ee_34', 0.3358340333109475, 'La actualizacion y modificacion al mapa de Zonificacion Municipal/urbano y Usos Globales del suelo, se sujetara a lo señalado en el art 9 de esta Ordenanza']
['55b3aab_36', 0.3358340333109475, 'La actualizacion y modificacion al mapa de Zonificacion Mun

['0f03aa3_12', 0.3805860829050789, 'Ordenamiento en sus diferentes tipologias o jerarquizacion; posterior a la aprobacion por el Concejo Municipal, se emitira la ordenanza respectiva, la cual entrara en vigencia ocho dias despues de su publicacion en el Diario Oficial']
['f4e4e13_31', 0.38051635367193704, 'De estar incompletos, lo hara saber al DAJ para que efectue prevencion al interesado por medio de auto para que subsane la observacion señalada otorgando un plazo de quince dias habiles para que la asociacion subsane los defectos encontrados en su solicitud o documentacion anexa']
['074b95f_84', 0.38021489365491634, '- La Unidad Ambiental Municipal y otros funcionarios y empleados de la Municipalidad, Agentes de la Policia Nacional Civil, y guardabosques tendran facultades para detener a los transgresores in fraganti, decomisar lo que porte y que pueda constituir prueba en la comision de la infraccion y recibir las denuncias de los hechos de que se trate y poner al detenido a la orde