# Sentence Transformers 

### all-MiniLM-L6-v2 

This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.


#### Background
It aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. The Language model used the pretrained nreimers/MiniLM-L6-H384-uncased model and fine-tuned in on a 1B sentence pairs dataset. 

It uses a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.

#### Intended uses
Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.

By default, input text longer than 256 word pieces is truncated.



#### References  
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2


In [1]:
import openpyxl
import pandas as pd
df_report = pd.read_csv('../500-reports-dataset/010c3e42-f753-47e8-bf0a-e9ed383a215b_sent.csv')
df_kpi = pd.read_excel('../KPI sample dataset.xlsx')



In [2]:
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer, util

import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)




In [3]:
df_kpi.columns

Index(['Sentences', 'KPI', 'Quantity', 'Indication', 'Time'], dtype='object')

In [4]:
df_kpi = df_kpi[['Sentences','KPI']]

In [5]:
df_kpi.dropna(inplace=True)

In [6]:
df_report.columns

Index(['company', 'paragraphs', 'sentences'], dtype='object')

In [7]:
df_report.shape

(967, 3)

In [8]:
sentences = df_report.sentences.to_list()[0:250]
sentences[0]

'as taken back in 20 * the photo | sustainability report 2020 foreword strategy & governance promotion report foreword by the managing board green and social bonds eco-balance human resources corporate citizenship dear readers, sustainable urban and neighbourhood development, resource-efficient circular economy, environmentally friendly mobility, climate-friendly energy supply and equal opportunities in education and digitalisation in our ca- pacity as the promotional bank for the state of north rhine-westphalia, we support and accompany enterprises and municipalities in improving the living conditions in nrw.'

In [9]:
kpi_sentences = df_kpi.KPI.to_list()
kpi_sentences

['use of electricity from renewable sources\ntotal energy consumption from fossil fuel sources',
 'the energy expenditure',
 'a greenhouse gas emissions reduction target',
 'the emission',
 'with the help of energy management\nenergy consumption',
 'its co2 emissions',
 'ghg emissions\ncarbon neutrality',
 'progress on climate action acciona met\nits emissions reduction targets',
 'the scope 3 emissions figure\nwhile the scope 3 emissions overall were down',
 '5,000 renewable mw',
 'the energy demand of conventional flat panel radiators',
 'overall the efficient products of the hvac division will save up annually\nsince heat pumps',
 'energy consumption',
 'energy consumption amounted',
 'new guidelines for the proper use of energy\nwith the appropriate training of employees\na reduction in electricity consumption',
 'a reduction in energy consumption',
 'the new cooling vestibules\nto the increase in energy consumption',
 '_emissions reduce\nthe total scope_1 _emissions of the group a

In [10]:
# Sentences we want sentence embeddings for
#sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Tokenize sentences
encoded_input_sentences = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Tokenize KPIs

encoded_input_kpis = tokenizer(kpi_sentences, padding=True, truncation=True, return_tensors='pt')


In [11]:
# Compute token embeddings
with torch.no_grad():
    model_output_sentences = model(**encoded_input_sentences)

# Perform pooling
sentence_embeddings = mean_pooling(model_output_sentences, encoded_input_sentences['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings[0])

Sentence embeddings:
tensor([ 2.9378e-02,  6.3231e-02,  7.3311e-03,  3.3939e-02,  1.0283e-02,
         2.5590e-02, -2.1547e-02, -5.2118e-02, -1.2180e-02, -2.2989e-02,
         1.7675e-03, -8.9223e-02,  6.1453e-03,  4.3522e-02,  3.3389e-02,
        -3.2338e-02, -4.2420e-02, -2.8161e-02, -4.7802e-04, -3.8429e-02,
        -4.7781e-02, -5.7008e-02,  3.4144e-03, -7.5171e-03, -1.9518e-02,
        -7.9651e-03,  2.4505e-02, -1.7000e-02, -2.9347e-02, -1.7121e-02,
         6.5101e-02,  2.8509e-02, -2.5904e-02, -4.4641e-02,  8.4384e-02,
         9.3945e-02,  3.8810e-02, -1.6127e-02, -5.8611e-02, -1.6423e-02,
        -4.0294e-02, -1.1505e-01, -4.9678e-02, -2.6736e-02, -1.2339e-02,
         3.2496e-02,  5.6462e-02,  4.0059e-03, -7.9637e-02, -5.6360e-02,
         6.2368e-02, -1.1834e-01,  4.1614e-02,  1.6330e-02,  1.5667e-02,
         1.3641e-02, -1.2777e-02,  1.7720e-02, -1.3524e-02, -7.4504e-02,
         7.5517e-02, -2.8611e-02, -6.0353e-02,  5.2160e-02,  3.8249e-02,
         2.8641e-03, -3.8119e-

In [12]:
# Compute token embeddings
with torch.no_grad():
    model_output_kpis = model(**encoded_input_kpis)

# Perform pooling
sentence_embeddings_kpis = mean_pooling(model_output_kpis, encoded_input_kpis['attention_mask'])

# Normalize embeddings
sentence_embeddings_kpis = F.normalize(sentence_embeddings_kpis, p=2, dim=1)

print("Sentence embeddings KPIs:")
print(sentence_embeddings_kpis)

Sentence embeddings KPIs:
tensor([[ 0.0045,  0.1391, -0.0004,  ..., -0.0064,  0.0684,  0.0197],
        [ 0.0204,  0.0894,  0.0036,  ...,  0.0024, -0.0033,  0.0207],
        [ 0.0493,  0.1274,  0.0173,  ..., -0.0529, -0.0207,  0.0179],
        ...,
        [-0.0315,  0.1166, -0.0232,  ...,  0.0052, -0.0051,  0.0163],
        [-0.0555,  0.0511, -0.0658,  ..., -0.0954,  0.0428, -0.0435],
        [-0.0728,  0.1049,  0.0672,  ..., -0.0865,  0.0394, -0.0360]])


In [13]:
benchmark_similarity = 0.5
similarity_score=[]
for j in range(0, len(sentence_embeddings)):
    for i in range(j, len(sentence_embeddings_kpis)):
        score = util.pytorch_cos_sim(sentence_embeddings[j],sentence_embeddings_kpis[i]).item() 
        if score > benchmark_similarity:
            similarity_score.append([j,sentences[j], i,kpi_sentences[i], score ])
            
len(similarity_score)

41

In [19]:
df_temp = pd.DataFrame(similarity_score)
df_temp.shape

(41, 5)

In [21]:
df_temp.columns = ['sentence_index','sentence','kpi_index','kpi','similarity score']

In [27]:
pd.set_option("display.max_columns", None)
pd.options.display.max_seq_items = 2000


In [29]:
df_temp[['sentence','kpi']].values

array([['as taken back in 20 * the photo | sustainability report 2020 foreword strategy & governance promotion report foreword by the managing board green and social bonds eco-balance human resources corporate citizenship dear readers, sustainable urban and neighbourhood development, resource-efficient circular economy, environmentally friendly mobility, climate-friendly energy supply and equal opportunities in education and digitalisation in our ca- pacity as the promotional bank for the state of north rhine-westphalia, we support and accompany enterprises and municipalities in improving the living conditions in nrw.',
        'the for the better sustainability framework including the objective\nour business circular\nabout our climate targets'],
       ['after all, sustainable action is not only our statuto- ry mission, but also a central guiding principle and essential criterion in our business policy decisions.',
        'the for the better sustainability framework including the ob

In [23]:
df_temp.to_excel('sentence-transformer-based-similarity score with KPIs.xlsx')