## Introduction to Transformers

In this notebook, we"re going to install a transformer model, analyze the embedding output, and compare some vectors

In [16]:
import sys
sys.path.append("..")
from aips import *
import pandas

## Listing 13.4

In [17]:
from sentence_transformers import SentenceTransformer
stsb = SentenceTransformer("roberta-base-nli-stsb-mean-tokens")
print(stsb)

Downloading .gitattributes:   0%|          | 0.00/748 [00:00<?, ?B/s]

Downloading 1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading README.md:   0%|          | 0.00/4.00k [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)_roberta_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/334 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': True}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)


## Listing 13.5

In [19]:
phrases = ["it's raining hard", "it is wet outside", "cars drive fast", "motorcycles are loud"]
embeddings = stsb.encode(phrases, convert_to_tensor=True)
print("Number of embeddings:", len(embeddings))
print("Dimensions per embedding:", len(embeddings[0]))
print("The embedding feature values of \"it's raining hard\":")
print(embeddings[0])

Number of embeddings: 4
Dimensions per embedding: 768
The embedding feature values of "it's raining hard":
tensor([ 1.1609e-01, -1.8422e-01,  4.1023e-01,  2.8474e-01,  5.8746e-01,
         7.4418e-02, -5.6910e-01, -1.5300e+00, -1.4629e-01,  7.9517e-01,
         5.0953e-01,  3.5076e-01, -6.7288e-01, -2.9603e-01, -2.3220e-01,
         4.8475e-01,  9.9531e-01,  6.1437e-01,  7.8995e-02, -7.7781e-01,
         1.0021e+00,  3.5468e-01, -1.7309e-01,  3.9410e-01,  1.6540e-01,
        -1.2335e-01,  6.1811e-01,  3.6482e-01,  3.2900e-01,  1.0812e+00,
        -5.4269e-01, -2.2409e-01, -1.4409e+00,  8.2625e-01, -1.1814e+00,
        -4.4101e-01,  5.8892e-01, -1.5328e+00, -2.9688e-01, -6.7424e-03,
         4.8688e-01,  1.0616e-01, -2.7084e-01,  2.4105e-02,  2.8154e-01,
        -2.0851e-01, -2.3183e-01, -1.0750e+00, -6.3038e-01, -7.4111e-02,
         3.0137e-01, -1.2426e+00, -2.5925e-01, -3.2736e-02,  9.0476e-01,
         1.0643e-02,  1.2886e-01,  8.2835e-01, -1.5106e+00, -4.7442e-01,
         1.5537e+

## Listing 13.6

In [20]:
from sentence_transformers import util as STutil
similarities = STutil.pytorch_cos_sim(embeddings, embeddings)
print("The shape of the resulting similarities:", similarities.shape)

The shape of the resulting similarities: torch.Size([4, 4])


## Listing 13.7

In [21]:
a_phrases = []
b_phrases = []
scores = []
for a in range(len(similarities) - 1):
    for b in range(a + 1, len(similarities)):
        a_phrases.append(phrases[a])
        b_phrases.append(phrases[b])
        scores.append(float(similarities[a][b]))

df = pandas.DataFrame({"phrase a": a_phrases, "phrase b": b_phrases, "score": scores})
df.sort_values(by=["score"], ascending=False, ignore_index=True)

Unnamed: 0,phrase a,phrase b,score
0,it's raining hard,it is wet outside,0.66906
1,cars drive fast,motorcycles are loud,0.590783
2,it's raining hard,cars drive fast,0.281166
3,it's raining hard,motorcycles are loud,0.2808
4,it is wet outside,motorcycles are loud,0.204867
5,it is wet outside,cars drive fast,0.138172


Up next: [Natural Language Autocomplete](3.natural-language-autocomplete.ipynb)