<a href="https://colab.research.google.com/github/unpackAI/unpackai/blob/main/examples/feature_encoder_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Encoder and Neural Search
> Although many good library wrapped things up nicely for production environment, it's critical to see how simple the idea is to understand the basics

## Installations

In [2]:
!pip install -q transformers
!pip install -Uqq fastai
!pip install -q unpackai==0.1.8.9
!pip install -q sentence-transformers
!pip install -q spacy

## Imports

In [3]:
from transformers import AutoModel, AutoTokenizer
from unpackai.nlp import *
from unpackai.nlp import Textual
from forgebox.imports import *
from forgebox.df import PandasDisplay
from sentence_transformers import SentenceTransformer
from tqdm.notebook import tqdm
from spacy import load

Load spacy for sentence cutter

In [4]:
spacy = load("en_core_web_sm", )

# Load Model

In [5]:
PRETRAINED = "distilbert-base-uncased"
model = SentenceTransformer(PRETRAINED)

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.54k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Get the data

In [6]:
textual = Textual.from_url("https://www.gutenberg.org/files/34206/34206-0.txt")
textual

Text (1495417 chars), textual(),
    train_path, val_path = textual.create_train_val()

In [7]:
textual()

interactive(children=(IntSlider(value=100, description='page', max=200), Output()), _dom_classes=('widget-inte…

In [8]:
spacy.max_length=1500000

Cut the entire book into sentences

In [9]:
%%time
sentences = list(sentence.text for sentence in tqdm(spacy(textual.text, ).sents))

0it [00:00, ?it/s]

CPU times: user 43 s, sys: 3.11 s, total: 46.1 s
Wall time: 45.9 s


Encode sentences into vectors

In [10]:
%%time
vectors = model.encode(sentences, device="cuda:0", batch_size=32, show_progress_bar=True)

Batches:   0%|          | 0/340 [00:00<?, ?it/s]

CPU times: user 1min 37s, sys: 18.8 s, total: 1min 56s
Wall time: 2min 9s


In [11]:
vectors.shape

(10879, 768)

## Search similar sentences

In [12]:
from unpackai.cosine import CosineSearch
cosine = CosineSearch(vectors)

In [18]:
from IPython.display import HTML

In [31]:
def search(sentence: str, topk:int = 10):
    return cosine.search(model.encode(sentence))[:topk]

def display_similar(text: str, topk:int = 10, expand:int=2):
    similars = search(text, topk)
    for i in similars:
        mid = sentences[i]
        before = " ".join(sentences[i-expand:i])
        after = " ".join(sentences[i+1:i+1+expand])
        display(HTML(f"""
        <h3>The {i} th sentence of the text</h3>
        <p>{before}<br> <strong style='color:red'>{mid}</strong> <br>{after}</p>
        <hr>
        """))

## Visualize the search

In [32]:
display_similar("When John sailed on the sea, he fished out a bottle from the stormy ocean", expand=1)

## Search interactively

In [28]:
from ipywidgets import Textarea, Layout

In [33]:
@interact_manual
def search_similar(
    text = Textarea(layout=Layout(width="90%", height="60px")),
    expand = (1,5)
    ):
    display_similar(text, expand=expand)

interactive(children=(Textarea(value='', description='text', layout=Layout(height='60px', width='90%')), IntSl…