<a href="https://colab.research.google.com/github/unpackAI/unpackai/blob/main/examples/feature_encoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Encoder and Neural Search
> Although many good library wrapped things up nicely for production environment, it's critical to see how simple the idea is to understand the basics

## Installations

In [2]:
!pip install -q transformers
!pip install -Uqq fastai
!pip install -q unpackai==0.1.8.9
!pip install -q sentence-transformers
!pip install -q spacy

[K     |████████████████████████████████| 3.1 MB 5.3 MB/s 
[K     |████████████████████████████████| 895 kB 37.0 MB/s 
[K     |████████████████████████████████| 3.3 MB 36.2 MB/s 
[K     |████████████████████████████████| 596 kB 44.6 MB/s 
[K     |████████████████████████████████| 56 kB 4.7 MB/s 
[K     |████████████████████████████████| 189 kB 5.4 MB/s 
[K     |████████████████████████████████| 56 kB 4.1 MB/s 
[K     |████████████████████████████████| 54 kB 1.5 MB/s 
[K     |████████████████████████████████| 85 kB 3.0 MB/s 
[K     |████████████████████████████████| 78 kB 3.5 MB/s 
[K     |████████████████████████████████| 1.2 MB 20.2 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


## Imports

In [28]:
from transformers import AutoModel, AutoTokenizer
from unpackai.nlp import *
from unpackai.nlp import Textual
from forgebox.imports import *
from forgebox.df import PandasDisplay
from sentence_transformers import SentenceTransformer
from tqdm.notebook import tqdm
from spacy import load

Load spacy for sentence cutter

In [5]:
spacy = load("en_core_web_sm", )

# Load Model

In [6]:
PRETRAINED = "distilbert-base-uncased"
model = SentenceTransformer(PRETRAINED)

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.54k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Get the data

In [8]:
textual = Textual.from_url("https://www.gutenberg.org/files/34206/34206-0.txt")
textual

Text (1495417 chars), textual(),
    train_path, val_path = textual.create_train_val()

In [9]:
textual()

interactive(children=(IntSlider(value=100, description='page', max=200), Output()), _dom_classes=('widget-inte…

In [10]:
spacy.max_length=1500000

Cut the entire book into sentences

In [11]:
%%time
sentences = list(sentence.text for sentence in tqdm(spacy(textual.text, ).sents))

0it [00:00, ?it/s]

CPU times: user 41.8 s, sys: 3.2 s, total: 45 s
Wall time: 44.8 s


Encode sentences into vectors

In [12]:
%%time
vectors = model.encode(sentences, device="cuda:0", batch_size=32, show_progress_bar=True)

Batches:   0%|          | 0/340 [00:00<?, ?it/s]

In [14]:
vectors.shape

(10879, 768)

## Search similar sentences

In [15]:
from unpackai.cosine import CosineSearch
cosine = CosineSearch(vectors)

In [26]:
def search(sentence: str):
    similars = cosine.search(model.encode(sentence))
    return pd.DataFrame(dict(sentences = np.array(sentences)[similars]))

def display_similar(text: str):
    with PandasDisplay(max_colwidth = 0,max_rows=100):
        display(search(text).head(20))

## Visualize the search

In [34]:
display_similar("When John sailed on the sea, he fished out a bottle from the stormy ocean")

Unnamed: 0,sentences
0,The fisherman then took the bottle to the brink of the sea.
1,The 'Efreet then kicked the bottle into the sea.
2,The Fisherman shewing the Fish to the Sulṭán THOMPSON 89
3,""" He had brought it, wrapped up, on the back of a camel."
4,"I was once told that the master of an English merchant-vessel, having fallen asleep in a state of intoxication on the shore of the harbour of Alexandria, at night, was devoured by dogs."
5,It eats the flesh of men whom the sea casts on the shore from wrecks.
6,"Being, one night, unable to sleep, he called for a person to tell him a story for his amusement. """
7,His fame he describes as having increased until he was induced to try an unlucky experiment.
8,"The fisherman did so, not believing in his escape, until they had quitted the neighbourhood of the city, and ascended a mountain, and descended into a wide desert tract, in the midst of which was a lake of water."
9,"Continuing his wanderings in the desert, he found, upon a pebbly plain, an old man with a long white beard, who accosted him, asking of what he was in search."


## Search interactively

In [35]:
from ipywidgets import Textarea, Layout

In [37]:
@interact_manual
def search_similar(text = Textarea(layout=Layout(width="90%", height="60px"))):
    display_similar(text)

interactive(children=(Textarea(value='', description='text', layout=Layout(height='60px', width='90%')), Butto…