# Neural search demo - initial indexing

Code in this notebook shows how to prepare data for indexing in a vector search engine.

It contains the following steps:

* Downloading text data which we want to search
* Initialization of pre-trained text vectorization models (with SentenceTransformer)
* Converting text data into vectors and saving it.

In [None]:
# We will use startup descriptions in this neural search demo
# Data source: https://startups-list.com/
# It contains name, short descrition, logo and location of startups.
!wget https://storage.googleapis.com/generall-shared-data/startups_demo.json

In [None]:
# We use SentenceTransformer pre-trained models to convert our text into vectors.
!pip install sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import json
import pandas as pd
from tqdm.notebook import tqdm

In [None]:
# This code will download and create a pre-trained sentence encoder

# all-MiniLM-L6-v2 - is a distilated (lightweight) version of MPNet model.
# It is optimized for the fast inference.
# Full list of available models could be found here https://www.sbert.net/docs/pretrained_models.html
model = SentenceTransformer('all-MiniLM-L6-v2', device="cuda")

In [None]:
df = pd.read_json('./startups_demo.json', lines=True)

In [None]:
# Here we encode all startup descriptions
# We do encoding in batches, as this reduces overhead costs and significantly speeds up the process
vectors = model.encode([
    row.alt + ". " + row.description
    for row in df.itertuples()
], show_progress_bar=True)

Batches:   0%|          | 0/1265 [00:00<?, ?it/s]

In [None]:
# Now we have all our descriptions converted into vectors.
# We have 40474 vectors of 384 dimentions. The output layer of the model has this dimension
vectors.shape

(40474, 384)

In [None]:
# You can download this saved vectors and continue with rest part of the tutorial.
np.save('vectors.npy', vectors, allow_pickle=False)

In [None]:
from google.colab import files
files.download('vectors.npy')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Optional part - make a test query

Let's just make sure, that our vectors are correctly converted and make sense.

For this we manually search for a closest vectors of a random sample.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Take a random description as a query
sample_query = df.iloc[12345].description
print(sample_query)

Dental transparency provider
Deenty is a dentistry marketplace. Find your next dentist in an easy, quick and transparent way. Find reviews, sales prices and education about your next dental treatment.


In [None]:
query_vector = model.encode(sample_query)  # Convert query description into a vector.

In [None]:
scores = cosine_similarity([query_vector], vectors)[0]  # Look for the most similar vectors, manually score all vectors
top_scores_ids = np.argsort(scores)[-5:][::-1]  # Select top-5 with vectors the largest scores

In [None]:
# Check if result similar to the query
for top_id in top_scores_ids:
  print(df.iloc[top_id].description)
  print("-----")

Dental transparency provider
Deenty is a dentistry marketplace. Find your next dentist in an easy, quick and transparent way. Find reviews, sales prices and education about your next dental treatment.
-----
Smiley marketplace. We connect patients with dentists
Deenty let dentists create a profile who let them be viewed by new patients. On the other hand patients get to know their dentists, book online, learn about treatments and compare prices.
We charge dentists everytime they treat one of the patients we send them.
-----
Our mission is to make quality dental care affordable for everyone.
-----
Dental management made easy
Dentalink is a web based monthly fee software, for the management of dental clinics and practices. With it, you can manage all the resources within, from appointment, notifications, email marketing to performance reports, like cash flow, quotes uptake rate, and ...
-----
Cure with confidence
Dentists make over half their income from resin-based fillings. However, the