# Text Search Engine

This notebook demonstrates code to build a Text Search Engine. This example uses a BERT model to convert text to embedding vectors stored in Milvus (milvus.io), a vector database. Postgres is then used to store text information about articles and is used to join back the human-readable version of results found by Milvus.

## Data

This example uses a kaggle dataset of news articles obtained externally (see reference below). This data is in JSON format and contains many other data points that could be used for search capabilities. For this example, we are using only the headline of the article for searching capabilities.
<p>
The dataset is quite large, and thus has been divided up into three separate files that can be used for embeddings (Small, Medium, and the entire dataset. Depending on available compute resources, you may decide to use one of the smaller files.
    
## Requirements
    
A requirements.txt file has been provided (002-requirements.txt). Here are the packages used for this example:
    
| Package               |
|-----------------------|
| pymilvus              |
| flask-cors            |
| numpy                 |
| flask                 |
| flask_restful         |
| psycopg2-binary       |
| sentence_transformers |


## Installing necessary software

All required python packages are included in this requirements.txt file.

In [21]:
%pip install -r 002-requirements.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.


## Starting the Milvus server and Postgres

This demo uses Milvus. Please refer to the Install Milvus guide to learn how to use this docker container. The milvus standard docker-compose.yml has been modified for this example to use appropriate packages.

Part of this demo uses Postgres for saving text information. This is also started in this step (docker postgres:latest).

In [22]:
!docker-compose -f docker-images/docker-compose-milvus.yml up -d
!docker run --rm --name postgres0 -d  -p 5438:5432 -e POSTGRES_HOST_AUTH_METHOD=trust postgres

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[1A[1B[0G[?25l[+] Running 1/0
 [32m✔[0m Network milvus          [32mCreated[0m                                         [34m0.0s [0m
 ⠋ Container milvus-etcd   Creating                                        [34m0.0s [0m
 ⠋ Container milvus-minio  Creating                                        [34m0.0s [0m
[?25h[1A[1A[1A[1A[0G[?25l[34m[+] Running 3/3[0m
 [32m✔[0m Network milvus                                                                                                                                            [32mCreated[0m[34m0.0s [0m
 ⠿ Container milvus-etcd                                                                                                                     

## Check to see that postgres is started
This step checks that postgres started and has no issues. The commented out `sleep 30` is only used if startup of the Milvus server is too slow.

In [23]:
!docker logs postgres0 --tail 6
#!sleep 30

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2023-06-05 17:57:07.492 UTC [1] LOG:  starting PostgreSQL 15.3 (Debian 15.3-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2023-06-05 17:57:07.492 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-06-05 17:57:07.492 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2023-06-05 17:57:07.493 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-06-05 17:57:07.496 UTC [63] LOG:  database system was shut down at 2023-06-05 17:57:07 UTC
2023-06-05 17:57:07.500 UTC [1] LOG:  database system is ready to accept connections


## Connections to Milvus and Postgresql
In this code block. we are connecting to the Milvus server and Postgresql. A cursor is created for Postgres so that we can perform operations against the database

In [24]:
from pymilvus import connections
import psycopg2
connections.connect(host='localhost', port='19530')
conn = psycopg2.connect(host='localhost', port='5438', user='postgres', password='postgres')
cursor = conn.cursor()

[93m[__internal_register] retry:4, cost: 0.27s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, internal: Milvus Proxy is not ready yet. please wait>[0m
[93m[__internal_register] retry:5, cost: 0.81s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, internal: Milvus Proxy is not ready yet. please wait>[0m
[93m[__internal_register] retry:6, cost: 2.43s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, internal: Milvus Proxy is not ready yet. please wait>[0m
[93m[__internal_register] retry:7, cost: 7.29s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, internal: Milvus Proxy is not ready yet. please wait>[0m
[93m[__internal_register] retry:8, cost: 21.87s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, internal: Milvus Proxy is not ready yet. please wait>[0m


## Creating the Collection

The next step is to create a collection, which requires declaring the name of the collection and the dimension of the vector.

In [25]:
TABLE_NAME = "text_collection"
FIELD_NAME = "example_field"

from pymilvus import Collection, CollectionSchema, FieldSchema, DataType

pk = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True)

field = FieldSchema(name=FIELD_NAME, dtype=DataType.FLOAT_VECTOR, dim=768)
schema = CollectionSchema(fields=[pk,field], description="example collection")

collection = Collection(name=TABLE_NAME, schema=schema)

## Setting an index

After creating the collection we want to assign it an index type. This can be done before or after inserting the data. When done before, indexes will be made as data comes in and fills the data segments. In this example we are using IVF_SQ8. If GPU and CPU resources are a constraint, you may want to consider IVF_FLAT **[here](https://milvus.io/docs/index.md)**.

In [28]:
index_type = 'IVF_SQ8' #if GPU and CPU are not an issue, use IVF_SQ8

index_param = {
        "metric_type":"L2",
        "index_type":f"{index_type}",
        "params":{"nlist":1024}
    }

collection.create_index(field_name=FIELD_NAME, index_params=index_param)

Status(code=0, message=)

# Creating Table in Postgres

PostgresSQL will be used to store Milvus ID and its corresponding headline and abstract.

In [29]:
#Deleting previouslny stored table for clean run
drop_table = f"DROP TABLE IF EXISTS {TABLE_NAME}"
cursor.execute(drop_table)
conn.commit()

try:
    sql = f"CREATE TABLE if not exists {TABLE_NAME} (ids bigint, title text, text text);"
    cursor.execute(sql)
    conn.commit()
    print("create postgres table successfully!")
except Exception as e:
    print("can't create a postgres table: ", e)

create postgres table successfully!


## Generating Embeddings

In this example we are using the sentence_transformer library to encode the sentence into vectors. This library uses a modified BERT model to generate the embeddings, and in this example we are using a model pretrained using Microsoft mpnet. More information can be found **[here](https://www.sbert.net/docs/pretrained_models.html)**.
<p>
Note: there are two other raw JSON files included with this demo (Small and Medium). If you have limited compute resources, you may want to consider using one of these files instead.

In [31]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
from sklearn.preprocessing import normalize

#model = SentenceTransformer('paraphrase-mpnet-base-v2')
model = SentenceTransformer('all-mpnet-base-v2')

articles_json_file = 'news-data/News_Category_Dataset_v3.json'

# Get article data
data = pd.read_json(articles_json_file, lines=True)
data = data.replace('\n',' ', regex=True)
data = data.replace('\\|', ' ', regex=True)
                                  
title_data = data['headline'].str.strip().tolist()
text_data = data['short_description'].str.strip().tolist()

sentence_embeddings = model.encode(title_data, batch_size=32)
sentence_embeddings = normalize(sentence_embeddings)
print(type(sentence_embeddings))

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/6548 [00:00<?, ?it/s]

<class 'numpy.ndarray'>


## Inserting embedding vectors into Milvus
In this step we are inserting the embeddings generated in the previous step into Milvus. I am splitting the array of embeddings into 10 separate entities so that the server does not reject the inserts.

In [32]:
import numpy as np

em = list(sentence_embeddings)
emb_array = np.array(em)
emb_splits = np.array_split(emb_array, 10)

ids = list()

for emb_split in emb_splits:
    mr_tmp = collection.insert([emb_split.tolist()])
    ids.extend(mr_tmp.primary_keys)

## Inserting text information into Postgres
In this step the id, headline, and abstract are loaded into Postgres so that we can join found articles in Milvus to its text representations.

In [33]:
import os

def record_temp_csv(fname, ids, title, text):
    with open(fname,'w') as f:
        for i in range(len(ids)):
            line = str(ids[i]) + "|" + title[i] + "|" + text[i] + "\n"
            f.write(line)

def copy_data_to_pg(table_name, fname, conn, cur):
    fname = os.path.join(os.getcwd(),fname)
    try:
        sql = f"COPY {table_name} FROM STDIN (QUOTE E'\u0007', FORMAT 'csv', DELIMITER '|')"
        cursor.copy_expert(sql, open(fname, "r"))
        conn.commit()
        print("Inserted into Postgress Sucessfully!")
    except Exception as e:
        print("Copy Data into Postgress failed: ", e)
        
DATA_WITH_IDS = 'news-data/temp-to-load.csv'   

record_temp_csv(DATA_WITH_IDS, ids, title_data, text_data)
copy_data_to_pg(TABLE_NAME, DATA_WITH_IDS, conn, cursor)

Inserted into Postgress Sucessfully!


## Processing Query
When searching for headlines, we first put the text for which we are interested through the same model to generate an embedding. Then with that embedding vector we can search for similar embeddings in Milvus.

In [34]:
search_params = {"metric_type": "L2", "params": {"nprobe": 10}}

query_vec = []

title = 'War in Ukraine'

query_embeddings = []
embed = model.encode(title)
embed = embed.reshape(1,-1)
embed = normalize(embed)
query_embeddings = embed.tolist()

collection.load()
results = collection.search(query_embeddings, FIELD_NAME, param=search_params, limit=20, expr=None)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

## Getting similar headlines.

There may not have titles that are similar to the given one. The distance can be set to a certain threshold to filter out non-relevant matches.. We then use the result ID's to pull out the similar titles from the Postgres server and print them with their corresponding similarity score.

In [36]:
similar_titles = []

print("Here are the closest article matches: ")

for result in results[0]:
    sql = f"select title from {TABLE_NAME} where ids = " + str(result.id) + ";"
    cursor.execute(sql)
    rows=cursor.fetchall()
    if len(rows):
        if rows[0][0]:
            similar_titles.append((rows[0][0], result.distance))
            print((rows[0][0], result.distance))



Here are the closest article matches: 
('A Time of Vigilance in the Ukraine', 0.5156738758087158)
('Ukraine War: Shelling and Hunger Killing Civilians', 0.5218479633331299)
('Russo-Ukrainian War Now a Reality', 0.533121645450592)
('Why Peace In Ukraine Must Be Seen To Be Believed', 0.5344517827033997)
('Ukraine: 9,000 Of Its Troops Killed Since Russia Began War', 0.5466265678405762)
('Ukraine: Why There Is Hope', 0.5467580556869507)
('The Many Paradoxes of the Ukraine Crisis', 0.5543750524520874)
('Ukraine Crisis Hitting Russia Where It Hurts', 0.5629241466522217)
('Ukraine: Strategic Considerations', 0.5637472867965698)
('Stop the Violence in Kiev', 0.5673375129699707)
("Ukraine's Uncertain Future", 0.5690397024154663)
('Syria v Ukraine', 0.580638587474823)
("WATCH LIVE: The Ukraine Crisis -- What's Next For Europe?", 0.5845084190368652)
('Some 150,000 Ukrainians Flee Largest European Ground War Since WWII', 0.6055887341499329)
('Zelenskyy: Russian Offensive In Eastern Ukraine Has Beg

## Cleanup of docker images

This optional step allows the notebook to clean up docker resources when the L_END_DOCKER flag is set to 1


In [20]:
%%bash
L_END_DOCKER=1

if [[ ${L_END_DOCKER} -eq 1 ]]; then
    docker-compose -f docker-images/docker-compose-milvus.yml down --remove-orphans
    docker stop postgres0
fi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


 Container milvus-standalone  Stopping
 Container milvus-standalone  Stopped
 Container milvus-standalone  Removing
 Container milvus-standalone  Removed
 Container milvus-minio  Stopping
 Container milvus-etcd  Stopping
 Container milvus-etcd  Stopped
 Container milvus-etcd  Removing
 Container milvus-etcd  Removed
 Container milvus-minio  Stopped
 Container milvus-minio  Removing
 Container milvus-minio  Removed
 Network milvus  Removing
 Network milvus  Removed


postgres0


### Articles cited (For data purposes)

1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).
2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).