## 0. Overview
We will use Milvus and Towhee to help searches. Towhee is used to extract the semantics of the text and return the text embedding. The Milvus vector database can store and search vectors, and return related dataset's metadata. So we first need to install [Milvus](https://github.com/milvus-io/milvus) and [Towhee](https://github.com/towhee-io/towhee).

Before getting started, please make sure that you have started a [Milvus service](https://milvus.io/docs/install_standalone-docker.md). This notebook uses [milvus 2.2.10](https://milvus.io/docs/v2.2.x/install_standalone-docker.md) and [pymilvus 2.2.11](https://milvus.io/docs/release_notes.md#2210).

### Package installations

In [None]:
#! pip install --upgrade pip
#! pip3 install -q towhee pymilvus==2.2.11
#! pip3 uninstall pymilvus -y

! pip3 install -q towhee pymilvus==2.1.1
! pip3 show pymilvus | grep -Ei 'Name:|Version:'
! pip3 show towhee | grep -Ei 'Name:|Version:'

## 1.1 Adding embeddings for columns

The dataset is from the [Kartverket dataset metadata](https://cdn.discordapp.com/attachments/1204433663035449384/1206537816654356480/metadata_no_format.csv?ex=65dc5ee7&is=65c9e9e7&hm=3b9a88db41103ef5393294c5eaeebb60ee2229f43724cc014d4cffc92de1f384&), which contains metadata about each dataset.

The strings in the columns need to be converted to vector representations (embedding) using Towhee [text_embedding.dpr operator](https://towhee.io/text-embedding/dpr). Columns containing these new embedings should contain the original column name with `_vector` at the end.

### NB In case pandas cannot read the csv, due to a delimiter parsing error

Use the code below to reformat the delimiters to "|", `NB! Also replace the excess ones inside sentences that replaced the regular commas.`

In [None]:
# Cell for reformatting the delimiters to "|"
import re
import csv

def replace_delimiter(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as file:
        content = file.read()

    # Regular expression to match commas not inside double quotes
    pattern = r',(?=(?:[^"]*"[^"]*")*[^"]*$)'

    # Replace the matched commas with '|'
    new_content = re.sub(pattern, '|', content)

    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(new_content)

# Replace this with your actual file paths
input_file = 'metadata.csv'
output_file = 'output_metadata_modified.csv'

replace_delimiter(input_file, output_file)


## Load dataset and vectorise chosen column

In [8]:
import pandas as pd
from towhee import pipe, ops, DataCollection
from tqdm import tqdm


# Function to compute embeddings for a single text
def compute_embeddings(text):
    MAX_TOKENS = 512 # Temporary limit on number characters to convert
    truncated_text = text[:MAX_TOKENS]
    return DataCollection(embeddings_pipe(truncated_text)).to_list()[0]['vec']


# Loads dataset into dataframe and recasts columns into correct datatypes
df_kartverket = pd.read_excel('Metadata_excel.xlsx')
recast_to_string = ['datasetcreationdate', 'metadatacreationdate']
df_kartverket[recast_to_string] = df_kartverket[recast_to_string].astype('object')

# Fill NaN values with an empty string
df_kartverket.fillna('', inplace=True)

# Pipe converting text to embeddings (vectors)
embeddings_pipe = (
    pipe.input('text')
        .map('text', 'vec', ops.text_embedding.dpr(model_name='facebook/dpr-ctx_encoder-single-nq-base'))
        .output('vec')
)

# Process each column and create new columns for embeddings
column_to_vectorise = 'title'
df_kartverket[column_to_vectorise + '_vector'] = df_kartverket[column_to_vectorise].apply(compute_embeddings)

Processing column: 100%|██████████| 1/1 [00:08<00:00,  8.72s/column]

Column Types and Example Row:
Column: schema, Type: object, Example: iso19139
Column: uuid, Type: object, Example: 7a62f16f-9aeb-4c39-bf5f-e710232fa366
Column: id, Type: int64, Example: 37228
Column: hierarchyLevel, Type: object, Example: software
Column: title, Type: object, Example: Artsfunn
Column: datasetcreationdate, Type: object, Example: 
Column: abstract, Type: object, Example: Datasettet inneholder stedfestet informasjon av arter samlet inn av NINA. Funndataene følger datastandarden Darwin Core Archive (se http://rs.tdwg.org/dwc/ for detaljer) og distribueres også via Artsdatabankens Artskart (https://artskart.artsdatabanken.no/default.aspx), Global Biodiversity Information Facility (GBIF, http://www.gbif.org/) og GBIF Norge (http://www.gbif.no/).
Column: keyword, Type: object, Example: Natur###Norge###Svalbard###lav###karplanter###botanikkdata###innsekter###vannlevende innsekter###fisk###vannlevende planter###biologisk mangfold###arter###artsfunn###rødlistearter###fremmede ar




In [6]:
from towhee import pipe, ops

def insert_data_to_milvus_with_towhee(df, server_host, server_port, collection_name):
    try:
        # Define the pipeline
        insert_pipe = (pipe.input('data_frame')
                       .flat_map('data_frame', 'data', lambda df: df.values.tolist())
                       .map('data', 'res', ops.ann_insert.milvus_client(host=server_host, 
                                                                        port=server_port,
                                                                        collection_name=collection_name))
                       .output('res'))

        # Execute the pipeline
        results = insert_pipe(df)
        return results
    except Exception as e:
        print(f"Error during insertion: {e}")

# Usage
server_host = 'ebjerk.no'
server_port = '19530'
collection_name = 'kartverket_metadata'

inserted_ids = insert_data_to_milvus_with_towhee(df_kartverket, server_host, server_port, collection_name)

Error during insertion: Node-ann-insert/milvus-client-1 runs failed, error msg: <DataTypeNotSupportException: (code=0, message=Field dtype must be of DataType)>, Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/towhee/runtime/nodes/node.py", line 158, in _call
    return True, self._op(*inputs), None
  File "/Users/williamaredal/.towhee/operators/ann-insert/milvus-client/versions/main/milvus_client.py", line 52, in __call__
    mr = self._collection.insert(row)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pymilvus/orm/collection.py", line 544, in insert
    if not self._check_insert_data_schema(data):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pymilvus/orm/collection.py", line 174, in _check_insert_data_schema
    infer_fields = parse_fields_from_data(data)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3

## Creation of Milvus collection for metadata

In [36]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

server_host = 'ebjerk.no'
server_port = '19530'

connections.connect(host=server_host, port=server_port)

def kartverket_create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)

    fields = [
            #FieldSchema(name='schema', dtype=DataType.STRING, max_length=100), # REQUIRES STRING TYPE 
            FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=False),
            #FieldSchema(name='uuid', dtype=DataType.VARCHAR, max_length=100), # REQUIRES STRING TYPE
            #FieldSchema(name='hierarchyLevel', dtype=DataType.VARCHAR, max_length=100), # REQUIRES STRING TYPE   
            #FieldSchema(name='hierarchyLevel_vector', dtype=DataType.FLOAT_VECTOR, dim=dim), #REQUIRES STRING TYPE   
            FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=100),   
            FieldSchema(name="title_vector", dtype=DataType.FLOAT_VECTOR, dim=dim),

            #FieldSchema(name='datasetcreationdate', dtype=DataType.VARCHAR, max_length=500), # REQUIRES STRING TYPE   
            FieldSchema(name='abstract', dtype=DataType.VARCHAR, max_length=2000),   
            #FieldSchema(name='abstract_vector', dtype=DataType.FLOAT_VECTOR, dim=dim),   
            FieldSchema(name='keyword', dtype=DataType.VARCHAR, max_length=2000),   
            #FieldSchema(name='keyword_vector', dtype=DataType.FLOAT_VECTOR, dim=dim),   
            #FieldSchema(name='geoBox', dtype=DataType.VARCHAR, max_length=100), # REQUIRES STRING TYPE   
            #FieldSchema(name='geoBox_vector', dtype=DataType.FLOAT_VECTOR, dim=dim), # REQUIRES STRING TYPE   
            FieldSchema(name='Constraints', dtype=DataType.VARCHAR, max_length=1000),   
            #FieldSchema(name='Constraints_vector', dtype=DataType.FLOAT_VECTOR, dim=dim),   

            FieldSchema(name='SecurityConstraints', dtype=DataType.VARCHAR, max_length=500),   
            #FieldSchema(name='SecurityConstraints_vector', dtype=DataType.FLOAT_VECTOR, dim=dim),   
            FieldSchema(name='LegalConstraints', dtype=DataType.VARCHAR, max_length=2000),   
            #FieldSchema(name='LegalConstraints_vector', dtype=DataType.FLOAT_VECTOR, dim=dim),   
            #FieldSchema(name='temporalExtent', dtype=DataType.VARCHAR, max_length=100), # REQUIRES STRING TYPE   
            ##FieldSchema(name='temporalExtent_vector', dtype=DataType.FLOAT_VECTOR, dim=dim), # REQUIRES STRING TYPE   
            #FieldSchema(name='image', dtype=DataType.VARCHAR, max_length=1000), # REQUIRES STRING TYPE   
            FieldSchema(name='responsibleParty', dtype=DataType.VARCHAR, max_length=500),   
            #FieldSchema(name='responsibleParty_vector', dtype=DataType.FLOAT_VECTOR, dim=dim),   

            #FieldSchema(name='link', dtype=DataType.VARCHAR, max_length=500), # REQUIRES STRING TYPE   
            #FieldSchema(name='metadatacreationdate', dtype=DataType.VARCHAR, max_length=500), # REQUIRES STRING TYPE   
            ##FieldSchema(name='metadatacreationdate_vector', dtype=DataType.FLOAT_VECTOR, dim=dim), # REQUIRES STRING TYPE   
            FieldSchema(name='productInformation', dtype=DataType.VARCHAR, max_length=1000),   
            #FieldSchema(name='productInformation_vector', dtype=DataType.FLOAT_VECTOR, dim=dim),   
            FieldSchema(name='parentId', dtype=DataType.VARCHAR, max_length=100),   
    ]
    schema = CollectionSchema(fields=fields, description='search text')
    collection = Collection(name=collection_name, schema=schema)
    
    index_params = {
        'metric_type': "L2",
        'index_type': "IVF_FLAT",
        'params': {"nlist": 2048}
    }
    collection.create_index(field_name='title_vector', index_params=index_params)
    return collection

kartverket_collection = kartverket_create_milvus_collection('kartverket_metadata', 768)

### Creation of dataframe subset to exclude columns with complex data types

In [37]:
df_kartverket_slice = df_kartverket[['id', 'title', 'title_vector', 'abstract', 'keyword', 'Constraints', 'SecurityConstraints', 'LegalConstraints', 'responsibleParty', 'productInformation', 'parentId']]
df_kartverket_slice

Unnamed: 0,id,title,title_vector,abstract,keyword,Constraints,SecurityConstraints,LegalConstraints,responsibleParty,productInformation,parentId
0,37228,Artsfunn,"[0.56139773, 0.34755126, 0.14640424, 0.1977914...",Datasettet inneholder stedfestet informasjon a...,Natur###Norge###Svalbard###lav###karplanter###...,###,#########,Tilgangsrestriksjoner Andre restriksjoner: Lim...,Frank HansenNorsk institutt for naturforskning...,Produktspesifikasjon###Produktark###Produktsid...,
1,21400,Hav og is - Iskart (shapefil),"[0.03226918, 0.60322803, 0.5526991, -0.4393136...",Istjenesten ved Meteorologisk institutt utarbe...,Oceanographic geographical features###Inspire#...,Bruksbegrensninger Ingen begrensninger på bruk...,Sikkerhetsnivå Ugradert: Available for general...,Tilgangsrestriksjoner Andre restriksjoner: Lim...,Meteorologisk instituttistjenesten@met.no###Me...,Produktspesifikasjon###Produktark###Produktsid...,
2,240,Losbordingsfelt,"[-0.5165257, 0.32858214, 0.57894915, -0.333858...",Bordingsfelt er angitt som et geografisk punkt...,Åpne data###Norge digitalt###modellbaserteVegp...,###,Sikkerhetsnivå Ugradert: Available for general...,Tilgangsrestriksjoner Andre restriksjoner: Lim...,Stian AamotKystverket37019700Kystveien 30Arend...,https://register.geonorge.no/register/versjone...,
3,21273,Radnett - doseratemålestasjoner,"[0.087247156, 0.4677642, 0.1239377, -0.3505301...",Datasettet inneholder strålevernets radnettsta...,Norge digitalt###Åpne data###modellbaserteVegp...,Bruksbegrensninger Ingen begrensninger på bruk...,Sikkerhetsnivå Ugradert: Available for general...,Tilgangsrestriksjoner Andre restriksjoner: Lim...,Statens strålevernnrpa@nrpa.no###Direktoratet ...,Produktspesifikasjon###Produktark###Produktsid...,
4,37251,Predikert utbredelse og tetthetsfordeling av s...,"[-0.14394166, 0.41232172, -0.1769793, -0.33108...",Basert på gamle og nye data for forekomst av s...,Species distribution###Norge digitalt###modell...,Bruksbegrensninger Ingen ###IngenNo conditions...,Sikkerhetsnivå Ugradert: Available for general...,Tilgangsrestriksjoner Andre restriksjoner: Lim...,Norsk institutt for naturforskningfrank.hansse...,Produktspesifikasjon###Produktark###Produktsid...,
...,...,...,...,...,...,...,...,...,...,...,...
184,55086,Sentrumssoner 2016,"[-0.5731087, 0.2514926, 0.21160778, -0.3906372...",Inneholder avgrensning av og statistikk knytte...,Land use###Norge digitalt###geodataloven###fel...,###,Sikkerhetsnivå Ugradert: Available for general...,Tilgangsrestriksjoner Andre restriksjoner: Lim...,Erik EngelienStatistisk sentralbyråerik.engeli...,http://www.ssb.no/natur-og-miljo/_attachment/1...,55b401f3-2ea1-4045-9f87-22ac9d6ecf66
185,152,Statistiske enheter grunnkretser - historiske ...,"[-0.11083123, -0.0798243, 0.12608653, 0.422164...",Datasettet viser grunnkretsinndelingen i Norge...,Statistical units###geodataloven###Norge digit...,Bruksbegrensninger Ingen begrensninger på bruk...,Sikkerhetsnivå Ugradert: Available for general...,Tilgangsrestriksjoner Andre restriksjoner: Lim...,Carina Tolpinrud JøntvedtKartverket+47 08700po...,https://register.geonorge.no/register/versjone...,02b6c97b-63da-4d46-9a70-6e9ef3442d54
186,55061,Boligstatistikk på rutenett 250m 2019,"[-0.5389366, 0.28145486, 0.6831738, -0.2194134...",Inneholder rutenettstatistikk over antall boli...,Buildings###Norge digitalt###geodataloven###fe...,Bruksbegrensninger Ingen begrensninger på bruk...,Sikkerhetsnivå Ugradert: Available for general...,Tilgangsrestriksjoner Andre restriksjoner: Lim...,Statistisk Sentralbyråper.morten.holt@ssb.no##...,Produktspesifikasjon###https://register.geonor...,f3cdcd1f-5ee7-40fe-ac19-33a9101e00a4
187,50171,Arealplanområder Svalbard,"[-0.28596509, 0.1529622, 1.1230906, -0.3778528...",Arealplaner på Svalbard følger et forenklet sy...,Annet###fellesDatakatalog###Plan###Svalbard###...,Bruksbegrensninger Oppgje alltid Sysselmannen ...,Sikkerhetsnivå Ugradert: Available for general...,Tilgangsrestriksjoner Andre restriksjoner: Lim...,Sysselmannen på Svalbardsbr@sysselmannen.no###...,Produktspesifikasjon###Produktark###Produktsid...,


## Insert the subset dataframe data into Milvus collection

In [38]:
from towhee import ops, pipe, DataCollection

insert_pipe_kartverket = (pipe.input('df_kartverket_slice')
                   .flat_map('df_kartverket_slice', 'data', lambda df: df.values.tolist())
                   .map('data', 'res', ops.ann_insert.milvus_client(host=server_host, 
                                                                    port=server_port,
                                                                    collection_name='kartverket_metadata'))
                   .output('res')
)

%time _ = insert_pipe_kartverket(df_kartverket_slice)


kartverket_collection.load()
kartverket_collection.num_entities

CPU times: user 569 ms, sys: 72.6 ms, total: 642 ms
Wall time: 6.42 s


189

## Query against metadata collection

In [45]:
import numpy as np
# Variables specifying what column and collection to perform ANN comparrison against
vector_columns = ['title_vector']
collection_name = 'kartverket_metadata'

print(df_kartverket.columns)
# What columns to return for view
response_output = [
       'id', 'title',
       'abstract', 'keyword', 'Constraints',
       'SecurityConstraints', 'LegalConstraints',
       'responsibleParty', 'productInformation', 'parentId'
]


demo_pipe = (pipe.input('query')
                    .map('query', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
                    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
                    .flat_map('vec', ('id', 'score', 'title',
       'abstract', 'keyword', 'Constraints',
       'SecurityConstraints', 'LegalConstraints',
       'responsibleParty', 'productInformation', 'parentId'), 
                                       ops.ann_search.milvus_client(host=server_host, 
                                                                    port=server_port,
                                                                    collection_name=collection_name,
                                                                    vector_field=vector_columns,
                                                                    output_fields=response_output, 
                                                                    limit=5))  
                    .output(*['query', 'score'], *response_output)
               )

kartverket_question1 = 'Just do it'
print(f'\n"{kartverket_question1}" search:')
res_kartverket1 = demo_pipe(kartverket_question1)
DataCollection(res_kartverket1).show()

Index(['schema', 'uuid', 'id', 'hierarchyLevel', 'title',
       'datasetcreationdate', 'abstract', 'keyword', 'geoBox', 'Constraints',
       'SecurityConstraints', 'LegalConstraints', 'temporalExtent', 'image',
       'responsibleParty', 'link', 'metadatacreationdate',
       'productInformation', 'parentId', 'title_vector'],
      dtype='object')

"Just do it" search:


query,score,id,title,abstract,keyword,Constraints,SecurityConstraints,LegalConstraints,responsibleParty,productInformation,parentId
Just do it,120.80125427246094,22788,22788,Magasin,"Database over regulerte innsjøer. Egenskapsdata er vassdragsnr., magasinnr., navn, laveste og høyeste regulerte vannstand (m.o.h...",Annet###Åpne data###Norge digitalt###modellbaserteVegprosjekter###fellesDatakatalog###Energi###Norge fastland###magasin###vannkr...,Bruksbegrensninger Ingen ###Ingen,Sikkerhetsnivå Ugradert: Available for general disclosure#########,Tilgangsrestriksjoner Andre restriksjoner: Limitation not listed######Andre restriksjonerÅpne data###Åpne data###Brukerrestriksj...,Seming Haakon SkauNorges vassdrags- og energidirektoratgisstotte@nve.no###NVE - Seksjon for geomatikk og dataanalyse/IGDNorges v...,Produktspesifikasjon###Produktark###Produktside###Tegnforklaring###dekningsoversikt###hjelp###dekningsoversikt rutenett###deknin...
Just do it,126.17570495605467,21492,21492,"Bunnsedimenter (kornstørrelse), detaljert",Datasettet viser kornstørrelsessammensetning i sjøbunnssedimentenes øvre del (øverste 0-10 cm av sjøbunnen). I egenskapstabellen...,Sea regions###Inspire###Norge digitalt###geodataloven###Mareano###ØkologiskGrunnkart###MarineGrunnkart###fellesDatakatalog###Geo...,Bruksbegrensninger Detaljnivået på datasettet tilsier bruk innenfor kartmålestokken: 1:20.000 - 1:150.000. ###Detaljnivået på d...,Sikkerhetsnivå Ugradert: Available for general disclosure#########,Tilgangsrestriksjoner Andre restriksjoner: Limitation not listed######Andre restriksjonerÅpne data###Åpne data###Brukerrestriksj...,Aave LeplandNorges geologiske undersøkelseDataadministrator4773904000Leiv Eirikssons vei 39TrondheimAave.Lepland@ngu.nohttp://ww...,https://register.geonorge.no/produktspesifikasjoner/bunnsedimenter-kornstørrelseProduktspesifikasjon###https://register.geonorge...
Just do it,127.85933685302734,68030,68030,"Bunnsedimenter (kornstørrelse), oversikt",Datasettet viser kornstørrelsessammensetning i sjøbunnssedimentenes øvre del. Kornstørrelsesdata er basert på analyser av sjøbun...,Sea regions###Inspire###Norge digitalt###geodataloven###Mareano###fellesDatakatalog###Geologi###Norge###Nordsjøen###Norskehavet#...,Bruksbegrensninger Detaljnivået på datasettet tilsier bruk innenfor kartmålestokken: 1:2000.000 - 1:10.000.000 ###Detaljnivået ...,Sikkerhetsnivå Ugradert: Available for general disclosure#########,Tilgangsrestriksjoner Andre restriksjoner: Limitation not listed######Andre restriksjonerÅpne data###Åpne data###Brukerrestriksj...,Aave LeplandNorges geologiske undersøkelseDataadministrator4773904000Leiv Eirikssons vei 39TrondheimAave.Lepland@ngu.nohttp://ww...,https://register.geonorge.no/produktspesifikasjoner/bunnsedimenter-kornstørrelseProduktspesifikasjon###https://register.geonorge...
Just do it,127.91626739501952,69607,69607,"Relativ bunnhardhet, åpne data",Relativ bunnhardhet er rasterdata som viser reflektivitetstyrke fra sjøbunnen. Reflektivitetsstyrke sier noe om sjøbunnens akust...,Geology###Åpne data###Norge digitalt###MarineGrunnkart###modellbaserteVegprosjekter###fellesDatakatalog###Geologi###Norge###Bare...,Bruksbegrensninger Ingen begrensninger på bruk er oppgitt. ###Ingen begrensninger på bruk er oppgitt.No conditions apply,Sikkerhetsnivå Ugradert: Available for general disclosure#########,Tilgangsrestriksjoner Andre restriksjoner: Limitation not listed######Andre restriksjonerÅpne data###Åpne data###Brukerrestriksj...,"Aave LeplandNorges geologiske undersøkelseDataadministrator, Maringeologi, NGU4773904000Leiv Eirikssons vei 39TrondheimAave.Lepl...",Produktspesifikasjon###https://register.geonorge.no/register/versjoner/produktark/norges-geologiske-undersokelse/relativ-bunnhar...
Just do it,128.5309600830078,37228,37228,Artsfunn,Datasettet inneholder stedfestet informasjon av arter samlet inn av NINA. Funndataene følger datastandarden Darwin Core Archive ...,Natur###Norge###Svalbard###lav###karplanter###botanikkdata###innsekter###vannlevende innsekter###fisk###vannlevende planter###bi...,###,#########,Tilgangsrestriksjoner Andre restriksjoner: Limitation not listed######Andre restriksjonerÅpne data###Åpne data###Brukerrestriksj...,Frank HansenNorsk institutt for naturforskningfrank.hanssen@nina.no###Frank HansenNorsk institutt for naturforskningfrank.hansse...,Produktspesifikasjon###Produktark###Produktside###Tegnforklaring###dekningsoversikt###hjelp


# Searh article in Medium

## 0. Overview

We'll search for text in the Medium dataset, and it will find the most similar results to the search text across all titles. Searching for articles is different from traditional keyword searches, which search for semantically relevant content. If you search for "**funny python demo**" it will return "**Python Coding for Kids - Setting Up For the Adventure**", not "**No key words about funny python demo**".

We will use Milvus and Towhee to help searches. Towhee is used to extract the semantics of the text and return the text embedding. The Milvus vector database can store and search vectors, and return related articles. So we first need to install [Milvus](https://github.com/milvus-io/milvus) and [Towhee](https://github.com/towhee-io/towhee).

Before getting started, please make sure that you have started a [Milvus service](https://milvus.io/docs/install_standalone-docker.md). This notebook uses [milvus 2.2.10](https://milvus.io/docs/v2.2.x/install_standalone-docker.md) and [pymilvus 2.2.11](https://milvus.io/docs/release_notes.md#2210).

In [None]:
#! pip install --upgrade pip
#! pip3 install -q towhee pymilvus==2.2.11
#! pip3 uninstall pymilvus -y

! pip3 install -q towhee pymilvus==2.1.1
! pip3 show pymilvus | grep -Ei 'Name:|Version:'
! pip3 show towhee | grep -Ei 'Name:|Version:'

## 1. Data preprocessing

The data is from the [Cleaned Medium Articles Dataset](https://www.kaggle.com/datasets/shiyu22chen/cleaned-medium-articles-dataset)(you can download it from Kaggle), which cleared the empty article titles in the data and conver the string title to the embeeding with Towhee [text_embedding.dpr operator](https://towhee.io/text-embedding/dpr), as you can see the `title_vector` is the embedding vectors of the title.

In [None]:
# Download data
! wget -q https://github.com/towhee-io/examples/releases/download/data/New_Medium_Data.csv

In [1]:
import pandas as pd

df = pd.read_csv('New_Medium_Data.csv', converters={'title_vector': lambda x: eval(x)})
df.head()

Unnamed: 0,id,title,title_vector,link,reading_time,publication,claps,responses
0,0,The Reported Mortality Rate of Coronavirus Is ...,"[0.041732933, 0.013779674, -0.027564144, -0.01...",https://medium.com/swlh/the-reported-mortality...,13,The Startup,1100,18
1,1,Dashboards in Python: 3 Advanced Examples for ...,"[0.0039737443, 0.003020432, -0.0006188639, 0.0...",https://medium.com/swlh/dashboards-in-python-3...,14,The Startup,726,3
2,2,How Can We Best Switch in Python?,"[0.031961977, 0.00047043373, -0.018263113, 0.0...",https://medium.com/swlh/how-can-we-best-switch...,6,The Startup,500,7
3,3,Maternity leave shouldn’t set women back,"[0.032572296, -0.011148319, -0.01688577, -0.00...",https://medium.com/swlh/maternity-leave-should...,9,The Startup,460,1
4,4,Python NLP Tutorial: Information Extraction an...,"[-0.011735886, -0.016938083, -0.027233299, 0.0...",https://medium.com/swlh/python-nlp-tutorial-in...,7,The Startup,163,0


## 2. Load Data

The next step is to get the text embedding, and then insert all the extracted embedding vectors into Milvus.

### Create Milvus Collection

We need to create a collection in Milvus first, which contains multiple fields of `id`, `title`, `title_vector`, `link`, `reading_time`, `publication`, `claps` and `responses`.

In [2]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

server_host = 'ebjerk.no'
server_port = '19530'

connections.connect(host=server_host, port=server_port)

def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    
    fields = [
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
            FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),   
            FieldSchema(name="title_vector", dtype=DataType.FLOAT_VECTOR, dim=dim),
            FieldSchema(name="link", dtype=DataType.VARCHAR, max_length=500),
            FieldSchema(name="reading_time", dtype=DataType.INT64),
            FieldSchema(name="publication", dtype=DataType.VARCHAR, max_length=500),
            FieldSchema(name="claps", dtype=DataType.INT64),
            FieldSchema(name="responses", dtype=DataType.INT64)
    ]
    schema = CollectionSchema(fields=fields, description='search text')
    collection = Collection(name=collection_name, schema=schema)
    
    index_params = {
        'metric_type': "L2",
        'index_type': "IVF_FLAT",
        'params': {"nlist": 2048}
    }
    collection.create_index(field_name='title_vector', index_params=index_params)
    return collection

collection = create_milvus_collection('search_article_in_medium', 768)

### Data to Milvus


Towhee supports reading df data through the `from_df` interface, and then we need to convert the `title_vector` column in the data to a two-dimensional list in float format, and then insert all the fields into Milvus, each field inserted into Milvus corresponds to one Collection fields created earlier.

In [3]:
from towhee import ops, pipe, DataCollection

insert_pipe = (pipe.input('df')
                   .flat_map('df', 'data', lambda df: df.values.tolist())
                   .map('data', 'res', ops.ann_insert.milvus_client(host=server_host, 
                                                                    port=server_port,
                                                                    collection_name='search_article_in_medium'))
                   .output('res')
)




In [4]:
%time _ = insert_pipe(df)

CPU times: user 18.6 s, sys: 2.65 s, total: 21.3 s
Wall time: 3min 6s


We need to call `collection.load()` to load the data after inserting the data, then run `collection.num_entities` to get the number of vectors in the collection. We will see the number of vectors is 5979, and we have successfully load the data to Milvus.

In [5]:
collection.load()
collection.num_entities

5979

## 3. Search embedding title

### Search one text in Milvus


The retrieval process also to generate the text embedding of the query text, then search for similar vectors in Milvus, and finally return the result, which contains `id`(primary_key) and `score`. For example, we can search for "funny python demo":

In [6]:
import numpy as np

search_pipe = (pipe.input('query')
                    .map('query', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
                    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
                    .flat_map('vec', ('id', 'score'), ops.ann_search.milvus_client(host=server_host, 
                                                                                   port=server_port,
                                                                                   collection_name='search_article_in_medium'))  
                    .output('query', 'id', 'score')
               )

res = search_pipe('funny python demo')
DataCollection(res).show()

query,id,score
funny python demo,3897,0.3737611174583435
funny python demo,1342,0.4368064999580383
funny python demo,1832,0.4572384059429168
funny python demo,5671,0.4593276083469391
funny python demo,1752,0.4645397365093231


### Search multi text in Milvus

We can also retrieve multiple pieces of data, for example we can specify the array(['funny python demo', 'AI in data analysis']) to search in batch, which will be retrieved in Milvus:

In [7]:
res = search_pipe.batch(['funny python demo', 'AI in data analysis'])
for re in res:
    DataCollection(re).show()

query,id,score
funny python demo,3897,0.3737611174583435
funny python demo,1342,0.4368064999580383
funny python demo,1832,0.4572384059429168
funny python demo,5671,0.4593276083469391
funny python demo,1752,0.4645397365093231


query,id,score
AI in data analysis,3493,0.2443668991327285
AI in data analysis,4542,0.2485119104385376
AI in data analysis,2649,0.284042477607727
AI in data analysis,4539,0.3186832070350647
AI in data analysis,3812,0.3224286139011383


### Search text and return multi fields

If we want to return more information when retrieving, we can set the `output_fields` parameter in [ann_search.milvus operator](https://towhee.io/ann-search/milvus). For example, in addition to `id` and `score`, we can also return `title`, `link`, `claps`, `reading_time`, `and response`:

In [8]:
search_pipe1 = (pipe.input('query')
                    .map('query', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
                    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
                    .flat_map('vec', ('id', 'score', 'title'), ops.ann_search.milvus_client(host=server_host, 
                                                                                   port=server_port,
                                                                                   collection_name='search_article_in_medium',
                                                                                   output_fields=['title']))  
                    .output('query', 'id', 'score', 'title')
               )

res = search_pipe1('funny python demo')
DataCollection(res).show()

query,id,score,title
funny python demo,3897,0.3737611174583435,Python Coding for Kids — Setting Up For the Adventure
funny python demo,1342,0.4368064999580383,How to Design Professional Venn Diagrams in Python
funny python demo,1832,0.4572384059429168,How to mock AWS services for rapid local development.
funny python demo,5671,0.4593276083469391,Adventure into Machine Learning using Python
funny python demo,1752,0.4645397365093231,Custom neural networks in Keras: a street fighter’s guide to build a graphCNN


In [9]:
# milvus search with multi output fields
search_pipe2 = (pipe.input('query')
                    .map('query', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
                    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
                    .flat_map('vec', ('id', 'score', 'title', 'link', 'reading_time', 'publication', 'claps', 'responses'), 
                                       ops.ann_search.milvus_client(host=server_host, 
                                                                    port=server_port,
                                                                    collection_name='search_article_in_medium',
                                                                    output_fields=['title', 'link', 'reading_time', 'publication', 'claps', 'responses'], 
                                                                    limit=5))  
                    .output('query', 'id', 'score', 'title', 'link', 'reading_time', 'publication', 'claps', 'responses')
               )

res = search_pipe2('funny python demo')
DataCollection(res).show()

query,id,score,title,link,reading_time,publication,claps,responses
funny python demo,3897,0.3737611174583435,Python Coding for Kids — Setting Up For the Adventure,https://medium.com/swlh/python-coding-for-kids-setting-up-for-the-adventure-9be4bef6b24e,14,The Startup,119,2
funny python demo,1342,0.4368064999580383,How to Design Professional Venn Diagrams in Python,https://towardsdatascience.com/how-to-design-professional-venn-diagrams-in-python-693c9ed2c288,6,Towards Data Science,97,1
funny python demo,1832,0.4572384059429168,How to mock AWS services for rapid local development.,https://medium.com/swlh/how-to-mock-aws-services-for-rapid-local-development-3d07581ffc3a,3,The Startup,84,0
funny python demo,5671,0.4593276083469391,Adventure into Machine Learning using Python,https://towardsdatascience.com/adventure-into-machine-learning-using-python-7a85fce81b7d,14,Towards Data Science,25,0
funny python demo,1752,0.4645397365093231,Custom neural networks in Keras: a street fighter’s guide to build a graphCNN,https://towardsdatascience.com/custom-neural-networks-in-keras-a-street-fighters-guide-to-build-a-graphcnn-e91f6b05f12e,7,Towards Data Science,55,0


### Search text with some expr


In addition, we can also set some expressions for retrieval. For example, we can specify that the beginning of the article is an article in Python by setting expr='title like "Python%"':

In [10]:
search_pipe3 = (pipe.input('query')
                    .map('query', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
                    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
                    .flat_map('vec', ('id', 'score', 'title', 'link', 'reading_time', 'publication', 'claps', 'responses'), 
                                       ops.ann_search.milvus_client(host=server_host, 
                                                                    port=server_port,
                                                                    collection_name='search_article_in_medium',
                                                                    expr='title like "Python%"',
                                                                    output_fields=['title', 'link', 'reading_time', 'publication', 'claps', 'responses'], 
                                                                    limit=5))  
                    .output('query', 'id', 'score', 'title', 'link', 'reading_time', 'publication', 'claps', 'responses')
               )

res = search_pipe3('funny python demo')
DataCollection(res).show()

query,id,score,title,link,reading_time,publication,claps,responses
funny python demo,3897,0.3737611174583435,Python Coding for Kids — Setting Up For the Adventure,https://medium.com/swlh/python-coding-for-kids-setting-up-for-the-adventure-9be4bef6b24e,14,The Startup,119,2
funny python demo,4644,0.4937489628791809,Python for Finance — The Complete Beginner’s Guide,https://towardsdatascience.com/python-for-finance-the-complete-beginners-guide-764276d74cef,8,Towards Data Science,292,5
funny python demo,2736,0.4956967830657959,Python for Beginners — Basics,https://towardsdatascience.com/python-for-beginners-basics-7ac6247bb4f4,7,Towards Data Science,11,0
funny python demo,1667,0.5019431114196777,Python — How to measure thread execution time in multithreaded application?,https://medium.com/swlh/python-how-to-measure-thread-execution-time-in-multithreaded-application-f4b2e2112091,6,The Startup,55,0
funny python demo,1298,0.5166990756988525,Python Testing with a mock database (SQL),https://medium.com/swlh/python-testing-with-a-mock-database-sql-68f676562461,4,The Startup,51,0


## 4. Query data in Milvus

We have done the text retrieval process before, and we can get articles such as "Python coding for kids - getting ready for an adventure" by retrieving "fun python demos".

We can also do a simple query on the data, we need to set `expr` and `output_fields` with the `collection.query` interface, for example, we can filter out articles with faults greater than 3000 and reading time less than 15 minutes, and submitted to TDS :

In [11]:
collection.query(
  expr = 'claps > 3000 && reading_time < 15 && publication like "Towards Data Science%"', 
  output_fields = ['id', 'title', 'link', 'reading_time', 'publication', 'claps', 'responses'],
  consistency_level='Strong'
)

[{'title': 'Top 3 Python Functions You Don’t Know About (Probably)',
  'link': 'https://towardsdatascience.com/top-3-python-functions-you-dont-know-about-probably-978f4be1e6d',
  'reading_time': 4,
  'publication': 'Towards Data Science',
  'claps': 4400,
  'responses': 20,
  'id': 2572},
 {'title': 'Do You Know Python Has A Built-In Database?',
  'link': 'https://towardsdatascience.com/do-you-know-python-has-a-built-in-database-d553989c87bd',
  'reading_time': 6,
  'publication': 'Towards Data Science',
  'claps': 3500,
  'responses': 8,
  'id': 4639},
 {'title': 'Machine Learning Engineers Will Not Exist In 10 Years.',
  'link': 'https://towardsdatascience.com/machine-learning-engineers-will-not-exist-in-10-years-c9cbbf4472f3',
  'reading_time': 6,
  'publication': 'Towards Data Science',
  'claps': 4600,
  'responses': 73,
  'id': 5766},
 {'title': 'I Thought I Was Mastering Python Until I Discovered These Tricks',
  'link': 'https://towardsdatascience.com/i-thought-i-was-mastering-

## Demo of semantic search

In [12]:
# Variables specifying what column and collection to perform ANN comparrison against
vector_columns = ['title_vector']
collection_name = 'search_article_in_medium'

# What columns to return for view
response_output = ['title', 'link', 'reading_time', 'publication', 'claps', 'responses']


demo_pipe = (pipe.input('query')
                    .map('query', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
                    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
                    .flat_map('vec', ('id', 'score', 'title', 'link', 'reading_time', 'publication', 'claps', 'responses'), 
                                       ops.ann_search.milvus_client(host=server_host, 
                                                                    port=server_port,
                                                                    collection_name=collection_name,
                                                                    vector_field=vector_columns,
                                                                    output_fields=response_output, 
                                                                    limit=5))  
                    .output(*['query', 'score'], *response_output)
               )

print('\n"Just do it" search:')
res_semantic1 = demo_pipe('Just do it')
DataCollection(res_semantic1).show()

print('\n"Assemble" search:')
res_semantic2 = demo_pipe('Assemble')
DataCollection(res_semantic2).show()

print('\n"Show me how i can become a data analyst" search:')
res_semantic3 = demo_pipe('Show me how i can become a data analyst')
DataCollection(res_semantic3).show()


"Just do it" search:


query,score,title,link,reading_time,publication,claps,responses
Just do it,0.5165345072746277,Tune Into Your Body’s Rhythms to Create Your Best Writing Routine,https://medium.com/swlh/tune-into-your-bodys-rhythms-to-create-your-best-writing-routine-4421d97b897c,7,The Startup,96,0
Just do it,0.521692156791687,Get your idea out there!,https://medium.com/swlh/get-your-idea-out-there-d396b9443d2f,11,The Startup,134,0
Just do it,0.5453013181686401,Fundraising – Getting Your Mind Right,https://medium.com/swlh/fundraising-getting-your-mind-right-76e6864670b,4,The Startup,51,0
Just do it,0.5468644499778748,Think Like a Boss and You Will Become One,https://medium.com/swlh/think-like-a-boss-and-you-will-become-one-9236fc5c4b79,8,The Startup,292,1
Just do it,0.5481520295143127,You Can Learn How to Be Creative.,https://medium.com/swlh/you-can-learn-how-to-be-creative-f1894da4bac5,4,The Startup,89,2



"Assemble" search:


query,score,title,link,reading_time,publication,claps,responses
Assemble,0.5605491399765015,"Create A Synthetic Image Dataset — The “What”, The “Why” and The “How”",https://towardsdatascience.com/create-a-synthetic-image-dataset-the-what-the-why-and-the-how-f820e6b6f718,7,Towards Data Science,50,0
Assemble,0.5605491399765015,"Create A Synthetic Image Dataset — The “What”, The “Why” and The “How”",https://towardsdatascience.com/create-a-synthetic-image-dataset-the-what-the-why-and-the-how-f820e6b6f718,7,Towards Data Science,50,0
Assemble,0.5768184661865234,The Planning Process for Your Organization,https://medium.com/swlh/the-planning-process-for-your-organization-acb61c785dfd,4,The Startup,76,0
Assemble,0.5851483345031738,Preparing the data for Transformer pre-training — a write-up,https://towardsdatascience.com/preparing-the-data-for-transformer-pre-training-a-write-up-67a9dc0cae5a,3,Towards Data Science,36,0
Assemble,0.5889052152633667,Creating Async Vue Components,https://medium.com/swlh/creating-async-vue-components-f1c60050270f,3,The Startup,295,0



"Show me how i can become a data analyst" search:


query,score,title,link,reading_time,publication,claps,responses
Show me how i can become a data analyst,0.2777085900306701,How I see a lesson from Flash holds a future of prototyping,https://uxdesign.cc/how-i-see-a-lesson-from-flash-holds-a-future-of-prototyping-9ed1e939232d,11,UX Collective,63,0
Show me how i can become a data analyst,0.2872746884822845,Find your first job as a Data Scientist,https://towardsdatascience.com/find-your-first-job-as-a-data-scientist-81e4401fe5bf,5,Towards Data Science,248,0
Show me how i can become a data analyst,0.2886537313461303,What You’ll Learn in 1 Year as a Data Scientist,https://towardsdatascience.com/what-youll-learn-in-1-year-as-a-data-scientist-b69061639653,9,Towards Data Science,161,2
Show me how i can become a data analyst,0.2963539361953735,Why I love being a data scientist,https://towardsdatascience.com/why-i-love-being-a-data-scientist-b4e2de7292e7,6,Towards Data Science,183,1
Show me how i can become a data analyst,0.2991792261600494,Make Your Data Models Into Websites,https://towardsdatascience.com/make-your-data-models-into-websites-d7260956c6d7,6,Towards Data Science,95,0


In [13]:
# Search by questions

#question_0 = "How can modern software enhance the efficiency of complex computational tasks?"
question_1 = "What are the latest breakthroughs in machines understanding human speech?"
#question_2 = "In what ways can an individual improve their creative expression?"
#question_3 = "What are the key principles in creating a user-friendly digital interface?"
#question_4 = "What factors should entrepreneurs consider for successful business growth in a digital age?"
#question_5 = "What foundational skills are essential for analyzing large datasets effectively?"
#question_6 = "What should newcomers understand before investing in cryptocurrency?"
question_7 = "How does predictive modeling transform decision-making in industries?"
#question_8 = "What strategies are crucial for a brand to stand out in a competitive market?"
question_9 = "How can a company cultivate a culture of trust and innovation among its employees?"

print(f'\n"{question_1}" search:')
res_question1 = demo_pipe(question_1)
DataCollection(res_question1).show()

print(f'\n"{question_7}" search:')
res_question2 = demo_pipe(question_7)
DataCollection(res_question2).show()

print(f'\n"{question_9}" search:')
res_question3 = demo_pipe(question_9)
DataCollection(res_question3).show()


"What are the latest breakthroughs in machines understanding human speech?" search:


query,score,title,link,reading_time,publication,claps,responses
What are the latest breakthroughs in machines understanding human speech?,0.1925591677427292,What do various countries’ healthcare capacities look like?,https://towardsdatascience.com/what-do-various-countries-healthcare-capacities-look-like-1581896a0601,8,Towards Data Science,1400,15
What are the latest breakthroughs in machines understanding human speech?,0.2154440581798553,What can we do to humanise our user’s experience?,https://uxdesign.cc/what-can-we-do-to-humanise-our-users-experience-98f6fda33609,5,UX Collective,31,0
What are the latest breakthroughs in machines understanding human speech?,0.2324463427066803,Where is the AI Strategy for Peace?,https://towardsdatascience.com/where-is-the-ai-strategy-for-peace-a89c1c681fe9,5,Towards Data Science,123,1
What are the latest breakthroughs in machines understanding human speech?,0.2330337762832641,How Quantum is the Uncertainty Principle?,https://medium.com/swlh/how-quantum-is-the-uncertainty-principle-4569eb7a9eb1,8,The Startup,104,6
What are the latest breakthroughs in machines understanding human speech?,0.2340763509273529,What does the Network see?,https://towardsdatascience.com/what-does-the-network-see-4fec5aa4d2eb,5,Towards Data Science,10,0



"How does predictive modeling transform decision-making in industries?" search:


query,score,title,link,reading_time,publication,claps,responses
How does predictive modeling transform decision-making in industries?,0.1813471913337707,How does data science create value for firms?,https://towardsdatascience.com/how-does-data-science-create-value-for-firms-a3e3e5ca86e3,19,Towards Data Science,24,0
How does predictive modeling transform decision-making in industries?,0.19828462600708,How can Machine Learning algorithms include better Causality?,https://medium.com/swlh/how-can-machine-learning-algorithms-include-better-causality-e869ca60e54d,9,The Startup,437,2
How does predictive modeling transform decision-making in industries?,0.19828462600708,How can Machine Learning algorithms include better Causality?,https://medium.com/swlh/how-can-machine-learning-algorithms-include-better-causality-e869ca60e54d,9,The Startup,437,2
How does predictive modeling transform decision-making in industries?,0.202661782503128,How to perform Data Analysis using the CRISP-DM approach?,https://towardsdatascience.com/how-to-perform-data-analysis-using-the-crisp-dm-approach-201708f220b2,6,Towards Data Science,26,0
How does predictive modeling transform decision-making in industries?,0.2252297401428222,Can machine learning help build a better stock portfolio?,https://towardsdatascience.com/can-machine-learning-help-build-a-better-stock-portfolio-8e4b3334a49,8,Towards Data Science,180,1



"How can a company cultivate a culture of trust and innovation among its employees?" search:


query,score,title,link,reading_time,publication,claps,responses
How can a company cultivate a culture of trust and innovation among its employees?,0.1816233992576599,How to Build an Outstanding Company Culture?,https://medium.com/swlh/a-human-oriented-framework-to-build-a-great-company-culture-d97ff49e6766,4,The Startup,45,0
How can a company cultivate a culture of trust and innovation among its employees?,0.1916501969099044,How Can Organizations Learn Effectively?,https://medium.com/swlh/the-keys-to-organizational-learning-9ba46bbcd7bc,7,The Startup,208,1
How can a company cultivate a culture of trust and innovation among its employees?,0.2152410745620727,Why is a clear value proposition essential for any startup or growing business?,https://uxdesign.cc/why-is-a-clear-value-proposition-essential-for-any-startup-or-growing-business-f0fce3446a3f,3,UX Collective,83,0
How can a company cultivate a culture of trust and innovation among its employees?,0.2346673756837844,How does data science create value for firms?,https://towardsdatascience.com/how-does-data-science-create-value-for-firms-a3e3e5ca86e3,19,Towards Data Science,24,0
How can a company cultivate a culture of trust and innovation among its employees?,0.2560636699199676,What Makes a Social Media Campaign Innovative?,https://medium.com/swlh/what-makes-a-social-media-campaign-innovative-2f65d8c51ab,4,The Startup,51,0
