<a href="https://colab.research.google.com/github/romellfudi/medium/blob/main/Milvus_Data's_Hidden_Relationships.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title Installing python dependencies
%%capture
!pip install -U sentence-transformers "milvus[client]" -q

In [2]:
# restart kernel runtime

In [3]:
#@title Downlod the free embedding model from HuggingFace hub
from transformers import AutoTokenizer, AutoModel
our_hf_tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base");
our_hf_model = AutoModel.from_pretrained("thenlper/gte-base");

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/219M [00:00<?, ?B/s]

In [4]:
#@title Make the embedding function with HF
def embedding_text(input_text, hf_tokenizer = our_hf_tokenizer, hf_model = our_hf_model):
    batch_dict = hf_tokenizer([input_text], max_length=512, padding=True, truncation=True, return_tensors='pt')
    outputs = hf_model(**batch_dict)
    # do a masked mean over the dimension(average_pool)
    last_hidden = outputs.last_hidden_state.masked_fill(~batch_dict['attention_mask'][..., None].bool(), 0.0)
    torch_embeddings_list = last_hidden.sum(dim=1) / batch_dict['attention_mask'].sum(dim=1)[..., None]
    # return only the first element of the batch (since we only passed one sentence to the model)
    # and transform embedding numbers in pytorch into a simple float list
    return torch_embeddings_list[0].tolist() # dimension of  768

In [5]:
# @title Testing the anything
from pprint import pprint, pformat
whatever = 'Milvus Vector Database is an open-source, \
highly optimized database designed for managing and searching large-scale vector data. \
It is specifically designed to handle high-dimensional vectors, \
which are commonly used in machine learning and similarity search applications.'

embeddings = embedding_text(whatever)
print(f"The vector of embeddings has a dimension of {len(embeddings)}")
pformat(embeddings)

The vector of embeddings has a dimension of 768


'[0.0763445571064949,\n 0.10579398274421692,\n 0.26722753047943115,\n 0.36061158776283264,\n 1.117963433265686,\n 0.03725322335958481,\n 0.6832746863365173,\n 0.4466577470302582,\n -0.631479799747467,\n -1.0175580978393555,\n 0.06265023350715637,\n -0.08705537766218185,\n -1.378287672996521,\n 0.43031516671180725,\n -0.5112660527229309,\n 1.0260306596755981,\n 0.32637229561805725,\n 0.4085122048854828,\n 0.24843551218509674,\n 0.20211558043956757,\n -0.23423005640506744,\n -0.35746651887893677,\n -0.3442435562610626,\n 0.5704398155212402,\n 0.2914879024028778,\n 0.5023062229156494,\n 0.3192705512046814,\n 0.3030664324760437,\n -0.994010329246521,\n 0.1679639220237732,\n 0.1808314025402069,\n -0.1928321123123169,\n -0.1262018382549286,\n -0.31704196333885193,\n -0.3159347474575043,\n 0.009828869253396988,\n -0.43610915541648865,\n -0.042592741549015045,\n -0.09506665915250778,\n -0.6257979869842529,\n -0.17251381278038025,\n -0.21260744333267212,\n -0.5072826147079468,\n 0.1448932886123

In [6]:
#@title You can find the dataset in [kaggle](https://www.kaggle.com/datasets/akash14/news-category-dataset)
!gdown --id 1i0mJxvSwa29Ucp7es0imC4JaPAzyMw-v
!unzip news_dataset.zip

Downloading...
From: https://drive.google.com/uc?id=1i0mJxvSwa29Ucp7es0imC4JaPAzyMw-v
To: /content/news_dataset.zip
100% 2.78M/2.78M [00:00<00:00, 225MB/s]
Archive:  news_dataset.zip
  inflating: Data_Test.csv           
  inflating: Data_Train.csv          


In [7]:
#@title Read the articles into a list
import csv

csv_file_path = 'Data_Train.csv'  # Replace with the actual path to your CSV file

with open(csv_file_path, 'r', encoding='latin-1') as csvfile:
    csvreader = csv.reader(csvfile)
    list_of_text = [row[0] for row in csvreader][1:1000:3]

print(f"We read {len(list_of_text)} news")
longest_length = max(len(item) for item in list_of_text)
print("The length of the longest string is:", longest_length)

list_of_text[0] # check the first news

We read 333 news
The length of the longest string is: 3754


'But the most painful was the huge reversal in fee income, unheard of among private sector lenders. Essentially, it means that Yes Bank took it for granted that fees on structured loan deals will be paid and accounted for upfront on its books. As borrowers turned defaulters, the fees tied to these loan deals fell off the cracks. Gill has now vowed to shift to a safer accounting practice of amortizing fee income rather than booking these upfront.\n\n\nGill\x92s move to mend past ways means that there will be no nasty surprises in the future. This is good news considering that investors love a clean image and loathe uncertainties.\n\n\nBut there is no gain without pain and the promise of a strong and stable balance sheet comes with some sacrifices as well. Investors will have to give up the hopes of phenomenal growth, a promise made by Kapoor.'

In [8]:
#@title Initilize the Milvus server
from milvus import default_server
from pymilvus import connections, utility

# Optional, if you want cleanup previous data
# default_server.cleanup()

# start the Milvus server
default_server.start()

# create a connection
connections.connect(host='127.0.0.1', port=default_server.listen_port)
# check the connection is working as expected by displsying the version
display(utility.get_server_version())

'v2.3.1-lite'

In [9]:
#@title Create the schema of the collection
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection
COLLECTION_NAME = "News"
DIMENSION = 768
# Check that the collection does not yet exist
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)

fields = [
    FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name='full_text', dtype=DataType.VARCHAR, max_length=4000),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)

In [10]:
index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128},
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load() # the collection is ready to work

In [11]:
#@title Let's ingest the data
def ingest_plain_text(list_of_text, milvus_collection):
    # A list with entities
    list_of_entities = [
                [], # list of text
                [] # list of embeddings
    ]
    for extract_text in list_of_text:
        embeddings = embedding_text(extract_text) # A numerical vector
        list_of_entities[0].append(extract_text)
        list_of_entities[1].append(embeddings)

    ids = collection.insert(list_of_entities)
    if ids:
        collection.flush()
        print(f"{ids.succ_count} entities were inserted")
    else:
        print("No entity was inserted.")

ingest_plain_text(list_of_text, collection)

333 entities were inserted


In [12]:
#@title Searching for Text Related to the Input Inquiry
embeds = embedding_text("The artificial intelligence") # return an embedding
TOP_K = 3
result = collection.search([embeds], # Wrap the embedding in an inquiry list
                  anns_field='embedding',
                  param={"metric_type": "L2",
                        "params": {"nprobe": 10}},
                  limit=TOP_K, # Number of most similar embeddings
                  output_fields=['full_text']) # Data associated with vectors

In [14]:
#@title Displaying the results
for hits_i, hits in enumerate(result):
    for hit_it, hit in enumerate(hits):
        pprint(str(hit.entity), width=120)

("id: 444665525477049575, distance: 85.1231918334961, entity: {'full_text': '(From left) Anushka Shetty, co-founder, "
 'Plop; Jo Aggarwal, co-founder, Wysa; Shreya Mishra, co-founder, Flyrobe\\n\\n\\nDigital Dossier has identified five '
 'startups that are using advanced technologies including (artificial intelligence) AI and machine learning to provide '
 'a gamut of solutions in diverse areas\\n\\n\\nmint-india-wire tech startupstech startups by womenAnushka '
 'ShettyPlopWysaJo AggarwalFlyrobeInclovEspresso LabsKalyani KhonaPallavi Gupta\\n\\n\\nNew Delhi: Digital Dossier has '
 'identified five startups that are using advanced technologies including (artificial intelligence) AI and machine '
 'learning to provide a gamut of solutions in diverse areas. All of these startups are co-founded by women who are '
 "passionate about the cause they pursue.'}")
('id: 444665525477049558, distance: 86.152587890625, entity: {\'full_text\': \'"The programme will enhance '
 'capabilities of the