DUPLICATE DETECTION

Goal of the project is to calculate similarity between texts, and find duplicate texts using tresholding method. Project contains two stages:

Text Embeddings
Similarity Search Using Vector Database (Milvus)

Text Embeddings

First of all, sentences are converted into embeddings using sentence transformer that consider bi-directional context (BERT), and takes the average of all sentences in a text. In this way semantic meaning of unstructured data is extracted from the text. Texts having similar content are closer in multi-dimensional embedding space. Cosine similarity metric is used to compare the similarity between texts.

Cosine similarity between two embedding vectors can be calculated as follows.

cos(θ) = (A • B) / (‖A‖ * ‖B‖)

Milvus

The Milvus vector database is designed to store and manage, and index high dimensional vector embeddings. It is utilized to accelerate the similarity search. Milvus is installed using Docker Compose. Details can be found in the following link:

https://milvus.io/docs/v2.0.x/install_standalone-docker.md

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
images		images
README.md		README.md
vectordatabase.py		vectordatabase.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DUPLICATE DETECTION

Text Embeddings

Milvus

About

Releases

Packages

Languages

serhatkurtt/Vector-Database-Milvus

Folders and files

Latest commit

History

Repository files navigation

DUPLICATE DETECTION

Text Embeddings

Milvus

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages