Skip to content
This repository has been archived by the owner on Mar 25, 2023. It is now read-only.

tos-kamiya/dvg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tests CodeQL

→ doc main | dev
→ Japanese doc main | dev

⚠️ dvg is incomplatible with CPython 3.11, because some of its dependencyies are so. Reference: Python 3.11 Readiness https://pyreadiness.org/3.11/

⚠️ In version 1.0.0b9, the model files have been revamped and now have a larger vocabulary and (vector) dimension. Due to PyPI space limitations, model files are not included in the distribution package; they are downloaded from a website (https://toshihirokamiya.com/) at the first time you run dvg.

ℹ️ Released an alpha version of stng (PyPI, GitHub), a CLI tool similar to dvg, but uses a Sentence-Transformer model. Heavy for usual PCs, though depending on GPU performance.

dvg

The dvg is an off-the-shelf grep-like tool that performs semantic similarity search, for Windows, macOS, and Ubuntu.

With SCDV models, search document files that contain similar parts to query. Supports searching within text files (.txt), PDF files (.pdf), and MS Word files (.docx).

Installation

Basically, it can be installed with pip dvg, but if you want to target PDF files or Japanese documents in addition to English, you need to install an option.

Installation on Ubuntu / macOS
Installation on Windows

TL;DR (typical usage)

Search for the document files similar to the query phrase.

dvg -v -m en <query_phrase> <document_files>...

Example of search:

Each line of output is, from left to right, similarity (the closer the number is to 1, the higher the similarity), length (characters) of the paragraph, file name, and range of line numbers.

Command-line options

dvg has several options. Here are some options that may be used frequently.

-v, --verbose
Verbose option. If specified, it will show the documents that have the highest similarity at that time.

-m MODEL, --model=MODEL
The available models are en (for English documents) and ja (for Japanese documents).

-k NUM, --top-k=NUM
Show top NUM documents as results. The default value is 20. Specify 0 to show all the documents searched, sorted by the degree of match to the query.

-p, --paragraph
If this option is specified, each paragraph in one document file will be considered as a document. Multiple paragraphs of a single document file will be output in the search results. If this option is not specified, one document file will be considered as one document. A single document file will be displayed in the search results only once at most.

-w NUM, --window=NUM
A chunk of lines specified by this number will be recognized as a paragraph. The default value is 20.

-f QUERYFILE, --query-file=QUERYFILE
Read query text from the file. The query file could be a PDF as well as a text file, like document files.

(As far as I have tried, when the query is specified as a file, better results tend to be obtained by increasing the size of the paragraph with the --window option, e.g. -w 80)

-i TEXT, --include=TEXT
Only paragraphs that contain the specified string will be included in the search results.

-e TEXT, --exclude=TEXT
Only paragraphs that do not contain the specified string will be included in the search results.

-l CHARS, --min-length=CHARS
Paragraphs shorter than this value get their similarity values lowered. You can use this to exclude short paragraphs from the search results. The default value is 80.

-t CHARS, --excerpt-length=CHARS
The length of the excerpt displayed in the rightmost column of the search results. The default value is 80.

-q, --quote
Show the entire paragraph (without excerpts) of search results.

-H, --header
Add a heading line to the output.

-j NUM, --worker=NUM
Number of worker processes. Option to run in parallel.

Hints

Search individual lines of a text file

Improve search speed

Troubleshooting

An error message like: "ModuleNotFoundError: No module named 'docopt'"

An error message like "dvg: command not found ".

A warning message "None of PyTorch, TensorFlow >= 2.0, or Flax have been found..."

Aborted by segmentation fault (SIGSEGV)

Acknowledgements

Thanks to Wikipedia for releasing a huge corpus of languages:
https://dumps.wikimedia.org/

License

dvg is distributed under BSD-2 license.

Links

Todo

  • Change PDF text extraction tool to GhostScript for easier installation on Windows