Skip to content

Search for most relevant documents containing words from query. Pure Python implementation without dependencies

License

Notifications You must be signed in to change notification settings

rtmigo/gifts_py

Repository files navigation

Searching for elements that have the common features with the query.

query = ['A', 'B']

elements = [
    ['N', 'A', 'M'],  # common features: 'A'
    ['C', 'B', 'A'],  # common features: 'A', 'B'  
    ['X', 'Y']  # no common features
]

In this case, the search with return ['C', 'B', 'A'] and ['N', 'A', 'M'] in that particular order.

Use for full-text search

Finding documents that contain words from the query.

from gifts import SmoothFts

fts = SmoothFts()

fts.add(["wait", "mister", "postman"],
        doc_id="doc1")

fts.add(["please", "mister", "postman", "look", "and", "see"],
        doc_id="doc2")

fts.add(["oh", "yes", "wait", "a", "minute", "mister", "postman"],
        doc_id="doc3")

# print IDs of documents in which at least one word of the query occurs, 
# starting with the most relevant matches
for doc_id in fts.search(['postman', 'wait']):
    print(doc_id)

Use for abstract data mining

In the examples above, the words were literally words as strings. But they can be any objects suitable as dict keys.

from gifts import SmoothFts

fts = SmoothFts()

fts.add([3, 1, 4, 1, 5, 9, 2], doc_id="doc1")
fts.add([6, 5, 3, 5], doc_id="doc2")
fts.add([8, 9, 7, 9, 3, 2], doc_id="doc3")

for doc_id in fts.search([5, 3, 7]):
    print(doc_id)

Implementation details

When ranking the results, the algorithm takes into account::

  • the number of matching words
  • the rarity of such words in the database
  • the frequency of occurrence of words in the document

SmoothFts

from gifts import SmoothFts

It uses logarithmic tf-idf for weighting the words and cosine similarity for scoring the matches.

SimpleFts

from gifts import SimpleFts

Minimalistic approach: weigh, multiply, compare. This object is noticeably faster than SmoothFts.

Install

pip

pip3 install git+https://github.com/rtmigo/gifts_py#egg=gifts

setup.py

install_requires = [
    "gifts@ git+https://github.com/rtmigo/gifts_py"
]

See also

The skifts package does the same search, but uses scikit-learn and numpy for better performance. It is literally hundreds of times faster.

About

Search for most relevant documents containing words from query. Pure Python implementation without dependencies

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages