Searching for elements that have the common features with the query.
query = ['A', 'B']
elements = [
['N', 'A', 'M'], # common features: 'A'
['C', 'B', 'A'], # common features: 'A', 'B'
['X', 'Y'] # no common features
]
In this case, the search with return ['C', 'B', 'A']
and ['N', 'A', 'M']
in that particular order.
Finding documents that contain words from the query.
from gifts import SmoothFts
fts = SmoothFts()
fts.add(["wait", "mister", "postman"],
doc_id="doc1")
fts.add(["please", "mister", "postman", "look", "and", "see"],
doc_id="doc2")
fts.add(["oh", "yes", "wait", "a", "minute", "mister", "postman"],
doc_id="doc3")
# print IDs of documents in which at least one word of the query occurs,
# starting with the most relevant matches
for doc_id in fts.search(['postman', 'wait']):
print(doc_id)
In the examples above, the words were literally words as strings. But they can
be any objects suitable as dict
keys.
from gifts import SmoothFts
fts = SmoothFts()
fts.add([3, 1, 4, 1, 5, 9, 2], doc_id="doc1")
fts.add([6, 5, 3, 5], doc_id="doc2")
fts.add([8, 9, 7, 9, 3, 2], doc_id="doc3")
for doc_id in fts.search([5, 3, 7]):
print(doc_id)
When ranking the results, the algorithm takes into account::
- the number of matching words
- the rarity of such words in the database
- the frequency of occurrence of words in the document
from gifts import SmoothFts
It uses logarithmic tf-idf for weighting the words and cosine similarity for scoring the matches.
from gifts import SimpleFts
Minimalistic approach: weigh, multiply, compare. This object is noticeably
faster than SmoothFts
.
pip3 install git+https://github.com/rtmigo/gifts_py#egg=gifts
install_requires = [
"gifts@ git+https://github.com/rtmigo/gifts_py"
]
The skifts package does the same search, but uses scikit-learn and numpy for better performance. It is literally hundreds of times faster.