-
Notifications
You must be signed in to change notification settings - Fork 15
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Yeray Diaz Diaz
committed
Apr 6, 2018
1 parent
7507e33
commit 046d1d2
Showing
6 changed files
with
168 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# Lunr.py 🌖 | ||
|
||
A Python implementation of [Lunr.js](https://lunrjs.com) by [Oliver Nightingale](https://github.com/olivernn). | ||
|
||
> A bit like Solr, but much smaller and not as bright. | ||
This Python version of Lunr.js aims to bring the simple and powerful full text search capabilities into Python guaranteeing results as close as the original implementation as possible. | ||
|
||
## What does this even do? | ||
|
||
Lunr is a simple full text search solution for situations where deploying a full scale solution like Elasticsearch isn't possible, viable or you're simply prototyping. | ||
|
||
Lunr parses a set of documents and creates an inverted index for quick full text searches. | ||
|
||
The typical use case is to integrate Lunr in a web application, an example would be the [MkDocs documentation library](http://www.mkdocs.org/). In order to do this, you'd integrate [Lunr.js](https://lunrjs.com) in the Javascript code of your application, which will need to fetch and parse a JSON of your documents and create the index at startup of your application. Depending on the size of your document set this can take some time and potentially block the browser's main thread. | ||
|
||
Lunr.py provides a backend solution, allowing you to parse the documents ahead of time and create a Lunr.js compatible index you can pass have the browser version read, minimizing start up time of your application. | ||
|
||
Of course you could also use Lunr.py to power full text search in desktop applications or backend services to search on your documents mimicking Elasticsearch. | ||
|
||
## Current state | ||
|
||
Each version of lunr.py [targets a specific version of lunr.js](https://github.com/yeraydiazdiaz/lunr.py/blob/master/lunr/__init__.py#L12) and produces the same results as it both in Python 2.7 and 3 for [non-trivial corpus of documents](https://github.com/yeraydiazdiaz/lunr.py/blob/master/tests/acceptance_tests/fixtures/search_index.json). | ||
|
||
Lunr.py also serializes `Index` instances respecting the [`lunr-schema`](https://github.com/olivernn/lunr-schema) which are consumable by Lunr.js and viceversa. | ||
|
||
The API is in alpha stage and likely to change. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
# Quick start | ||
|
||
First, you'll need a list of dicts representing the documents you want to search on. These documents must have a unique field which will serve as a reference and a series of fields you'd like to search on. | ||
|
||
```python | ||
>>> from lunr import lunr | ||
>>> | ||
>>> documents = [{ | ||
...: 'id': 'a', | ||
...: 'title': 'Mr. Green kills Colonel Mustard', | ||
...: 'body': """Mr. Green killed Colonel Mustard in the study with the | ||
...: candlestick. Mr. Green is not a very nice fellow.""" | ||
...: }, { | ||
...: 'id': 'b', | ||
...: 'title': 'Plumb waters plant', | ||
...: 'body': 'Professor Plumb has a green and a yellow plant in his study', | ||
...: }, { | ||
...: 'id': 'c', | ||
...: 'title': 'Scarlett helps Professor', | ||
...: 'body': """Miss Scarlett watered Professor Plumbs green plant | ||
...: while he was away on his murdering holiday.""", | ||
...: }] | ||
``` | ||
|
||
Lunr provides a convenience `lunr` function to quickly index this set of documents: | ||
|
||
```python | ||
>>> idx = lunr( | ||
... ref='id', fields=('title', 'body'), documents=documents | ||
... ) | ||
``` | ||
|
||
For basic no-fuss searches just use the `search` on the index: | ||
|
||
```python | ||
>>> idx.search('kill') | ||
[{'ref': 'a', 'score': 0.6931722372559913, 'match_data': <MatchData "kill">}] | ||
>>> idx.search('study') | ||
[{'ref': 'b', 'score': 0.23576799568081389, 'match_data': <MatchData "studi">}, | ||
{'ref': 'a', 'score': 0.2236629211724517, 'match_data': <MatchData "studi">}] | ||
``` | ||
|
||
## Using query strings | ||
|
||
The query string passed to `search` accepts multiple terms: | ||
|
||
```python | ||
>>> idx.search('green plant') | ||
[{'ref': 'b', 'score': 0.5023294192217546, 'match_data': <MatchData "green, plant">}, | ||
{'ref': 'a', 'score': 0.12544083739725947, 'match_data': <MatchData "green">}, | ||
{'ref': 'c', 'score': 0.07306110905506158, 'match_data': <MatchData "green, plant">}] | ||
``` | ||
|
||
The index will search for `green` OR `plant`, a few things to note on the results: | ||
|
||
- document `b` scores highest because `plant` appears in both fields and `green` appears in the body | ||
- document `a` is second includes only `green` but in the title and the body twice | ||
- document `c` includes both terms but only on one of the fields | ||
|
||
Query strings support a variety of modifiers: | ||
|
||
### Wildcards | ||
|
||
You can use `*` as a wildcard anywhere in your query string: | ||
|
||
```python | ||
>>> idx.search('pl*') | ||
[{'ref': 'b', 'score': 0.725901569004226, 'match_data': <MatchData "plumb, plant">}, | ||
{'ref': 'c', 'score': 0.0816178155209697, 'match_data': <MatchData "plumb, plant">}] | ||
>>> idx.search('*llow') | ||
[{'ref': 'b', 'score': 0.6210112024848421, 'match_data': <MatchData "yellow">}, | ||
{'ref': 'a', 'score': 0.30426104537491444, 'match_data': <MatchData "fellow">}] | ||
``` | ||
|
||
Note that, when using wildcards, no stemming is performed in the search terms. | ||
|
||
### Fields | ||
|
||
Prefixing any search term with `<FIELD_NAME>:` allows you to specify which field a particular term should be searched for: | ||
|
||
```python | ||
>>> idx.search('title:green title:plant') | ||
[{'ref': 'b', 'score': 0.18604713274256787, 'match_data': <MatchData "plant">}, | ||
{'ref': 'a', 'score': 0.07902963505882092, 'match_data': <MatchData "green">}] | ||
``` | ||
|
||
Note the difference with the example above, document `c` is no longer in the results. | ||
|
||
Specifying an unindexed field will raise an exception: | ||
|
||
```python | ||
>>> idx.search('foo:green') | ||
Traceback (most recent call last): | ||
... | ||
lunr.exceptions.QueryParseError: Unrecognized field "foo", possible fields title, body | ||
``` | ||
|
||
You can combine this with wildcards: | ||
|
||
```python | ||
>>> idx.search('body:mu*') | ||
[{'ref': 'c', 'score': 0.3072276611029057, 'match_data': <MatchData "murder">}, | ||
{'ref': 'a', 'score': 0.14581429988419872, 'match_data': <MatchData "mustard">}] | ||
``` | ||
|
||
### Boosts | ||
|
||
When searching for several terms you can use boosting to give more importance to the each term: | ||
|
||
```python | ||
>>> idx.search('green plant^10') | ||
[{'ref': 'b', 'score': 0.831629678987025, 'match_data': <MatchData "green, plant">}, | ||
{'ref': 'c', 'score': 0.06360184858161157, 'match_data': <MatchData "green, plant">}, | ||
{'ref': 'a', 'score': 0.01756105367777591, 'match_data': <MatchData "green">}] | ||
``` | ||
|
||
Note how document `c` now scores higher because of the boosting on the term `plant`. The `10` represents a multiplier on the relative score for the term and must be positive integers. | ||
|
||
### Fuzzy matches | ||
|
||
You can also use fuzzy matching for terms that are likely to be misspelled: | ||
|
||
```python | ||
>>> idx.search('yellow~1') | ||
[{'ref': 'b', 'score': 0.621155860224936, 'match_data': <MatchData "yellow">}, | ||
{'ref': 'a', 'score': 0.3040972809936496, 'match_data': <MatchData "fellow">}] | ||
``` | ||
|
||
The positive integer after `~` represents the edit distance, in this case 1 character, either by addition, removal or transposition. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
site_name: Lunr.py | ||
|
||
pages: | ||
- Home: index.md | ||
- Searching: usage.md |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
-e . | ||
-r test.txt | ||
twine==1.5.0 | ||
mkdocs==0.17.3 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
twine==1.5.0 |