### **Lunr**

Lunr is a lightweight full-text search engine for the web. Lunr.js provides a simple and powerful interface for indexing and searching textual data in JavaScript applications.

In [3]:
!pip install lunr

Collecting lunr
  Downloading lunr-0.7.0.post1-py3-none-any.whl (35 kB)
Installing collected packages: lunr
Successfully installed lunr-0.7.0.post1


In [4]:
import pandas as pd
import json
from lunr import lunr
from lunr.index import Index

Importing Data

In [6]:
df = pd.read_csv("/clinc.csv").assign(idx=lambda d: d.index)
df.sample(3)

Unnamed: 0,text,label,idx
12101,where is the w-2 form located,w2,12101
22817,i need tips on how to overcome insomnia,oos,22817
10683,tell me how much pto i've used,pto_used,10683


In [7]:
documents = df.to_dict(orient="records")

In [8]:
#Building index for lunr. All that we need to perform the query.
index = lunr(ref='idx', fields=('text',), documents=documents)

In [9]:
index.search('spanish')

[{'ref': '4501', 'score': 7.801, 'match_data': <MatchData "spanish">},
 {'ref': '3', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '26', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '27', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '28', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4526', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4529', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4556', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4573', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4575', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4576', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4585', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '5638', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '19505', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '19507', 'score': 

In [10]:
[documents[int(i['ref'])] for i in index.search('spanish')]

[{'text': "can you tell me how to say 'i do not speak much spanish', in spanish",
  'label': 'translate',
  'idx': 4501},
 {'text': 'how do you say fast in spanish', 'label': 'translate', 'idx': 3},
 {'text': 'what is dog in spanish', 'label': 'translate', 'idx': 26},
 {'text': 'how do you say dog in spanish', 'label': 'translate', 'idx': 27},
 {'text': 'dog in spanish', 'label': 'translate', 'idx': 28},
 {'text': 'how can i say not now in spanish',
  'label': 'translate',
  'idx': 4526},
 {'text': 'how do you say goodbye in spanish',
  'label': 'translate',
  'idx': 4529},
 {'text': 'what is spanish for hello', 'label': 'translate', 'idx': 4556},
 {'text': 'how do you say thank you in spanish',
  'label': 'translate',
  'idx': 4573},
 {'text': 'how can i say thank you in spanish',
  'label': 'translate',
  'idx': 4575},
 {'text': 'what is thank you in spanish', 'label': 'translate', 'idx': 4576},
 {'text': 'how do you say cat in spanish', 'label': 'translate', 'idx': 4585},
 {'text': 

### **Serialize**

To reconstruct the index. Useful when we are dealing with large indices and we want to store the intermediate results.

In [13]:
import json
from lunr.index import Index

serialized = index.serialize()

# Save the index
with open('idx.json', 'w') as fd:
    json.dump(serialized, fd)

# Load it again
with open("idx.json") as fd:
    reloaded = json.loads(fd.read())

idx = Index.load(reloaded)
idx.search("plant")

[{'ref': '11998', 'score': 9.056, 'match_data': <MatchData "plant">},
 {'ref': '9435', 'score': 8.144, 'match_data': <MatchData "plant">},
 {'ref': '2097', 'score': 7.399, 'match_data': <MatchData "plant">},
 {'ref': '9433', 'score': 7.399, 'match_data': <MatchData "plant">},
 {'ref': '23246', 'score': 7.399, 'match_data': <MatchData "plant">},
 {'ref': '9439', 'score': 6.778, 'match_data': <MatchData "plant">},
 {'ref': '19441', 'score': 6.254, 'match_data': <MatchData "plant">}]

"%timeit" line is used to measure the execution time of the corresponding search operation to compare their performance.

In [14]:
#execution time of locating rows in df where the column 'text' contains the substring "spanish".
%timeit df.loc[lambda d: d['text'].str.contains("spanish")]

#execution time of a list comprehension that iterates over a list of documents and selects those documents where the substring "spanish" is found in the 'text' attribute of each document.
%timeit [d for d in documents if 'spanish' in d['text']]

#execution time of searching for the term "spanish" using an unspecified index object
%timeit index.search('spanish')

#execution time of searching for the term "spanish" using an unspecified index object and then retrieving the corresponding documents from the documents list based on the search results.
%timeit [documents[int(i['ref'])] for i in index.search('spanish')]



12 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.13 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
650 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
859 µs ± 232 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
