## Search Engines Notebook Contents
- [How can I create a Search Engine?](#How-can-I-create-a-Search-Engine?)
- [How can I query the Search Engine?](#How-can-I-query-the-Search-Engine?)

**You can use the script `query.py` to query the search engines and  `create_se_indexes.py` is what creates the Search Engine
indexes for Donkeybot.**  
See [scripts](https://github.com/rucio/donkeybot/tree/master/scripts) for source code and run the scripts with the '-h' option for info on the arguments they take.  
eg.  

`(virt)$ python scripts/query.py -h`

### How can I create a Search Engine?

There are 3 types of Search Engines in Donkeybot at the moment:  
- `SearchEngine` which is used to query general documenation ( in our case Rucio Documentation )  
- `QuestionSearchEngine` which is used to query Question objects saved in Data Storage  
- `FAQSearchEngine` which is used to query FAQs saved in Data Storage  

Let's create a `QuestionSearchEngine`

In [1]:
from bot.searcher.question import QuestionSearchEngine

In [2]:
qse = QuestionSearchEngine()
qse

<bot.searcher.question.QuestionSearchEngine at 0x2a2cf58a348>

**The QuestionSearchEngine is not yet usable!**    

We need 3 things:   

**Step 1.** Have a pandas **DataFrame** with the column **question** that holds the information we will index. The document id for th QuestionSearchEngine will be a column named **question_id** under corpus.   

*sidenote*: A nice addition to Donkeybot will be the ability to change the name of these columns and have something more general.  
But, this is only needed for the sqlite implementation. If in the future we move to Elasticsearch there is no need.

**Step 2.** Have an open connection to the Data Storage

**Step 3.** `create_index()` or `load_index()` which is the document term matrix of the questions.

In [3]:
# Step 1
import pandas as pd
# example DataFrame
corpus_df = pd.DataFrame({"question_id": [0,1,2,3],
                          "question":["What happened in GSoC 2020 ?",
                                      "How can I create an index ?",
                                      "How can I load an index ?", 
                                      "Why are there so many questions in this example?"], 
                          "answer":["Donkeybot was created!", 
                                    "With the .create_index() method!",
                                    "With the .load_index() method!",
                                    "Because BM25 need enough data to create good tf-df vectors :D"]})
corpus_df

Unnamed: 0,question_id,question,answer
0,0,What happened in GSoC 2020 ?,Donkeybot was created!
1,1,How can I create an index ?,With the .create_index() method!
2,2,How can I load an index ?,With the .load_index() method!
3,3,Why are there so many questions in this example?,Because BM25 need enough data to create good t...


In [4]:
# Step 2
from bot.database.sqlite import Database
data_storage = Database('your_data_storage.db')

In [5]:
# Step 3 create the index!
qse.create_index(
        corpus=corpus_df, db=data_storage, table_name="corpus_doc_term_matrix"
    )
qse.index

Unnamed: 0_level_0,terms
question_id,Unnamed: 1_level_1
0,"gsoc, happen"
1,"creat, index"
2,"load, index"
3,"exampl, mani, question"


In [6]:
data_storage.close_connection()

Now the QuestionSearchEngine is ready!

### How can I query the Search Engine?

Let's try and query the `QuestionSearchEngine` we just created above

In [7]:
query = "Anything cool that happened in this year's GSoC?" # whatever you want to ask
top_n = 1 # number of retrieved documents 

And just run the `.search()` method.

In [8]:
qse.search(query, top_n)

Unnamed: 0,question_id,question,answer,bm25_score,query
0,0,What happened in GSoC 2020 ?,Donkeybot was created!,1.783785,Anything cool that happened in this year's GSoC?
