# Overview
This notebook shows the use of a purpose built search engine for the **CORD Research Papers** dataset.

The library is an upgrade on the simple search engine built in my other kernel for this dataset here https://www.kaggle.com/dgunning/browsing-research-papers-with-a-bm25-search-engine

## 1. Installing dgunning/cord

The library is in github at https://github.com/dgunning/cord19.git. To install it, the Internet must be on for the Kaggle notebook.

In [None]:
!pip install -U git+https://github.com/dgunning/cord19.git 

# 2. Using the cord library

Using the library just means importing **cord.ResearchPapers** and calling **load()**.

In [None]:
from cord import ResearchPapers

To create a **ResearchPapers** instance call `ResearchPapers.load`. This will load the metadata from the **metadata.csv** file and perform all the necessary preprocesing. It will not load the JSON files initially, but the files are available will provide the ability to retrieve and display the paper contents.

While loading, the code will create the BM25 index from either the abstracts or the texts of the paper. Indexing from texts takes **30** seconds on my laptop, but around **100** seconds on a Kaggle instance. If you want to create a more powerful, and potentially more accurate index, then use `ResearchPapers.load(index="text")`. This will take much longer - 10 minutes on my laptop/~30 minutes on Kaggle.

In [None]:
research_papers = ResearchPapers.load()

# 3. Search Bar
The searchbar provides interactive search. Enter a search term and and it will update in real time based on the search query.
The results will show as an HTML view, similar to Google or Semantic Scholar search results. 

In [None]:
research_papers.searchbar("mother to child transmission")

### Display as Table
If you prefer to see the results as a table/dataframe, then create the searchbar with the parameter `view="table"` or `view="df"` or `view="dataframe"`

In [None]:
research_papers.searchbar("mother to child transmission", view='table')

# 4. Searching Research Papers

The function search returns a list of papers matching the search terms.
You can add **start_date** and **end_dates** and **covid_related** as optional terms
```
research_papers.search('treatment sars-cov-2', start_date='2020-01-30', end_date='2020-02-15', covid_related=True)
``

In [None]:
research_papers.search('vaccine effectiveness')

# 5. Selecting subsets of research papers
There are a few ways to select from the research papers. Firstly, you can use the query function

## Query
The query function follows the same pattern as the DataFrame query function - in fact it delegates to the metadata dataframe's query function, makes a copy of the metadata and creates a new ResearchPaper instance. This means you can use the exact syntax that is available for dataframes. In the example below, we select research papers published in **New Scientist** after February 29th

In [None]:
research_papers.query('journal =="New Scientist" & published > "2020-02-29" ') 

## Head and Tail
Similarly the query function, the head() and tail() functions allow you to select only the first or last n rows from the research papers

In [None]:
research_papers.head(4)

In [None]:
research_papers.tail(4)

## Dedicated Query Functions
These functions build on top of the query function and are dedicated for a subset of the data

- **since_sars** Select papers since SARS
- **since_sarscov2** Select papers since SARS-COV-2
- **before_sars** Select papers before SARS
- **before_sarscov2** Select papers before SARS-COV-2
- **before** Select research papers before a date
- **after** Select papers after a date

### Papers Since SARS
Show all research papers published since SARS

In [None]:
research_papers.since_sars()

In [None]:
research_papers = ResearchPapers.load(index='texts')

In [None]:
research_papers.searchbar('disease modelling pandemics')

In [None]:
research_papers['8m06zdho']

In [None]:
from cord.vectors import similar_papers

In [None]:
similar_papers('8m06zdho')

In [None]:
query = """
Since its emergence and detection in Wuhan, China in late 2019, the novel coronavirus SARS-CoV-2 has spread to nearly
every country around the world, resulting in hundreds of thousands of infections to date. To uncover the sources of SARS-CoV-2 
introductions and patterns of spread within the U.S., we sequenced nine viral genomes from early 
reported COVID-19 patients in Connecticut. By coupling our genomic data with domestic and international travel patterns, 
we show that early SARS-CoV-2 transmission in Connecticut was likely driven by domestic introductions. Moreover, the risk of domestic 
importation to Connecticut exceeded that of international importation by mid-March regardless of our estimated impacts of federal travel restrictions. This study provides evidence for widespread, sustained transmission of SARS-CoV-2 within the U.S. and highlights the critical need for local surveillance.
"""
research_papers.search_2d(query)