# Demo: Semantic Scholar Loader in llama-index

### Some preliminaries -

- `query_space` : broad area of research
- `query_string` : a specific question to the documents in the query space


To download the open access pdfs and extract text from them, simply mark the `full_text` flag as `True` -


```python
s2reader = SemanticScholarReader()
documents = s2reader.load_data(query_space, total_papers, full_text=True)
```

In [1]:
from llama_hub.semanticscholar.base import SemanticScholarReader
import os
import openai
from llama_index.llms import OpenAI
from llama_index.query_engine import CitationQueryEngine
from llama_index import (
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
    ServiceContext,
)
from llama_index.response.notebook_utils import display_response

# initialize the SemanticScholarReader
s2reader = SemanticScholarReader()

# initialize the service context
openai.api_key = os.environ["OPENAI_API_KEY"]
service_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0)
)


In [5]:
query_space = "large language models"
query_string = "limitations of using large language models"
full_text = True
# be careful with the total_papers when full_text = True
# it can take a long time to download
total_papers = 50

persist_dir = (
    "./citation_" + query_space + "_" + str(total_papers) + "_" + str(full_text)
)

if not os.path.exists(persist_dir):
    # Load data from Semantic Scholar
    documents = s2reader.load_data(query_space, total_papers, full_text=full_text)
    index = VectorStoreIndex.from_documents(documents, service_context=service_context)
    index.storage_context.persist(persist_dir=persist_dir)
else:
    index = load_index_from_storage(
        StorageContext.from_defaults(persist_dir=persist_dir),
        service_context=service_context,
    )
    
# initialize the citation query engine
query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=3,
    citation_chunk_size=512,
)

# query the citation query engine
response = query_engine.query(query_string)
display_response(response, show_source=True, source_length=3)

**`Final Response:`** Large language models have limitations in terms of their training cost and computational resources [1]. While they can be efficient once trained, generating content from a trained model can still consume significant resources [1]. Techniques like model distillation can help reduce the cost of these models [1]. Additionally, increasing the size of language models may not necessarily improve their performance on long-tail knowledge or rare instances [3]. Scaling up models alone may not be sufficient to achieve high accuracy on specific types of questions [3]. There is also a need to modify the training objective or increase the number of training epochs to encourage memorization and focus on salient facts [4]. It is important to be cautious in how we talk about large language models, avoiding anthropomorphism and recognizing their limitations [5].

---

**`Source Node 1/6`**

**Node ID:** ce86c15c-97b2-462b-97a8-01d4f9b5cdca<br>**Similarity:** 0.8679221353278955<br>**Text:** ...<br>

---

**`Source Node 2/6`**

**Node ID:** 928d36ca-bf21-47f1-820b-b57a7fa30354<br>**Similarity:** 0.8679221353278955<br>**Text:** ...<br>

---

**`Source Node 3/6`**

**Node ID:** 6d0eba26-64a5-4b84-a71a-1bc412323761<br>**Similarity:** 0.864251829100195<br>**Text:** ...<br>

---

**`Source Node 4/6`**

**Node ID:** 509f3675-6048-4a14-8ba2-874316078ebf<br>**Similarity:** 0.864251829100195<br>**Text:** ...<br>

---

**`Source Node 5/6`**

**Node ID:** 8de0d9da-9729-486f-8490-434caf207934<br>**Similarity:** 0.8627260872607259<br>**Text:** ...<br>

---

**`Source Node 6/6`**

**Node ID:** 45b8d27e-c739-4493-acec-9554ea1ed24c<br>**Similarity:** 0.8627260872607259<br>**Text:** ...<br>

In [9]:
query_space = "covid 19 vaccine"
query_string = "List the efficacy numbers of the covid 19 vaccines"
full_text = True
# be careful with the total_papers when full_text = True
# it can take a long time to download
total_papers = 50

persist_dir = (
    "./citation_" + query_space + "_" + str(total_papers) + "_" + str(full_text)
)

if not os.path.exists(persist_dir):
    # Load data from Semantic Scholar
    documents = s2reader.load_data(query_space, total_papers, full_text=full_text)
    index = VectorStoreIndex.from_documents(documents, service_context=service_context)
    index.storage_context.persist(persist_dir=persist_dir)
else:
    index = load_index_from_storage(
        StorageContext.from_defaults(persist_dir=persist_dir),
        service_context=service_context,
    )
    
# initialize the citation query engine
query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=3,
    citation_chunk_size=512,
)

# query the citation query engine
response = query_engine.query(query_string)
display_response(response, show_source=True, source_length=100)

**`Final Response:`** The efficacy numbers of the COVID-19 vaccines are as follows:

- NVX-CoV2373: 49% efficacy against the B.1.351 variant, increasing to 60% when excluding HIV-positive individuals [1].
- Ad26.COV2-S: 72% efficacy against PCR-confirmed infection in the USA, reduced to 66% efficacy in Latin America and 57% efficacy in South Africa [1].
- AZD1222: Did not demonstrate protection against mild to moderate B.1.351-induced COVID-19 [1].
- BNT162b2: Elicited antibodies with neutralizing activity against B.1.1.7 and P.1 variants [1].
- CoronaVac: 50% efficacy against symptomatic infection [1].
- Sinopharm (BBIBP-CorV): 78% efficacy against COVID-19 and 79% efficacy against hospitalization [5].
- Novavax (NVX-CoV2373): 89% efficacy against symptomatic COVID-19 and positive RT-PCR test result [5].
- VECTOR (EpiVacCorona): No data available [5].

Note: These efficacy numbers are based on the provided sources and may not represent the most up-to-date information.

---

**`Source Node 1/6`**

**Node ID:** b6663a8b-5679-4723-9cc9-925ce5b84c34<br>**Similarity:** 0.8624234672546093<br>**Text:** Source 1:
NVX-CoV2373 
showed an efficacy of 49% against the 
B.1.351 variant in the prevention o...<br>

---

**`Source Node 2/6`**

**Node ID:** 67d4659d-b164-4841-9176-63e49f837b9c<br>**Similarity:** 0.8624234672546093<br>**Text:** Source 2:
617.2) variant.A significant 
decrease in neutralizing antibody titre 
has been seen fo...<br>

---

**`Source Node 3/6`**

**Node ID:** 98ff1d6f-1992-4674-b396-12f58acb5cd7<br>**Similarity:** 0.8616244247348551<br>**Text:** Source 3:
The only valid way to compare vaccines directly is 
in head-to-head efficacy trials, wh...<br>

---

**`Source Node 4/6`**

**Node ID:** 6752826d-3c31-48de-ab75-c625c06718c1<br>**Similarity:** 0.8616244247348551<br>**Text:** Source 4:
population studied and prevalence 
of SARS-CoV-2 variants at the time of the 
trial, it...<br>

---

**`Source Node 5/6`**

**Node ID:** 0283e2f1-1eb7-4d73-ba92-1724487b5aee<br>**Similarity:** 0.8593642969779912<br>**Text:** Source 5:
Although differences in how the 
clinical trials were set up make comparison 
between v...<br>

---

**`Source Node 6/6`**

**Node ID:** 57356c41-b2bf-429b-a842-bc9962a5f007<br>**Similarity:** 0.8593642969779912<br>**Text:** Source 6:
laboratory 
confirmed COVID-19 
within 
6 months after first dose≥18 years old 9 months...<br>