# Natural Language Query → Solr Query Translation

This notebook will implement two pipelines for the NLQ -> SQ translation task. The first is a custom LangChain chain that explicitly combines all the components that are used to accomplish the task. The second is a more convenient, streamlined approach that takes advantage of the LLMChain. 

## Setup

In [70]:
import download_ads
import os

# Custom Chain

## Chain (No Chat)

#### Output Parser

This is not the ideal implementation. The output parser would ideally be a separate module in the chain. Here, we just inject the format instructions into the prompt.

In [71]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

# generate format instructions for the prompt
query_schema = ResponseSchema(
    name="q",
    description="The structured query based on Human input to be sent to ADS",
)
response_schemas = [query_schema]
output_parser = StructuredOutputParser(response_schemas=response_schemas)
format_instructions = output_parser.get_format_instructions()


In [72]:
print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"q": string  // The structured query based on Human input to be sent to ADS
}
```


#### Parameterize Prompt

In [73]:
# Load promp template
from langchain.prompts import load_prompt
from langchain.prompts import PromptTemplate

template_path = '../../data/forward_templates/forward_prompt_simple_history.yaml'

# Load the prompt template from a file. It is a PromptTemplate
prompt_template = load_prompt(template_path)
print(prompt_template.template)

INSTRUCTIONS: 
The following is a conversation between a human and an AI. The AI should answer the question based on the context, examples, and current conversation provided. If the AI does not know the answer to a question, it truthfully says it does not know. 

CONTEXT: 
The AI is an expert database search engineer. Specifically, the AI is trained to create structured queries that are submitted to NASA Astrophysics Data System (ADS), a digital library portal for researchers in astronomy and physics. The ADS system accepts queries using the Apache Solr search syntax. 
Here are all available fields and operators in the ADS database, where each field is separated by a space in this list: {fields} {operators}
{multi_query_paragraph}

AVAILABLE FIELDS: 
Here is an example for each of the available fields in the ADS database. The formatting is a Python list of lists. The inner list corresponds to an available field, is five elements long, and each element starts and ends with a single quot

In [74]:
partial_input_vars = {
    "fields": download_ads.get_fields_names(),
    "operators": download_ads.get_operator_names(),
    "multi_query_paragraph": download_ads.get_multi_query_paragraph(), 
    "fields_examples": str(download_ads.get_examples()),
    "operators_examples": str(download_ads.get_operators_info()["name_example_explanation"]),
    "explanation": "The AI should only output the answer and no additional information.",
    "format_instructions": format_instructions   
}

partial_template = prompt_template.partial(**partial_input_vars)

#### Create the model

In [75]:
from langchain.chat_models import ChatOpenAI

def create_model(temperature: float = 0.0, model_name: str = "gpt-3.5-turbo") -> ChatOpenAI:
    return ChatOpenAI(temperature=temperature, model=model_name)

llm = create_model(temperature=1.0)

#### Choose Embeddings

In [41]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings import OpenAIEmbeddings

embedding_choices = {
    "HF": {"model": HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"), "dim": 384},
    "OpenAI": {
        "model": OpenAIEmbeddings(),  # using this is annoying since rate limiting of 3/minute
        "dim": 1536,
    },
}

embedding = embedding_choices["HF"]

#### Set up Pinecone

In [42]:
from langchain.vectorstores import Pinecone
import pinecone


def get_pinecone_langchain_client(index_name: str, embedding, embedding_dim) -> Pinecone:
    # Initialize pinecone module
    pinecone.init(
    api_key=os.environ["PINECONE_API_KEY"], environment=os.environ["PINECONE_ENV"]
    )

    # Get or create pinecone index
    if index_name not in pinecone.list_indexes():
        pinecone.create_index(
            name=index_name,
            metric="cosine",
            dimension=embedding_dim,
        )
    index = pinecone.Index(index_name)

    # Create LangChain pinecone client
    # note: embedding.embed_query is the function that's called to do the embedding
    pinecone_vectorstore = Pinecone(index=index, embedding=embedding, text_key="text")
    return pinecone_vectorstore

In [43]:
pinecone_vectorstore = get_pinecone_langchain_client("demo", embedding=embedding["model"], embedding_dim=embedding["dim"])

#### Embed Examples

In [44]:
import yaml
from langchain.docstore.document import Document

In [45]:
def documents_from_file(file_path: str) -> list[Document]:
    examples = []
    try:
        with open(file_path, 'r') as stream:
            examples = yaml.safe_load(stream)

        print(examples)
    except yaml.YAMLError as exc:
        print(f"Error parsing YAML file: {exc}")
    except FileNotFoundError:
        print(f"Error: file not found: {file_path}")

    documents = []
    for example in examples:
        documents.append(
            Document(page_content=example['text'], metadata={'solr': example['solr']})
        )
    return documents

In [46]:
documents = documents_from_file('../../data/examples.yaml')

[{'text': 'finds articles published between 1980 and 1990 by John Huchra', 'solr': '```json\n{\n"q": "author:\\"Huchra, John\\" year:1980-1990"\n}\n```'}, {'text': 'What are papers that mention neural networks in the abstract?', 'solr': '```json\n{\n"q": "abs:\\"neural networks\\""\n}\n```'}, {'text': 'Give me papers that mention neural networks in the title or keywords or abstract', 'solr': '```json\n{\n"q": "abs:\\"neural networks\\""\n}\n```'}, {'text': 'Papers with that contain neural networks in the full text', 'solr': '```json\n{\n"q": "body:\\"neural networks\\""\n}\n```'}, {'text': 'Everything from 2002 to 2008', 'solr': '```json\n{\n"q": "year:2002-2008"\n}\n```'}, {'text': 'What papers by Kurtz, et al discuss weak lensing?', 'solr': '```json\n{\n"q": "author:\\"Kurtz\\" abs:\\"weak lensing\\""\n}\n```'}, {'text': 'What papers by Alberto, et al discuss astronomy?', 'solr': '```json\n{\n"q": "author:\\"Alberto\\" abs:\\"astronomy\\""\n}\n```'}, {'text': 'Show me papers about ex

In [47]:
# add documents to the vector database
pinecone_vectorstore.add_documents(documents)

['c072db95-a7d6-4320-a28f-d20d3cbee589',
 '28ecbd8a-e057-4528-977a-4857becfef13',
 '48804b9e-0482-433d-946a-ce1a628b47eb',
 'd35db382-ba40-401e-8163-375816ba5dc2',
 '3aacbd38-14cd-4328-bfb2-7967ec1ab333',
 '20c00b40-1b82-4fe7-97e2-f2059ce72d2c',
 '3ae10e25-23e3-477b-b8b0-fb1a855fa269',
 '212f9d8d-a2b6-4796-96ff-1ca9f901c96e',
 '730bb67a-c169-4292-8ffd-fb5b8b5d444a',
 'b0a50722-d5fb-4925-b811-946f2c6c5765',
 '79fe5132-44bc-4ff6-b2e2-08c775a49bb3',
 'eaa67af6-9c6e-4f35-985f-c907eb02c1c3',
 'fa2158f8-0285-478b-ad20-2ce7dfdb94f7']

#### Run the chain

In [48]:
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough, RunnableLambda
from langchain.prompts import SemanticSimilarityExampleSelector
from langchain.chat_models import ChatOpenAI
from operator import itemgetter


retriever = pinecone_vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})
prompt = partial_template
model = llm
question = "Papers that contain neural networks in the full text"
output_parser = output_parser

chain = {
    "specific_examples": itemgetter("question") | retriever,
    "input": itemgetter("question")
} | prompt | model | output_parser

result = chain.invoke({"question": question})
result


KeyError: 'history'

In [71]:
result_str = str(result)
result_str

'{\'q\': \'body:"neural networks"\'}'

In [72]:
result_repr = repr(result)
result_repr

'{\'q\': \'body:"neural networks"\'}'

In [None]:
eval()

#### Write Result to File to See JSON Encoding

The string written to file is something that should be able to be encoded as a json object, and when encoded should consist of a string representing a json object. If we wanted to convert this string back into python object, we could just do `dict(json_str)`

In [66]:
import json
out_file = "../../data/result.json"

def result_to_ls_format(file_path, result):

    result_str = str(result)

    with open(file_path, 'w') as f:
        json.dump(result_str, f)
        # s = json.dumps(result)
        # json.dump(s, f)
        # f.write(str(result))
        

result_to_ls_format(out_file, result)
# with open(out_file, 'w') as f:
#     s = json.dumps(result)
#     print(s)
#     f.write(s)
#     o = {"solr_query": s}
#     json.dump(s, f)
#     f.write('\n')
#     json.dump(o, f)
#     # f.write(s)

ValueError: dictionary update sequence element #0 has length 1; 2 is required

## Chat Chain

#### Memory

In [49]:
from operator import itemgetter

from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

model = ChatOpenAI()
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful chatbot"),
        MessagesPlaceholder(variable_name="history"),
        ("human", "{input}"),
    ]
)

In [51]:
# print out the promp template
# print(str(prompt.template))

In [52]:
memory = ConversationBufferMemory(return_messages=True)

In [53]:
memory.load_memory_variables({})

{'history': []}

In [54]:
chain = (
    RunnablePassthrough.assign(
        history=RunnableLambda(memory.load_memory_variables) | itemgetter("history")
    )
    | prompt
    | model
)

In [20]:
inputs = {"input": "hi im bob"}
response = chain.invoke(inputs)
response

AIMessage(content='Hi Bob! How can I assist you today?')

In [21]:
memory.save_context(inputs, {"output": response.content})

In [22]:
memory.load_memory_variables({})

{'history': [HumanMessage(content='hi im bob'),
  AIMessage(content='Hi Bob! How can I assist you today?')]}

In [23]:
inputs = {"input": "whats my name"}
response = chain.invoke(inputs)
response

AIMessage(content='Your name is Bob.')

#### Experiments

In [63]:
chain = (
    RunnablePassthrough.assign(
        history=RunnableLambda(memory.load_memory_variables) | itemgetter("history")
    )
)

chain = (
    RunnablePassthrough() |
    {
    "history" : RunnableLambda(memory.load_memory_variables) | itemgetter("history")
    }
)

In [79]:
inputs = {"input": "hi im bob", "specific_examples": "none"}
response = chain.invoke(inputs)
response

{'history': []}

#### My Chain with Memory

In [90]:
def update_memory(x):
    print(x)
    return x
    # print(inputs)
    # print(response)
    # memory.save_context(inputs, {"output": response.content})


In [95]:
fill_history_chain = (
    RunnablePassthrough.assign(
        history=RunnableLambda(memory.load_memory_variables) | itemgetter("history")
    )
    | RunnablePassthrough.assign(
        specific_examples= itemgetter("specific_examples") | retriever
    )
    | partial_template
    # | model
    | RunnableLambda(update_memory)
    # | output_parser
)

In [96]:
fill_history_chain.invoke(inputs)

text='INSTRUCTIONS: \nThe following is a conversation between a human and an AI. The AI should answer the question based on the context, examples, and current conversation provided. If the AI does not know the answer to a question, it truthfully says it does not know. \n\nCONTEXT: \nThe AI is an expert database search engineer. Specifically, the AI is trained to create structured queries that are submitted to NASA Astrophysics Data System (ADS), a digital library portal for researchers in astronomy and physics. The ADS system accepts queries using the Apache Solr search syntax. \nHere are all available fields and operators in the ADS database, where each field is separated by a space in this list: abs abstract ack aff aff_id alternate_bibcode alternate_title arXiv arxiv_class author author author_count author_count bibcode bibgroup bibstem body citation_count citation_count copyright data database pubdate doctype doi ^ full grant identifier inst issue keyword lang object orcid orcid_pu

StringPromptValue(text='INSTRUCTIONS: \nThe following is a conversation between a human and an AI. The AI should answer the question based on the context, examples, and current conversation provided. If the AI does not know the answer to a question, it truthfully says it does not know. \n\nCONTEXT: \nThe AI is an expert database search engineer. Specifically, the AI is trained to create structured queries that are submitted to NASA Astrophysics Data System (ADS), a digital library portal for researchers in astronomy and physics. The ADS system accepts queries using the Apache Solr search syntax. \nHere are all available fields and operators in the ADS database, where each field is separated by a space in this list: abs abstract ack aff aff_id alternate_bibcode alternate_title arXiv arxiv_class author author author_count author_count bibcode bibgroup bibstem body citation_count citation_count copyright data database pubdate doctype doi ^ full grant identifier inst issue keyword lang obj

# Additional Topics

#### Aside: Manual Vector lookup

In [42]:
def get_top_k_similar(vectorstore, query: str, top_k: int = 3) -> list[Document]:
    docs = vectorstore.similarity_search_with_score(query, k=top_k)
    return docs

In [43]:
get_top_k_similar(pinecone_vectorstore, question)

[(Document(page_content='Papers with that contain neural networks in the full text', metadata={'solr': '```json\n{\n"q": "body:\\"neural networks\\""\n}\n```'}),
  0.983315647),
 (Document(page_content='Papers with that contain neural networks in the full text', metadata={'solr': '```json\n{\n"q": "body:\\"neural networks\\""\n}\n```'}),
  0.983315527),
 (Document(page_content='What are papers that mention neural networks in the abstract?', metadata={'solr': '```json\n{\n"q": "abs:\\"neural networks\\""\n}\n```'}),
  0.844233811)]

#### Aside: Example Parsing for Label Studio

In [10]:
import yaml

def yaml_to_label_studio(file_path: str):
    examples = []
    try:
        with open(file_path, 'r') as stream:
            examples = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(f"Error parsing YAML file: {exc}")
    except FileNotFoundError:
        print(f"Error: file not found: {file_path}")
    return examples

In [11]:
examples = yaml_to_label_studio('../../data/examples.yaml')

In [12]:
examples[0]

{'text': 'finds articles published between 1980 and 1990 by John Huchra',
 'solr': '```json\n{\n"q": "author:\\"Huchra, John\\" year:1980-1990"\n}\n```'}

In [14]:
print(examples[0]['solr'])

```json
{
"q": "author:\"Huchra, John\" year:1980-1990"
}
```
