# Chatbot development
### Initial demo, 19 April 2023

## Requirements
We import `config` in order to access:
- where the API key is stored
- where the document directory is located
- the prompt template

### API key
- is not stored in plain text in `config.py`
- is accessed as an environment variable (using an *.env* file that remains hidden)

### Prompt template
The prompt is written so as to avoid "hallucinations"

```python
prompt_template = """You are a personal Bot assistant for answering any questions about the documents.
You are given a question and a set of documents.
If the user's question requires you to provide specific information from the documents, give your answer based only on the examples provided below. DON'T generate an answer that is NOT written in the provided examples.
If you don't find the answer to the user's question with the examples provided to you below, answer that you didn't find the answer in the documentation and propose him to rephrase his query with more details.
Use bullet points if you have to make a list, only if necessary.

QUESTION: {question}

DOCUMENTS:
=========
{context}
=========
Finish by proposing your help for anything else.
"""
```

In [1]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI, VectorDBQA
from langchain.document_loaders import DirectoryLoader
from langchain.prompts import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
import config
import logging

In [2]:
# Load documents from the specified directory using a DirectoryLoader object
loader = DirectoryLoader(config.FILE_DIR, glob='*.pdf')
documents = loader.load()

# split the text to chuncks of of size 1000
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
# Split the documents into chunks of size 1000 using a CharacterTextSplitter object
texts = text_splitter.split_documents(documents)

# Create a vector store from the chunks using an OpenAIEmbeddings object and a Chroma object
embeddings = OpenAIEmbeddings(openai_api_key=config.OPENAI_API_KEY)
docsearch = Chroma.from_documents(texts, embeddings)

detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.
detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.
detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.
detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.
detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.
detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.
detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.
detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the

**Warning message: detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.**

The warning message suggests that the Python package detectron2 is not installed on your system. detectron2 is an object detection library that provides various pre-trained models and tools to train custom object detection models.

The warning message specifically mentions that the hi_res partitioning strategy is not available due to the absence of detectron2. The hi_res strategy is a partitioning strategy that is used to split large images into smaller tiles for efficient processing in object detection pipelines.

Since detectron2 is not available, the object detection pipeline will fall back to the fast partitioning strategy. The fast strategy is a simpler approach that divides images into non-overlapping blocks of equal size. While the fast strategy is less computationally expensive than the hi_res strategy, it may not be suitable for processing large images or detecting small objects.

If you require the hi_res partitioning strategy, you should install the detectron2 package using pip install detectron2. Once installed, the detectron2 package should be importable in your Python code, and you should be able to use the hi_res partitioning strategy in your object detection pipeline.

**Using embedded DuckDB without persistence: data will be transient**

The message you provided suggests that the current instance of DuckDB is running in an embedded mode without any persistence, which means that any data loaded into the database will not be saved to disk and will be transient.

DuckDB is an embedded analytical SQL database engine that can be used for various data manipulation and querying tasks. When DuckDB is used in embedded mode, it runs inside the application process, and the database is not saved to disk by default.

There are two modes of DuckDB: persistent mode and in-memory mode. In persistent mode, the database is saved to disk and can be used across multiple sessions. In-memory mode, on the other hand, does not save the database to disk, and any data loaded into the database will be lost once the application exits.

The message you provided suggests that the current instance of DuckDB is running in in-memory mode without any persistence, which means that any data loaded into the database will be transient and will not be saved to disk.

If you want to persist the data loaded into DuckDB, you should switch to persistent mode and specify a file path to save the database. This can be done by specifying the path to the database file as an argument to the duckdb.connect function, for example:

```
import duckdb
con = duckdb.connect('my_database.db')
```

This will create a new database file named my_database.db in the current working directory and connect to it. Any data loaded into the database will be persisted in the file and can be accessed across multiple sessions.

### Exploring object properties

In [3]:
len(documents)

10

In [4]:
len(texts)

840

In [5]:
print(dir(embeddings))

['Config', '__abstractmethods__', '__annotations__', '__class__', '__class_vars__', '__config__', '__custom_root_type__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__exclude_fields__', '__fields__', '__fields_set__', '__format__', '__ge__', '__get_validators__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__include_fields__', '__init__', '__init_subclass__', '__iter__', '__json_encoder__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__post_root_validators__', '__pre_root_validators__', '__pretty__', '__private_attributes__', '__reduce__', '__reduce_ex__', '__repr__', '__repr_args__', '__repr_name__', '__repr_str__', '__rich_repr__', '__schema_cache__', '__setattr__', '__setstate__', '__signature__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__try_update_forward_refs__', '__validators__', '__weakref__', '_abc_impl', '_calculate_keys', '_copy_and_set_values', '_decompose_class', '_embedding_func', '_enforce_dict_if_root', '_ge

In [6]:
print(dir(docsearch))

['_LANGCHAIN_DEFAULT_COLLECTION_NAME', '__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_client', '_client_settings', '_collection', '_embedding_function', '_persist_directory', 'add_documents', 'add_texts', 'as_retriever', 'delete_collection', 'from_documents', 'from_texts', 'max_marginal_relevance_search', 'max_marginal_relevance_search_by_vector', 'persist', 'similarity_search', 'similarity_search_by_vector', 'similarity_search_with_score']


## Core of `chat.py` module

In [7]:
# Define a function named 'answer' that takes a string prompt and an optional directory path
# for persisting data. The function returns a string that represents the answer to the prompt.
def answer(prompt: str, persist_directory: str = config.PERSIST_DIR) -> str:
    
    # Log a message indicating that the function has started
    #LOGGER.info(f"Start answering based on prompt: {prompt}.")
    
    # Create a prompt template using a template from the config module and input variables
    # representing the context and question.
    prompt_template = PromptTemplate(template=config.prompt_template, input_variables=["context", "question"])
    
    # Load a QA chain using an OpenAI object, a chain type, and a prompt template.
    doc_chain = load_qa_chain(
        llm=OpenAI(
            openai_api_key = config.OPENAI_API_KEY,
            model_name="text-davinci-003",
            temperature=0,
            max_tokens=300,
        ),
        chain_type="stuff",
        prompt=prompt_template,
    )
    
    # Log a message indicating the number of chunks to be considered when answering the user's query.
    #LOGGER.info(f"The top {config.k} chunks are considered to answer the user's query.")
    
    # Create a VectorDBQA object using a vector store, a QA chain, and a number of chunks to consider.
    qa = VectorDBQA(vectorstore=docsearch, combine_documents_chain=doc_chain, k=config.k)
    
    # Call the VectorDBQA object to generate an answer to the prompt.
    result = qa({"query": prompt})
    answer = result["result"]
    
    # Log a message indicating the answer that was generated
    #LOGGER.info(f"The returned answer is: {answer}")
    
    # Log a message indicating that the function has finished and return the answer.
    #LOGGER.info(f"Answering module over.")
    return answer

### Some results

In [8]:
answer("Who is Abonia?")

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIError: The server had an error while processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID f3e2adf14022a5d5ed59ff27d9995b1b in your message.) {
  "error": {
    "message": "The server had an error while processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID f3e2adf14022a5d5ed59ff27d9995b1b in your message.)",
    "type": "server_error",
    "param": null,
    "code": null
  }
}
 500 {'error': {'message': 'The server had an error while processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID f3e2adf

'\nAnswer: Abonia Sojasingarayar is a Machine Learning Scientist, Data Scientist, NLP Engineer, Computer Vision Engineer, AI Analyst, and Technical Writer. She has a M2 in Artificial Intelligence from IA School in Boulogne-Billancourt, a M1 in Digital Project Management from Institut F2I in Paris, and a Licence in Computer Technology and Engineering from Université Pondicherry in Karikal, India. She is also certified in Google Professional Data Engineer, IBM Quantum Conversation Badge, Machine Learning Specialization by Deeplearning.IA, Andrew NG, Deep Learning & Natural Language Specialization by Deeplearning.IA, Andrew NG, Spark by Learning Academy, Advance Python Bootcamp, GCP Professional Data Engineer Badges, Watson Assistant Hands on, and Watson Foundation Methodology. She is also experienced in resolving data security and backup issues, disk cloning and configuration, and maintenance of PC post.\n\nI hope this answers your question. Is there anything else I can help you with?'

In [14]:
answer("What are illicit financial flows?")

'\nAnswer: Illicit financial flows (IFFs) are international transfers of funds that were or are illegally obtained, transferred or used. They can also be any international transfers of wealth that are harmful to development. Examples of IFFs include trade misinvoicing, tariff evasion, corruption, and money laundering. I hope this answers your question. If you need any further help, please let me know.'

In [16]:
answer("What is the global carbon budget to stay below 2 degrees?")

'\nAnswer: The global carbon budget to stay below 2 degrees is 953 GtC, or 3,494 GtCO2. This budget is based on a 50% chance of limiting warming to 2°C. If mitigation begins immediately, emissions peak in 2019 at 40 billion metric tonnes of CO2, requiring emissions reduction rates of 6.2% a year from then on.\n\nI hope this answers your question. Is there anything else I can help you with?'

In [11]:
answer("What would be the total costs for Ethiopia and Bangladesh as a percentage of their GDP in 2030 and 2050 if they were to participate in the SkyShares reference scenario?")

'\nAnswer:\nFor Ethiopia and Bangladesh, the total costs as a percentage of their GDP in 2030 and 2050 in the SkyShares reference scenario would be 27.23% and 10.96% respectively in 2030, and 0.56% and 2.97% respectively in 2050.\n\nIf you need help with anything else, please let me know.'

In [12]:
answer("What is the reference scenario in SkyShares?")

'\nAnswer: The reference scenario in SkyShares is a framework designed to limit global average warming to 2°C, with mitigation activity commencing immediately after agreement of the framework. All countries participate from the outset, and allocations of the global emissions budget converge from being in proportion to current emissions to start with, to equal per capita entitlements in the year 2030.\n\nI hope this answers your question. Is there anything else I can help you with?'

## To discuss
1. Validation
    - What approach should be use to check that the chatbot is giving correct answers?
        - Human experts
        - Roll-out a mini experiment at AgreedEarth
2. Fine-tuning
3. Use cases to focus on
4. Features to add
    - See [Google Doc](https://docs.google.com/document/d/1pP2_LqOo7bLCFvV33FpfLxM_UlcMhrO2lmsaUyfqTcA/edit#)