## 0. Installation and Setup

In [None]:
# hide output
%%capture output

! pip install pdfplumber
! pip install chromadb
! pip install grpcio==1.58.0
! pip install milvus
! pip install pymilvus
! pip install sentence-transformers
! pip install langchain
! pip install pypdf
! pip install faiss-gpu

## 1. Load Data
In Langchiain, we use document_loaders to load our data. We can simply import langchain.document_loaders and specify the data type.
1. folder: DirectoryLoader
2. Azure: AzureBlobStorageContainerLoader
3. CSV file: CSVLoader
4. Google Drive: GoogleDriveLoader
5. Website: UnstructuredHTMLLoader
6. PDF: PyPDFLoader
7. Youtube: YoutubeLoader

For more data loader refer to the following link:
https://python.langchain.com/docs/modules/data_connection/document_loaders.html

In [None]:
import os
from google.colab import drive
# Access drive
drive.mount('/content/drive')
path = '/content/drive/MyDrive/Capstone/'


# companies
companies = os.listdir(os.path.join(path, 'Company Reports'))
for i, comp in enumerate(companies):
    print(i, ": ", comp)


# get reports
def get_reports(comp, year:int, rep_type:int = 1):
    """
    comp:       string or index
    year:       specific year or # recent year, 0 for all
    rep_type:   report type, 1 for annual report, 2 for sustainability report, 0 for both
    ret:        list of report pathes
    """
    if type(comp) == str:
        if comp not in companies:
            print("Error: ", comp, " does not exist")
            return
    elif type(comp) == int:
        if comp not in range(len(companies)):
            print("Error: invalid index")
            return
        comp = companies[comp]
    else:
        print("Error: invalid company")
        return

    file_path = os.path.join(path, 'Company Reports', comp)
    files = os.listdir(file_path)
    files.sort(reverse=True)

    years = range(2013,2023)
    if year in range(11):
        if year:
            years = years[-year:]
    else:
        years = [year]

    if rep_type == 0:
        reps = ["", "_sus"]
    elif rep_type == 1:
        reps = [""]
    elif rep_type == 2:
        reps = ["_sus"]
    else:
        print("Error: invalid report type")
        return

    ret = []
    for year in years:
        for rep in reps:
            file = comp + '_' + str(year) + rep + '.pdf'
            if file in files:
                ret.append(file)
    return [os.path.join(file_path, file) for file in ret]

Mounted at /content/drive
0 :  ExxonMobil
1 :  Shell plc
2 :  BP PLC
3 :  Saudi Aramco
4 :  Chevron
5 :  TotalEnergies
6 :  Valero Energy
7 :  Marathon Petroleum Corporation
8 :  Sinopec
9 :  PetroChina


In [None]:
files = get_reports(0, 2018)
file = files[0]
file

'/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf'

In [None]:
# take pdf as a exapmle. This is helpful if we directly download the documents from company website.
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(file)
data = loader.load_and_split()


# We can also use github (Website type) to store our original data.

# from langchain.document_loaders import WebBaseLoader

# loader = WebBaseLoader("https://drive.google.com/file/d/1EA8Iifu4kSIfziXAYz33P7Zon_u_beWb/view?usp=drive_link")
# data = loader.load()

## 2. Split the data
Once we loaded documents, we need to transform them to better suit our application. The simplest example is to split a long document into smaller chunks that can fit into our model's context window. The most common Splitter in LangChain includes:

1. RecursiveCharacterTextSplitter()
2. CharacterTextSplitter()

The paramether of above functions:
 - length_function: how the length of chunks is calculated. Defaults to just counting number of characters, but it's pretty common to pass a token counter here.
 - chunk_size: the maximum size of your chunks (as measured by the length function).
 - chunk_overlap: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (e.g. do a sliding window).
 - add_start_index: whether to include the starting position of each chunk within the original document in the metadata.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 400, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)


## 3. Vectorstores
Since the input of model is vector instead of character, we need to transfer the text data into vector space(embeddding). There are already some useful vector database like ChromaDB, Milvus, pgvector...

Before we load the data into vector database, we need a perfect embeddings model.The Embeddings class is a class designed for interfacing with text embedding models. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc).

https://python.langchain.com/en/latest/modules/indexes/vectorstores.html

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
def get_vs_path(file_path, vs):
    return os.path.join(file_path[:-4], vs)

### 3.1 Chroma

In [None]:
from langchain.vectorstores import Chroma

vs_path_chroma = get_vs_path(file, 'chroma')


# load from document
vs_chroma = Chroma.from_documents(all_splits, embeddings)


# load from disk
#vs_chroma = Chroma(persist_directory=vs_path_chroma, embedding_function=embeddings)

### 3.2 Milvus

In [None]:
from milvus import default_server
from pymilvus import connections, utility
from langchain.vectorstores import Milvus

default_server.start()

connections.connect(host='127.0.0.1', port=default_server.listen_port)

print(utility.get_server_version())

vs_milvus = Milvus.from_documents(all_splits, embedding=embeddings)

#default_server.stop()

v2.3.1-lite


### 3.3 FAISS

In [None]:
from langchain.vectorstores import FAISS

vs_path_faiss = get_vs_path(file, 'faiss')

# load from document
vs_faiss = FAISS.from_documents(all_splits, embeddings)
#vs_faiss.save_local(vs_path_faiss)


# load from disk
#vs_faiss = FAISS.load_local(vs_path_faiss, embeddings)

## 4.Retrive
Retrieve relevant splits for any question using similarity search. There are servral way for retrievals:

*   Vectorstores + similarity search
*   Vectorstores + transformed to retriver
*   Just retriver (bypass vectorstores)

Vectorstores + similarity_search are most commonly used.

In [None]:
from langchain.retrievers import SVMRetriever
svm_retriever = SVMRetriever.from_documents(all_splits, embeddings)

In [None]:
question = "What's the upstream earnings after income tax in 2017?"


# Vectorstores + similarity search
docs_chroma_ss = vs_chroma.similarity_search(question)
docs_milvus_ss = vs_milvus.similarity_search(question)
docs_faiss_ss  = vs_faiss.similarity_search(question)


# Vectorstores + transformed to retriver
docs_chroma_r = vs_chroma.as_retriever().get_relevant_documents(question)
docs_milvus_r = vs_milvus.as_retriever().get_relevant_documents(question)
docs_faiss_r  = vs_faiss.as_retriever().get_relevant_documents(question)


# Just retriver (bypass vectorstores)
docs_svm = svm_retriever.get_relevant_documents(question)

In [None]:
def print_doc(doc):
    for i, d in enumerate(doc):
        print('-'*100)
        print('|', str(i+1)+'. Page', d.metadata['page'], '|')
        print('-'*14)
        print(d.page_content)
    print('-'*100)

In [None]:
print_doc(docs_chroma_ss)

----------------------------------------------------------------------------------------------------
| 1. Page 29 |
--------------
UPSTREAM  STATISTICAL RECAP 2018 2017 2016 2015 2014
Earnings (millions of dollars) 14,079 13,355 196 7,101 27,548
Liquids production (net, thousands of barrels per day) 2,266 2,283 2,365 2,345 2,111
Natural gas production available for sale (net, millions of cubic feet per day) 9,405 10,211 10,127 10,515 11,145
----------------------------------------------------------------------------------------------------
| 2. Page 5 |
--------------
(4)  Competitor data estimated on a consistent 
basis with ExxonMobil and based on 
public information.
(5)  Net income attributable to ExxonMobil.(6)  S&P 500 and CPI indexed to 1982 Exxon 
dividend.
(7)  CPI based on historical yearly average 
from U.S. Bureau of Labor Statistics.40
302010
0
–4FUNCTIONAL EARNINGS AND NET INCOME
(billions of dollars)Upstream Downstream Chemical
-------------------------------------------

In [None]:
print_doc(docs_milvus_ss)

----------------------------------------------------------------------------------------------------
| 1. Page 29 |
--------------
UPSTREAM  STATISTICAL RECAP 2018 2017 2016 2015 2014
Earnings (millions of dollars) 14,079 13,355 196 7,101 27,548
Liquids production (net, thousands of barrels per day) 2,266 2,283 2,365 2,345 2,111
Natural gas production available for sale (net, millions of cubic feet per day) 9,405 10,211 10,127 10,515 11,145
----------------------------------------------------------------------------------------------------
| 2. Page 5 |
--------------
(4)  Competitor data estimated on a consistent 
basis with ExxonMobil and based on 
public information.
(5)  Net income attributable to ExxonMobil.(6)  S&P 500 and CPI indexed to 1982 Exxon 
dividend.
(7)  CPI based on historical yearly average 
from U.S. Bureau of Labor Statistics.40
302010
0
–4FUNCTIONAL EARNINGS AND NET INCOME
(billions of dollars)Upstream Downstream Chemical
-------------------------------------------

In [None]:
print_doc(docs_faiss_ss)

----------------------------------------------------------------------------------------------------
| 1. Page 29 |
--------------
UPSTREAM  STATISTICAL RECAP 2018 2017 2016 2015 2014
Earnings (millions of dollars) 14,079 13,355 196 7,101 27,548
Liquids production (net, thousands of barrels per day) 2,266 2,283 2,365 2,345 2,111
Natural gas production available for sale (net, millions of cubic feet per day) 9,405 10,211 10,127 10,515 11,145
----------------------------------------------------------------------------------------------------
| 2. Page 5 |
--------------
(4)  Competitor data estimated on a consistent 
basis with ExxonMobil and based on 
public information.
(5)  Net income attributable to ExxonMobil.(6)  S&P 500 and CPI indexed to 1982 Exxon 
dividend.
(7)  CPI based on historical yearly average 
from U.S. Bureau of Labor Statistics.40
302010
0
–4FUNCTIONAL EARNINGS AND NET INCOME
(billions of dollars)Upstream Downstream Chemical
-------------------------------------------

In [None]:
print_doc(docs_svm)

----------------------------------------------------------------------------------------------------
| 1. Page 29 |
--------------
UPSTREAM  STATISTICAL RECAP 2018 2017 2016 2015 2014
Earnings (millions of dollars) 14,079 13,355 196 7,101 27,548
Liquids production (net, thousands of barrels per day) 2,266 2,283 2,365 2,345 2,111
Natural gas production available for sale (net, millions of cubic feet per day) 9,405 10,211 10,127 10,515 11,145
----------------------------------------------------------------------------------------------------
| 2. Page 28 |
--------------
DOWNSTREAM
CHEMICAL2018 10-year average
27
----------------------------------------------------------------------------------------------------
| 3. Page 38 |
--------------
37
----------------------------------------------------------------------------------------------------
| 4. Page 38 |
--------------
Earnings per common share – assuming dilution  (dollars) 4.88 4.63 1.88
The information in the Summary statement of 

In [None]:
question = "What's the upstream earnings minus income tax in 2017?"


# Vectorstores + similarity search
docs_chroma_ss = vs_chroma.similarity_search(question)
docs_milvus_ss = vs_milvus.similarity_search(question)
docs_faiss_ss  = vs_faiss.similarity_search(question)


# Vectorstores + transformed to retriver
docs_chroma_r = vs_chroma.as_retriever().get_relevant_documents(question)
docs_milvus_r = vs_milvus.as_retriever().get_relevant_documents(question)
docs_faiss_r  = vs_faiss.as_retriever().get_relevant_documents(question)


# Just retriver (bypass vectorstores)
docs_svm = svm_retriever.get_relevant_documents(question)

In [None]:
print_doc(docs_chroma_ss)

----------------------------------------------------------------------------------------------------
| 1. Page 29 |
--------------
UPSTREAM  STATISTICAL RECAP 2018 2017 2016 2015 2014
Earnings (millions of dollars) 14,079 13,355 196 7,101 27,548
Liquids production (net, thousands of barrels per day) 2,266 2,283 2,365 2,345 2,111
Natural gas production available for sale (net, millions of cubic feet per day) 9,405 10,211 10,127 10,515 11,145
----------------------------------------------------------------------------------------------------
| 2. Page 5 |
--------------
(4)  Competitor data estimated on a consistent 
basis with ExxonMobil and based on 
public information.
(5)  Net income attributable to ExxonMobil.(6)  S&P 500 and CPI indexed to 1982 Exxon 
dividend.
(7)  CPI based on historical yearly average 
from U.S. Bureau of Labor Statistics.40
302010
0
–4FUNCTIONAL EARNINGS AND NET INCOME
(billions of dollars)Upstream Downstream Chemical
-------------------------------------------

In [None]:
print_doc(docs_milvus_ss)

----------------------------------------------------------------------------------------------------
| 1. Page 29 |
--------------
UPSTREAM  STATISTICAL RECAP 2018 2017 2016 2015 2014
Earnings (millions of dollars) 14,079 13,355 196 7,101 27,548
Liquids production (net, thousands of barrels per day) 2,266 2,283 2,365 2,345 2,111
Natural gas production available for sale (net, millions of cubic feet per day) 9,405 10,211 10,127 10,515 11,145
----------------------------------------------------------------------------------------------------
| 2. Page 5 |
--------------
(4)  Competitor data estimated on a consistent 
basis with ExxonMobil and based on 
public information.
(5)  Net income attributable to ExxonMobil.(6)  S&P 500 and CPI indexed to 1982 Exxon 
dividend.
(7)  CPI based on historical yearly average 
from U.S. Bureau of Labor Statistics.40
302010
0
–4FUNCTIONAL EARNINGS AND NET INCOME
(billions of dollars)Upstream Downstream Chemical
-------------------------------------------

In [None]:
print_doc(docs_faiss_ss)

----------------------------------------------------------------------------------------------------
| 1. Page 29 |
--------------
UPSTREAM  STATISTICAL RECAP 2018 2017 2016 2015 2014
Earnings (millions of dollars) 14,079 13,355 196 7,101 27,548
Liquids production (net, thousands of barrels per day) 2,266 2,283 2,365 2,345 2,111
Natural gas production available for sale (net, millions of cubic feet per day) 9,405 10,211 10,127 10,515 11,145
----------------------------------------------------------------------------------------------------
| 2. Page 5 |
--------------
(4)  Competitor data estimated on a consistent 
basis with ExxonMobil and based on 
public information.
(5)  Net income attributable to ExxonMobil.(6)  S&P 500 and CPI indexed to 1982 Exxon 
dividend.
(7)  CPI based on historical yearly average 
from U.S. Bureau of Labor Statistics.40
302010
0
–4FUNCTIONAL EARNINGS AND NET INCOME
(billions of dollars)Upstream Downstream Chemical
-------------------------------------------

In [None]:
print_doc(docs_svm)

----------------------------------------------------------------------------------------------------
| 1. Page 29 |
--------------
UPSTREAM  STATISTICAL RECAP 2018 2017 2016 2015 2014
Earnings (millions of dollars) 14,079 13,355 196 7,101 27,548
Liquids production (net, thousands of barrels per day) 2,266 2,283 2,365 2,345 2,111
Natural gas production available for sale (net, millions of cubic feet per day) 9,405 10,211 10,127 10,515 11,145
----------------------------------------------------------------------------------------------------
| 2. Page 28 |
--------------
DOWNSTREAM
CHEMICAL2018 10-year average
27
----------------------------------------------------------------------------------------------------
| 3. Page 38 |
--------------
37
----------------------------------------------------------------------------------------------------
| 4. Page 38 |
--------------
Earnings per common share – assuming dilution  (dollars) 4.88 4.63 1.88
The information in the Summary statement of 

## 5. Model
The LLM we are using

### 5.1 Flan-t5

In [None]:
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM, AutoModelForCausalLM

model_id_flan = 'google/flan-t5-large'
tokenizer_flan = AutoTokenizer.from_pretrained(model_id_flan)
model_flan = AutoModelForSeq2SeqLM.from_pretrained(model_id_flan)

pipe_flan = pipeline(
    "text2text-generation",
    model = model_flan,
    tokenizer = tokenizer_flan,
    max_length = 800
)

pipe_flan.model.config.pad_token_id = pipe_flan.model.config.eos_token_id
llm_flan = HuggingFacePipeline(pipeline = pipe_flan)

### 5.2 Mistral-7b

In [None]:
model_id_mistral = "ehartford/samantha-mistral-7b"
model_id_mistral = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer_mistral = AutoTokenizer.from_pretrained(model_id_mistral)
model_mistral = AutoModelForCausalLM.from_pretrained(model_id_mistral)

pipe_mistral = pipeline(
    "text-generation",
    model = model_mistral,
    tokenizer = tokenizer_mistral,
    max_length = 800
)

pipe_mistral.model.config.pad_token_id = pipe_mistral.model.config.eos_token_id
llm_mistral = HuggingFacePipeline(pipeline = pipe_mistral)

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

## 6. Generate Answer
The key function of this part is RetrievalQA(). We need to feed our model, retriever and prompt into the function to create Q&A object.

For details on RetrievalQA, refers to
https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html

In [None]:
vectorstores = [vs_chroma, vs_milvus, vs_faiss]
retrivers = [vs_chroma.as_retriever(), vs_milvus.as_retriever(), vs_faiss.as_retriever(), svm_retriever]
models = [llm_flan, llm_mistral]

In [None]:
# wrapper function
from langchain.chains.question_answering import load_qa_chain
import time
def get_answer(q, vs, r, llm):
    s1 = time.time()
    if vs:
        doc = vs.similarity_search(q)
    else:
        doc = r.get_relevant_documents(q)
    t1 = time.time() - s1

    s2 = time.time()
    chain = load_qa_chain(llm, chain_type="stuff")
    res = chain({"input_documents": doc, "question": question}, return_only_outputs=True)
    t2 = time.time() - s2
    return res['output_text'], round(t1, 2), round(t2, 2)


def show_results(q):
    print("-" * 100)
    print(q)
    print("-" * 100)
    print("| Flan-t5 |")
    print("-" * 100)

    print("Vectorstore + similarity search: ")
    res, t1, t2 = get_answer(q, vs_chroma, None, llm_flan)
    print("    - Chroma:", res)
    print("         retriver time: ", t1, 's')
    print("         model time:    ", t2, 's')

    # res, t1, t2 = get_answer(q, vs_milvus, None, llm_flan)
    # print("    - Milvus:", res)
    # print("         retriver time: ", t1, 's')
    # print("         model time:    ", t2, 's')

    res, t1, t2 = get_answer(q, vs_faiss, None, llm_flan)
    print("    - FAISS:", res)
    print("         retriver time: ", t1, 's')
    print("         model time:    ", t2, 's')
    print("-" * 100)

    print("Vectorstore + retriver: ")
    res, t1, t2 = get_answer(q, None, vs_chroma.as_retriever(), llm_flan)
    print("    - Chroma:", res)
    print("         retriver time: ", t1, 's')
    print("         model time:    ", t2, 's')

    # res, t1, t2 = get_answer(q, None, vs_milvus.as_retriever(), llm_flan)
    # print("    - Milvus:", res)
    # print("         retriver time: ", t1, 's')
    # print("         model time:    ", t2, 's')

    res, t1, t2 = get_answer(q, None, vs_faiss.as_retriever(), llm_flan)
    print("    - FAISS:", res)
    print("         retriver time: ", t1, 's')
    print("         model time:    ", t2, 's')
    print("-" * 100)

    print("Retriver only: ")
    res, t1, t2 = get_answer(q, None, svm_retriever, llm_flan)
    print("    - SVM:", res)
    print("         retriver time: ", t1, 's')
    print("         model time:    ", t2, 's')
    print("-" * 100)

    print("| Mistral-7b |")
    print("-" * 100)

    print("Vectorstore + similarity search: ")
    res, t1, t2 = get_answer(q, vs_chroma, None, llm_mistral)
    print("    - Chroma:", res)
    print("         retriver time: ", t1, 's')
    print("         model time:    ", t2, 's')

    # res, t1, t2 = get_answer(q, vs_milvus, None, llm_mistral)
    # print("    - Milvus:", res)
    # print("         retriver time: ", t1, 's')
    # print("         model time:    ", t2, 's')

    res, t1, t2 = get_answer(q, vs_faiss, None, llm_mistral)
    print("    - FAISS:", res)
    print("         retriver time: ", t1, 's')
    print("         model time:    ", t2, 's')
    print("-" * 100)

    print("Vectorstore + retriver: ")
    res, t1, t2 = get_answer(q, None, vs_chroma.as_retriever(), llm_mistral)
    print("    - Chroma:", res)
    print("         retriver time: ", t1, 's')
    print("         model time:    ", t2, 's')

    # res, t1, t2 = get_answer(q, None, vs_milvus.as_retriever(), llm_mistral)
    # print("    - Milvus:", res)
    # print("         retriver time: ", t1, 's')
    # print("         model time:    ", t2, 's')

    res, t1, t2 = get_answer(q, None, vs_faiss.as_retriever(), llm_mistral)
    print("    - FAISS:", res)
    print("         retriver time: ", t1, 's')
    print("         model time:    ", t2, 's')
    print("-" * 100)

    print("Retriver only: ")
    res, t1, t2 = get_answer(q, None, svm_retriever, llm_mistral)
    print("    - SVM:", res)
    print("         retriver time: ", t1, 's')
    print("         model time:    ", t2, 's')
    print("-" * 100)

## 7. Testing

In [None]:
question = 'What is the upstream earnings after income tax in 2017?'
show_results(question)

----------------------------------------------------------------------------------------------------
What is the upstream earnings after income tax in 2017?
----------------------------------------------------------------------------------------------------
| Flan-t5 |
----------------------------------------------------------------------------------------------------
Vectorstore + similarity search: 
    - Chroma: 14,079
         retriver time:  0.04 s
         model time:     3.18 s
    - FAISS: 14,079
         retriver time:  0.02 s
         model time:     2.81 s
----------------------------------------------------------------------------------------------------
Vectorstore + retriver: 
    - Chroma: 14,079
         retriver time:  0.02 s
         model time:     2.06 s
    - FAISS: 14,079
         retriver time:  0.02 s
         model time:     3.32 s
----------------------------------------------------------------------------------------------------
Retriver only: 
    - SVM: 14,

In [None]:
question = 'What is the upstream earnings minus income tax in 2017?'
show_results(question)

----------------------------------------------------------------------------------------------------
What is the upstream earnings minus income tax in 2017?
----------------------------------------------------------------------------------------------------
| Flan-t5 |
----------------------------------------------------------------------------------------------------
Vectorstore + similarity search: 
    - Chroma: 0
         retriver time:  0.03 s
         model time:     2.32 s
    - FAISS: 0
         retriver time:  0.02 s
         model time:     2.26 s
----------------------------------------------------------------------------------------------------
Vectorstore + retriver: 
    - Chroma: 0
         retriver time:  0.02 s
         model time:     2.25 s
    - FAISS: 0
         retriver time:  0.02 s
         model time:     3.2 s
----------------------------------------------------------------------------------------------------
Retriver only: 
    - SVM: 14,079
         retriver

Adjust `max_length`

In [None]:
pipe_flan = pipeline(
    "text2text-generation",
    model = model_flan,
    tokenizer = tokenizer_flan,
    max_length = 1500
)

pipe_flan.tokenizer.pad_token_id = pipe_flan.model.config.eos_token_id
llm_flan = HuggingFacePipeline(pipeline = pipe_flan)

pipe_mistral = pipeline(
    "text-generation",
    model = model_mistral,
    tokenizer = tokenizer_mistral,
    max_length = 1500
)

llm_mistral = HuggingFacePipeline(pipeline = pipe_mistral)

question = 'What is the upstream earnings minus income tax in 2017?'
show_results(question)

----------------------------------------------------------------------------------------------------
What is the upstream earnings minus income tax in 2017?
----------------------------------------------------------------------------------------------------
| Flan-t5 |
----------------------------------------------------------------------------------------------------
Vectorstore + similarity search: 
    - Chroma: 0
         retriver time:  0.03 s
         model time:     3.54 s
    - FAISS: 0
         retriver time:  0.03 s
         model time:     2.35 s
----------------------------------------------------------------------------------------------------
Vectorstore + retriver: 
    - Chroma: 0
         retriver time:  0.02 s
         model time:     2.25 s
    - FAISS: 0
         retriver time:  0.02 s
         model time:     2.36 s
----------------------------------------------------------------------------------------------------
Retriver only: 
    - SVM: 14,079
         retrive