## 0. Installation and Setup

In [None]:
# hide output
%%capture output

! pip install pdfplumber
! pip install chromadb
! pip install grpcio==1.58.0
! pip install milvus
! pip install pymilvus
! pip install sentence-transformers
! pip install langchain
! pip install pypdf
! pip install faiss-gpu

## 1. Load Data
In Langchiain, we use document_loaders to load our data. We can simply import langchain.document_loaders and specify the data type.
1. folder: DirectoryLoader
2. Azure: AzureBlobStorageContainerLoader
3. CSV file: CSVLoader
4. Google Drive: GoogleDriveLoader
5. Website: UnstructuredHTMLLoader
6. PDF: PyPDFLoader
7. Youtube: YoutubeLoader

For more data loader refer to the following link:
https://python.langchain.com/docs/modules/data_connection/document_loaders.html

In [None]:
import os
from google.colab import drive
# Access drive
drive.mount('/content/drive')
path = '/content/drive/MyDrive/Capstone/'


# companies
companies = os.listdir(os.path.join(path, 'Company Reports'))
for i, comp in enumerate(companies):
    print(i, ": ", comp)


# get reports
def get_reports(comp, year:int, rep_type:int = 1):
    """
    comp:       string or index
    year:       specific year or # recent year, 0 for all
    rep_type:   report type, 1 for annual report, 2 for sustainability report, 0 for both
    ret:        list of report pathes
    """
    if type(comp) == str:
        if comp not in companies:
            print("Error: ", comp, " does not exist")
            return
    elif type(comp) == int:
        if comp not in range(len(companies)):
            print("Error: invalid index")
            return
        comp = companies[comp]
    else:
        print("Error: invalid company")
        return

    file_path = os.path.join(path, 'Company Reports', comp)
    files = os.listdir(file_path)
    files.sort(reverse=True)

    years = range(2013,2023)
    if year in range(11):
        if year:
            years = years[-year:]
    else:
        years = [year]

    if rep_type == 0:
        reps = ["", "_sus"]
    elif rep_type == 1:
        reps = [""]
    elif rep_type == 2:
        reps = ["_sus"]
    else:
        print("Error: invalid report type")
        return

    ret = []
    for year in years:
        for rep in reps:
            file = comp + '_' + str(year) + rep + '.pdf'
            if file in files:
                ret.append(file)
    return [os.path.join(file_path, file) for file in ret]

Mounted at /content/drive
0 :  ExxonMobil
1 :  Shell plc
2 :  BP PLC
3 :  Saudi Aramco
4 :  Chevron
5 :  TotalEnergies
6 :  Valero Energy
7 :  Marathon Petroleum Corporation
8 :  Sinopec
9 :  PetroChina


In [None]:
files = get_reports(0, 2018)
file = files[0]
file

'/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf'

In [None]:
# take pdf as a exapmle. This is helpful if we directly download the documents from company website.
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(file)
data = loader.load_and_split()


# We can also use github (Website type) to store our original data.

# from langchain.document_loaders import WebBaseLoader

# loader = WebBaseLoader("https://drive.google.com/file/d/1EA8Iifu4kSIfziXAYz33P7Zon_u_beWb/view?usp=drive_link")
# data = loader.load()

## 2. Split the data
Once we loaded documents, we need to transform them to better suit our application. The simplest example is to split a long document into smaller chunks that can fit into our model's context window. The most common Splitter in LangChain includes:

1. RecursiveCharacterTextSplitter()
2. CharacterTextSplitter()

The paramether of above functions:
 - length_function: how the length of chunks is calculated. Defaults to just counting number of characters, but it's pretty common to pass a token counter here.
 - chunk_size: the maximum size of your chunks (as measured by the length function).
 - chunk_overlap: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (e.g. do a sliding window).
 - add_start_index: whether to include the starting position of each chunk within the original document in the metadata.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 200, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)


## 3. Vectorstores
Since the input of model is vector instead of character, we need to transfer the text data into vector space(embeddding). There are already some useful vector database like ChromaDB, Milvus, pgvector...

Before we load the data into vector database, we need a perfect embeddings model.The Embeddings class is a class designed for interfacing with text embedding models. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc).

https://python.langchain.com/en/latest/modules/indexes/vectorstores.html

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
def get_vs_path(file_path, vs):
    return os.path.join(file[:-4], vs)

### 3.1 Chroma

In [None]:
from langchain.vectorstores import Chroma

vs_path_chroma = get_vs_path(file, 'chroma')


# load from document
#vs_chroma = Chroma.from_documents(all_splits, embeddings, persist_directory=vs_path_chroma)


# load from disk
vs_chroma = Chroma(persist_directory=vs_path_chroma, embedding_function=embeddings)

### 3.2 Milvus

In [None]:
from milvus import default_server
from pymilvus import connections, utility
from langchain.vectorstores import Milvus

default_server.start()

connections.connect(host='127.0.0.1', port=default_server.listen_port)

print(utility.get_server_version())

vs_milvus = Milvus.from_documents(all_splits, embedding=embeddings)

#default_server.stop()

v2.3.1-lite


### 3.3 FAISS

In [None]:
from langchain.vectorstores import FAISS

vs_path_faiss = get_vs_path(file, 'faiss')

# load from document
#vs_faiss = FAISS.from_documents(all_splits, embeddings)
#vs_faiss.save_local(vs_path_faiss)


# load from disk
vs_faiss = FAISS.load_local(vs_path_faiss, embeddings)

## 4.Retrive
Retrieve relevant splits for any question using similarity search. There are servral way for retrievals:

*   Vectorstores + similarity search
*   Vectorstores + transformed to retriver
*   Just retriver (bypass vectorstores)

Vectorstores + similarity_search are most commonly used.

In [None]:
question = "What's the project savings in manufacturing"
question = "What's the upstream earnings after income tax in 2017?"
question = "What's the income tax in 2017?"


# Vectorstores + similarity search
docs_chroma_ss = vs_chroma.similarity_search(question)
#docs_milvus_ss = vs_milvus.similarity_search(question)
docs_faiss_ss  = vs_faiss.similarity_search(question)


# Vectorstores + transformed to retriver
docs_chroma_r = vs_chroma.as_retriever().get_relevant_documents(question)
#docs_milvus_r = vs_milvus.as_retriever().get_relevant_documents(question)
docs_faiss_r  = vs_faiss.as_retriever().get_relevant_documents(question)


# Just retriver (bypass vectorstores)
from langchain.retrievers import SVMRetriever

svm_retriever = SVMRetriever.from_documents(all_splits, embeddings)
docs_svm = svm_retriever.get_relevant_documents(question)

In [None]:
docs_chroma_ss

[Document(page_content='cumulative earnings contribution in our Downstream business between 2017 and 2020.', metadata={'page': 14, 'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf'}),
 Document(page_content='UPSTREAM  STATISTICAL RECAP 2018 2017 2016 2015 2014\nEarnings (millions of dollars) 14,079 13,355 196 7,101 27,548\nLiquids production (net, thousands of barrels per day) 2,266 2,283 2,365 2,345 2,111', metadata={'page': 29, 'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf'}),
 Document(page_content='Earnings (millions of dollars) 6,010 5,597 4,201 6,557 3,045\nRefinery throughput (thousands of barrels per day) 4,272 4,291 4,269 4,432 4,476', metadata={'page': 31, 'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf'}),
 Document(page_content='CHEMICAL(5-year average)EARNINGS BY BUSINESS SEGMENT\n(percent)RETURN ON AVERAGE CAPITAL EMPLOYE D(3)\n0 5 10 15 20 25UPST

In [None]:
docs_milvus_ss

[Document(page_content='OVER $1 BILLION OF \nPROJECT SAVINGS \nIN MANUFACTURING \nBEST PRACTICE DEVELOPMENT\nA series of global networks provide the platform to develop', metadata={'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf', 'page': 22}),
 Document(page_content='We aggressively identify efficiencies and cost reductions during project design and development, such as the implementation of facility-related optimizations that reduce plant complexity. In', metadata={'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf', 'page': 22}),
 Document(page_content='projects – involves constructing equipment off-site at a lower cost, and then transporting it to the site fully built. We also successfully utilized this practice in the construction of the Antwerp', metadata={'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf', 'page': 22}),
 Document(page_content='reduce the inten

In [None]:
docs_faiss_ss

[Document(page_content='cumulative earnings contribution in our Downstream business between 2017 and 2020.', metadata={'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf', 'page': 14}),
 Document(page_content='UPSTREAM  STATISTICAL RECAP 2018 2017 2016 2015 2014\nEarnings (millions of dollars) 14,079 13,355 196 7,101 27,548\nLiquids production (net, thousands of barrels per day) 2,266 2,283 2,365 2,345 2,111', metadata={'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf', 'page': 29}),
 Document(page_content='Earnings (millions of dollars) 6,010 5,597 4,201 6,557 3,045\nRefinery throughput (thousands of barrels per day) 4,272 4,291 4,269 4,432 4,476', metadata={'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf', 'page': 31}),
 Document(page_content='CHEMICAL(5-year average)EARNINGS BY BUSINESS SEGMENT\n(percent)RETURN ON AVERAGE CAPITAL EMPLOYE D(3)\n0 5 10 15 20 25UPST

In [None]:
docs_svm # q: earnings after income tax

[Document(page_content='Other taxes and duties 32,663 30,104 29,020\nTotal costs and other deductions 259,259 225,689 200,145\nIncome before income taxes 30,953 18,674 7,969\nIncome taxes 9,532 (1,174) (406)', metadata={'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf', 'page': 38}),
 Document(page_content='UPSTREAM  STATISTICAL RECAP 2018 2017 2016 2015 2014\nEarnings (millions of dollars) 14,079 13,355 196 7,101 27,548\nLiquids production (net, thousands of barrels per day) 2,266 2,283 2,365 2,345 2,111', metadata={'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf', 'page': 29}),
 Document(page_content='cumulative earnings contribution in our Downstream business between 2017 and 2020.', metadata={'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf', 'page': 14}),
 Document(page_content='UPSTREAM  IS A', metadata={'source': '/content/drive/MyDrive/Capstone/Company Rep

In [None]:
docs_svm # q: income tax

[Document(page_content='Other taxes and duties 32,663 30,104 29,020\nTotal costs and other deductions 259,259 225,689 200,145\nIncome before income taxes 30,953 18,674 7,969\nIncome taxes 9,532 (1,174) (406)', metadata={'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf', 'page': 38}),
 Document(page_content='37', metadata={'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf', 'page': 38}),
 Document(page_content='Agreement and market-based approaches to reduce greenhouse gas emissions, such as a revenue-neutral carbon tax. \nOur 2019 Energy & Carbon Summary  provides a', metadata={'source': '/content/drive/MyDrive/Capstone/Company Reports/ExxonMobil/ExxonMobil_2018.pdf', 'page': 35}),
 Document(page_content='forecast to grow from about 7.4 billion people in 2016 to about 9.2 billion people by 2040. According to research by the Brookings Institution, the global middle class is expected to grow by about 80', me

## 5. Model
The LLM we are using

### 5.1 Flan-t5

In [None]:
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM, AutoModelForCausalLM

model_id_flan = 'google/flan-t5-large'
tokenizer_flan = AutoTokenizer.from_pretrained(model_id_flan)
model_flan = AutoModelForSeq2SeqLM.from_pretrained(model_id_flan)

pipe_flan = pipeline(
    "text2text-generation",
    model = model_flan,
    tokenizer = tokenizer_flan,
    max_length = 500
)

llm_flan = HuggingFacePipeline(pipeline = pipe_flan)

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### 5.2 Mistral-7b

In [None]:
model_id_mistral = "ehartford/samantha-mistral-7b"
model_id_mistral = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer_mistral = AutoTokenizer.from_pretrained(model_id_mistral)
model_mistral = AutoModelForCausalLM.from_pretrained(model_id_mistral)

pipe_mistral = pipeline(
    "text-generation",
    model = model_mistral,
    tokenizer = tokenizer_mistral,
    max_length = 500
)

llm_mistral = HuggingFacePipeline(pipeline = pipe_mistral)

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

## 6. Generate Answer
The key function of this part is RetrievalQA(). We need to feed our model, retriever and prompt into the function to create Q&A object.

For details on RetrievalQA, refers to
https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html

In [None]:
vectorstores = [vs_chroma, vs_milvus, vs_faiss]
retrivers = [vs_chroma.as_retriever(), vs_milvus.as_retriever(), vs_faiss.as_retriever(), svm_retriever]
models = [llm_flan, llm_mistral]

In [None]:
# wrapper function
from langchain.chains.question_answering import load_qa_chain
def get_answer(docs, model) :
    chain = load_qa_chain(model, chain_type="stuff")
    res = chain({"input_documents": docs, "question": question}, return_only_outputs=True)
    return res['output_text']

def show_results(q):
    print("-" * 100)
    print(q)
    print("-" * 100)
    print("| Flan-t5 |")
    print("-" * 100)
    print("Vectorstore + similarity search: ")
    print("    - Chroma:", get_answer(vs_chroma.similarity_search(q), llm_flan))
    #print("    - Milvus:", get_answer(vs_milvus.similarity_search(q), llm_flan))
    print("    - FAISS: ", get_answer(vs_faiss.similarity_search(q), llm_flan))
    print("-" * 100)
    print("Vectorstore + retriver: ")
    print("    - Chroma:", get_answer(vs_chroma.as_retriever().get_relevant_documents(q), llm_flan))
    #print("    - Milvus:", get_answer(vs_milvus.as_retriever().get_relevant_documents(q), llm_flan))
    print("    - FAISS: ", get_answer(vs_faiss.as_retriever().get_relevant_documents(q), llm_flan))
    print("-" * 100)
    print("Retriver only: ")
    print("    - SVM:   ", get_answer(svm_retriever.get_relevant_documents(q), llm_flan))
    print("-" * 100)
    print("| Mistral |")
    print("-" * 100)
    print("Vectorstore + similarity search: ")
    print("    - Chroma:", get_answer(vs_chroma.similarity_search(q), llm_mistral))
    #print("    - Milvus:", get_answer(vs_milvus.similarity_search(q), llm_mistral))
    print("    - FAISS: ", get_answer(vs_faiss.similarity_search(q), llm_mistral))
    print("-" * 100)
    print("Vectorstore + retriver: ")
    print("    - Chroma:", get_answer(vs_chroma.as_retriever().get_relevant_documents(q), llm_mistral))
    #print("    - Milvus:", get_answer(vs_milvus.as_retriever().get_relevant_documents(q), llm_mistral))
    print("    - FAISS: ", get_answer(vs_faiss.as_retriever().get_relevant_documents(q), llm_mistral))
    print("-" * 100)
    print("Retriver only: ")
    print("    - SVM:   ", get_answer(svm_retriever.get_relevant_documents(q), llm_mistral))
    print("-" * 100)

In [None]:
question = 'What is the upstream earnings after income tax in 2017?'
show_results(question)

----------------------------------------------------------------------------------------------------
What is the upstream earnings after income tax in 2017?
----------------------------------------------------------------------------------------------------
| Flan-t5 |
----------------------------------------------------------------------------------------------------
Vectorstore + similarity search: 
    - Chroma: 14,079
    - FAISS:  14,079
----------------------------------------------------------------------------------------------------
Vectorstore + retriver: 
    - Chroma: 14,079
    - FAISS:  14,079
----------------------------------------------------------------------------------------------------
Retriver only: 


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


    - SVM:    14,079
----------------------------------------------------------------------------------------------------
| Mistral |
----------------------------------------------------------------------------------------------------
Vectorstore + similarity search: 


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


    - Chroma:  The upstream earnings after income tax in 2017 is $13,355 million.


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


    - FAISS:   The upstream earnings after income tax in 2017 is $13,355 million.
----------------------------------------------------------------------------------------------------
Vectorstore + retriver: 


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


    - Chroma:  The upstream earnings after income tax in 2017 is $13,355 million.


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


    - FAISS:   The upstream earnings after income tax in 2017 is $13,355 million.
----------------------------------------------------------------------------------------------------
Retriver only: 
    - SVM:     The upstream earnings after income tax in 2017 is 13,355 - 406 = 12,949 million dollars.
----------------------------------------------------------------------------------------------------


In [None]:
question = 'What is the income tax in 2017?'
show_results(question)

----------------------------------------------------------------------------------------------------
What is the income tax in 2017?
----------------------------------------------------------------------------------------------------
| Flan-t5 |
----------------------------------------------------------------------------------------------------
Vectorstore + similarity search: 
    - Chroma: 9,532
    - FAISS:  9,532
----------------------------------------------------------------------------------------------------
Vectorstore + retriver: 
    - Chroma: 9,532
    - FAISS:  9,532
----------------------------------------------------------------------------------------------------
Retriver only: 


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


    - SVM:    9,532
----------------------------------------------------------------------------------------------------
| Mistral |
----------------------------------------------------------------------------------------------------
Vectorstore + similarity search: 


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


    - Chroma:  The income tax in 2017 is $9,532.


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


    - FAISS:   The income tax in 2017 is $9,532.
----------------------------------------------------------------------------------------------------
Vectorstore + retriver: 


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


    - Chroma:  The income tax in 2017 is $9,532.


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


    - FAISS:   The income tax in 2017 is $9,532.
----------------------------------------------------------------------------------------------------
Retriver only: 
    - SVM:     The income tax in 2017 is 9,532.
----------------------------------------------------------------------------------------------------
