##  Installation and Setup

In [1]:
# hide output
%%capture output

! pip install pdfplumber
! pip install chromadb
! pip install grpcio==1.58.0
! pip install milvus
! pip install pymilvus
! pip install sentence-transformers
! pip install langchain
! pip install pypdf
! pip install faiss-gpu
! pip install pdf2image
! apt-get install poppler-utils

##  Load multiple pdfs
This time, we loaded the 2022 annual statements of all target companies. We try to find out whether model can effectively distinguish the data of different companies.

In [2]:
import os
from google.colab import drive
# Access drive
drive.mount('/content/drive')
path = '/content/drive/MyDrive/Capstone/'

from langchain.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("/content/drive/MyDrive/Capstone/Company Reports 2022")
data = loader.load()
len(data)

Mounted at /content/drive


KeyboardInterrupt: ignored

## Split the data
Once we loaded documents, we need to transform them to better suit our application. The simplest example is to split a long document into smaller chunks that can fit into our model's context window. The most common Splitter in LangChain includes:

1. RecursiveCharacterTextSplitter()
2. CharacterTextSplitter()

The paramether of above functions:
 - length_function: how the length of chunks is calculated. Defaults to just counting number of characters, but it's pretty common to pass a token counter here.
 - chunk_size: the maximum size of your chunks (as measured by the length function).
 - chunk_overlap: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (e.g. do a sliding window).
 - add_start_index: whether to include the starting position of each chunk within the original document in the metadata.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)


## Vectorstores
Since the input of model is vector instead of character, we need to transfer the text data into vector space(embeddding). There are already some useful vector database like ChromaDB, Milvus, pgvector...

Before we load the data into vector database, we need a perfect embeddings model.The Embeddings class is a class designed for interfacing with text embedding models. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc).

https://python.langchain.com/en/latest/modules/indexes/vectorstores.html

This time, we use FAISS.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
from langchain.vectorstores import FAISS

# vs_path_faiss = get_vs_path(file, 'faiss')

# load from document
vs_faiss = FAISS.from_documents(all_splits, embeddings)
#vs_faiss.save_local(vs_path_faiss)


# load from disk
#vs_faiss = FAISS.load_local(vs_path_faiss, embeddings)

##  Model
For LLM, we are using 5.2 Mistral-7b.

In [None]:
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM, AutoModelForCausalLM

model_id_mistral = "ehartford/samantha-mistral-7b"
model_id_mistral = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer_mistral = AutoTokenizer.from_pretrained(model_id_mistral)
model_mistral = AutoModelForCausalLM.from_pretrained(model_id_mistral)

pipe_mistral = pipeline(
    "text-generation",
    model = model_mistral,
    tokenizer = tokenizer_mistral,
    max_length = 2000
)

pipe_mistral.model.config.pad_token_id = pipe_mistral.model.config.eos_token_id
llm_mistral = HuggingFacePipeline(pipeline = pipe_mistral)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

##  Conversation


In [None]:
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate

DEFAULT_SYSTEM_PROMPT = """
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
""".strip()

def generate_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
    return f"""
[INST] <>
{system_prompt}
<>

{prompt} [/INST]
""".strip()

SYSTEM_PROMPT = "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer."
template = generate_prompt (
    "Combine the chat history and follow up question into "
    "a standalone question. Chat History: {chat_history}"
    "Follow up question: {question}",
    system_prompt=SYSTEM_PROMPT,
)
prompt = PromptTemplate.from_template(template)

memory = ConversationBufferWindowMemory(
    memory_key="history", k=6, return_only_outputs=True
)

question_generator_chain = LLMChain(
    llm=llm_mistral, prompt=prompt)


In [None]:
qa = ConversationalRetrievalChain.from_llm(
    llm = llm_mistral,  retriever = vs_faiss.as_retriever(), return_source_documents=True)

chat_history = []
while True:
  question = input('Send a question:')
  # Use chat_history to store history data
  if question == 'exit':
    break
  if question == 'clear':
    chat_history = []
    continue
  result = qa({'question': question, 'chat_history': chat_history})
  chat_history.append((question, result['answer']))
  print(result['answer'])
  print('I find this answer from', result['source_documents'][0].metadata['source'], ' page', result['source_documents'][0].metadata['page'])
  print('-'*100)

Send a question:What is the upstreaming earnings of ExxonMobil in 2022?




 The upstreaming earnings of ExxonMobil in 2022 is $12.6 billion.
I find this answer from /content/drive/MyDrive/Capstone/Company Reports 2022/ExxonMobil_2022.pdf  page 148
----------------------------------------------------------------------------------------------------
Send a question:Who is the Chairman of Chevron?
  The Chairman of Chevron is Michael K. Wirth.
I find this answer from /content/drive/MyDrive/Capstone/Company Reports 2022/Chevron_2022.pdf  page 14
----------------------------------------------------------------------------------------------------
Send a question:In the past three years, in which year did BP PLC have the highest revenue? How much exactly?
  The highest revenue BP PLC had in the past three years was in 2019, with a revenue of £154 million.
I find this answer from /content/drive/MyDrive/Capstone/Company Reports 2022/BP PLC_2022.pdf  page 292
----------------------------------------------------------------------------------------------------
Send a ques

In [None]:
from pdf2image import convert_from_path
images = convert_from_path("/content/drive/MyDrive/Capstone/Company Reports/Chevron/Chevron_2018.pdf")
print(len(images))
images[0]