# 2. Chunking

LLM의 context window의 한계 때문에 chunking이 필요하지만 chunking을 하기전에 고려할 사항들이 있습니다. [Reference](https://www.pinecone.io/learn/chunking-strategies/)

1.Document Structure & Length

 - Long (Book, academic articles, ...) or Short (Social Media post, reviews, ...)
 - Format (html, markdown, pdf, ...)


sentence로 embedding을 할 경우에는 sentense자체의 구체적 의미에 집중하는 embedding이기 때문에 단락이나 문서에서 찾을 수 있는 더 넓은 문맥 정보를 놓칠 수 있음. 반대로 document형태로 전체단락을 embedding하는 경우 전반적인 맥락과 텍스트 내의 문장 및 구문 간의 관계를 모두 고려하는 더 포괄적인 vector representation이 얻어지며 텍스트의 보다 넓은 의미와 주제를 포함되지만 큰 입력 텍스트 크기는 개별 문장이나 구문의 중요성을 희석시킬 수 있으며, 인덱스를 쿼리할 때 정확한 일치를 찾기 어렵게 만들 수 있습니다.

2.Embedding Model:

 - Chunk size가 어떤 embedding 모델을 사용할지를 결정하기 때문에 중요합니다.
 - Sentence-Transformer (개별문장에서 잘동작) vs OpenAI "text-embedding-ada-002" (256 or 512개의 token chunk)

3.Expected Queries:
 - 사용자의 query가 짧고 구체적일지? 아니면 길고 복잡할지?

4.Top-K retrieve
 - retrieve를 많이 할 수록 token의 갯수가 증가하여 inference time과 memory 소모가 커짐

### Chunking Methods

1.Fixed-size chunking

2.Content-aware Chunking
 - Sentence splitting
    - Naive splitting
    ```python
    # based on number of characters
    from langchanin.text_spliter import CharacterTextSplitter
    splitter = CharacterTextSplitter(
      chunk_size=100,
      chunk_overlap=10,
      seperator='\n\n'
    )

    # based on sentences
    chunks = text.split(".")
    ```

    - NLTK sentence splitting
    ```python
    from langchain.text_splitter import NLTKTextSplitter
    splitter = NLTKSplitter()
    chunks = splitter.split_text(text)
    ```
    - spaCy sentence splitting
    ```python
    from langchain.text_splitter import SpacyTextSplitter
    splitter = SpacyTextSplitter()
    chunks = splitter.split_text(text)
    ```
 - Recursive Chunking
 - Specialized Chunking
    - Markdown
    - Latex
    - HTML
    

### make document from crawling data

Strategy

1.question과 (accepted) answer를 concat하여 하나의 문단으로 만들어서 문서화 하고 이를 연결하여 저장합니다.

2.answer만 사용하여 문서화하고 정리

In [None]:
# %load_ext autoreload
# %autoreload 2

In [None]:
import os
import json
import glob

from tqdm import tqdm
from dotenv import load_dotenv
load_dotenv()
STACKEXCHANGE_API_KEY=os.environ.get('STACKEXCHANGE_API_KEY', None)

In [None]:
# load crawled data
concat_contents = []
answer_contents = []
question_contents = []

for json_path in sorted(
    glob.glob("../data/law_stackexchange/raw/*.json"),
    key=lambda x: int(x.split("/")[-1].split(".")[0]),
):
    with open(json_path, "r") as f:
        data = json.load(f)

    for question in data["items"]:
        # filter only answered questions
        if question["is_answered"] and question["answer_count"] > 0:
            question_content = (question["body"] + "\n" + question["title"]).strip()
            concat_content = (question["body"] + "\n" + question["title"]).strip()
            accepted_answer_id = question.get("accepted_answer_id", None)
            
            # concat question with accepted answers
            for answer in question["answers"]:
                if accepted_answer_id is not None and answer["answer_id"] == accepted_answer_id:
                    concat_content += "\n" + answer["body"].strip()
                    answer_contents.append(answer["body"].strip())
                
                elif int(answer["score"]) >= 5:
                    answer_contents.append(answer["body"].strip())

            concat_contents.append(concat_content.strip())
            question_contents.append(question_content.strip())

In [None]:
len(concat_contents), len(answer_contents), len(question_contents)

(25629, 17802, 25629)

OpenAI model을 활용하여 summary를 db에 저장하는 방식?
```python
# summarize concat contents
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain_community.llms import Ollama, LlamaCpp
from langchain_core.documents import Document
# docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:latest

# Define prompt
prompt_template = """Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:"""
# prompt_template = """Write a summary of the following:
# "{text}"
# SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)

# Define LLM chain
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106")
# llm = Ollama(model="llama2")
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Define StuffDocumentsChain
stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")

# save summarized results one by one to keep track
with open("../data/law_stackexchange/summarized.txt", 'w') as f:
    f.write('')
failed_chunk_ids = []
for chunk_idx, chunk in enumerate(tqdm(chunks)):
    try:
        content = stuff_chain.run([chunk])
    except:
        failed_chunk_ids.append(chunk_idx)
        continue
    with open("../data/law_stackexchange/summarized.txt", 'a') as f:
        f.write(content + '\n\n')
# 2029 so far
# summarized_chunks = [stuff_chain.run([chunk]) for chunk in tqdm(chunks)]
# summarized_contents = '\n\n'.join([summarized_chunk.page_content for summarized_chunk in summarized_chunks])

```

markdown tag를 없애거나 추가적으로 pre processing을 하려고 시도하였으나 gpt를 이용해서 raw문서에서 바로 summarization을 하는 것이 더 좋다고 판단하였습니다... 


하지만 GPT가격이 비싸서 대략 전체 $15 ~ $20 정도 소요될 것으로 추정되어, opensource를 사용해보려하였으나 CPU version model의 속도가 현저히 떨어져서 그냥 raw data를 사용하기로 하였습니다

[Ollama](https://github.com/ollama/ollama?tab=readme-ov-file) 홈페이지에서 Ollama


In [None]:
# chunks = [Document(page_content=page_content) for page_content in concat_contents]

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000, # sentense transformer라 숫자를 낮춰야하지만 문맥을 고려한 벡터를 만들기 위해 높은 숫자를 사용
    chunk_overlap=200,
)
concat_chunks = text_splitter.create_documents(concat_contents)
answer_chunks = text_splitter.create_documents(answer_contents)
question_chunks = text_splitter.create_documents(question_contents)

In [None]:
print("concat_chunks", len(concat_chunks))
print("answer_chunks", len(answer_chunks))
print("question_chunks", len(question_chunks))

concat_chunks 30036
answer_chunks 20326
question_chunks 26135


# 3. Indexing

그냥 raw document를 split하지않고 하나의 document로 openai embedding model로 embedding 진행하는 것 또한 비용이 많이 들어서
기존 OpenAIEmbeddings 말고 sentence transformer를 이용하여 임베딩을 진행하였습니다. 
물론.. sentense transformer의 input이 여러개의 문장이 되어버려서 임베딩의 정확도가 떨어질 것으로 생각됩니다. 
하지만 생각보다 잘 동작하는 것 같습니다. 

아래는 gpt embedding모델을 사용하는 코드입니다.


```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
# text-embedding-3-small is cheaper than text-embedding-ada-002, and performing well
embeddings_model = OpenAIEmbeddings(openai_api_key=os.environ.get('OPENAI_API_KEY'), model="text-embedding-3-small")
vectorstores = FAISS.from_documents(chunks, embeddings_model)
vectorstores.save_local("faiss_index_law_stackexchange")

```

In [None]:
# replace openAI embeddings to sentense transformer to save costs..
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

EMBEDDINGS_MODEL_NAME = "all-MiniLM-L6-v2"

embeddings_model = HuggingFaceEmbeddings(model_name=EMBEDDINGS_MODEL_NAME)

for type, chunks in [ ('question', question_chunks), ('concat', concat_chunks), ('answer', answer_chunks)]:
    DB_PATH = f"../db/law_stackexchange/sentence_transformer/faiss_index_{type}"
    os.makedirs(os.path.dirname(DB_PATH), exist_ok=True)

    try:
        vectorstores=FAISS.load_local(DB_PATH, embeddings_model)
    except Exception as e:
        print(e)
        print("Failed to load FAISS index, creating new one..")
        vectorstores = FAISS.from_documents(chunks, embeddings_model)
        vectorstores.save_local(DB_PATH)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Error in faiss::FileIOReader::FileIOReader(const char*) at /project/faiss/faiss/impl/io.cpp:67: Error: 'f' failed: could not open ../db/law_stackexchange/sentence_transformer/faiss_index_answer/index.faiss for reading: No such file or directory
Failed to load FAISS index, creating new one..


In [None]:
# example code for retriving

# retrived_docs = vectorstores.similarity_search("What is the difference between a contract and a deed?", k=3)
# retrived_docs_with_score = vectorstores.similarity_search_with_score("How can I Find the license for a widely-used photograph", k=3)
retrived_docs_with_score = vectorstores.similarity_search_with_score(
    """I'm contemplating the development of a music transcription service as a personal side project. Through this service, clients would have the option to commission me to transcribe a song or specific sections of a song in exchange for a fee. Following completion, I would provide the client with a PDF version of the transcription.
As far as my understanding goes, distributing or selling any derivative work of a copyrighted song, including sheet music, is considered illegal. However, in the case of this service, my intention is not to publish or sell the sheet music; rather, I would create the transcription exclusively for the client who requested it in exchange for a fee.
I am aware that it is legally permissible to transcribe a song for personal use if you have purchased the song yourself. However, I am uncertain about the legality when someone else pays for a transcription of a song they have purchased, with the understanding that they are the sole recipient of the transcription.""",
      k=3)

In [None]:
retrived_docs_with_score

[(Document(page_content="<p>I'm considering creating a music transcription service as a side hobby project. With this service, clients could request that I transcribe a song (or parts of a song) for them in exchange for a fee. I would then send the client a PDF version of my transcription.</p>\n\n<p>From my understanding, it is illegal when you sell or give away any derivative form of a copyrighted song (including sheet music). With this service, I do not intend to publish or sell sheet music. I would merely create the transcription, and give it to the client that requested it for a fee. </p>\n\n<p>I understand it is perfectly legal to transcribe a song you purchased yourself, as long as it is for personal use. Does this become copyright infringement when someone else pays you to transcribe a song they've purchased, and this person is the only one who will get the transcription? </p>\n\nLegality of Music Transcription Service"),
  0.0892093),
 (Document(page_content="<p>I'm learning to

# 4. Evaluation

Evaluation을 위한 QA Set을 만드는 작업입니다. 아래 프롬프트를 이용해서 ChatBot Model을 이용하여 QA Set을 만듭니다.


In [None]:
import os
import re
import wandb
import random
import pandas as pd
from typing import List
from pydantic import BaseModel, Field, validator

from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from langchain.output_parsers import OutputFixingParser
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import QAGenerationChain
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma

from langchain.callbacks import get_openai_callback

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Generate QA Eval Set

In [None]:
templ = """You are a smart assistant designed to come up with meaninful question and answer pair. The question should be to the point and the answer should be as detailed as possible.
Given a piece of text, you must come up with a question and answer pair that can be used to evaluate a QA bot. Do not make up stuff. Stick to the text to come up with the question and answer pair.
When coming up with this question/answer pair, you must respond in the following format:
```
{{
    "question": "$YOUR_QUESTION_HERE",
    "answer": "$THE_ANSWER_HERE"
}}
```

Everything between the ``` must be valid json.

Please come up with a question/answer pair, in the specified JSON format, for the following text:
----------------
{text}"""

PROMPT = PromptTemplate.from_template(templ)
num_qa_pairs = 5


In [None]:
# Generate QA
EVAL_QA_PAIRS_PATH = '../data/law_stackexchange/eval_qa_pairs.json'
if not os.path.exists(EVAL_QA_PAIRS_PATH):
    llm = ChatOpenAI(temperature=0.9)
    chain = QAGenerationChain.from_llm(llm=llm, prompt=PROMPT)

    random_chunks = []
    for i in range(num_qa_pairs):
        random_chunks.append(random.randint(0, len(concat_chunks))) # (5, 172)

    print("random_chunks", random_chunks)
    eval_qa_pairs = []

    for idx in random_chunks:
        qa = chain.run(concat_chunks[idx].page_content)
        eval_qa_pairs.extend(qa)

    with open(EVAL_QA_PAIRS_PATH, 'w') as f:
        json.dump(eval_qa_pairs, f, indent=4)
else:
    with open(EVAL_QA_PAIRS_PATH, 'r') as f:
        eval_qa_pairs = json.load(f)

  warn_deprecated(


In [None]:
eval_qa_pairs

[{'question': 'Can a nation prevent another nation from issuing visas to its citizens?',
  'answer': "No, a nation cannot directly tell another nation to not grant a particular visa. However, a nation can require its citizens to obtain an exit permit or work permit, as exemplified by Nepal's requirement for citizens emigrating to the United States on an H-1B visa. This exit permit needs to be presented to immigration in order to leave the country. Imposing exit visas, though possible, is not common and is often associated with authoritarian regimes. It is important to note that imposing an exit visa may raise concerns or objections in western society."},
 {'question': 'What principles of sentencing are applied in Canada?',
  'answer': 'In Canada, the declared purposes of sentencing do not include revenge. The purposes of sentencing, as listed in section 718 of the Criminal Code, are to denounce unlawful conduct and the harm done to victims or the community, to deter offenders and other

### QA Pipeline

In [None]:
from langchain.chains import RetrievalQA

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS


embeddings_model = HuggingFaceEmbeddings(model_name=EMBEDDINGS_MODEL_NAME)
faiss_db = FAISS.load_local(
    f"../db/law_stackexchange/sentence_transformer/faiss_index_answer", embeddings_model
)
retriever = faiss_db.as_retriever()

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

In [None]:
qa.run("Can a nation prevent another nation from issuing visas to its citizens?")

"No, a nation cannot directly prevent another nation from issuing visas to its citizens. Each nation has sovereignty over its own immigration policies and can decide who to grant visas to. However, a nation can indirectly restrict its citizens from traveling to another country by implementing exit visas, as mentioned in the initial context. Exit visas require citizens to obtain permission from their own government before leaving the country. But it's worth noting that exit visas are not common and can be seen as limiting individual freedom."

In [None]:
predictions = []

for qa_pair in eval_qa_pairs:
    question = qa_pair["question"]
    print(question)
    predictions.append({"response": qa.run(question)})

Can a nation prevent another nation from issuing visas to its citizens?
What principles of sentencing are applied in Canada?
According to the definition of a right, can a child in a developing country go to a private school or private hospital without paying any fee and tuition just because that's their right?
How is Citymapper not breaking the Google Maps Terms of Service?
Under what circumstances can a person use deadly physical force in Oregon?


### Eval

In [None]:
from langchain.evaluation.qa import QAEvalChain

In [None]:
eval_chain = QAEvalChain.from_llm(llm = OpenAI(temperature=0))

  warn_deprecated(


In [None]:
eval_qa_pairs

[{'question': 'Can a nation prevent another nation from issuing visas to its citizens?',
  'answer': "No, a nation cannot directly tell another nation to not grant a particular visa. However, a nation can require its citizens to obtain an exit permit or work permit, as exemplified by Nepal's requirement for citizens emigrating to the United States on an H-1B visa. This exit permit needs to be presented to immigration in order to leave the country. Imposing exit visas, though possible, is not common and is often associated with authoritarian regimes. It is important to note that imposing an exit visa may raise concerns or objections in western society."},
 {'question': 'What principles of sentencing are applied in Canada?',
  'answer': 'In Canada, the declared purposes of sentencing do not include revenge. The purposes of sentencing, as listed in section 718 of the Criminal Code, are to denounce unlawful conduct and the harm done to victims or the community, to deter offenders and other

In [None]:
graded_outputs = eval_chain.evaluate(
    eval_qa_pairs,
    predictions,
    question_key="question",
    answer_key="answer",
    prediction_key="response",
)

In [None]:
graded_outputs

[{'results': ' CORRECT'},
 {'results': ' CORRECT'},
 {'results': ' CORRECT'},
 {'results': ' CORRECT'},
 {'results': ' CORRECT'}]

In [None]:
correct = 0
for graded_output in graded_outputs:
    assert isinstance(graded_output, dict)
    if graded_output["results"].strip() == "CORRECT":
        correct+=1

correct/len(graded_outputs)

1.0

In [None]:
from evaluate import load

In [None]:
squad_metric = load("squad")

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

In [None]:
squad_metric

EvaluationModule(name: "squad", module_type: "metric", features: {'predictions': {'id': Value(dtype='string', id=None), 'prediction_text': Value(dtype='string', id=None)}, 'references': {'id': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}}, usage: """
Computes SQuAD scores (F1 and EM).
Args:
    predictions: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair as given in the references (see below)
        - 'prediction_text': the text of the answer
    references: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair (see above),
        - 'answers': a Dict in the SQuAD dataset format
            {
                'text': list of possible texts for the answer, as a list of strings
                'answer_start': list of start positions for 

In [None]:
# Some data munging to get the examples in the right format
for i, eg in enumerate(eval_qa_pairs):
    eg["id"] = str(i)
    eg["answers"] = {"text": [eg["answer"]], "answer_start": [0]}
    predictions[i]["id"] = str(i)
    predictions[i]["prediction_text"] = predictions[i]["response"]

for p in predictions:
    del p["response"]

new_qa_pairs = eval_qa_pairs.copy()
for eg in new_qa_pairs:
    del eg["question"]
    del eg["answer"]

In [None]:
results = squad_metric.compute(
    references=[new_qa_pairs[1]],
    predictions=[predictions[1]],
) # can also get mean scores

In [None]:
results

{'exact_match': 0.0, 'f1': 60.416666666666664}