### Text Splitter

document ➡️ chunk ➡️ vector store ➡️ 선택 ➡️ chunk + 질문을 이용하여 LLM 답변

### CharacterTextSplitter

구분자 1개 기준 분할 max token 을 지키지 못할 수 있음

In [40]:
with open ("docs/doc_1.txt") as f:
    text = f.read()

In [45]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator = "\n\n", # 문단 기준
    chunk_size = 150, # 단일 청크 갯수 : length_function 기준 청크 갯수
    chunk_overlap = 20, # 겹치는 청크 사이즈
    length_function = len
)

docs = text_splitter.split_text(text)
print("제한된 chunk를 넘는 경우가 존재:",[len(doc) for doc in docs
                      if len(doc) > 150], end="\n\n")
print(docs[0])
print("-"*100)
print(docs[1])

Created a chunk of size 374, which is longer than the specified 150
Created a chunk of size 460, which is longer than the specified 150
Created a chunk of size 191, which is longer than the specified 150
Created a chunk of size 201, which is longer than the specified 150


제한된 chunk를 넘는 경우가 존재: [374, 459, 190, 200, 190]

(4, 3, 2, 1)
No one sleep in Tokyo
All right crossing the line
No one quit the radio
Tokyo is on fire
Even if you say: I have been the world wide
I'll take you where surely you have never been
All night in the fight I'm okay, come on
Come on
Hey do you feel the night is breathable
Look at this town which is unbelievable
No other places like that in the world, world, world
----------------------------------------------------------------------------------------------------
(1, 2, 3, 4)
No one sleep in Tokyo
All night crossing the line
No one quit the radio
Tokyo is on fire
No one sleep in Tokyo
All night crossing the line
No one quit the radio
Tokyo is on fire
Turning to the left easy chicks and red lights
And to the right crazy music everywhere
All right in the fight I'm okay, come on
Come on
Hey do you feel the night is breathable
Look at this town which is unbelievable
No other places like that in the world, world, world


In [51]:
# documents 리스트로 만들기
text_splitter.create_documents(docs)

[Document(metadata={}, page_content="(4, 3, 2, 1)\nNo one sleep in Tokyo\nAll right crossing the line\nNo one quit the radio\nTokyo is on fire\nEven if you say: I have been the world wide\nI'll take you where surely you have never been\nAll night in the fight I'm okay, come on\nCome on\nHey do you feel the night is breathable\nLook at this town which is unbelievable\nNo other places like that in the world, world, world"),
 Document(metadata={}, page_content="(1, 2, 3, 4)\nNo one sleep in Tokyo\nAll night crossing the line\nNo one quit the radio\nTokyo is on fire\nNo one sleep in Tokyo\nAll night crossing the line\nNo one quit the radio\nTokyo is on fire\nTurning to the left easy chicks and red lights\nAnd to the right crazy music everywhere\nAll right in the fight I'm okay, come on\nCome on\nHey do you feel the night is breathable\nLook at this town which is unbelievable\nNo other places like that in the world, world, world"),
 Document(metadata={}, page_content='(1, 2, 3, 4)\nNo one s

### **RecursiveCharacterTextSplitter**
(여러번 돌면서 줄바꿈, 마침표, 쉼표 등 을)재귀적 으로 구분 후에 max token 을 넘지 않게 지킬 수 있다.

In [60]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 150, # 단일 청크 갯수 : length_function 기준 청크 갯수
    chunk_overlap = 20, # 겹치는 청크 사이즈
    length_function = len
)

docs = text_splitter.split_text(text)
print("제한된 chunk를 넘는 경우가 없다:",[len(doc) for doc in docs], end="\n\n")
print(docs[0])
print("-"*100)
print("last chunk:\n")
print(docs[-1])

제한된 chunk를 넘는 경우가 없다: [145, 135, 92, 123, 113, 128, 92, 123, 66, 147, 52, 123, 66]

(4, 3, 2, 1)
No one sleep in Tokyo
All right crossing the line
No one quit the radio
Tokyo is on fire
Even if you say: I have been the world wide
----------------------------------------------------------------------------------------------------
last chunk:

All night crossing the line
No one quit the radio
Tokyo is on fire


### 기타 Splitter(파이썬 스플리터, **토큰 단위 스플리터**)

- 파이썬 스플리터

In [95]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

python_code = """
- 아래의 내용은 파이썬 코드 입니다.
```python
def hello_world():
    print("hello world)

# 함수 불러오기
hello_worlds()
```
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language = Language.PYTHON,
    chunk_size = 50,
    chunk_overlap=0
    )


In [97]:
# 구분자 확인
python_splitter.get_separators_for_language(Language.PYTHON)

['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']

In [96]:
python_splitter.split_text(python_code)

['- 아래의 내용은 파이썬 코드 입니다.\n```python',
 'def hello_world():\n    print("hello world)',
 '# 함수 불러오기\nhello_worlds()\n```']

- 토큰 단위 텍스트 스플리터

In [108]:
import tiktoken # 토크나이저

tokenizer = tiktoken.get_encoding("cl100k_base")


def tiktoken_get_len(text):
    tokens = tokenizer.encode(text)
    return len(tokens)

In [110]:
tiktoken_get_len(text)

404

In [130]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 150, # 단일 청크 갯수 : length_function 기준 청크 갯수
    chunk_overlap = 20, # 겹치는 청크 사이즈
    length_function = tiktoken_get_len # 새로 정의한 토크나이저 기준 length function
)

texts = text_splitter.split_text(text)
print(f"길이 기준 : {[len(t) for t in texts]}")
print(f"토큰 기준 : {[tiktoken_get_len(t) for t in texts]}")

길이 기준 : [374, 459, 393, 190]
토큰 기준 : [103, 122, 117, 59]
