LnagChain을 활용한 문서 요약

In [8]:
!pip install langchain transformers




In [9]:
!pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.2.12-py3-none-any.whl.metadata (2.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.21.3-py3-none-any.whl.metadata (7.1 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_community-0.2.12-py3-none-any.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading dataclasses_json-0.6.7-

In [10]:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from transformers import pipeline

from langchain.llms import HuggingFacePipeline

In [11]:
# 1. 예제 텍스트
text_to_summarize = """
Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think
like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with
a human mind such as learning and problem-solving. The ideal characteristic of artificial intelligence is its ability
to rationalize and take actions that have the best chance of achieving a specific goal. A subset of AI is machine
learning, which refers to the concept that computer programs can automatically learn from and adapt to new data without
being assisted by humans. Deep learning techniques enable this automatic learning through the absorption of huge amounts
of unstructured data such as text, images, or video.
"""


In [12]:
# 2. 텍스트 요약을 위한 프롬프트 템플릿 생성
# 요약 생성을 위한 프롬프트 정의
prompt_template = """
Summarize the following text in one concise paragraph:

{text}
"""

In [13]:
# 3. 요약기 모델 설정 (Hugging Face 모델)
# pipeline : Hugging Face에서 제공하는 사전 학습된 요약 모델 호출
# summarization 태스크를 위한 파이프라인 설정

summarizer = pipeline("summarization")



No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [14]:
# 4. Chain 생성
# LLMChain 사용,  프롬프트 템플릿과 요약 모델 결합

# prompt = PromptTemplate(template=prompt_template, input_variables=["text"])
# chain = LLMChain(llm=summarizer, prompt=prompt)

llm = HuggingFacePipeline(pipeline=summarizer)

prompt = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = LLMChain(llm=llm, prompt=prompt)  # Use the wrapped llm


  warn_deprecated(


In [15]:
# 5. 요약 실행
# summarizer 사용 텍스트 요약
# 최대 100 단어, 최소 30 단어의 길이 요약 결과 생성
summary = summarizer(text_to_summarize, max_length=100, min_length=30, do_sample=False)

In [16]:
# 결과 출력
print("Original Text:\n", text_to_summarize)
print("\nSummarized Text:\n", summary[0]['summary_text'])


Original Text:
 
Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think
like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with
a human mind such as learning and problem-solving. The ideal characteristic of artificial intelligence is its ability
to rationalize and take actions that have the best chance of achieving a specific goal. A subset of AI is machine
learning, which refers to the concept that computer programs can automatically learn from and adapt to new data without
being assisted by humans. Deep learning techniques enable this automatic learning through the absorption of huge amounts
of unstructured data such as text, images, or video.


Summarized Text:
  Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions . The ideal characteristic of artificial intelli

한국어 요약

In [17]:
from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration

In [18]:
# 1. 한국어 처리를 위한 Kobart 모델, 토크나이저
# gogamza/kobart-summarization 모델 사용, 한국어 문서 요약

model = BartForConditionalGeneration.from_pretrained('gogamza/kobart-summarization')
tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-summarization')

config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.


model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/682k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/4.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BartTokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.


In [48]:
# 2. 예제 텍스트 (한국어)
text_to_summarize = """
무야호가 뭐야??
"""

In [49]:
# 3. 토큰화
inputs = tokenizer.encode(text_to_summarize, return_tensors='pt', max_length=512, truncation=True)

In [50]:
# 4. 요약 생성
# model.generate() 사용, 요약 생성
# length_penalty 모델이 너무 짧거나 너무 긴 텍스트를 생성하지 않도록 제어
   # 1.0: 페널티 적용 X -> 텍스트 길이에 대한 제약 없이 텍스트 생성
   # > 1.0: 긴 텍스트에 페널티 부여 -> 값이 클수록 모델이 짧은 텍스트 선호
   # < 1.0: 짧은 텍스트에 페널티 부여 -> 값이 작을수록 더 긴 텍스트 생성

#  num_beams : 텍스트 생성 과정에서 탐색할 경로의 수  지정 -> 생성되는 텍스트의 다양성과 품질을 제어
   # 빔의 수가 많을수록 더 다양한 선택지 고려
   # 1: 빔 서치 사용 X, 그리디 서치(Greedy Search) 사용(가장 가능성이 높은 단어 선택)
   # > 2 : 여러 경로 동시에 탐색. 값이 클수록 더 많은 경로 탐색 but 계산 비용 UP
   # ex, num_beams=5: 다섯 개의 경로 동시 탐색

summary_ids = model.generate(inputs, max_length=128, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)


In [51]:
# 5. 결과 출력
print("Original Text:\n", text_to_summarize)
print("\nSummarized Text:\n", summary)


Original Text:
 
무야호가 뭐야??


Summarized Text:
  무무야호가 무무무무야호가 뭐야?                                                             무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무무


: 