## Transformers를 활용한 데이터 전처리 및 모델 로드 방법

Transformers는 자연어 처리(NLP) 분야에서 매우 강력한 도구입니다. 이 가이드는 Hugging Face의 transformers 라이브러리를 사용하여 데이터 전처리 및 모델 로드 방법을 설명합니다. 이 과정에서는 텍스트 데이터를 전처리하고, 사전 훈련된 모델을 로드한 후, 이를 사용하여 다양한 예측을 수행하는 방법을 다룹니다.

### 필요한 라이브러리 설치

먼저, Hugging Face의 transformers 라이브러리와 torch를 설치합니다.

In [1]:
!pip install transformers torch

Defaulting to user installation because normal site-packages is not writeable
Collecting transformers
  Downloading transformers-4.44.0-py3-none-any.whl (9.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting torch
  Downloading torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m797.2/797.2 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:04[0m
[?25hCollecting tokenizers<0.20,>=0.19
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m45.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting safetensors>=0.4.1
  Downloading safetensors-0.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (435 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### 데이터 전처리
텍스트 데이터를 전처리하려면 토크나이저(tokenizer)가 필요합니다. 

토크나이저는 텍스트를 모델이 이해할 수 있는 형식으로 변환합니다. Hugging Face는 다양한 토크나이저를 제공합니다.

In [2]:
from transformers import AutoTokenizer

# BERT 기반 모델의 토크나이저 로드
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# 샘플 텍스트 데이터
texts = ["Hello, how are you?", "I am using transformers library!"]

# 텍스트 데이터 토큰화
tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

print(tokens)


  from .autonotebook import tqdm as notebook_tqdm


{'input_ids': tensor([[  101,  7592,  1010,  2129,  2024,  2017,  1029,   102],
        [  101,  1045,  2572,  2478, 19081,  3075,   999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1]])}




In [4]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Thu Aug  8 01:52:32 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:17:00.0 Off |                   On |
| N/A   36C    P0              64W / 300W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+------------------------------------------------------------------

### 모델 로드
사전 훈련된 모델을 로드하여 예측을 수행할 수 있습니다. 

예제에서는 자연어 처리의 대표적인 모델중 하나인 BERT 모델을 예로 들어 설명합니다.

In [3]:
from transformers import AutoModelForSequenceClassification
import torch

# BERT 기반 시퀀스 분류 모델 로드
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')

# 모델을 평가 모드로 전환
model.eval()

# 모델을 GPU로 이동 (가능한 경우)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
tokens = {key: val.to(device) for key, val in tokens.items()}


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
tokens

{'input_ids': tensor([[  101,  7592,  1010,  2129,  2024,  2017,  1029,   102],
         [  101,  1045,  2572,  2478, 19081,  3075,   999,   102]],
        device='cuda:0'),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

### 예측(추론) 수행
모델을 사용하여 예측(추론) 을 수행합니다.

In [8]:
outputs = model(**tokens)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-0.0799, -0.0861],
        [-0.0246, -0.1385]], device='cuda:0', grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [9]:
with torch.no_grad():
    outputs = model(**tokens)
    predictions = torch.argmax(outputs.logits, dim=-1)

print(predictions)

tensor([0, 0], device='cuda:0')


## 다양한 모델 및 작업 예시
Hugging Face의 transformers 라이브러리를 사용하면 다양한 NLP 작업을 수행할 수 있습니다.

### 1. 감정 분석
감정 분석 모델을 사용하여 텍스트의 감정을 예측할 수 있습니다.

In [10]:
from transformers import pipeline

# 감정 분석 파이프라인 생성
sentiment_pipeline = pipeline('sentiment-analysis')

# 샘플 텍스트 데이터
sentiments = sentiment_pipeline(["I love using transformers!", "I hate waiting in traffic."])

print(sentiments)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.9994327425956726}, {'label': 'NEGATIVE', 'score': 0.9964919686317444}]


### 2. 번역
번역 모델을 사용하여 텍스트를 다른 언어로 번역할 수 있습니다.

In [15]:
# 번역 파이프라인 생성 (영어 -> 프랑스어)
translation_pipeline = pipeline('translation_en_to_fr')

# 샘플 텍스트 데이터
translations = translation_pipeline("Transformers library is amazing!")

print(translations)


No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'translation_text': 'La bibliothèque Transformers est étonnante !'}]


### 3. 요약
텍스트 요약 모델을 사용하여 긴 텍스트를 요약할 수 있습니다.

In [14]:
# 요약 파이프라인 생성
summarization_pipeline = pipeline('summarization')

# 샘플 텍스트 데이터
article = """
The Hugging Face library provides a wide range of state-of-the-art NLP models for various tasks,
including text classification, named entity recognition, question answering, and more.
It is an essential tool for anyone working in the field of natural language processing.
"""

summary = summarization_pipeline(article)

print(summary)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Your max_length is set to 142, but your input_length is only 61. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=30)


[{'summary_text': ' The Hugging Face library provides a wide range of state-of-the-art NLP models for various tasks, including text classification, named entity recognition, question answering, and more . It is an essential tool for anyone working in the field of natural language processing .'}]


### 4. 질의응답
질의응답 모델을 사용하여 주어진 텍스트에서 질문에 대한 답을 찾을 수 있습니다.

In [16]:
# 질의응답 파이프라인 생성
qa_pipeline = pipeline('question-answering')

# 샘플 텍스트 및 질문
context = """
The Transformers library by Hugging Face provides a simple and powerful way to work with state-of-the-art NLP models.
It includes pre-trained models for a variety of tasks and allows for easy fine-tuning.
"""
question = "What does the Transformers library provide?"

answer = qa_pipeline(question=question, context=context)

print(answer)


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'score': 0.02247987501323223, 'start': 51, 'end': 76, 'answer': 'a simple and powerful way'}


### 5. 텍스트 생성
텍스트 생성 모델을 사용하여 주어진 프롬프트에 따라 텍스트를 생성할 수 있습니다.

In [17]:
# 텍스트 생성 파이프라인 생성
text_generation_pipeline = pipeline('text-generation', model='gpt2')

# 샘플 프롬프트
prompt = "Once upon a time, in a land far away,"

# 텍스트 생성
generated_texts = text_generation_pipeline(prompt, max_length=50, num_return_sequences=3)

print(generated_texts)


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Once upon a time, in a land far away, a man called 'Shun' set to work to free his son, a small boy with a very small nose, and a big head. He had been captured by the Emperor's armies on"}, {'generated_text': "Once upon a time, in a land far away, the most powerful and peaceful people ever to live was the man, that was called 'Sir, the King' or 'the King of England'. He stood like a statue about to rise over a"}, {'generated_text': "Once upon a time, in a land far away, the world lived within the boundaries of it's mind, just for a time...a long while. Then its mind began to fade away. After a while, however, when they finally heard that"}]


### 6. 명명된 개체 인식 (NER)
NER 모델을 사용하여 텍스트에서 명명된 개체를 인식할 수 있습니다.

In [19]:
# NER 파이프라인 생성
ner_pipeline = pipeline('ner', grouped_entities=True)

# 샘플 텍스트 데이터
text = "Hugging Face Inc. is a company based in New York City."

# NER 예측
entities = ner_pipeline(text)

print(entities)


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is

[{'entity_group': 'ORG', 'score': 0.9937138, 'word': 'Hugging Face Inc', 'start': 0, 'end': 16}, {'entity_group': 'LOC', 'score': 0.99901754, 'word': 'New York City', 'start': 40, 'end': 53}]
