<a href="https://colab.research.google.com/github/rtajeong/ChatGPT_for_Management/blob/main/5_huggingface_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face Transformers
- an open-source Python library that provides a wide variety of pre-trained models for natural language processing (NLP), computer vision (CV), and audio tasks.
- 주요 기능:
  - 모델 허브(Model Hub): 사전 학습된 수천 개의 모델 제공
  - 파이프라인(Pipeline): 텍스트 분류, 질문-답변, 감정 분석, 텍스트 생성, 번역, 요약 등 고수준 API 제공
  - Fine-Tuning:사전 학습된 모델을 새 데이터셋에 맞춰 조정 가능
  - Tokenizer:텍스트 데이터를 모델 입력 형식으로 변환
- 지원 모델:
  - BERT: 자연어 이해를 위한 대표 모델
  - GPT: 텍스트 생성
  - T5: 텍스트 변환 (번역, 요약 등)
  - DistilBERT: 경량화된 BERT 모델
  - RoBERTa: BERT의 변형 모델
- 활용:
  - OpenAI의 GPT-3.5나 GPT-4와 같은 대규모 모델과 비교하면 사전 학습 데이터의 양과 범위에서 성능이 떨어짐.
  - 주로 특정 도메인이나 작업에 맞게 추가 Fine-Tuning을 해야 성능이 극대화 (사용자가 업로드한 모델에 의존하므로 최신 기술이 반영되지 않을 가능성이 있음)
  - 반면, OpenAI GPT API는 광범위한 사용 사례에 대응할 수 있도록 설계되어 Fine-Tuning 없이도 더 나은 성능을 제공
- 장점:
  - Hugging Face의 가장 큰 장점 중 하나는 Fine-Tuning을 통해 특정 작업이나 도메인에 최적화된 모델을 쉽게 학습시킬 수 있다는 점. (사용자에게 더 큰 유연성)
  - 오픈 소스 자유
  - 비용 효율성
  - 모델 구조 및 매개변수에 대한 제어 가능

- 주요 함수:
  - pipeline(task, model=None, tokenizer=None, framework=None, device=-1)
    - task: 수행할 작업의 유형을 지정 (예: question-answering, text-generation, sentiment-analysis, summarization, translation, zero-shot-classification, ner, 등)
    - model: 사용할 사전 학습 모델 이름 또는 경로 (기본값은 Hugging Face의 추천 모델) (예: "bert-base-cased", "gpt-2", "deepset/bert-base-cased-squad2")
    - tokenizer: 사용할 토크나이저. 모델과 동일한 것이 기본값으로 설정
    - framework: 사용할 딥러닝 프레임워크 ('pt' 또는 'tf'): pytorh or tensorflow
    - device: 실행할 디바이스 (-1: CPU (기본값), 0: GPU)
  - model.generate()
   - 텍스트 생성이 필요한 경우: 텍스트 생성 (Text Generation), 요약(Summarization), 번역(Translation), 데이터 변환(Task-Specific Generation)
   - 입력값 (input_ids): 모델에 입력할 토큰 시퀀스.
   - max_length: 생성할 시퀀스의 최대 길이.
   - 샘플링 관련 매개변수:
      - temperature: 생성된 텍스트의 다양성을 제어. (값이 높을수록 더 창의적인 텍스트를 생성)
      - top_k: 확률이 가장 높은 상위 k개의 토큰만 고려.
      - top_p: 누적 확률이 p 이상인 토큰만 고려 (각 토큰의 확률을 높은 순서대로 정렬한 후, 확률을 누적합으로 계산)
   - 저수준 API로, 모델의 세부적인 작동 방식을 제어 가능
      - pipeline("text-generation") 은 고수준 API로, 간단히 텍스트 생성을 수행하므로 적절한 사전 설정값이 적용되어 있어 빠른 프로토타이핑에 적합.
  - model(): 텍스트 생성이 필요없는 경우
    - 텍스트 분류 (Text Classification): BERT 또는 DistilBERT와 같은 모델을 사용
    - 질문-답변 (Question Answering): BERT와 같은 모델을 사용
    - 개체명 인식 (Named Entity Recognition, NER): BERT와 같은 모델을 사용

# high-level API: Pipeline()

## Text generation

In [None]:
from transformers import pipeline

# Initialize a text-generation pipeline
generator = pipeline("text-generation", model="gpt2")

# Generate text
output = generator("Once upon a time", max_length=50, num_return_sequences=1)
print(output[0]["generated_text"])


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time the story suddenly went on a tear. The story was about someone who had done great things and now had to live with how they could do another great thing in his life… But in terms of actual physicality that is completely understandable


## Sentiment analysis

In [None]:
from transformers import pipeline

# Initialize a sentiment-analysis pipeline
classifier = pipeline("sentiment-analysis")

# Analyze sentiment
result = classifier("Hugging Face makes working with Transformers easy!")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9969885945320129}]


## Text summarization

In [None]:
from transformers import pipeline

# Initialize a summarization pipeline
summarizer = pipeline("summarization")

# Summarize text
text = """
The Transformers library by Hugging Face provides pre-trained models for NLP tasks.
It allows easy use of models for tasks like text generation, summarization, translation, etc.
"""
summary = summarizer(text, max_length=25, min_length=20, do_sample=False)

summary[0]["summary_text"]

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu


' The Transformers library by Hugging Face provides pre-trained models for NLP tasks . It allows easy use of'

## Translation

In [None]:
from transformers import pipeline

# Initialize a translation pipeline
translator = pipeline("translation_en_to_ko", model="facebook/m2m100_418M")

# Translate text
result = translator("Hugging Face is an amazing library!")
print(result[0]["translation_text"])


config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Device set to use cpu


Hugging Face는 놀라운 도서관입니다!


## NER (Named Entity Recognition)

- 명명된 엔터티 인식(NER): 구조화되지 않은 텍스트를 구조화하기 위한 강력한 도구
  - 텍스트에서 언급된 특정 엔터티를 식별하고 미리 정의된 범주로 분류하는 자연어 처리(NLP)의 하위 작업으로 다음이 포함될 수 있다.
    - 인물 이름: 예: "알버트 아인슈타인"
    - 위치(LOC): 예: "파리", "에베레스트 산"
    - 조직(ORG): 예: "Google", "유엔"
    - 날짜: 예: "2024년 1월 1일"
    - 시간: 예: "오후 2시 30분"
    - 화폐 가치: 예: "$100"
  - NER이 중요한 이유
    - 정보 추출: 대규모 텍스트 코퍼스에서 주요 세부 정보를 자동으로 추출
    - 검색 및 검색: 엔터티를 인덱싱하여 더 나은 검색 알고리즘을 활성화합니다.
    - 콘텐츠 태그 지정: 식별된 엔터티를 기반으로 콘텐츠를 구성
  - 실제 적용:
    - 뉴스 분류(예: 이름과 이벤트 연결).
    - 고객 지원을 위한 챗봇(예: 사용자가 제공한 이름이나 계좌 번호 식별).
    - 법률 문서 처리(예: 조항, 날짜, 당사자 이름 추출).

In [None]:
from transformers import pipeline

# Initialize an NER pipeline
ner = pipeline("ner", grouped_entities=True)

# Perform NER
text = "Barack Obama was born in Honolulu, Hawaii, and served as President of the United States."
entities = ner(text)

print(entities)


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cpu


[{'entity_group': 'PER',
  'score': 0.99929297,
  'word': 'Barack Obama',
  'start': 0,
  'end': 12},
 {'entity_group': 'LOC',
  'score': 0.99826163,
  'word': 'Honolulu',
  'start': 25,
  'end': 33},
 {'entity_group': 'LOC',
  'score': 0.9994584,
  'word': 'Hawaii',
  'start': 35,
  'end': 41},
 {'entity_group': 'LOC',
  'score': 0.9980311,
  'word': 'United States',
  'start': 74,
  'end': 87}]

#Low-level API:
- model.generate() or model()
- 텍스트 생성이 필요한 경우: model.generate()
  - 텍스트 생성 (Text Generation), 요약(Summarization), 번역(Translation), 데이터 변환(Task-Specific Generation)
- 텍스트 생성이 필요없는 경우: 직접 model() 사용 (예: Bert)
  - 텍스트 분류 (Text Classification), 질문-답변 (Question Answering), 개체명 인식 (Named Entity Recognition, NER)

## Text generation

In [None]:
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")
result = generator("The future of AI is", max_length=50)

print(result)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The future of AI is an endless battle. It's one that's always going to be interesting to watch.\n\nWill it be useful? I doubt it. Maybe. But that shouldn't make it an easy fight.\n\nIs it the"}]


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

input_ids = tokenizer.encode("The future of AI is", return_tensors="pt")
output = model.generate(
    input_ids,
    max_length=50,
    temperature=0.7,
    top_k=50,
    top_p=0.9,
    num_beams=5,
    repetition_penalty=1.2,
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The future of AI is in the hands of the next generation of scientists and engineers.

This article was originally published on The Conversation. Read the original article.

Join the conversation See the latest news and share your comments with CNN Health on


## Seq2Seq 모델:
- 입력 텍스트를 다른 형식(요약, 번역 등)으로 변환하는 작업에 적합

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

input_text = "summarize: The Hugging Face Transformer library is a powerful \
and widely-used open-source library for natural language processing (NLP) and \
other machine learning tasks. It provides a seamless interface to utilize \
state-of-the-art Transformer-based models like BERT, GPT, T5, RoBERTa, \
and more. Hugging Face aims to simplify the use of these advanced models by \
offering easy-to-use APIs for both beginners and experts in machine learning."

input_ids = tokenizer.encode(input_text, return_tensors="pt")

output = model.generate(
    input_ids,
    max_length=20,
    num_beams=4,
    length_penalty=2.0)

print(tokenizer.decode(output[0], skip_special_tokens=True))


the Hugging Face Transformer library is a powerful and widely-used open-source library for


## Question Answering

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

context = "Hugging Face is creating tools for NLP."
question = "What is Hugging Face creating?"

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)
answer_start = outputs.start_logits.argmax()
answer_end = outputs.end_logits.argmax()

answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(inputs.input_ids[0][answer_start:answer_end + 1])
)
print(answer)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tools


In [None]:
answer

'tools'

## Sentiment Analysis (감정분석)

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("I love Hugging Face!", return_tensors="pt")
outputs = model(**inputs)
probabilities = outputs.logits.softmax(dim=-1)
sentiment = "Positive" if probabilities[0][1] > probabilities[0][0] else "Negative"

print(sentiment)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Positive


# Exerise