# Huggingface로 무엇을 할 수 있나요?

https://huggingface.co/learn/nlp-course/ko/chapter1/3?fw=pt

* feature-extraction : 특징 추출 (텍스트에 대한 벡터 표현 추출)
* fill-mask : 마스크 채우기
* ner : 개체명 인식 (named entity recognition)
* question-answering : 질의 응답
* sentiment-analysis : 감정 분석
* summarization : 요약
* text-generation : 텍스트 생성
* translation : 번역
* zero-shot-classification : 제로샷 분류

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━

In [2]:
from transformers import pipeline

## 제로샷 분류(Zero-shot classification)

In [3]:
classifier = pipeline("zero-shot-classification")
classifier(
    "We will learn about deep learning algorithm.",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'We will learn about deep learning algorithm.',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.6590580344200134, 0.2252979725599289, 0.1156439483165741]}

In [4]:
classifier(
    "우리는 이번 시간에 딥러닝의 기본 원리에 대해 배워보겠습니다.",
    candidate_labels=["education", "politics", "business", "health"]
)

{'sequence': '우리는 이번 시간에 딥러닝의 기본 원리에 대해 배워보겠습니다.',
 'labels': ['health', 'education', 'business', 'politics'],
 'scores': [0.4566043019294739,
  0.24176481366157532,
  0.20801809430122375,
  0.09361273795366287]}

## sentiment-analysis : 감정 분석

In [5]:
classifier = pipeline("sentiment-analysis")
classifier("I have been waiting for you.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.996815025806427}]

In [6]:
classifier(
    ["I hate this!",
     "I like it",
     "I love you",
     "행복하다",
     "슬프다",
     "힘들다"]
)

[{'label': 'NEGATIVE', 'score': 0.9995765089988708},
 {'label': 'POSITIVE', 'score': 0.9998593330383301},
 {'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'POSITIVE', 'score': 0.7440058588981628},
 {'label': 'POSITIVE', 'score': 0.7086785435676575},
 {'label': 'POSITIVE', 'score': 0.7637549638748169}]

In [7]:
classifier("이 영화는 너무 재미있었다.")

[{'label': 'POSITIVE', 'score': 0.7929632663726807}]

In [8]:
classifier("행복한 기억이었다.")

[{'label': 'POSITIVE', 'score': 0.7657269835472107}]

In [9]:
classifier("너무 피곤하고 힘들다")

[{'label': 'POSITIVE', 'score': 0.6452802419662476}]

## 텍스트 생성(Text generation)

In [10]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will learn about how to deep learning advanced")

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In this course, we will learn about how to deep learning advanced concepts in your application. If you want to learn about learning to extend a technique, that's really really great, but it's really about not focusing on the goal as much. It"}]

In [11]:
generator("lunch menu?")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "lunch menu?\n\nYeah, that's pretty much the most popular type of food. A lot of people like bacon. A lot of people like lettuce. We have sandwiches out of salads. We have all kinds of sandwiches out of sandwiches."}]

In [12]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to build your own computer system.\n\n\n\nOur goal with this course is to set your goals'},
 {'generated_text': 'In this course, we will teach you how to identify the key features of your training sessions to make this an effective solution.\n\nIn your course'}]

## 마스크 채우기




In [13]:
mask_fill = pipeline("fill-mask")
mask_fill("This course will teach you about <mask> theorem.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.3432292938232422,
  'token': 42,
  'token_str': ' this',
  'sequence': 'This course will teach you about this theorem.'},
 {'score': 0.09307415038347244,
  'token': 5,
  'token_str': ' the',
  'sequence': 'This course will teach you about the theorem.'}]

In [14]:
mask_fill("I eat bread with <mask>.", top_k=2)

[{'score': 0.057292189449071884,
  'token': 9050,
  'token_str': ' butter',
  'sequence': 'I eat bread with butter.'},
 {'score': 0.029135676100850105,
  'token': 7666,
  'token_str': ' rice',
  'sequence': 'I eat bread with rice.'}]

### klue/bert-base

In [15]:
mask_fill_ko = pipeline("fill-mask", model="klue/bert-base") # 한국어로 fine-tuning한 모델
mask_fill_ko ("한국인이 가장 좋아하는 음식은 [MASK] 입니다.", top_k=2)

config.json:   0%|          | 0.00/425 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/289 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/248k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/495k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

[{'score': 0.11091504991054535,
  'token': 15764,
  'token_str': '비빔밥',
  'sequence': '한국인이 가장 좋아하는 음식은 비빔밥 입니다.'},
 {'score': 0.11054597795009613,
  'token': 6260,
  'token_str': '김치',
  'sequence': '한국인이 가장 좋아하는 음식은 김치 입니다.'}]

In [16]:
mask_fill_ko  = pipeline("fill-mask", model="klue/bert-base")
mask_fill_ko ("한국인이 좋아하는 국가는 [MASK] 이다.", top_k=2)

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.192066952586174,
  'token': 3666,
  'token_str': '미국',
  'sequence': '한국인이 좋아하는 국가는 미국 이다.'},
 {'score': 0.15815035998821259,
  'token': 3708,
  'token_str': '일본',
  'sequence': '한국인이 좋아하는 국가는 일본 이다.'}]

## 개체명 인식(Named entity recognition)

In [None]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("Alice lives in Seattle and works as a software engineer at Amazon.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [None]:
ner("My Name is Sieun Hyeon. I live in Korea.")

## 질의 응답(Question-answering)

In [None]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    context="Alice lives in Seattle and works as a software engineer at Amazon.",
    question="What is Alice's job?",
)

## 요약(Summarization)

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
The rapid advancement of technology has significantly transformed our daily lives.
Mobile devices like smartphones have revolutionized the way we communicate,
while social media platforms enable us to connect with people around the world.
Additionally, artificial intelligence and machine learning technologies are increasing efficiency in various industries,
and cutting-edge innovations such as autonomous vehicles are fundamentally changing our modes of transportation.
These changes create new opportunities but also present new challenges, such as concerns about privacy.
Therefore, it is essential to maximize the benefits of technology while carefully managing its potential risks.
"""
)

https://huggingface.co/gogamza/kobart-summarization

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="gogamza/kobart-summarization")
summarizer(
    """기술의 급속한 발전은 우리의 일상 생활을 크게 변화시키고 있다. 스마트폰과 같은 모바일 기기는 우리의 소통 방식을 혁신하였으며, 소셜 미디어 플랫폼은 전 세계 사람들과의 연결을 가능하게 만들었다. 또한, 인공지능과 머신러닝 기술은 다양한 산업에서 효율성을 증대시키고 있으며, 자율주행차와 같은 첨단 기술은 우리의 이동 수단을 근본적으로 바꾸고 있다. 이러한 변화는 새로운 기회를 창출하는 동시에, 개인 정보 보호와 같은 새로운 도전 과제를 제시하고 있다. 따라서 우리는 기술의 이점을 최대한 활용하면서도, 그로 인한 잠재적 위험을 신중하게 관리해야 한다.
"""
)

## 번역(Translation)

* https://huggingface.co/models

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

https://huggingface.co/Helsinki-NLP/opus-mt-ko-en

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-ko-en")
translator("""안녕하세요. 비가 많이 내리고 나서 날씨가 더워졌습니다.""")

In [None]:
translator("""
오늘의 강의에서는 딥러닝의 최신 기법들을 다룰 예정입니다.
데이터 전처리는 머신러닝 모델의 성능을 크게 향상시킬 수 있습니다.
신경망의 다양한 층을 이해하면 모델을 더 효과적으로 설계할 수 있습니다.
""")

# 데이터셋 불러와보기

https://huggingface.co/datasets/dair-ai/emotion

In [None]:
from datasets import load_dataset

emotion_dataset = load_dataset("dair-ai/emotion")
emotion_dataset