In [1]:
from IPython.display import display, HTML
display(HTML("""
<style>
div.container{width:90% !important;}
div.cell.code_cell.rendered{width:100%;}
div.input_prompt{padding:0px;}
div.CodeMirror {font-family:Consolas; font-size:12pt;}
div.text_cell_render.rendered_html{font size:12pt;}
div.output {font-size:12pt; font-weight:bold;}
div.input{font-family:Consolas; font-size:12pt;}
div.prompt {min width:70px;}
div#toc-wrapper {padding-top:120px;}
div.text_cell_render ul li{font-size:12pt;padding:5px;}
table.dataframe {font-size:12px;}
</style>
"""))

<font size='6' color='red'><b>ch1 허깅페이스</b></font>
- Inference API 이용 : 모델의 결과를 server에서
- pipeline() 이용 : 모델을 다운로드 받아 모델의 결과를 local에서
    - raw tokenizer -> model -> 결과값 (logits(각 결과와 확률)) -> 예측값 출력
    
```
허깅 페이스 transformers에서 지원하는 task
"sentiment-analysis" : "text-classification"의 별칭 (감정분석 전용으로 사용)
"text-classificaiton" : 감정분석, 뉴스분류, 리뷰분류 등 분류 등 일반적인 문장 분류
"zero-shot-classification" : 레이블을 학습 없이 주어진 후보군 중에서 분류
"token-classificaiton" : 개체명 인식(NER : Name Entity Recognition) 등 단위 라벨링
"ner" : "token-classification"의 별칭
"text-generation" : 텍스트 생성 (GPT류 모델에 사용)
"text2text-generation" : 번역, 요약 등 입력 -> 출력변환
"translation" : 번역
"summarization" : 텍스트 요약
"image-to-text" : 그림 설명
"image-classification" : 이미지 분류
```

In [1]:
import warnings
import os
import logging
# 경고 제거
warnings.filterwarnings('ignore')

# transformers 로깅 레벨 조정
logging.getLogger("transformers").setLevel(logging.ERROR)

# Hugging Face symlink 경고 제거
os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# from transformers import pipeline, logging as hf_logging
# hf_logging.set_verbosity_error()


# 1. 텍스트 기반 감정분석(긍정/부정)
- C:/사용자/컴퓨터명/.cache/huggingface/hub/모델이름

In [17]:
from transformers import pipeline
classifier = pipeline(task="sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

In [18]:
classifier = pipeline(task="text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
classifier(["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"])

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [5]:
classifier(["이 영화 정말 최고였어요. 감동적이고 연기가 대단해", 
            "This was the best movie. It was touching and the acting is amazing"])

[{'label': 'POSITIVE', 'score': 0.857815682888031},
 {'label': 'POSITIVE', 'score': 0.9998776912689209}]

In [6]:
classifier(["I like you", "I hate you", "나 너 싫어", "힘들어요"])

[{'label': 'POSITIVE', 'score': 0.9998695850372314},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079},
 {'label': 'NEGATIVE', 'score': 0.5795602202415466},
 {'label': 'POSITIVE', 'score': 0.8669533729553223}]

In [19]:
from transformers import pipeline
classifier = pipeline(task="sentiment-analysis",
                     model="matthewburke/korean_sentiment")
texts = ['나는 너가 좋아', "당신이 싫어요", "힘들어요", "오늘 기분이 최고야"]
result = classifier(texts)

Device set to use cpu


In [11]:
for text, result in zip(texts, classifier(texts)):
    label = "긍정" if result['label']=='LABEL_1' else "부정"
    print(f"{text} => {label} : {result['score']:.4f}")

나는 너가 좋아 => 긍정 : 0.9558
당신이 싫어요 => 부정 : 0.9093
힘들어요 => 부정 : 0.9140
오늘 기분이 최고야 => 긍정 : 0.9714


# 2. 제로샷분류(Zero-shot)분류
- 기계학습 및 자연어처리에서 각 개별 작업에 대한 특정 교육없이 작업을 수행할수 있는 모형(비지도학습)

In [20]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
classifier(
    "I have a problem with my iphone that needs to be resolved asap!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'sequence': 'I have a problem with my iphone that needs to be resolved asap!',
 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'],
 'scores': [0.5227580070495605,
  0.45814019441604614,
  0.0142647260800004,
  0.0026850001886487007,
  0.002152054337784648]}

In [16]:
sequence_to_classify = "One day I well see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)

{'sequence': 'One day I well see the world',
 'labels': ['travel', 'cooking', 'dancing'],
 'scores': [0.9938077926635742, 0.003099897177889943, 0.003092351369559765]}

# 3. text 생성

In [21]:
from transformers import pipeline
generation = pipeline("text-generation", "gpt2") # 텍스트 생성 gpt3부터는 허깅페이스에 없음
generation(
    "in this cotokenizer. We will teach you how to",
    pad_token_id=generation.tokenizer.eos_token_id
) # pad_token_id 경고를 없애려고 setting

Device set to use cpu


[{'generated_text': 'in this cotokenizer. We will teach you how to create a cotokenizer as well as to make your own. First you need to make a new cotokenizer. Choose your favorite cotokenizer, or you can choose from our 5 favorite cotokenizers and place them in our category. Then you can put a text in your cotokenizer and use that text to create a new cotokenizer. If you want to add a new cotokenizer and place it next to your favorite cotokenizer, choose the option "Add to your category" and add the text you want and then choose "Copy to folder".\n\nThe cotokenizer will have a folder in the background with your favorite text and a folder in the center that will contain your favorite cotokenizer.\n\nMake the cotokenizer.\n\nStep-by-step instructions for creating and using this cotokenizer\n\n1. In the cotokenizer, copy and paste the text you want to create and paste the text into your text editor.\n\n2. In the text editor, edit the lines.\n\n3. In the text editor, copy and paste the tex

In [14]:
result = generation(
    "in this course. We will teach you how to",
    pad_token_id=generation.tokenizer.eos_token_id
)
print(result[0]['generated_text'])

in this course. We will teach you how to use an HTML5/CSS3 library to create your own web applications.

The course is also available as an online course.

Course Features

Course Overview

The course is designed to teach you the basics of Web application development. This course is designed to teach you how to write HTML5 applications in Go.

Learning Objective-C for Web Applications

This course provides an easy way for you to learn Objective-C in Go, while still learning Objective-C for Web Applications. We will cover:

HTML5 development (using the Go library)

The concepts and tools of the library

The Go documentation

How to create a web application in Go

What the library does

The course will demonstrate the basic concepts and tools of the library, while also providing you with a demonstration of the underlying Go implementation.

Note: Students must be familiar with Go. It is recommended that students have access to Go 4.2 or below.

What the Go Library Does

The library creat

In [22]:
generation = pipeline("text-generation", "skt/kogpt2-base-v2")
result = generation(
    "이 과정은 다음과 같은 방법을 알려드려요",
    pad_token_id=generation.tokenizer.eos_token_id,
    max_new_tokens = 100, # 생성할 최대 길이 (생성할 토큰 수)
    num_return_sequences=1, # 생성할 문장 갯수 
    do_sample=True, # 다양한 샘플 사용
    top_k=50, # top-k 샘플링(확률 높은 사위 50개 토큰만 사용)
    top_p=0.95, # 확률이 높은 순서대로 95%될 때까지의 단어들로만 후보로 사용
    temperature=1.2, # 창의성 조절 (낮을수록 보수적)
    no_repeat_ngram_size=2 #반복 방지
)
print(result[0]['generated_text'])

Device set to use cpu


이 과정은 다음과 같은 방법을 알려드려요~
1. 모든 사람이 동의한다.
2. 이 절차를 통해 얻은 정보가 얼마나 유용할까?
3. 그 정보는 어떤 방식으로 입수하고 어디에서 무엇을 얻었는지 구체적으로 설명되지 않는다.
4. 다른 사람의 아이디어는 어느 날 어떤 곳에서 입수했고 어떤 분야에서 더 많은 이익을 얻고 더 많이 얻은 것이 있을까?
5. 모든 사람들에게 얼마나 많은 것을 얻기 위해서 얼마나 노력했는지를 알려준다.
6. 우리는 다른 사람과 똑같은 정보를 입수했는지 궁금해지는가?
7. 그들은 자신의 아이디어와 관련한 정보를 습득해 왔는지?
8. 다른 사람들의 아이디어, 특히 사람들이


# 4. 마스크(빈칸) 채우기

In [3]:
from transformers import pipeline
unmasker = pipeline(task='fill-mask', model='distilbert/distilroberta-base') #마스크 채우기
unmasker("I'm going to hospital and meet a <mask>", top_k=2) # top_k 기본값 : 5

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'score': 0.19275707006454468,
  'token': 3299,
  'token_str': ' doctor',
  'sequence': "I'm going to hospital and meet a doctor"},
 {'score': 0.06794589757919312,
  'token': 27321,
  'token_str': ' psychiatrist',
  'sequence': "I'm going to hospital and meet a psychiatrist"}]

In [6]:
unmasker('안녕하세요! 나는 <mask>모델이에요', top_k=3)

[{'score': 0.2594446539878845,
  'token': 1437,
  'token_str': ' ',
  'sequence': '안녕하세요! 나는 모델이에요'},
 {'score': 0.14142775535583496,
  'token': 12,
  'token_str': '-',
  'sequence': '안녕하세요! 나는-모델이에요'},
 {'score': 0.09121906757354736,
  'token': 34437,
  'token_str': '~',
  'sequence': '안녕하세요! 나는~모델이에요'}]

In [10]:
unmasker = pipeline(task='fill-mask', model='google-bert/bert-base-uncased')
unmasker("Hello, I'm a [MASK] model.")

Some weights of the model checkpoint at google-bert/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.1441437155008316,
  'token': 2535,
  'token_str': 'role',
  'sequence': "hello, i ' m a role model."},
 {'score': 0.14175789058208466,
  'token': 4827,
  'token_str': 'fashion',
  'sequence': "hello, i ' m a fashion model."},
 {'score': 0.062214579433202744,
  'token': 2047,
  'token_str': 'new',
  'sequence': "hello, i ' m a new model."},
 {'score': 0.041028350591659546,
  'token': 3565,
  'token_str': 'super',
  'sequence': "hello, i ' m a super model."},
 {'score': 0.025911200791597366,
  'token': 2449,
  'token_str': 'business',
  'sequence': "hello, i ' m a business model."}]

## ※ Inference API 사용

In [13]:
from dotenv import load_dotenv
import os
load_dotenv()
# os.environ['HF_TOKEN']

True