### 허깅페이스
- [토크나이저](https://huggingface.co/docs/tokenizers/index)
- [데이터셋](https://huggingface.co/docs/datasets/index)
- [액셀러 레이트](https://huggingface.co/docs/accelerate/index)
- 해결 해야 할 과제
    - 언어의 다양성
    - 데이터의 한계
    - 긴 문서 처리
    - 불 투명성(black box problem)
    - 편향성(윤리적 문제)

### transformer pipeline - 텍스트 분류

In [2]:
from transformers import pipeline

classfier = pipeline("text-classification", device="cuda")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [3]:
review_text = "Nvidia Corporation is a major American technology company, \
    best known for its graphics processing units (GPUs) that are widely used in gaming, professional graphics, \
    and high-performance computing. Founded in 1993 and headquartered in Santa Clara, \
    California, Nvidia has grown into a key player in various industries, including artificial intelligence (AI),\
     data science, automotive technology, and mobile computing."

In [4]:
import pandas as pd
output = classfier(review_text)
pd.DataFrame(output)


Unnamed: 0,label,score
0,POSITIVE,0.99914


### transformer pipeline - 개체명 인식(Ner)
- ORG = 조직
- MISC = 기타
- LOC = 위치

등

In [5]:
ner_tagger = pipeline("ner", device="cuda")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
entities = ner_tagger(review_text)
pd.DataFrame(entities)

Unnamed: 0,entity,score,index,word,start,end
0,I-ORG,0.999609,1,N,0,1
1,I-ORG,0.999246,2,##vid,1,4
2,I-ORG,0.999594,3,##ia,4,6
3,I-ORG,0.999056,4,Corporation,7,18
4,I-MISC,0.997677,8,American,30,38
5,I-LOC,0.99805,46,Santa,243,248
6,I-LOC,0.997077,47,Clara,249,254
7,I-LOC,0.99753,49,California,260,270
8,I-ORG,0.998946,51,N,272,273
9,I-ORG,0.999167,52,##vid,273,276


### transformer pipeline - 질의 응답
- 추출적 질문 답변

In [7]:
answer = pipeline("question-answering", device="cuda")
question = "what is gpu?"
output = answer(question = question, context = review_text)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [8]:
output["answer"]


'graphics processing units'

### transformer pipeline - 요약

In [9]:
summrizer = pipeline("summarization", device="cuda")
output = summrizer(review_text,max_length=12)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Your min_length=56 must be inferior than your max_length=12.


In [10]:
print(output[0]["summary_text"])

 Nvidia Corporation is a major American technology company .


### transformer pipeline - 번역

In [21]:
translater = pipeline("translation_en_to_de",
                      model="Helsinki-NLP/opus-mt-en-de",
                      device="cuda")
output = translater(review_text, clean_up_tokenization_spaces=True, min_length=100)

In [22]:
output[0]['translation_text']

'Nvidia Corporation ist ein großes amerikanisches Technologieunternehmen, das am besten für seine Grafikverarbeitungseinheiten (GPUs) bekannt ist, die weit verbreitet in Gaming, professioneller Grafik und Hochleistungs-Computing verwendet werden. Nvidia wurde 1993 gegründet und hat seinen Hauptsitz in Santa Clara, Kalifornien, und ist zu einem wichtigen Akteur in verschiedenen Branchen gewachsen, einschließlich künstlicher Intelligenz (KI), Datenwissenschaft, Automobiltechnologie und mobiles Computing..................................................................................................................................................................................................................'

### transformer pipeline - 텍스트 생성

In [118]:
text_gen = pipeline("text-generation")
input_text = "me" + review_text + "\nanswer:"
output = text_gen(input_text, max_length=200)

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [119]:
print(output[0]["generated_text"].split(review_text)[-1])


answer: There is no such thing as the "Nvidia GPUs" listed above. In fact the Nvidia GPUs are sold as part of the Intel "Nvidia Series D" which consist of the "Nvidia GPUs 3xxx, Kepler, and Kepler Mega APUs --------------" series and "Nvidia GPUs 3xxx, Kepler Mega APUs 3", or "Nvidia GeForce GPUs 3xxx, --------------" series. The    GPUs are known as "Nvidia GPUs Pro"  (GPUs 3xx
