<a href="https://colab.research.google.com/github/younghun-cha/Healthcare-Big-Data-Engineer/blob/main/AI/05-NLP-Transformers/02_Transformers_Pipeline_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP: Transformers Pipeline 활용

**Hugging Face**: https://huggingface.co
* feature-extraction (텍스트에 대한 벡터 표현 제공)
* fill-mask
* ner(Named Entity Recognition, 개체명 인식)
* question-answering
* sentiment-analysis
* summarization
* text-generation
* translation
* zero-shot-classification

In [1]:
%pip install tensorflow torch transformers --quiet

In [2]:
import transformers
import logging
transformers.logging.get_verbosity = lambda: logging.NOTSET

import torch
from transformers import pipeline

## 1. Sentiment Analysis

In [3]:
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [4]:
result = classifier(["Biotech is a great field", "Chemistry is difficult"])
result

[{'label': 'POSITIVE', 'score': 0.9998420476913452},
 {'label': 'NEGATIVE', 'score': 0.9974084496498108}]

In [5]:
result2 = classifier(["이 음식은 굉장이 맛이 없습니다.", "나는 이번 수업으로부터 굉장히 만족한 결과를 얻었다."])
result2

[{'label': 'POSITIVE', 'score': 0.9223692417144775},
 {'label': 'POSITIVE', 'score': 0.6938512921333313}]

## 2. Question & Answering

In [6]:
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [7]:
paragraph = """Recent studies suggest,however, that a single clone may synthesize antibody molecules of the same speci-ficity which differ in heavy chain class. Thus, idiotypic determinants, which are con-sidered to be a function of the antibody combining site and therefore a variable re-gion marker, have been shown to be shared among the IgM and IgG anti-Sal-monella antibodies produced by individual rabbits. Furthermore, IgG and IgMmyeloma proteins derived from a single individual have been shown to share idio-typic determinants while structural studies indicate that these proteins have identical light chains and heavy chain variable region subgroups."""

In [8]:
paragraph

'Recent studies suggest,however, that a single clone may synthesize antibody molecules of the same speci-ficity which differ in heavy chain class. Thus, idiotypic determinants, which are con-sidered to be a function of the antibody combining site and therefore a variable re-gion marker, have been shown to be shared among the IgM and IgG anti-Sal-monella antibodies produced by individual rabbits. Furthermore, IgG and IgMmyeloma proteins derived from a single individual have been shown to share idio-typic determinants while structural studies indicate that these proteins have identical light chains and heavy chain variable region subgroups.'

In [9]:
result = question_answerer(question="What type of molecule can a single clone synthesize?", context=paragraph)
result

{'score': 0.4665096402168274,
 'start': 67,
 'end': 85,
 'answer': 'antibody molecules'}

In [10]:
result = question_answerer(question="How do the molecules differ?", context=paragraph)
result

{'score': 0.7798488140106201,
 'start': 127,
 'end': 144,
 'answer': 'heavy chain class'}

## 3. Text Generation

In [11]:
text_generator = pipeline("text-generation")
text_generator("The field of biotech is", max_length=50, do_sample=False)

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The field of biotech is growing rapidly, and the number of companies that are developing new drugs is growing at a rapid pace.\n\n"We\'re seeing a lot of new drugs coming out of the labs," said Dr. David S. Gorman'}]

## 4. Sentence Summarization

In [12]:
summarizer = pipeline("summarization")
summarizer(paragraph, max_length=30, min_length=20, do_sample=False)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' Recent studies suggest that a single clone may synthesize antibody molecules of the same speci-ficity which differ in heavy chain class .'}]

## 5. Pipeline에서 PLM 모델 사용

In [13]:
# distilgpt2 모델 로드
generator = pipeline("text-generation", model="distilgpt2") 
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In this course, we will teach you how to use the power of memory and how to make use of it. So, we're teaching you how"},
 {'generated_text': 'In this course, we will teach you how to create code in your app. We will do our best to use the first principles of this course with'}]