<a href="https://colab.research.google.com/github/xpdlaldam/nlp/blob/master/Hugging%20Face/transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. pipeline**

In [22]:
!pip install datasets gradio evaluate transformers[sentencepiece]

Collecting gradio
  Downloading gradio-5.16.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.7.0 (from gradio)
  Downloading gradio_client-1.7.0-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.9.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.meta

# Libraries

In [1]:
from transformers import pipeline
from datasets import load_dataset
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

# 1-1. sentiment analysis

In [5]:
### sentiment-analysis
classifier = pipeline("sentiment-analysis")

sents = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
    "neutral i'd say"
    ]
# classifier(sents[2]) # one by one
classifier(sents) # simultaneous

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455},
 {'label': 'NEGATIVE', 'score': 0.9988003969192505}]

## 1-2. customize labels

In [7]:
### zero-shot-classification: lets customize labels
classifier = pipeline("zero-shot-classification")
sents = [
    "this is biology 101",
    "president trump",
    "capex was over 1B this time",
]

classifier(
    sents,
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'sequence': 'this is biology 101',
  'labels': ['education', 'business', 'politics'],
  'scores': [0.9758106470108032, 0.015421471558511257, 0.008767769671976566]},
 {'sequence': 'president trump',
  'labels': ['politics', 'business', 'education'],
  'scores': [0.8464727401733398, 0.11240741610527039, 0.04111983999609947]},
 {'sequence': 'capex was over 1B this time',
  'labels': ['business', 'politics', 'education'],
  'scores': [0.9776291847229004, 0.013183980248868465, 0.009186833165585995]}]

## 1-3. generate text

In [11]:
# distilgpt2
# deepset/roberta-base-squad2
generator = pipeline("text-generation", model="distilgpt2")
generator("summarize AMD's most recent financial report")

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'summarize AMD\'s most recent financial report, which includes market estimates and market analyses, released today (April 12, 2013). According to the report\'s conclusions, AMD is currently in the midst of a "reorganization" of financial markets because'}]

In [5]:
pipe = pipeline(model="FacebookAI/roberta-large-mnli")

sents = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
    "neutral i'd say"
    ]

pipe(sents)

Some weights of the model checkpoint at FacebookAI/roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'label': 'NEUTRAL', 'score': 0.737484872341156},
 {'label': 'NEUTRAL', 'score': 0.5799961686134338},
 {'label': 'ENTAILMENT', 'score': 0.6553459167480469}]

## 1-4. speech recognition

In [9]:
pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = load_dataset("superb", name="asr", split="test")

# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item
# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset
for out in tqdm(pipe(KeyDataset(dataset, "file"))):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
    # {"text": ....}
    # ....

config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Device set to use cpu


README.md:   0%|          | 0.00/57.1k [00:00<?, ?B/s]

superb.py:   0%|          | 0.00/29.9k [00:00<?, ?B/s]

The repository for superb contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/superb.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/338M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/347M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.39G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/28539 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2703 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2620 [00:00<?, ? examples/s]

In [15]:
dataset

Dataset({
    features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
    num_rows: 2620
})

In [16]:
for out in tqdm(pipe(KeyDataset(dataset, "file"))):
    print(out)

  0%|          | 0/2620 [00:00<?, ?it/s]

{'text': 'HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOWER FAT AND SAUCE'}
{'text': 'STUFFERED INTO YOU HIS BELLY COUNSELLED HIM'}
{'text': 'AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS'}
{'text': 'HO BERTIE ANY GOOD IN YOUR MIND'}


KeyboardInterrupt: 

In [None]:
from huggingface_hub import list_datasets
print([dataset.id for dataset in list_datasets()])

In [12]:
minds = load_dataset("PolyAI/minds14", name="ko-KR", split="train")
minds

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
    num_rows: 592
})

In [13]:
minds[0]

{'path': '/root/.cache/huggingface/datasets/downloads/extracted/f9018fd3747971e77d59e6c5da3fdf9d5bb914c495e16c23e1fe47c921d76a7a/ko-KR~ATM_LIMIT/602bef265f67b421554f65e7.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/f9018fd3747971e77d59e6c5da3fdf9d5bb914c495e16c23e1fe47c921d76a7a/ko-KR~ATM_LIMIT/602bef265f67b421554f65e7.wav',
  'array': array([0.00024414, 0.        , 0.        , ..., 0.00073242, 0.00048828,
         0.00048828]),
  'sampling_rate': 8000},
 'transcription': 'app Manager 하고 싶은데 최대 금액이 얼마인지요',
 'english_transcription': 'I want to do app manager, what is the maximum amount',
 'intent_class': 3,
 'lang_id': 8}

In [14]:
id2label = minds.features["intent_class"].int2str
id2label(minds[0]["intent_class"])

'atm_limit'

In [19]:
minds.shuffle()[0]

{'path': '/root/.cache/huggingface/datasets/downloads/extracted/f9018fd3747971e77d59e6c5da3fdf9d5bb914c495e16c23e1fe47c921d76a7a/ko-KR~DIRECT_DEBIT/603f0615d7d083c1cb57a8d1.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/f9018fd3747971e77d59e6c5da3fdf9d5bb914c495e16c23e1fe47c921d76a7a/ko-KR~DIRECT_DEBIT/603f0615d7d083c1cb57a8d1.wav',
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 8000},
 'transcription': '자동 이체를 어떻게 사용해요',
 'english_transcription': 'How to use automatic debit',
 'intent_class': 8,
 'lang_id': 8}

In [None]:
import gradio as gr

def generate_audio():
    example = minds[0]
    audio = example["audio"]
    return (
        audio["sampling_rate"],
        audio["array"],
    ), id2label(example["intent_class"])


with gr.Blocks() as demo:
    with gr.Column():
        for _ in range(4):
            audio, label = generate_audio()
            output = gr.Audio(audio, label=label)

demo.launch(debug=True)



Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://68e4a6954aa7bb5145.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
