In [1]:
import librosa
import pytesseract
import soundfile

from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from datasets import load_dataset
from pytesseract import Output

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
transcriber = pipeline(task="automatic-speech-recognition")

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 55bb623 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebo

In [3]:
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")


{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'}

In [4]:
transcriber = pipeline(model="openai/whisper-large-v2")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

In [6]:
### using pipelines on a dataset
def data():
    for i in range(10):
        yield f"My example {i}"


pipe = pipeline(model="openai-community/gpt2", pad_token_id=50256)
generated_characters = 0
for out in pipe(data()):
    print(out[0]["generated_text"])
    generated_characters += len(out[0]["generated_text"])

My example 0.0.50 (and even older) is a different matter. In its early development form the standard is the current-day WSD-3 specification, which was developed as one of the first open source standards to include hardware control
My example 1 - for example, you can find the list of 'deleted' files on your local filesystem... 1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
My example 2: If the player was really unlucky and it was in their favor, instead of a 2-1 situation against Fnatic, they should have chosen a 3-0 and maybe pushed mid laner Maokai's gank if needed.
My example 3 is a lot more comfortable having a nice little "T-shirt" that doesn't include a collar.  I think the collar of a "T-shirt" looks pretty neat at this point. The only downside to my T-
My example 4" and 5" are designed with your eye in mind, so these will be easy to order.
My example 5 is an example which uses the same language of SQL as MySQL on its own, and does not use the following s

In [7]:
# KeyDataset is a util that will just output the item we're interested in.
pipe = pipeline(model="hf-internal-testing/tiny-random-wav2vec2")
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:10]")

for out in pipe(KeyDataset(dataset, "audio")):
    print(out)

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at hf-internal-testing/tiny-random-wav2vec2 and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.feature_extractor.conv_layers.1.layer_norm.bias', 'wav2vec2.feature_extractor.conv_layers.1.layer_norm.weight', 'wav2vec2.feature_extractor.conv_layers.2.layer_norm.bias', 'wav2vec2.feature_extractor.conv_layers.2.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


{'text': "EYB  ZB COE C BEZCYCZ HO MOWWB EM BWOB ZMEG  B COEB BE BEC B U OB BE BCB BEWUBB BXYWBESWYCB SBBB SSEZ C Z WH UB F IGVB SB Z<unk> XOES CZ BBXOXFBB  OBY W B VM OFOWUONFWB ZCX B M WZ Q S C Q BC CQBF FOMB BOT ZWYBZ WB  B CM B C B WZCWWW BHU EOYTO YWB BZ SHZBGEM Q OO T B BM XZ QW C OFBZMSEHB BE ZZBX M Q XB<unk> CEVWZ FOHSB W B O Z ZW S ZB O VM <s> D EUCKH XNC D Q BG B O BW U  U  MBE CBYE  WB HFQUBQBUWZ B MW BMPY F ZBU  EB B WBOF S XFOBB ZB X B MOT W B CEO WBM   BBXBBEOBECB B UM C BP FMBWB BZ WFCED Z B B FXB Z OZ OBBZ NVD UBZC W B WYCWY X CE CW B WB MWU BWN B DECF GEF'C WZS CS BYWB<s>FZ'Z<s>ZGBU ECFEY BF ZOZ O UWBSSZBBBBW   O O DBB BZWFUW ZWOZYCGOYCOT WC O CZ BD BBBBBBX X W T B BC BZC FWYBFO FBCE X Z PEZ CE B WEDBMBO BN B BY Y  W B BMCB XOXQ  BSZES Z M CF S FB BBXBB B C CSZ EF SEQF S BEC BNO BN  SU EH  WRFBS WB  W B OEZ WS X B F B X ZBBE BBEHB B BU BECBSXHB BSQWFW BSZXH BWSEG W VQETZMCZ UCXW Z DBE<s> O SXZX MB W RX YYOBSUBWOCFYEF O B O B C Z UBEZBE BTB C   CBFCB V W B BF W ZBBESBBE

In [8]:
### text pipeline
# This model is a `zero-shot-classification` model.
# It will classify text, except you are free to choose any label you might imagine
classifier = pipeline(model="facebook/bart-large-mnli")
classifier(
    "I have a problem with my iphone that needs to be resolved asap!!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
)

{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!',
 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'],
 'scores': [0.5036367177963257,
  0.47879841923713684,
  0.012600354850292206,
  0.002655796706676483,
  0.0023087686859071255]}

In [9]:
### multi-modal pipeline
vqa = pipeline(model="impira/layoutlm-document-qa")
vqa(
    image="https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png",
    # question="What is the invoice number?",
    question="What is the total amount on the invoice?"
)

[{'score': 0.8941310048103333, 'answer': '$154.06', 'start': 74, 'end': 74}]