# Huggingface Learning
Working through the Huggingface NLP course. When downloading models for the first time it takes a while but they are cached locally in `C:\Users\Trevor_Kinsey\.cache\huggingface\hub` so that later instances can reuse them.

In [1]:
from transformers import pipeline




# Pipelines
Provide pre-processing, model, and post-processing. Here is a tour of some predefined pipelines built into Huggingface

## Sentiment Analysis

In [2]:
# instantialte pipeline
sentiment = pipeline(task="sentiment-analysis", model = "distilbert-base-uncased-finetuned-sst-2-english")




All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [3]:
statements = [
    "I've been waiting for a HuggingFace course my whole life.",
    "looking for a job is soul-crushing",
    "mashed potatoes, eh?",
    "chocolate, eh?"
]
responses = sentiment(statements)
for statement, response in zip(statements, responses):
    print(f"{response}: {statement}")

{'label': 'POSITIVE', 'score': 0.9598047137260437}: I've been waiting for a HuggingFace course my whole life.
{'label': 'NEGATIVE', 'score': 0.9913671612739563}: looking for a job is soul-crushing
{'label': 'NEGATIVE', 'score': 0.9899002909660339}: mashed potatoes, eh?
{'label': 'POSITIVE', 'score': 0.9837786555290222}: chocolate, eh?


# Zero-shot Classification
This allows you to specify the labels then classify text by them.


In [4]:
zero_shot = pipeline("zero-shot-classification", model="roberta-large-mnli")

All PyTorch model weights were used when initializing TFRobertaForSequenceClassification.

All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [5]:
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
zero_shot(sequence_to_classify, candidate_labels)

{'sequence': 'one day I will see the world',
 'labels': ['travel', 'cooking', 'dancing'],
 'scores': [0.9799639582633972, 0.010605032555758953, 0.009431014768779278]}

In [6]:
sequence_to_classify = "The CEO had a strong handshake."
candidate_labels = ['male', 'female']
hypothesis_template = "This text speaks about a {} profession."
zero_shot(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)

{'sequence': 'The CEO had a strong handshake.',
 'labels': ['male', 'female'],
 'scores': [0.8384835720062256, 0.16151642799377441]}

## Text Generation


In [8]:
generator = pipeline("text-generation", model="gpt2") # note: a bug hallucinates a closing parenthesis in this line of code. WEIRD)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [9]:
prompt = "When you are finished your training you will"
generator(prompt)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'When you are finished your training you will receive a check of your own for more money. If you need assistance, please let me know.'}]

In [10]:
generator2 = pipeline("text-generation", model="distilgpt2")
generator2(
    prompt,
    max_length=30,
    num_return_sequences=2,
)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'When you are finished your training you will become the type of person who wants to spend the day playing a soccer game. I like to think about it'},
 {'generated_text': 'When you are finished your training you will notice a big change in the intensity of the first three minutes of your training. You will expect to increase your'}]

# Mask Filling
This involves filling in the blank in a prompt


In [11]:
unmasker = pipeline("fill-mask", model = "distilroberta-base")

All PyTorch model weights were used when initializing TFRobertaForMaskedLM.

All the weights of TFRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


In [12]:
prompt = "I like to drink <mask> with my dinner."
unmasker(prompt, top_k=3)

[{'score': 0.2740415930747986,
  'token': 3984,
  'token_str': ' wine',
  'sequence': 'I like to drink wine with my dinner.'},
 {'score': 0.0926973968744278,
  'token': 3895,
  'token_str': ' coffee',
  'sequence': 'I like to drink coffee with my dinner.'},
 {'score': 0.05292801186442375,
  'token': 6845,
  'token_str': ' tea',
  'sequence': 'I like to drink tea with my dinner.'}]

In [13]:
prompts = [
    "I have 3 teenage children so my house can get very <mask>.",
    "I do <mask> to stay in shape"
]
unmasker(prompts, top_k=3)

[[{'score': 0.3800225257873535,
   'token': 18882,
   'token_str': ' messy',
   'sequence': 'I have 3 teenage children so my house can get very messy.'},
  {'score': 0.11004576086997986,
   'token': 11138,
   'token_str': ' crowded',
   'sequence': 'I have 3 teenage children so my house can get very crowded.'},
  {'score': 0.04842783138155937,
   'token': 3610,
   'token_str': ' busy',
   'sequence': 'I have 3 teenage children so my house can get very busy.'}],
 [{'score': 0.3566606640815735,
   'token': 236,
   'token_str': ' want',
   'sequence': 'I do want to stay in shape'},
  {'score': 0.16791412234306335,
   'token': 860,
   'token_str': ' try',
   'sequence': 'I do try to stay in shape'},
  {'score': 0.10354165732860565,
   'token': 240,
   'token_str': ' need',
   'sequence': 'I do need to stay in shape'}]]

# Named Entity Recognition

In [14]:
ner = pipeline("ner", grouped_entities=True, model = "dbmdz/bert-large-cased-finetuned-conll03-english")

All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


In [15]:
text = "My daughter Naomi goes to New West Secondary in New Westminster and is going on an exchange to France next year"
ner(text)

[{'entity_group': 'PER',
  'score': 0.9983121,
  'word': 'Naomi',
  'start': 12,
  'end': 17},
 {'entity_group': 'ORG',
  'score': 0.80635494,
  'word': 'New West Secondary',
  'start': 26,
  'end': 44},
 {'entity_group': 'LOC',
  'score': 0.99673724,
  'word': 'New Westminster',
  'start': 48,
  'end': 63},
 {'entity_group': 'LOC',
  'score': 0.99989116,
  'word': 'France',
  'start': 95,
  'end': 101}]

## Question Answering
Answers questions based on a given context, but doesn't use external sources.


In [16]:
question_answerer = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

All PyTorch model weights were used when initializing TFDistilBertForQuestionAnswering.

All the weights of TFDistilBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training.


In [17]:
question="How old am I?"
context="My name is Trevor and I am 47 years old."
question_answerer(question,context)

{'score': 0.5341746807098389, 'start': 27, 'end': 29, 'answer': '47'}

## Summarizing
Makes summary of a text.


In [18]:
summarizer = pipeline("summarization", model = "t5-small")

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [19]:
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

[{'summary_text': 'the number of graduates in traditional engineering disciplines has declined . in most of the premier american universities engineering curricula now concentrate on and encourage largely the study of engineering science . rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

## Translation

In [20]:
translator = pipeline("translation_en_to_fr", model="t5-base")

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [21]:
eng = "hello good sir, get ready for a nautical adventure!"

translator(eng)

[{'translation_text': 'Bonjour sir, préparez-vous à une aventure nautique!'}]