# Lab - Transformers

## Lab Summary:
In this lab we explore the use of certain Python library transformers.

## Lab Goal:
Upon completion of this lab, the student should be able to:
<ul>
    <li> Apply Python to implement machine learning transformers</li>
</ul>


## Packages and Classes
In this lab we will be using the following libraries:
<ol>
    <li> transformers (from Hugging Face) </li>
    <li> pipeline </li>
</ol>



In [1]:
# Install Transformers if they are not already a member of your Python libraries.
! pip install torch transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Example 1: question-answering Transformer Pipeline

This example demonstrates how to apply a simple "question-answering" pipeline to a set of text.

In [2]:
# This set of text is used as the "context" for the pipeline.
context = "Sweden is a Nordic country in Northern Europe. \
It borders Norway to the west and north, Finland to the east, \
and is connected to Denmark in the southwest by a bridge-tunnel \
across the Öresund Strait. At 450,295 square kilometres (173,860 sq mi), \
Sweden is the largest country in Northern Europe, the third largest country \
in the European Union, and the fifth largest country in Europe. The capital \
city is Stockholm. Sweden has a total population of 10.4 million;\
and a low population density of 25 inhabitants per square kilometre \
(65/sq mi). 87% of Swedes live in urban areas, which cover 1.5% of the \
entire land area. The highest concentration is in the central and \
southern half of the country.Sweden is part of the geographical area \
of Fennoscandia. The climate is in general mild for its northerly \
latitude due to significant maritime influence. In spite of the \
high latitude, Sweden often has warm continental summers, being \
located in between the North Atlantic, the Baltic Sea, and vast \
Russia. The general climate and environment vary significantly from \
the south and north due to the vast latitudal difference, and much of \
Sweden has reliably cold and snowy winters. Southern Sweden is \
predominantly agricultural, while the north is heavily forested and \
includes a portion of the Scandinavian Mountains."

In [3]:
# The "pipeline" class from the transformers library lets us perform basic NLP tasks with little code.
from transformers import pipeline, set_seed
question_answerer = pipeline('question-answering')
question_answerer({
    'question': 'What is the capital of Sweden?',
    'context': context})

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


{'score': 0.9498878121376038, 'start': 406, 'end': 415, 'answer': 'Stockholm'}

Another example, using the same context data.

Note this time we saved the result to a variable.

In [4]:
# from transformers import pipeline, set_seed
# question_answerer = pipeline('question-answering')
density = question_answerer({
    'question': 'What is the population density of Sweden?',
    'context': context})
density

{'score': 0.6592300534248352,
 'start': 495,
 'end': 530,
 'answer': '25\xa0inhabitants\xa0per\xa0square\xa0kilometre'}

Another example, this time, using the properties of the result to make the answer more user-friendly.

In [None]:
from transformers import pipeline, set_seed
question_answerer = pipeline('question-answering')
geography = question_answerer({
    'question': 'Which geographical area is Sweden a part of?',
    'context': context})
print("Swedish is located within the geographical area of:", geography["answer"])

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


Swedish is located within the geographical area of: Fennoscandia


# Practice - use a pipeline

Now, try it yourself.  Find out what the area of Sweden is in square miles.

1. Save a question_answerer() object to a variable.
2. Print "Land area of Sweden in square miles is approximately" followed by the answer to the question.

In [None]:
# Your Code Here
land_area = question_answerer({
    'question': 'How large is Sweden in square miles?',
    'context': context})
print("Land area of Sweden in square miles is approximately: ", land_area["answer"])

Land area of Sweden in square miles is approximately:  173,860


# Masked Language Model

This next section will allow us to explore a different transformer pipeline: <b>fill-mask</b>.

<code>fill-mask</code> gives us answers, too, but it does it in a different way than question-answerer.

Reference: https://huggingface.co/tasks/fill-mask


In [10]:
# create a pipeline using 'fill-mask'.
# This time, we specify which model to use: bert-base-cased. 
unmasker = pipeline('fill-mask', model='bert-base-cased')
capital = unmasker("The capital of Australia is [MASK].")
capital

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


[{'score': 0.7679561972618103,
  'token': 13400,
  'token_str': 'Canberra',
  'sequence': 'The capital of Australia is Canberra.'},
 {'score': 0.0704473927617073,
  'token': 4141,
  'token_str': 'Melbourne',
  'sequence': 'The capital of Australia is Melbourne.'},
 {'score': 0.0512290894985199,
  'token': 6908,
  'token_str': 'Adelaide',
  'sequence': 'The capital of Australia is Adelaide.'},
 {'score': 0.0295842457562685,
  'token': 3122,
  'token_str': 'Sydney',
  'sequence': 'The capital of Australia is Sydney.'},
 {'score': 0.02096281200647354,
  'token': 7217,
  'token_str': 'Brisbane',
  'sequence': 'The capital of Australia is Brisbane.'}]

Notice a few differences with how this works:

1. fill-mask does not depend on source data to be loaded.
2. fill-mask provides multiple results, and a score associated with each result.

# Practice - fill-mask

Create a mask pipeline using the same, except use the model, "distilroberta-base" instead.  

NOTE: the "[MASK]" argument should be replaced with "\<mask>" instead, when using the distilroberta-base model.

Answer these questions: 
1. What is the top-ranked result?
2. What is the first result that differs from the previous mask?
3. How might this inform your decision to use different bases for mask questions?

In [11]:
# Your Code Here:
unmasker = pipeline('fill-mask', model='distilroberta-base')
capital_distilroberts = unmasker("The capital of Australia is <mask>.")
capital_distilroberts

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


[{'score': 0.6051315069198608,
  'token': 16773,
  'token_str': ' Canberra',
  'sequence': 'The capital of Australia is Canberra.'},
 {'score': 0.1604301631450653,
  'token': 5703,
  'token_str': ' Melbourne',
  'sequence': 'The capital of Australia is Melbourne.'},
 {'score': 0.06049101799726486,
  'token': 10157,
  'token_str': ' Brisbane',
  'sequence': 'The capital of Australia is Brisbane.'},
 {'score': 0.044494856148958206,
  'token': 4290,
  'token_str': ' Sydney',
  'sequence': 'The capital of Australia is Sydney.'},
 {'score': 0.030387917533516884,
  'token': 11426,
  'token_str': ' Perth',
  'sequence': 'The capital of Australia is Perth.'}]

### Answer these questions: 
1. What is the top-ranked result?


2. What is the first result that differs from the previous mask?


3. How might this inform your decision to use different bases for mask questions?



1. What is the top-ranked result?  
    Canberra
2. What is the first result that differs from the previous mask?  
    The third result in the bert-base-cased model is Adelaide whereas the distilroberta-base model came up with Brisbane.
3. How might this inform your decision to use different bases for mask questions?  
    Though the right answer (Syndey) was pretty low in both bases, it was higher in distilroberta-base so that might indicate that it is slightly better than the bert-base. It also appears to use more tokens so there is probably a performance tradeoff for larger texts.

# One more Mask question example

Our previous example showed good results.  What happens if we ask a different kind of question?


In [12]:
unmasker = pipeline('fill-mask', model='bert-base-cased')
president = unmasker("The current President of the United States is [MASK].")
president

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


[{'score': 0.18453288078308105,
  'token': 10942,
  'token_str': 'Hon',
  'sequence': 'The current President of the United States is Hon.'},
 {'score': 0.08316890150308609,
  'token': 7834,
  'token_str': 'Democrats',
  'sequence': 'The current President of the United States is Democrats.'},
 {'score': 0.0527765117585659,
  'token': 3215,
  'token_str': 'Republican',
  'sequence': 'The current President of the United States is Republican.'},
 {'score': 0.033744927495718,
  'token': 11115,
  'token_str': 'Republicans',
  'sequence': 'The current President of the United States is Republicans.'},
 {'score': 0.0264601893723011,
  'token': 7661,
  'token_str': 'Obama',
  'sequence': 'The current President of the United States is Obama.'}]

# Practice

Work with the question-answering pipeline, using a PDF training manual as the context.

NOTE: The final project asks students to create a chatbot using an external data source as input. If you decide to use a PDF document, these steps may help.

### First, find a PDF and convert it to Text.

Steps for the student to complete:
1. Download the coffee shop operations manual from Smith here (or available in this week's resources): https://static1.squarespace.com/static/5ad7d558f93fd4cf94252090/t/5c54d0e4fa0d6014a4772292/1549062373251/SmithTraining.pdf
2. After downloading it, open it with Adobe Acrobat.
3. Save the file as a text file using the "Save as Text..." option in the "Menu" hamburger menu in the top left of Acrobat.


### The following steps are already done for the student (just use the code provided):

#### Next, import the text file into the notebook.  

4. Import the text file into your notebook and save the data into a variable.  Example code is provided.

NOTE: The example code assumes the .txt file is in the same directory as your notebook. It assumes your file name is "SmithTrainingManual.txt".

In [16]:
# Read in the file:
with open('SmithTrainingManual.txt', 'r', encoding='utf-8') as f:
    manual = f.read()
    
# Test to ensure the file imported.
print("length of dataset in characters: ", len(manual))

length of dataset in characters:  55323


#### Create a pipeline object for question-answering. 

5. Use a pipeline for question-answering, following the procedure above. Example code is provided.

In [17]:
manual_question = pipeline('question-answering')


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


#### Ask a question.

6. In your question-answering pipeline, ask a question that you might expect the code to answer. An example has been given.


In [None]:
# Example Question:
question = "What is the first task in the opening checklist?"




The first task in the opening checklist is: simply letting the guest know you understand the issue


## Perform these steps:

7. Save the response into a variable.


In [21]:
# Your Code Here:
manual_answer = manual_question({
    'question': question,
    'context': manual})

8. Print the response.

In [22]:
# Your Code Here:
print("The first task in the opening checklist is:", manual_answer["answer"])


The first task in the opening checklist is: simply letting the guest know you understand the issue


## Reflection

### Answer the following questions.

### 9a. Does the result make sense?

Answer: Not really; it appears to be answering what the first step is in solving customer issues, not the opening checkinglist.

### 9b. Can you see this tool being useful for an organization?  Why or why not?

Answer: Yes, with more accuracy and a larger dataset of context, I can see it being useful for answering employee questions that would normally be answered by a manager or looking up the answer in a manual.

## Practice
10. Ask a question irrelevant to the contents of the document, save the response into a variable, and print the response.


In [25]:
# Example Question:
question = "How many licks does it take to get to the center of a Tootsie Pop?"


In [26]:
# Save the response into a variable:
# Your Code Here:
candy_answer = manual_question({
    'question': question,
    'context': manual})

In [27]:
# Print the repsonse:
# Your Code Here:
print(f"It takes {candy_answer['answer']} licks to get to the center of a Tootsie Pop.")

It takes 4 licks to get to the center of a Tootsie Pop.


## Reflection

### Answer the following questions:


### 11a. How likely do you think it is that someone would use this tool in this way?


Answer: Very unlikely though if you ask questions that indicate what kind of answer you're looking for (like a quantity) it can provide answers that _look_ correct.

### 11b. What are some potential implications for designing a tool that accepts user input against a user manual like this?

Answer: You'd want some way to validate that an answer is correct, like requireing the model to provide a source or using multiple models to see if they agree to ensure that you are giving facutal information. In extreme cases, it's possible that the model would provide incorrect information that when acted upon increase some legal or compliance risk. In these cases, human or rule-based answers make more sense.