# Transformers, what can they do?  

### &nbsp;&nbsp; ... and how to *call*  &nbsp;&nbsp; them, in Python?  &nbsp;&nbsp; [Econ176 version]

<br>

***Be sure to make your own copy of this notebook***

<br>

This notebook follows the advice, arc, and ideas of the [Hugging Face Natural Language Processing course](https://huggingface.co/learn/nlp-course/chapter1/1?fw=pt) &nbsp; <font size="-1">with many thanks to all at HF!</font>

<br>

The idea is to get familiar with the interactions available from Transformer models, including what they do well (and not so well), in library form.

It will be surprising if you ***don't*** overlap with the prompting, fine-tuning, and programmatic access of these models in the future!

<br>

In this notebook, you'll see <font color="DodgerBlue">Econ176 Tasks</font> at various points...

Most of them invite you to create new examples for each Transformer capability --

and to comment on how well - <i>or not</i> - the LLMs can handle those tasks:

#### Installing the libraries needed

These next cells should install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install transformers



In [2]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m9

<hr>

## Sentiment-classification

This is an "encoding-only" application

It uses one classification layer on top of the encoder's "semantic connections":

In [3]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
print()
print("Complete. Libraries loaded...")  # blank line

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu



Complete. Libraries loaded...


In [4]:
#
# Try out a single sentence...
#

classifier("Some people are skeptical about generative AI.")

[{'label': 'NEGATIVE', 'score': 0.9846604466438293}]

In [5]:
#
# Try out a multiple sentences...
#

classifier(
    ["I've been waiting for a HuggingFace course my whole life.",
     "Aargh! I loathe this so much!"]  # could change this to "love"  :)
)

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9984083771705627}]

#### <font color="DodgerBlue">Econ176 Task</font>
+ create a list of 5-6 sentences below and run them through the classifier...
+ briefly, comment on how much you agree/disagree with the LLM's judgments!
+ You'll note that the default sentiment classifier is "extreme": it very rarely gives _neutral_ scores, i.e., ones near 0.
+ See if you can find a sentence whose score is less than .9, either way

In [6]:
# Feel free to use this cell -- or edit the one above...


my_sentences = [
    "The stock market is experiencing a lot of volatility lately.",
    "This new technology seems promising for future development.",
    "I'm unsure about the outcome of this project.",
    "The customer service was neither good nor bad.",
    "It's just an average day.",
    "The report was factually correct but lacked depth."
]
classifier(my_sentences)

# Agree with all classifications, impressive!

[{'label': 'NEGATIVE', 'score': 0.995173990726471},
 {'label': 'POSITIVE', 'score': 0.9985445737838745},
 {'label': 'NEGATIVE', 'score': 0.9994297623634338},
 {'label': 'NEGATIVE', 'score': 0.9936608076095581},
 {'label': 'NEGATIVE', 'score': 0.9906170964241028},
 {'label': 'NEGATIVE', 'score': 0.9990177154541016}]

<hr>

## <i>Zero-shot</i> classification (no additional training)

This is another encoder-based application of transformers.

Above, the classifier used _positive_ and _negative_

Here, you get to choose the classification-categories themselves -- because it is tunable, it's more likely to have value in business applications:

In [7]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

print()
print("Complete. Libraries loaded...")  # blank line


No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu



Complete. Libraries loaded...


In [8]:
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445994257926941, 0.11197380721569061, 0.04342673346400261]}

In [9]:
classifier(
    "I am not looking forward to the election this year...",
    candidate_labels=["education", "politics", "business"],
)

{'sequence': 'I am not looking forward to the election this year...',
 'labels': ['politics', 'business', 'education'],
 'scores': [0.9777106642723083, 0.016249822452664375, 0.006039463449269533]}

#### <font color="DodgerBlue">Econ176 Task</font>
+ First, create another example of the above classifier, where <font color="Coral"><i>business</i></font> results in being the most likely label
+ Then, create a <i>completely different example</i>, with <i><b>three other</b></i> <tt>candidate_labels</tt>, or more...
+ Construct an example to show that _each label_ you have chosen is the likliest for that sentence or text
+ Briefly comment on how much you agree/disagree with the LLM's judgements...

In [12]:
# Feel free to use this cell -- or the one above...


classifier(
    "The company announced record profits in the last quarter.",
    candidate_labels=["education", "politics", "business"],
)

#


{'sequence': 'The company announced record profits in the last quarter.',
 'labels': ['business', 'politics', 'education'],
 'scores': [0.9913414120674133, 0.004493010230362415, 0.004165545105934143]}

In [13]:

new_labels = ["sports", "technology", "food", "travel"]

# Example for sports
classifier(
    "The home team won the championship game in a thrilling overtime victory.",
    candidate_labels=new_labels,
)


{'sequence': 'The home team won the championship game in a thrilling overtime victory.',
 'labels': ['sports', 'technology', 'travel', 'food'],
 'scores': [0.9145822525024414,
  0.03826862573623657,
  0.034547239542007446,
  0.012601900845766068]}

In [14]:

# Example for technology
classifier(
    "The latest smartphone features a revolutionary new camera system and AI capabilities.",
    candidate_labels=new_labels,
)


{'sequence': 'The latest smartphone features a revolutionary new camera system and AI capabilities.',
 'labels': ['technology', 'travel', 'sports', 'food'],
 'scores': [0.9774075150489807,
  0.013180059380829334,
  0.006610660348087549,
  0.0028017729055136442]}

In [15]:

# Example for food
classifier(
    "This restaurant is famous for its authentic pasta dishes and delightful desserts.",
    candidate_labels=new_labels,
)


{'sequence': 'This restaurant is famous for its authentic pasta dishes and delightful desserts.',
 'labels': ['food', 'travel', 'sports', 'technology'],
 'scores': [0.9929121732711792,
  0.0037896151188760996,
  0.001950043486431241,
  0.0013481521746143699]}

In [16]:

# Example for travel
classifier(
    "We're planning a backpacking trip through Southeast Asia next summer.",
    candidate_labels=new_labels,
)

{'sequence': "We're planning a backpacking trip through Southeast Asia next summer.",
 'labels': ['travel', 'sports', 'technology', 'food'],
 'scores': [0.9823882579803467,
  0.008485807105898857,
  0.005118268076330423,
  0.004007652401924133]}

<hr>

## Text generation applications

This is a "decoder-only" application of transformers.

Admittedly, the encoder has been trained when training the decoder, so it's not truly decoder-only:

In [17]:
from transformers import pipeline

generator = pipeline("text-generation")

print()
print("Complete. Libraries loaded...")  # blank line

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu



Complete. Libraries loaded...


In [18]:
generator("In this course, we will teach you how to")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to set up, configure and apply VBScript commands for Visual Basic to use for building and running the projects.\n\nThe first step with this course – to get started – is to install VB'}]

In [19]:
from transformers import pipeline

generator2 = pipeline("text-generation", model="distilgpt2")

print()
print("Complete. distilgpt2 library loaded...")  # blank line


config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu



Complete. distilgpt2 library loaded...


In [20]:
generator2(
    "In this course, we will teach you how to",
    max_length=50,
    truncation=True,
    num_return_sequences=3,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In this course, we will teach you how to generate and execute new functions, as described below. If you haven't used the Haskell interface yet, you may have missed out on a great article that explains how some libraries use its methods, but this"},
 {'generated_text': 'In this course, we will teach you how to manipulate the game to make you better on your own.\n\nOne of the key topics on this workshop is that you will be able to customize the default difficulty slider on your game and, to this'},
 {'generated_text': 'In this course, we will teach you how to use PowerShell, and demonstrate the way you can use your own tools.'}]

#### <font color="DodgerBlue">Econ176 Task</font>
+ Run the above prompt 2-3 more times to see the results...
+ Then, create a <i>completely different prompt</i>, and again run it 2-3 times to get a sense of the "space of possibilities" the generator will create...
+ As before, briefly comment on ***how smoothly expressed*** and ***how  thematically natural*** the generator's results are ...

In [21]:
# Feel free to use this cell -- or the one above...

new_prompt = "The future of finance will likely involve"
generator2(
    new_prompt,
    max_length=50,
    truncation=True,
    num_return_sequences=3,
)

#

# The generated text seems to be consistently on theme but it doesn't make much sense.

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The future of finance will likely involve a new framework for the market that's important for the environment and has a good way of addressing that.”"},
 {'generated_text': 'The future of finance will likely involve different groups and individuals, who both will support and understand that they have the greatest resources, and, by the time it gets done, both institutions will be in control.”\n\n\n\nIt will also'},
 {'generated_text': 'The future of finance will likely involve the most ambitious and high-tech and powerful finance firms. In a bid to create opportunities for their clients, the Swiss Bank of England and Credit Agricole (CRB), which oversees these firms, will spend less'}]

<hr>

## Mask-filling / word-replacement applications

In [22]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=5)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


[{'score': 0.19619767367839813,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052715748548508,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'},
 {'score': 0.033018019050359726,
  'token': 27930,
  'token_str': ' predictive',
  'sequence': 'This course will teach you all about predictive models.'},
 {'score': 0.03194151446223259,
  'token': 745,
  'token_str': ' building',
  'sequence': 'This course will teach you all about building models.'},
 {'score': 0.024522872641682625,
  'token': 3034,
  'token_str': ' computer',
  'sequence': 'This course will teach you all about computer models.'}]

#### <font color="DodgerBlue">Econ176 Task</font>
+ Create <i>another prompt</i>, and take a look at the top five or so mask-fill suggestions...
+ As with each example, briefly comment on how well you feel the model has done, relative to your intuition (or overall human expectations)

In [23]:
# Feel free to use this cell -- or the one above...

unmasker("Artificial intelligence is rapidly transforming the <mask> industry.", top_k=5)

# All of these mask-fill suggustions make sense. It is impressive this simple model has the world knowledge to generate believable results.

[{'score': 0.15909530222415924,
  'token': 8568,
  'token_str': ' automotive',
  'sequence': 'Artificial intelligence is rapidly transforming the automotive industry.'},
 {'score': 0.07248751819133759,
  'token': 3717,
  'token_str': ' healthcare',
  'sequence': 'Artificial intelligence is rapidly transforming the healthcare industry.'},
 {'score': 0.06515815854072571,
  'token': 9016,
  'token_str': ' pharmaceutical',
  'sequence': 'Artificial intelligence is rapidly transforming the pharmaceutical industry.'},
 {'score': 0.0631415843963623,
  'token': 4000,
  'token_str': ' entertainment',
  'sequence': 'Artificial intelligence is rapidly transforming the entertainment industry.'},
 {'score': 0.054845914244651794,
  'token': 15064,
  'token_str': ' aerospace',
  'sequence': 'Artificial intelligence is rapidly transforming the aerospace industry.'}]

<hr>

## Named-entity recognition and question-answering

<font color="DodgerBlue">Econ176 Task</font> &nbsp;&nbsp; Run these two examples, then <font color="black"><i>create another example - for each - of your own design</i></font> &nbsp;&nbsp; How does it do?

In [24]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
print("\n")

ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cpu






[{'entity_group': 'PER',
  'score': np.float32(0.9981694),
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': np.float32(0.9796019),
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': np.float32(0.9932106),
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [25]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
print("\n")

question_answerer(
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
    question="Where do I work?",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu






{'score': 0.6949766278266907, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [26]:
# Feel free to use this cell -- or the one above...

my_ner_sentence = "[REDACTED] attends Claremont McKenna College, which is located in California, and is studying Financial Technology."
ner(my_ner_sentence)


#

[{'entity_group': 'PER',
  'score': np.float32(0.9992169),
  'word': 'Econ176_Participant_5',
  'start': 0,
  'end': 5},
 {'entity_group': 'ORG',
  'score': np.float32(0.9934234),
  'word': 'Claremont McKenna College',
  'start': 14,
  'end': 39},
 {'entity_group': 'LOC',
  'score': np.float32(0.9993088),
  'word': 'California',
  'start': 61,
  'end': 71}]

In [28]:
my_context = "The European Central Bank, headquartered in Frankfurt, Germany, recently adjusted its monetary policy in response to rising inflation."
my_question = "Where is the European Central Bank located?"
question_answerer(
    context=my_context,
    question=my_question,
)


# Looks like these models completed all the tasks I indicated correctly

{'score': 0.9496544599533081,
 'start': 44,
 'end': 62,
 'answer': 'Frankfurt, Germany'}

<hr>

## Summarization

Run this example - and then <font color="DodgerBlue"><i>create another of your own design</i></font> &nbsp;&nbsp; How does it do?

Feel free to grab <i>some of your own writing in the past</i> for it to summarize -- or something else that would be interesting to see...

In [29]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu


[{'summary_text': ' The number of engineering graduates in the United States has declined in recent years . China and India graduate six and eight times as many traditional engineers as the U.S. does . Rapidly developing economies such as China continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, infrastructure, the environment, and related issues .'}]

In [30]:
# Feel free to use this cell -- or the one above...

my_text_to_summarize = """
Understanding how citizens form opinions in an era dominated by partisan media is crucial, as reliance on sources like Fox News and MSNBC can shape divergent perceptions of reality. This thesis moves beyond static channel choice to investigate the relationship between exposure to dynamic, time-varying characteristics of partisan cable news content and individual policy attitudes. It specifically asks how the volume (salience) and ideological slant (framing) of recent news coverage associate with opinions on key policy issues. To address this, the study integrates individual-level data from the Cooperative Election Study (CES) from 2020-2023 (approx. 68,000 observations across seven policy issues) with a high-frequency dataset of Fox News and MSNBC transcripts. Leveraging large language models, every relevant news segment broadcast during this period was classified for topic and ideological stance (liberal/conservative). These classifications were used to construct daily time series of content volume and slant for each channel and policy topic. For each CES respondent, 7-day rolling aggregates of media content ending the day before their interview were calculated and interacted with their self-reported channel viewership status. Ordinary Least Squares regression models were estimated to predict standardized policy attitudes (0=Conservative, 1=Liberal), incorporating these media exposure interaction terms alongside controls for demographics, ideology, party identification, and county and year fixed effects. The analysis reveals a robust association between the ideological slant of recent media exposure and policy attitudes. Controlling for content volume and other factors, exposure to more liberal-slanted coverage in the preceding week was strongly and statistically significantly associated with holding more liberal views across nearly all policy domains for viewers of both channels (most p<0.001). For instance, a hypothetical shift from perfectly balanced to exclusively liberal slant (+1 change) in Fox News' coverage of assault weapons over a week was associated with a 0.146 (SE=0.012) increase in support for a ban among Fox viewers. The effects associated with normalized slant generally dominated those related to content volume, which were smaller and less consistent. A simpler model examining only the net directional tone (liberal minus conservative segments) also showed highly significant associations in the expected direction (e.g., each additional net liberal Fox segment on CO2 regulation associated with a 0.068 point increase in support, SE=0.004, p<0.001). These findings provide quantitative support for the hypothesis that how partisan news outlets frame issues is strongly correlated with audience opinion, distinct from mere channel selection or the overall amount of coverage. This demonstrates the potential role of specific, time-varying media narratives in reinforcing and potentially driving policy attitude polarization, showing the value of integrating granular, AI-assisted content analysis with large-scale survey data to understand media influence.
"""
summarizer(my_text_to_summarize)

# This seems like an ok summary of my thesis abstract!

[{'summary_text': " Study: Exposure to more liberal-slanted coverage in the preceding week was strongly and statistically significantly associated with holding more liberal views across nearly all policy domains . For instance, a hypothetical shift from perfectly balanced to exclusively liberal slant (+1 change) in Fox News' coverage of assault weapons over a week was associated with a 0.146 (SE=0.012) increase in support for a ban among Fox viewers ."}]

<hr>

## Translation!

This was the original application that motivated the development of the Transformer model.

In [31]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Device set to use cpu


[{'translation_text': 'This course is produced by Hugging Face.'}]

In [32]:
# Feel free to use this cell -- or the one above...

translator("Ce cours est produit par Hugging Face.")


#

[{'translation_text': 'This course is produced by Hugging Face.'}]

#### <font color="DodgerBlue">Econ176 Task</font>
+ Look around and find _another language model_ that HF offers
+ See if you can load it (use the "copy" button that looks like to pieces of paper -- often it includes the _whole path_ to the library)
+ Then, create two more <i>translation prompts</i>, and
+ As with each example, briefly comment on how well you feel the model has done, relative to your intuition (or overall human expectations)
+ Languages in which we've found success so far include Spanish, French, and Hindi - feel free to use one of these or try another...
  

In [33]:
from transformers import pipeline

translator_en_es = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es")
print("English to Spanish translator loaded.")

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Device set to use cpu


English to Spanish translator loaded.


In [34]:
# Prompt 1
translation1 = translator_en_es("Hello, how are you today?")
print(translation1)

# Prompt 2
translation2 = translator_en_es("This is an interesting exercise in machine learning.")
print(translation2)


# Translation worked! Great notebook, thanks!

[{'translation_text': 'Hola, ¿cómo estás hoy?'}]
[{'translation_text': 'Este es un ejercicio interesante en el aprendizaje automático.'}]


<br>
<br>

<hr>

## You've _transformed_ !

In fact, you've completed -- and expanded upon -- <font color="DodgerBlue"><b>Section 1</b></font> of the [Hugging Face NLP course](https://huggingface.co/learn/llm-course/en/chapter1/1) ...

That is all that's asked for this _Transformer-based_ assignment.

That said, you may find your future path, whether for Econ 176 or something else entirely, that bring you back to experiment more with Natural Language processing.

If so, you'll be able to pick up where you left off, and then
+ look inside the Transformer models' individual components
+ fine-tune existing models into special-purpose classifiers
  + fine-tuning might help with some of the business-exploration
+ other resources from the HF collection of models and libraries
+ all with the goal of increase our own sophistication, namely about how sophisticated (or not) LLMs are...

<br>

Big-picture, _programming-focused_ launching points, like Hugging Face, are likely to be a more and more common means to interact with computational libraries in the future. And _Transformers_ are likely to be around - and improving - for a while!


