# <center> **Introduction to Natural Language Processing  with Hugging Face Transformers** <center>


In [1]:
### Installing Required Libraries

In [2]:
!pip install torch



In [3]:
!pip install --upgrade torch



In [4]:
!pip install -q transformers

In [5]:
!pip install datasets evaluate transformers[sentencepiece]



In [6]:
!pip install sacremoses



In [7]:
### Importing Required Libraries

In [8]:
import warnings
warnings.filterwarnings('ignore')

In [9]:
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModel

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


In [10]:
### Example 1 - Sentiment Analysis

In [12]:
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

In [13]:
classifier("Having three long haired, heavy shedding dogs at home, I was pretty skeptical that this could hold up to all the hair and dirt they trek in, but this wonderful piece of tech has been nothing short of a godsend for me! ")

[{'label': 'POSITIVE', 'score': 0.9982457160949707}]

In [14]:
classifier("I like cat")

[{'label': 'POSITIVE', 'score': 0.9994874000549316}]

In [15]:
### Example 2 - [Topic Classification]

In [17]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "Exploratory Data Analysis is the first course in Machine Learning Program that introduces learners to the broad range of Machine Learning concepts, applications, challenges, and solutions, while utilizing interesting real-life datasets",
    candidate_labels=["art", "natural science", "data analysis"],
)

{'sequence': 'Exploratory Data Analysis is the first course in Machine Learning Program that introduces learners to the broad range of Machine Learning concepts, applications, challenges, and solutions, while utilizing interesting real-life datasets',
 'labels': ['data analysis', 'art', 'natural science'],
 'scores': [0.9957792162895203, 0.0026982580311596394, 0.0015224860981106758]}

In [18]:
generator = pipeline("text-generation", model="gpt2")
generator("This course will teach you")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This course will teach you many more ways to use JavaScript code than just learning to program.\n\nCourse objectives\n\nCreate an application for PHP, jQuery and jQuery 2 or older. Create an Excel spreadsheet. Send all of your data into, say'}]

In [19]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "This course will teach you",
    max_length=30,
    num_return_sequences=2,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This course will teach you how to get started with the development and development of an app and to see how to get started with your project quickly. You'},
 {'generated_text': 'This course will teach you how to use the right software for your project design:\n\n\n\n\n\n\n\n\n\n\n\n\n\n'}]

In [20]:
### Example 4 - [Name Entity Recognition (NER)]

In [22]:
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)
ner("My name is Roberta and I work with IBM Skills Network in Toronto")

[{'entity_group': 'PER',
  'score': 0.9993105,
  'word': 'Roberta',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9976597,
  'word': 'IBM Skills Network',
  'start': 35,
  'end': 53},
 {'entity_group': 'LOC',
  'score': 0.99702173,
  'word': 'Toronto',
  'start': 57,
  'end': 64}]

In [23]:
del ner

In [24]:
qa_model = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
question = "Which name is also used to describe the Amazon rainforest in English?"
context = "The Amazon rainforest, also known in English as Amazonia or the Amazon Jungle."
qa_model(question = question, context = context)

{'score': 0.8247062563896179, 'start': 48, 'end': 56, 'answer': 'Amazonia'}

In [25]:
### Example 6: [Text Summarization]

In [26]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
summarizer(
    """
Exploratory Data Analysis is the first course in Machine Learning Program that introduces learners to the broad range of Machine Learning concepts, applications, challenges, and solutions, while utilizing interesting real-life datasets. So, what is EDA and why is it important to perform it before we dive into any analysis?
EDA is a visual and statistical process that allows us to take a glimpse into the data before the analysis. It lets us test the assumptions that we might have about the data, proving or disproving our prior believes and biases. It lays foundation for the analysis, so our results go along with our expectations. In a way, it’s a quality check for our predictions.
As any data scientist would agree, the most challenging part in any data analysis is to obtain a good quality data to work with. Nothing is served to us on a silver plate, data comes in different shapes and formats. It can be structured and unstructured, it may contain errors or be biased, it may have missing fields, it can have different formats than what an untrained eye would perceive. For example, when we import some data, very often it would contain a time stamp. To a human it is understandable format that can interpreted. But to a machine, it is not interpretable, so it needs to be told what that means, the data needs to be transformed into simple numbers first. There are also different date-time conventions depending on a country (i.e., Canadian versus USA), metric versus imperial systems, and many other data features that need to be recognized before we start doing the analysis. Therefore, the first step before performing any analysis – is get really aquatinted with your data!
This course will teach you to ‘see’ and to ‘feel’ the data as well as to transform it into analysis-ready format. It is introductory level course, so no prior knowledge is required, and it is a good starting point if you are interested in getting into the world of Machine Learning. The only thing that is needed is some computer with internet, your curiosity and eagerness to learn and to apply acquired knowledge.  If you live in Canada, you might be interested about gasoline prices in different cities or if you are an insurance actuary you need to analyze the financial risks that you will take based on your clients information. Whatever is the case, you will be able to do your own analysis, and confirm or disprove some of the existing information.
The course contains videos and reading materials, as well as well as a lot of interactive practice labs that learners can explore and apply the skills learned. It will allow you to use Python language in Jupyter Notebook, a cloud-based skills network environment that is pre-set for you with all available to be downloaded packages and libraries. It will introduce you to the most common visualization libraries such as Pandas, Seaborn, and Matplotlib to demonstrate various EDA techniques with some real-life datasets.

"""
)

[{'summary_text': ' Exploratory Data Analysis is the first course in Machine Learning Program that introduces learners to the broad range of Machine Learning concepts, applications, challenges, and solutions . EDA is a visual and statistical process that allows us to take a glimpse into the data before the analysis . It lays foundation for the analysis so our results go along with our expectations .'}]

In [27]:
del summarizer

In [28]:
### Example 7 - [Translation]

In [29]:
en_fr_translator = pipeline("translation_en_to_fr", model="t5-small")
en_fr_translator("How old are you?")

[{'translation_text': 'Quel est votre âge ?'}]

In [30]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("La science des données est la meilleure.")

[{'translation_text': 'Data science is the best.'}]

# **Let's Practice**


In [31]:
### Exercise 1 - Sentiment Analysis

In [32]:
specific_model = pipeline(model="cardiffnlp/twitter-roberta-base-sentiment")
data = "Artificial intelligence and automation are already causing friction in the workforce. Should schools revamp existing programs for topics like #AI, or are new research areas required?"
specific_model(data)

[{'label': 'LABEL_1', 'score': 0.5272255539894104}]

In [33]:
original_model = pipeline("sentiment-analysis")
data = "Artificial intelligence and automation are already causing friction in the workforce. Should schools revamp existing programs for topics like #AI, or are new research areas required?"
original_model(data)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9989722967147827}]

In [34]:
### Exercise 2 - Topic Classification

In [35]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "I like reading novels and listening to songs",
    candidate_labels=["art", "education", "travel"],
)

{'sequence': 'I like reading novels and listening to songs',
 'labels': ['art', 'education', 'travel'],
 'scores': [0.398181289434433, 0.37836214900016785, 0.2234565168619156]}

In [36]:
### Exercise 3 - Text Generation Models

In [38]:
generator = pipeline('text-generation', model = 'gpt2')
generator("Hello, I'm a language model", max_length = 30, num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model manager. I know what it's like. I'm not a computer programmer. People like programming languages. I'm"},
 {'generated_text': "Hello, I'm a language model\n\nI want to teach people to code with, just like we taught you to code with Python.\n\n"},
 {'generated_text': 'Hello, I\'m a language model at MIT." What kind of a job? "Well, I do a lot of writing, a lot of translation'}]

In [39]:
### Exercise 4 - Name Entity Recognition

In [40]:
nlp = pipeline("ner", model="Jean-Baptiste/camembert-ner", grouped_entities=True)
example = "Her name is Nabilla and she lives in Batam."

ner_results = nlp(example)
print(ner_results)

[{'entity_group': 'PER', 'score': 0.99728644, 'word': 'Nabilla', 'start': 11, 'end': 19}, {'entity_group': 'LOC', 'score': 0.9981779, 'word': 'Batam', 'start': 36, 'end': 42}]


In [42]:
### Exercise 5 - Question Answering

In [41]:
question_answerer = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
question_answerer(
    question="Which lake is one of the five Great Lakes of North America?",
    context="Lake Ontario is one of the five Great Lakes of North America. It is surrounded on the north, west, and southwest by the Canadian province of Ontario, and on the south and east by the U.S. state of New York, whose water boundaries, along the international border, meet in the middle of the lake.",
)

{'score': 0.9834363460540771, 'start': 0, 'end': 12, 'answer': 'Lake Ontario'}

In [43]:
### Exercise 6 - Text Summarization

In [2]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6",  max_length=59)
summarizer(
    """
Lake Superior in central North America is the largest freshwater lake in the world by surface area and the third-largest by volume, holding 10% of the world's surface fresh water. The northern and westernmost of the Great Lakes of North America, it straddles the Canada–United States border with the province of Ontario to the north, and the states of Minnesota to the northwest and Wisconsin and Michigan to the south. It drains into Lake Huron via St. Marys River and through the lower Great Lakes to the St. Lawrence River and the Atlantic Ocean.
"""
)

NameError: name 'pipeline' is not defined

In [None]:
### Exercise 7 - Translation

In [1]:
translator = pipeline("translation_en_to_id", model="t5-small")
print(translator("New York is my favourite city", max_length=40))

NameError: name 'pipeline' is not defined