<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# <center> **Introduction to Natural Language Processing  with Hugging Face Transformers** <center>


<center> <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0AIAEN/Transformers_models.png" width="60%" alt="iris image"> <center>

Image Source: [nvidia](https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transformer-model/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01)


Estimated time needed: **30** minutes


This Guided Project will walk you through some of the applications of Hugging Face Transformers in Natural Language Processing (NLP).

Hugging Face Transformers package is very popular and versatile Python library that provides pre-trained models for a variety of applications in NLP, as well other areas such as image analysis, audio analysis, multimodal analysis (optical character recognition, video classification, visual question answering and many more). 


This Guided Project will focus on text analysis tasks, which are:

*   Text Classification.
    *   Sentiment Analysis. Classifies the polarity of a given text.
*   Topic Classification. Classifies sequences into specified class names.
*   Text Generator. Generates text from a given input.
*   Token Classification.
    *   Name Entity Recognition (NER). Labels each word with the entity it represents.
*   Question answering. Extracts the answer from the context.
*   Text Summarization. Generates a summary of a long sequence of text or document.
*   Translation. Translates text into another language.


## __Table of Contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Installing-Required-Libraries">Installing Required Libraries</a></li>
            <li><a href="#Importing-Required-Libraries">Importing Required Libraries</a></li>
        </ol>
    </li>
    <li>
        <a href="#Background-(optional)">Background (optional)</a>
        <ol>
            <li><a href="#Example-1--Sentiment-Analysis?">Example 1 - Sentiment Analysis</a></li>
            <li><a href="#Example-2--Topic-Classification?">Example 2 - Topic Classification</a></li>
            <li><a href="#Example-3--Text-Generator?">Example 3 - Text Generator: Masked Language Modeling</a></li>
            <li><a href="#Example-4--Name-Entity-Recognition?">Example 4 - Name Entity Recognition</a></li>
            <li><a href="#Example-5--Question-Answering?">Example 5 - Question Answering</a></li>
            <li><a href="#Example-6--Text-Summarization?">Example 6 - Text Summarization</a></li>
            <li><a href="#Example-7--Translation?">Example 7 - Translation</a></li>
        </ol>
    </li>
</ol>

<a href="#Exercises">Exercises</a>
<ol>
    <li><a href="#Exercise-1---Sentiment-Analysis)">Exercise 1. Sentiment Analysis</a></li>
    <li><a href="#Exercise-2---Topic-Classification">Exercise 2. Topic Classification</a></li>
    <li><a href="#Exercise-3---Text-Generation">Exercise 3. Text Generation</a></li>
    <li><a href="#Exercise-4---Name-Entity-Recognition">Exercise 4. Name Entity Recognition</a></li>
    <li><a href="#Exercise-5---Question-Answering">Exercise 5. Question Answering</a></li>
    <li><a href="#Exercise-6---Text-Summarization">Exercise 6. Text Summarization</a></li>
    <li><a href="#Exercise-7---Translation">Exercise 7. Translation</a></li>
</ol>


## Objectives

After completing this lab you will be able to:

 - Use Hugging Face Transformers to do:
   - Sentiment Analysis
   - Topic Classification
   - Text Generation
   - Name Entity Recognition
   - Question Answering
   - Text Summarization
   - Text Translation


----


## Setup


For this lab, we will be using the following libraries:

*   [`pandas`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for managing the data.


### Installing Required Libraries

The following required libraries are pre-installed in the Skills Network Labs environment. However, if you run this notebook commands in a different Jupyter environment (e.g. Watson Studio or Ananconda), you will need to install these libraries by removing the `#` sign before `!mamba` in the code cell below.


In [1]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1
# Note: If your environment doesn't support "!mamba install", use "!pip install"

The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [2]:
!pip install torch



In [3]:
!pip install --upgrade torch



In [4]:
!pip install -q transformers

In [5]:
!pip install datasets evaluate transformers[sentencepiece]



In [6]:
!pip install sacremoses



Please **restart the kernel** after running the above installs.


### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [7]:
import warnings
warnings.filterwarnings('ignore')

In [8]:
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModel


  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


## Background (optional)

### What is a Transformer Model?

A Transformer Model is a neural network that learns context and thus meaning by tracking relationships in sequential data, for example, words in a sentence. Transformer models apply an evolving set of mathematical techniques, called attention or self-attention, differently weighing the significance of each part of the input data. They are used primarily in the fields of natural language processing and computer vision. Similarly to recurrent neural networks (RNNs), transformers are designed to process sequential data, for example a body of text, with applications such as translation and text summarization. However, unlike RNNs, transformers process the entire input all at once. The attention mechanism provides context for any position in the input sequence. For example, if the input data is a natural language sentence, the transformer does not have to process one word at a time. This allows for more parallelization than RNNs, and therefore, reduces training times. The additional training parallelization allows training on larger datasets. This led to the development of pre-trained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), which were trained with large language datasets, such as the Wikipedia Corpus and Common Crawl, and can be fine-tuned for specific tasks.

Transformer models were first described in a [2017 paper](https://arxiv.org/abs/1706.03762?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01) from Google, as the newest and the most powerful classes of models invented today, replacing some RNNs.

### What do Transformer Models do?

Transformer Models have many useful applications in today world. They are widely used in near real-time translation tasks, opening communication spaces to language-diverse and hearing-impaired audiences. These models are advancing medical research by helping scientists to understand the DNA and amino acid sequences to speed up the development and improve the quality of treatments for different diseases. The models can detect anomalies to prevent fraud detection, streamline manufacturing, make recommendations and overall improve our day-to-day quality of life.

### Transformer Model Architecture

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0AIAEN/The-Transformer-model-architecture.png" width="50%" alt="iris image"> 

Image Source: [wikipedia](https://en.wikipedia.org/wiki/Transformer_\(machine_learning_model\)?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01)

The diagram above describes the pipeline between the input, output and positional encoding. For more information on Transformer Architecture visit this [wikipedia](https://en.wikipedia.org/wiki/Transformer_\(machine_learning_model\)?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01) page. 


### How to use Hugging Face Transformers

There is very easy way to apply Hugging Face Transformers, it is to use the `pipeline()` function. The pipeline makes it simple to use any model from the Hugging Face Hub for inference on any language, computer vision, speech, or multimodal tasks. Each task has an associated pipeline. However, it is simpler to use the general pipeline abstraction, which contains all the task-specific pipelines. The `pipeline()` automatically loads a default model and a pre-processing class capable of inference for any of your desired tasks.

The links beside each of the 7 Examples shown here, are from Hugging Face community and contain more information about the models and data they were trained on.
This [Hugging Face Page](https://huggingface.co/models?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01) also contains ALL the models created by this community.


### Example 1 - [Sentiment Analysis](https://huggingface.co/blog/sentiment-analysis-python?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01) 


**Sentiment analysis** is a natural language processing technique that identifies the polarity of a given text. Some of the most common practical uses of sentiment analysis are tweets analysis, product reviews, support tickets classification and others. Sentiment analysis allows quick processing of large amounts, real-time data. The Diagram below, describes the process of Sentiment Analysis. As shown in the Diagram below, we can use sentiment analysis to predict a label and a score of a random amazon product review. 

<center> <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0AIAEN/sent_analysis6.png" width="70%"> <center>

The evaluation metric for this type of analysis is accuracy score. The default model behind "sentiment-analysis" pipeline is "distilbert-base-uncased-finetuned-sst-2-english" model, pre-trained on a large corpus of English data in a self-supervised fashion. This means it was pre-trained on the raw texts only, with no humans labelling them in any way. For more information on this model read [here.](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01&text=I+like+you.+I+love+you)


We can start by loading text classification pipeline from `pipeline()` using "sentiment-analysis" task identifier. It uses the default, "distilbert-base-uncased-finetuned-sst-2-english" model for sentiment analysis. Note, you can skip specifying a default model, but you will recieve a warning message.


In [9]:
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")


'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: b9b33439-0519-4d0e-9489-50c3fcf1c6ca)')' thrown while requesting HEAD https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json


OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like distilbert-base-uncased-finetuned-sst-2-english is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

We can copy a random product review from amazon and input it into selected classifier. [Review Source.](https://www.amazon.ca/iRobot-Roomba-Self-Emptying-Robot-Vacuum/dp/B094NYHTMF/ref=sr_1_1_sspa?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01&crid=2LO2ZA75UKXKQ&keywords=roomba&qid=1663097073&sprefix=roomba%2Caps%2C194&sr=8-1-spons&psc=1#customerReviews) 


In [10]:
classifier("Having three long haired, heavy shedding dogs at home, I was pretty skeptical that this could hold up to all the hair and dirt they trek in, but this wonderful piece of tech has been nothing short of a godsend for me! ")

NameError: name 'classifier' is not defined

As we see, the sentiment is classified as Positive with 99.8% accuracy score.


### Example 2 - [Topic Classification](https://huggingface.co/facebook/bart-large-mnli?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01)


**Topic Classification** task classifies sequences into specified class names. It applies "zero-shot-classification" algorithm to perform this task. Zero-Shot Learning (ZSL) is when a classifier learns on one set of labels and then evaluates on a different set of labels, the ones that it has never seen before.  [This link](https://joeddav.github.io/blog/2020/05/29/ZSL.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01) has more information about ZSL. 
As shown in the Diagram below, we also need to specify some kind of descriptor (input labels) for an unseen class (such as a set of visual attributes or simply the class name) in order for our model to be able to predict that class without training data.

<center> <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0AIAEN/sent_analysis5.png" width="70%"> <center>

First, we load a pipeline with "zero-shot-classification", pass a sequence that we want to classify and a list of candidate labels and see how the model will assign corresponding labels to the input.


In [11]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "Exploratory Data Analysis is the first course in Machine Learning Program that introduces learners to the broad range of Machine Learning concepts, applications, challenges, and solutions, while utilizing interesting real-life datasets",
    candidate_labels=["art", "natural science", "data analysis"],
)

OSError: There was a specific connection error when trying to load facebook/bart-large-mnli:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/facebook/bart-large-mnli/resolve/main/config.json

As we see, 'data analysis' is the most successful candidate for the topic of this input, having 99.6% score.


### Example 3 - [Text Generator](https://huggingface.co/tasks/text-generation?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01)


**Text Generation** model is also known as causal language model, is a task of predicting a next word in a sentence, given some previous input. This task is very similar to the auto-correct function we have on our phones. Classification metric cannot be used in this task, as there is no single correct answer. Instead, text distribution auto-completed by the model is evaluated by the cross entropy loss and perplexity. The default model behind Text Generation is Generative Pre-trained Transformer 2, [GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01) model. It can receive an input like "This course will teach you" and proceed to complete the sentence based on those first words, as shown in the Diagram below. 

<center> <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0AIAEN/text_gener2.png" width="60%" alt="iris image"> <center>

Similarly to Completion Generation Models, we also have Text-to-Text Models. These models are trained to learn the mapping between a pair of texts (e.g. translation from one language to another). The most popular variants of these models are T5, T0 and BART. Text-to-Text models are trained with multi-tasking capabilities, they can accomplish a wide range of tasks, including summarization, translation, and text classification. 

Interestingly, causal language models can also be used to generate a code to help programmers in their repetitive coding tasks.

To begin our analysis, we can load a pipeline with the default "text-generation" model:


In [12]:
generator = pipeline("text-generation", model="gpt2")
generator("This course will teach you")

OSError: There was a specific connection error when trying to load gpt2:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/gpt2/resolve/main/config.json

Alternatively, we can also use "distilgpt2" model, as well as some parameters, such length and number of the sentences needed. Distilled GPT-2 model is an English-language model pre-trained with the supervision of the smallest version of GPT-2. Like GPT-2, DistilGPT2 can be used to generate text. For more information about this model, please visit this [link](https://huggingface.co/distilgpt2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01).


In [13]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "This course will teach you",
    max_length=30,
    num_return_sequences=2,
)

OSError: There was a specific connection error when trying to load distilgpt2:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/distilgpt2/resolve/main/config.json

#### Masked Language Modeling
 

Sometimes, it is useful to use Masked Language modeling, which also has Text Generation capabilities. **Masked language** modeling is the task of masking some of the words in a sentence and predicting which words should replace those masks. These models are useful when we want to get a statistical understanding of the language in which the model is trained in. Masked language models do not require labelled data! They are trained by masking a couple of words in sentences and the model is expected to guess the masked word. The Diagram below shows a simple representation of this concept.

<center> <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0AIAEN/fill_mask3.png" width="60%" alt="iris image"> <center>

For example, masked language modeling is used to train large models for domain-specific problems. If you have to work on a domain-specific task, such as retrieving information from medical research papers, you can train a masked language model using those papers.

For more information about "fill-mask" model you can read [here.](https://huggingface.co/tasks/fill-mask?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01)
The Example below, shows a few options for a 'masked' word in the input sentence. Let's see 4 options in the output.


In [14]:
unmasker = pipeline("fill-mask", "distilroberta-base")
unmasker("This course will teach you all about <mask> models.", top_k=4)

OSError: There was a specific connection error when trying to load distilroberta-base:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/distilroberta-base/resolve/main/config.json

### Example 4 - [Name Entity Recognition (NER)](https://huggingface.co/Jean-Baptiste/camembert-ner?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01)


**NER** sometimes also referred as entity chunking, extraction, or identification, is the task of identifying and categorizing key information (entities) in text. The model sorts according to name of the person: 'PER', group: 'ORG', and location: 'LOC' with appropriate accuracy score and token location. 

The default model behind NER is "camembert-ner". It was trained and fine tuned on wikiner-fr dataset (~170 634 sentences). Click [here](https://medium.com/mysuperai/what-is-named-entity-recognition-ner-and-how-can-i-use-it-2b68cf6f545d?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01) to learn more about NER.

<center> <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0AIAEN/ner2.png" width="60%" alt="iris image"> <center> 

Let's look at the specific Example, where we load a pipeline with "ner" model and some input:


In [15]:
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)
ner("My name is Roberta and I work with IBM Skills Network in Toronto")


OSError: There was a specific connection error when trying to load dbmdz/bert-large-cased-finetuned-conll03-english:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english/resolve/main/config.json

In [16]:
del ner

NameError: name 'ner' is not defined

As we see, the model properly identifies all entities in the sentence with highest confidence scores. 


### Example 5 - [Question Answering](https://huggingface.co/tasks/question-answering?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01)


Another widely used application of Hugging Face transformers is Question Answering task. **Question Answering** is the task of extracting an answer from a document. QA models take in a context parameter, which is a document in which you are searching for some information, and a question, and return an answer. The answer is being extracted, not generated. The task is evaluated on two metrics: exact match and F1-score.
As discussed in Example 3, there are also other QA models that are not extractive but generative. They generate free text directly based on the context, as shown in the Diagram below.

<center> <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0AIAEN/q_a1.png" width="60%"> <center> 

Let's look at the Example where we have some context and we try to extract some information from it.
First, we load the `pipeline()` with "question-answering" identifier. We also input our question and content. Then we apply our model to the input.


In [17]:
qa_model = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
question = "Which name is also used to describe the Amazon rainforest in English?"
context = "The Amazon rainforest, also known in English as Amazonia or the Amazon Jungle."
qa_model(question = question, context = context)

OSError: There was a specific connection error when trying to load distilbert-base-cased-distilled-squad:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/distilbert-base-cased-distilled-squad/resolve/main/config.json

As we see, the correct answer has been extracted with 82% confidence score.


### Example 6: [Text Summarization](https://huggingface.co/tasks/summarization?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01)


The next Example of this Guided Project deals with Text Summarization. **Text Summarization** is the task of creating a shorter version of a document, while preserving the relevant information and importance of the original document. The summarizer model takes in the whole document as input and outputs the summarized version. The evaluation metric used in this analysis is called Rouge. It is a benchmark based on the shared sabsequent tokens between the produced sequence and the original document.

<center> <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0AIAEN/summarization.png" width="60%"> <center> 

Let's load the "summarization" pipeline, input some text that we want to summarize, and see what the output will look like.


In [18]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
summarizer(
    """
Exploratory Data Analysis is the first course in Machine Learning Program that introduces learners to the broad range of Machine Learning concepts, applications, challenges, and solutions, while utilizing interesting real-life datasets. So, what is EDA and why is it important to perform it before we dive into any analysis?
EDA is a visual and statistical process that allows us to take a glimpse into the data before the analysis. It lets us test the assumptions that we might have about the data, proving or disproving our prior believes and biases. It lays foundation for the analysis, so our results go along with our expectations. In a way, it’s a quality check for our predictions.
As any data scientist would agree, the most challenging part in any data analysis is to obtain a good quality data to work with. Nothing is served to us on a silver plate, data comes in different shapes and formats. It can be structured and unstructured, it may contain errors or be biased, it may have missing fields, it can have different formats than what an untrained eye would perceive. For example, when we import some data, very often it would contain a time stamp. To a human it is understandable format that can interpreted. But to a machine, it is not interpretable, so it needs to be told what that means, the data needs to be transformed into simple numbers first. There are also different date-time conventions depending on a country (i.e., Canadian versus USA), metric versus imperial systems, and many other data features that need to be recognized before we start doing the analysis. Therefore, the first step before performing any analysis – is get really aquatinted with your data!
This course will teach you to ‘see’ and to ‘feel’ the data as well as to transform it into analysis-ready format. It is introductory level course, so no prior knowledge is required, and it is a good starting point if you are interested in getting into the world of Machine Learning. The only thing that is needed is some computer with internet, your curiosity and eagerness to learn and to apply acquired knowledge.  If you live in Canada, you might be interested about gasoline prices in different cities or if you are an insurance actuary you need to analyze the financial risks that you will take based on your clients information. Whatever is the case, you will be able to do your own analysis, and confirm or disprove some of the existing information.
The course contains videos and reading materials, as well as well as a lot of interactive practice labs that learners can explore and apply the skills learned. It will allow you to use Python language in Jupyter Notebook, a cloud-based skills network environment that is pre-set for you with all available to be downloaded packages and libraries. It will introduce you to the most common visualization libraries such as Pandas, Seaborn, and Matplotlib to demonstrate various EDA techniques with some real-life datasets.

"""
)

OSError: There was a specific connection error when trying to load sshleifer/distilbart-cnn-12-6:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/config.json

In [19]:
del summarizer

NameError: name 'summarizer' is not defined

The result is a short summary of our paragraph.


### Example 7 - [Translation](https://huggingface.co/course/chapter7/4?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01&fw=pt)

The last Example of this Guided Project shows the Translation application. **Translation** models take an input in some source language and output the translation in a target language. The evaluation metric used for this task is called BLEU (bilingual evaluation understudy). It is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. It is on a scale from 0 to 1, 1 meaning perfect score.

The are two types of models, *monolingual* models, trained on a specific language duo data, and there are *multilingual* models, trained on multiple languages dataset. 


<center> <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0AIAEN/translation.png" width="70%"> <center> 

The Example below shows a monolingual, French-English model, "translation_en_to_fr", which levereges T5-base model under the hood. These models add a task prefix, e.g., "_en_to_fr", indicating the task itself, such as translate English to French. There are also multilingual models for inference. Their inference usage differs from monolingual models. For more information on how to use multilingual models, please visit this [documentation](https://huggingface.co/transformers/v3.0.2/multilingual.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMGPXX0AIAEN102-2022-01-01).


In [20]:
en_fr_translator = pipeline("translation_en_to_fr", model="t5-small")
en_fr_translator("How old are you?")

OSError: There was a specific connection error when trying to load t5-small:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/t5-small/resolve/main/config.json

If you would like to use a specific model that is from one specific language to another, you can also directly use the translation pipeline without specifying the model under the hood. 


In [21]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("La science des données est la meilleure.")

OSError: There was a specific connection error when trying to load Helsinki-NLP/opus-mt-fr-en:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/Helsinki-NLP/opus-mt-fr-en/resolve/main/config.json

# **Let's Practice**


### Exercise 1 - Sentiment Analysis
For sentiment analysis, we can also use a specific model that is better suited to our use case by providing the name of the model. For example, if we want a sentiment analysis model for tweets, we can specify the following model id: "cardiffnlp/twitter-roberta-base-sentiment". This model has been trained on  ~58M tweets and fine-tuned for sentiment analysis with the "TweetEval" benchmark. 
The output labels for this model are: 0 -> Negative; 1 -> Neutral; 2 -> Positive.

In this Exercise, use "cardiffnlp/twitter-roberta-base-sentiment" model pre-trained on tweets data, to analyze any tweet of choice. Optionally, use the default model (used in Example 1) on the same tweet, to see if the result will change.


In [22]:
specific_model = pipeline(model="cardiffnlp/twitter-roberta-base-sentiment")
data = "Artificial intelligence and automation are already causing friction in the workforce. Should schools revamp existing programs for topics like #AI, or are new research areas required?"
specific_model(data)

OSError: There was a specific connection error when trying to load cardiffnlp/twitter-roberta-base-sentiment:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment/resolve/main/config.json

In [23]:
original_model = pipeline("sentiment-analysis")
data = "Artificial intelligence and automation are already causing friction in the workforce. Should schools revamp existing programs for topics like #AI, or are new research areas required?"
original_model(data)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


OSError: There was a specific connection error when trying to load distilbert-base-uncased-finetuned-sst-2-english:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json

<details>
    <summary>Click here for Solution</summary>

```python
specific_model = pipeline(model="cardiffnlp/twitter-roberta-base-sentiment")
data = "Artificial intelligence and automation are already causing friction in the workforce. Should schools revamp existing programs for topics like #AI, or are new research areas required?"
specific_model(data)

```

</details>


<details>
    <summary>Click here for Solution</summary>

```python
original_model = pipeline("sentiment-analysis")
data = "Artificial intelligence and automation are already causing friction in the workforce. Should schools revamp existing programs for topics like #AI, or are new research areas required?"
original_model(data)

```

</details>


<details>
    <summary>Click here for a Hint</summary>
    
As we see, tweet specific model classifies the tweet as neutral (LABEL_1), which it actually might be, versus the general model, classifies the same tweet as 'NEGATIVE', which is probably not entirely negative. 

</details>


### Exercise 2 - Topic Classification
In this Exercise, use any sentence of choice to classify it under any classes/ topics of choice. Use "zero-shot-classification" and specify the model="facebook/bart-large-mnli".


In [24]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "I love travelling and learning new cultures",
    candidate_labels=["art", "education", "travel"],
)

OSError: There was a specific connection error when trying to load facebook/bart-large-mnli:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/facebook/bart-large-mnli/resolve/main/config.json

<details>
    <summary>Click here for Solution</summary>

```python
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "I love travelling and learning new cultures",
    candidate_labels=["art", "education", "travel"],
)
```

</details>


### Exercise 3 - Text Generation Models

In this Exercise, use 'text-generator' and 'gpt2' model to complete any sentence. Define any desirable number of returned sentences.


In [25]:
generator = pipeline('text-generation', model = 'gpt2')
generator("Hello, I'm a language model", max_length = 30, num_return_sequences=3)

OSError: There was a specific connection error when trying to load gpt2:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/gpt2/resolve/main/config.json

<details>
    <summary>Click here for Solution</summary>

```python
generator = pipeline('text-generation', model = 'gpt2')
generator("Hello, I'm a language model", max_length = 30, num_return_sequences=3)
```

</details>


### Exercise 4 - Name Entity Recognition
In this Exercise, use any sentence of choice to extract entities: person, location and organization, using Name Entity Recognition task, specify model as "Jean-Baptiste/camembert-ner".


In [26]:
nlp = pipeline("ner", model="Jean-Baptiste/camembert-ner", grouped_entities=True)
example = "Her name is Anjela and she lives in Seoul."

ner_results = nlp(example)
print(ner_results)

OSError: There was a specific connection error when trying to load Jean-Baptiste/camembert-ner:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/Jean-Baptiste/camembert-ner/resolve/main/config.json

<details>
    <summary>Click here for Solution</summary>

```python
nlp = pipeline("ner", model="Jean-Baptiste/camembert-ner", grouped_entities=True)
example = "Her name is Anjela and she lives in Seoul."

ner_results = nlp(example)
print(ner_results)
```

</details>


### Exercise 5 - Question Answering
In this Exercise, use any sentence and a question of choice to extract some information, using "distilbert-base-cased-distilled-squad" model.


In [27]:
question_answerer = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
question_answerer(
    question="Which lake is one of the five Great Lakes of North America?",
    context="Lake Ontario is one of the five Great Lakes of North America. It is surrounded on the north, west, and southwest by the Canadian province of Ontario, and on the south and east by the U.S. state of New York, whose water boundaries, along the international border, meet in the middle of the lake.",
)

OSError: There was a specific connection error when trying to load distilbert-base-cased-distilled-squad:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/distilbert-base-cased-distilled-squad/resolve/main/config.json

<details>
    <summary>Click here for Solution</summary>

```python
question_answerer = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
question_answerer(
    question="Which lake is one of the five Great Lakes of North America?",
    context="Lake Ontario is one of the five Great Lakes of North America. It is surrounded on the north, west, and southwest by the Canadian province of Ontario, and on the south and east by the U.S. state of New York, whose water boundaries, along the international border, meet in the middle of the lake.",
)
```

</details>



### Exercise 6 - Text Summarization
In this Exercise, use any document/paragraph of choice and summarize it, using "sshleifer/distilbart-cnn-12-6" model.


In [29]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6",  max_length=59)
summarizer(
    """
Lake Superior in central North America is the largest freshwater lake in the world by surface area and the third-largest by volume, holding 10% of the world's surface fresh water. The northern and westernmost of the Great Lakes of North America, it straddles the Canada–United States border with the province of Ontario to the north, and the states of Minnesota to the northwest and Wisconsin and Michigan to the south. It drains into Lake Huron via St. Marys River and through the lower Great Lakes to the St. Lawrence River and the Atlantic Ocean.
"""
)

OSError: There was a specific connection error when trying to load sshleifer/distilbart-cnn-12-6:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/config.json

<details>
    <summary>Click here for Solution</summary>

```python
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6",  max_length=59)
summarizer(
    """
Lake Superior in central North America is the largest freshwater lake in the world by surface area and the third-largest by volume, holding 10% of the world's surface fresh water. The northern and westernmost of the Great Lakes of North America, it straddles the Canada–United States border with the province of Ontario to the north, and the states of Minnesota to the northwest and Wisconsin and Michigan to the south. It drains into Lake Huron via St. Marys River and through the lower Great Lakes to the St. Lawrence River and the Atlantic Ocean.
"""
)
```

</details>


### Exercise 7 - Translation
In this Exercise, use any sentence of choice to translate English to German. The translation model you can use is "translation_en_to_de".


In [28]:
translator = pipeline("translation_en_to_de", model="t5-small")
print(translator("New York is my favourite city", max_length=40))

OSError: There was a specific connection error when trying to load t5-small:
503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/t5-small/resolve/main/config.json

<details>
    <summary>Click here for Solution</summary>

```python
translator = pipeline("translation_en_to_de", model="t5-small")
print(translator("New York is my favourite city", max_length=40))
```

</details>


## Congratulations! You have completed this guided project.


## Author


[Svitlana Kramar](www.linkedin.com/in/svitlana-kramar)


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2022-09-20|0.1|Svitlana K.|Created first version|
|2022-09-27|0.2|Svitlana K.|Created second version. Added diagrams.|


Copyright © 2022 IBM Corporation. All rights reserved.
