# Analyzing 10K
### Item 7 - Management's Discussion and Analysis

## <p><strong>Common NLP Tasks</strong></p>
<ol>
    <li><strong>Classifying whole sentences: </strong>Getting the sentiment from a text, detecting if an email is spam, determining whether two pieces of text are logically related or not</li>
    <li><strong>Classifying each word in a sentence:</strong> Identifying the named entities (person, location, organization)</li> 
    <li><strong>Generating text content:</strong>Summarization of text or Completing a prompt with auto-generated text as in chat bots or search engines</li> 
    <li><strong>Extracting an answer from a text:</strong> Given a question and a context, extracting the answer to the question based on the information provided in the context</li>
    
</ol>

## Pipeline
We experiment with models for these tasks using the high-level API called pipeline. The pipeline takes care of all preprocessing and returns cleaned up predictions. The pipeline is primarily used for inference where we apply fine-tuned models to new examples.

<img src="images/pipeline.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=800>

## Setup (This has already been installed for you)

Before we start we need to make sure we have the transformers library installed as well as the sentencepiece tokenizer which we'll need for some models.

In [15]:
%%capture
!pip install transformers
!pip install sentencepiece
!pip install torch
!pip install tesnsorflow

Furthermore, we create a textwrapper to format long texts nicely.

In [16]:
import textwrap
wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)

## Classification
  <li><strong>Classifying whole sentences: </strong>Getting the sentiment from a text, detecting if an email is spam, determining whether two pieces of text are logically related or not</li>

We start by setting up an example text that we would like to analyze with a transformer model. This looks like your standard customer feedback from a transformer:

In [17]:
# Extracted from Bed Bath and Beyon 10K statement
text="""
"We are executing on a comprehensive plan to transform our business and position us for long-term success under the leadership of our President and CEO Mark Tritton, who joined the Company on November 4, 2019. Mr. Tritton has been assessing our operations, portfolio, capabilities and culture and is developing and implementing the initial stages of a strategic plan designed to re-establish our leading position as the preferred omnichannel home destination, which is grounded in five key pillars: product, price, promise, place and people. With these five pillars as our framework, and a singular purpose to make it easy for customers to feel at home, we are embracing a commitment to build and manage a modern, durable omnichannel model. Early actions include the extensive restructure of our leadership team. Interim leaders were appointed in merchandising, marketing, digital, stores, operations, finance, legal and human resources. During fiscal 2020, we announced the hiring of a new leadership team, consisting of the following: On March 4, 2020, Joe Hartsig joined the Company as Executive Vice President, Chief Merchandising Officer of the Company and President of Harmon Stores Inc.; On May 4, 2020, Gustavo Arnal joined the Company as Executive Vice President, Chief Financial Officer and Treasurer; On May 11, 2020, Rafeh Masood joined the Company as Executive Vice President, Chief Digital Officer; On May 11, 2020, Gregg Melnick assumed the role of Executive Vice President, Chief Stores Officer. Previously, Mr. Melnick served as the Company’s interim Chief Digital Officer; On May 18, 2020, John Hartmann joined the Company as Chief Operating Officer of the Company and President, buybuy BABY; On May 18, 2020, Arlene Hong joined the Company as Executive Vice President, Chief Legal Officer and Corporate Secretary; On May 26, 2020, Cindy Davis joined the Company as Executive Vice President, Chief Brand Officer of the Company and President, Decorist; and On September 28, 2020, Lynda Markoe joined the Company as Executive Vice President, Chief People and Culture Officer. As discussed in "Overview" above, as part of our business transformation, we are also pursuing deliberate actions as part of our restructuring program to drive profit improvement over the next two-to-three years. We expect to reinvest a portion of the expected cost savings into future growth initiatives."
"""

print(wrapper.fill(text))

 "We are executing on a comprehensive plan to transform our business and
position us for long-term success under the leadership of our President and CEO
Mark Tritton, who joined the Company on November 4, 2019. Mr. Tritton has been
assessing our operations, portfolio, capabilities and culture and is developing
and implementing the initial stages of a strategic plan designed to re-establish
our leading position as the preferred omnichannel home destination, which is
grounded in five key pillars: product, price, promise, place and people. With
these five pillars as our framework, and a singular purpose to make it easy for
customers to feel at home, we are embracing a commitment to build and manage a
modern, durable omnichannel model. Early actions include the extensive
restructure of our leadership team. Interim leaders were appointed in
merchandising, marketing, digital, stores, operations, finance, legal and human
resources. During fiscal 2020, we announced the hiring of a new leadersh

One of the most common tasks in NLP and especially when dealing with customer texts is _sentiment analysis_. We would like to know if a customer is satisfied with a service or product and potentially aggregate the feedback across all customers for reporting.

For text classification the model gets all the inputs and makes a single prediction as shown in the following example:

<img src="images/clf_arch.png" alt="Alt text that describes the graphic" title="Title text" width=550>

We can achieve this by setting up a `pipeline` object which wraps a transformer model. When initializing we need to specify the task. Sentiment analysis is a subfield of text classification where a single label is given to a group of text.

In [18]:
from transformers import pipeline

sentiment_pipeline = pipeline('text-classification')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


You can see a warning message: we did not specify in the pipeline which model we would like to use. In that case it loads a default model. The `distilbert-base-uncased-finetuned-sst-2-english` model is a small BERT variant trained on [SST-2](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary) which is a sentiment analysis dataset.

You'll notice that the first time you execute the model a download is executed. The model is downloaded from the 🤗 Hub! The second time the cached model will be used.

Now we are ready to run our example through pipeline and look at some predictions:

In [19]:
sentiment_pipeline(text)

[{'label': 'POSITIVE', 'score': 0.9960073828697205}]

The model predicts negative sentiment with a high confidence which makes sense. You can see that the pipeline returns a list of dicts with the predictions. We can also pass several texts at the same time in which case we would get several dicts in the list for each text one.

## Named entity recognition
<li><strong>Classifying each word in a sentence:</strong> Identifying the named entities (person, location, organization)</li> 

Let's see if we can do something a little more sophisticated. Instead of just finding the overall sentiment let's see if we can extract named entities such as organizations, locations, or individuals from the text. This task is called named entity recognition (NER). Instead of predicting just a class for the whole text a class is predicted for each token, thus this task belongs to the category of token classification:

<img src="images/ner_arch.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=550>

Again, we just load a pipeline for the NER task without specifying a model. This will load a default BERT model that has been trained on the [CoNLL-2003](https://huggingface.co/datasets/conll2003).

In [20]:
ner_pipeline = pipeline('ner')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


When we pass our text through the model we get a long list of dicts: each dict corresponds to one detected entity. Since multiple tokens can correspond to a a single entity we can apply an aggregation strategy that merges entities if the same class appears in consequtive tokens.

In [21]:
entities = ner_pipeline(text, aggregation_strategy="simple")
print(entities)

[{'entity_group': 'PER', 'score': 0.9992383, 'word': 'Mark Tritton', 'start': 153, 'end': 165}, {'entity_group': 'PER', 'score': 0.9971431, 'word': 'Tritton', 'start': 215, 'end': 222}, {'entity_group': 'PER', 'score': 0.947806, 'word': 'Joe Hartsig', 'start': 1056, 'end': 1067}, {'entity_group': 'ORG', 'score': 0.48377684, 'word': 'Company', 'start': 1079, 'end': 1086}, {'entity_group': 'ORG', 'score': 0.999298, 'word': 'Harmon Stores Inc', 'start': 1176, 'end': 1193}, {'entity_group': 'PER', 'score': 0.99962455, 'word': 'Gustavo Arnal', 'start': 1212, 'end': 1225}, {'entity_group': 'ORG', 'score': 0.5372771, 'word': 'Company', 'start': 1237, 'end': 1244}, {'entity_group': 'PER', 'score': 0.99922913, 'word': 'Rafeh Masood', 'start': 1330, 'end': 1342}, {'entity_group': 'PER', 'score': 0.9995363, 'word': 'Gregg Melnick', 'start': 1431, 'end': 1444}, {'entity_group': 'PER', 'score': 0.99836326, 'word': 'Melnick', 'start': 1529, 'end': 1536}, {'entity_group': 'PER', 'score': 0.9996248, '

Let's clean the outputs a bit up:

In [22]:
for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.2f})")

Mark Tritton: PER (1.00)
Tritton: PER (1.00)
Joe Hartsig: PER (0.95)
Company: ORG (0.48)
Harmon Stores Inc: ORG (1.00)
Gustavo Arnal: PER (1.00)
Company: ORG (0.54)
Rafeh Masood: PER (1.00)
Gregg Melnick: PER (1.00)
Melnick: PER (1.00)
John Hartmann: PER (1.00)
BABY: ORG (0.90)
Arlene Hong: PER (1.00)
Company: ORG (0.62)
Cindy Davis: PER (1.00)
Mark: PER (1.00)


It seems that the model found most of the named entities but was confused about the class of the transformer characters. This is no surprise since the original dataset probably did not contain many transformer characters. For this reason it makes sense to further fine-tune a model on your on dataset!

## Summarization
<li><strong>Generating text content:</strong>Summarization of text or Completing a prompt with auto-generated text as in chat bots or search engines</li> 

Let's see if we can go beyond these natural language understanding tasks (NLU) where BERT excels and delve into the generative domain. Note that generation is much more expensive since we usually generate one token at a time and need to run this several times.

<img src="images/gen_steps.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=600>

A popular task involving generation is summarization. Let's see if we can use a transformer to generate a summary for us:

In [23]:
summarization_pipeline = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


This model is trained was trained on the [CNN/Dailymail dataset](https://huggingface.co/datasets/cnn_dailymail) to summarize news articles.

In [24]:
outputs = summarization_pipeline(text, max_length=45, clean_up_tokenization_spaces=True)
print(wrapper.fill(outputs[0]['summary_text']))

Your min_length=56 must be inferior than your max_length=45.


 "We are embracing a commitment to build and manage a modern, durable
omnichannel model," says CEO Mark Tritton. Early actions include the extensive
restructure of our leadership team. Interim leaders were appointed


## Question-answering
<li><strong>Extracting an answer from a text:</strong> Given a question and a context, extracting the answer to the question based on the information provided in the context</li>

We have now seen an example of text and token classification using transformers. However, there are more interesting tasks we can use transformers for. One of them is question-answering. In this task the model is given a question and a context and needs to find the answer to the question within the context. This problem can be rephrased into a classification problem: For each token the model needs to predict whether it is the start or the end of the answer. In the end we can extract the answer by looking at the span between the token with the highest start probability and highest end probability:

<img src="images/qa_arch.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=600>

You can imagine that this requires quite a bit of pre- and post-processing logic. Good thing that the pipeline takes care of all that!

In [25]:
qa_pipeline = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


This default model is trained on the canonical [SQuAD dataset](https://huggingface.co/datasets/squad). Let's see if we can ask some questions:

In [26]:
questions = ["Who is the president of the company?", 
             "What are the five key pillars of the strategic plan?", 
             'How does the company plan to grow?']

In [27]:
for q in questions:
    outputs = qa_pipeline(question=q, context=text)
    print(q)
    print(outputs)
    print("-----")

Who is the president of the company?
{'score': 0.9902592301368713, 'start': 153, 'end': 165, 'answer': 'Mark Tritton'}
-----
What are the five key pillars of the strategic plan?
{'score': 0.9782744646072388, 'start': 500, 'end': 541, 'answer': 'product, price, promise, place and people'}
-----
How does the company plan to grow?
{'score': 0.049509648233652115, 'start': 2319, 'end': 2366, 'answer': 'reinvest a portion of the expected cost savings'}
-----


Awesome, that sounds about right!

# More pipelines
There are many more pipelines that you can experiment with. Look at the following list for an overview:

In [28]:
from transformers import pipelines
for task in pipelines.SUPPORTED_TASKS:
    print(task)

audio-classification
automatic-speech-recognition
feature-extraction
text-classification
token-classification
question-answering
table-question-answering
fill-mask
summarization
translation
text2text-generation
text-generation
zero-shot-classification
conversational
image-classification
image-segmentation
object-detection
