## 💻 UnpackAI DL201 Bootcamp - Week 3 - NLP pipelines

### 📕 Learning Objectives

* Getting working examples able to achieve the main NLP tasks
* Knowing the existence of Hugging Face and the strenth of its pre-trained models and all-in-one pipelines

### 📖 Concepts map

* Pipeline

In [1]:
# import (use not verbose mode : ex "import -Uqq pandas as pd" if you are sure that there is no dependency error)

from transformers import pipeline
import pandas as pd
import numpy as np

# Part 1. Introduction

Giving working example able to inspire you to build your own AI project

Hugging Face made available all-in-one ***pipelines*** including all the main steps of NLP.
https://huggingface.co/course/chapter2/2?fw=pt
* choosing a pre-trained model
* adapting the input text into this model (tokenization, vectorization) 
* running the model on the transformed input data
* adapting the model answer to human beings (ex : de-tokenization, to get an output text from an output vector or numbers)

Once the pipeline works, you can decide to tune it, more and more, little by little, as one would do to transform their car for a speed race.
So, you can decide to :
* fine tune the model or train it from scratch (instead of using pre-trained model)
* using a tokenizer from your own (instead of the default one)
* clean the training data before feeding the model


# Part 2. Example of question answering

In [2]:
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [3]:
my_answer = question_answerer(
    question="Where do I work?",
    context="My name is John and I work at unpackAI in Beijing."
)

In [4]:
my_answer

{'score': 0.5068178772926331, 'start': 30, 'end': 38, 'answer': 'unpackAI'}

In [5]:
my_answer['answer']

'unpackAI'

In [6]:
def my_question_answerer(my_question, my_context):
        
    complete_answer = question_answerer(
        question=my_question,
        context=my_context
    )
    return complete_answer['answer']

In [7]:
context = 'My name is James and I work in Shenzhen'
question = 'Where do you work?'

In [8]:
my_question_answerer(question,context)

'Shenzhen'

# Part 3. Example of Sentiment Analysis

Here, let's apply our pipeline on a list series (i.e. one column of a dataframe) containing text data.

In [9]:
Text_series = pd.Series([ "I've been waiting for a HuggingFace course my whole life.","I hate this so much!"])

In [10]:
Text_series.shape

(2,)

Now, we have to transform this series into a normal list, because pipelines can deal with list but not series.

In [11]:
Text_list = Text_series.to_list()

In [12]:
len(Text_list)

2

We load the sentiment analysis pipeline

In [13]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [14]:
list_of_answers = classifier (Text_list)
list_of_answers

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [15]:
len(list_of_answers)

2

Let's try to get the list of labels, or the list of score

In [16]:
#list_of_answers[:]['label'] # error

So, to get them, we have to transform it back to a series

In [17]:
series_of_answers = pd.Series(list_of_answers)
series_of_answers.shape

(2,)

In [18]:
series_of_labels = series_of_answers.apply(lambda x: x['label'])
series_of_labels

0    POSITIVE
1    NEGATIVE
dtype: object

In [19]:
series_of_scores = series_of_answers.apply(lambda x: x['score'])
series_of_scores

0    0.959805
1    0.999456
dtype: float64

Here, we remark than we cannot tell if it is positive or not by reading the score.
So, we have to combine both the score and the label to get a signed score.

In [20]:
series_of_signed_scores = np.where(series_of_labels == 'POSITIVE', series_of_scores, -1 * series_of_scores)
series_of_signed_scores

array([ 0.95980495, -0.99945587])

# Part 4. Example of Text Generation

In [21]:
generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [22]:
generator(
    "In this course, we will teach you how to utilize NLP",
    max_length=30,
    num_return_sequences=2
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In this course, we will teach you how to utilize NLP's features to make life in a mobile environment enjoyable for everyone.\n\nYou can"},
 {'generated_text': 'In this course, we will teach you how to utilize NLP during your session while getting your data from NLP sources. In our next version,'}]

# Part 5. Example of Named Entity Recognition (NER)

Named Entity Recognition (NER) is the task of identifying and categorizing key information (entities) in text. An entity can be any word or series of words that consistently refers to the same thing. Examples could be entities such as person (PER), organization (ORG), date (DATE), location (LOC), or more.

In [23]:
ner_pipeline = pipeline("ner", grouped_entities=True)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


In [24]:
ner_pipeline("My name is John and I work at unpackAI in Beijing.")

[{'entity_group': 'PER',
  'score': 0.9986534,
  'word': 'John',
  'start': 11,
  'end': 15},
 {'entity_group': 'ORG',
  'score': 0.77543575,
  'word': 'unpackAI',
  'start': 30,
  'end': 38},
 {'entity_group': 'LOC',
  'score': 0.99954224,
  'word': 'Beijing',
  'start': 42,
  'end': 49}]

# Part 6. How to fine tune a transformer ?

Finetuning a transformer is not hard, but transformers were made to work with shared data, and shared data must respect some strict rules to adapt different models. So, a new data type was invented : the ***Dataset***

And the main difficulty of transformers fine tuning, is to transform our training data into a Dataset.

You may see an example below : 

<a href="https://www.kaggle.com/code/philanoe/nlp-transformer-training?scriptVersionId=102105898" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>