In [1]:
%run supportvectors-common.ipynb



<center><img src="https://d4x5p7s4.rocketcdn.me/wp-content/uploads/2016/03/logo-poster-smaller.png"/> </center>
<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



# Hugging Face

Hugging Face has become a de-facto repository of pre-trained models. Indeed, there seems to be a collection of models for most of the common natural language processing tasks.

We will start with the concept of the pipeline in the HuggingFace `transformers` library.

## Pipeline

The `pipeline()` function returns a model that does a specific task, say classification.

We can then supply the model with input. The model will output the task specific output. In the case of the classifier, it will predict the most likely label from its inference.

Let us illustrate this with an example. Consider that we would like to infer the sentiment of a piece of text, as either positive or negative. For this, let us consider an few examples.

### Sentiment analysis classifier

Let us first instantiate the classifier:

In [2]:
from transformers import pipeline
sentiment_classifier = pipeline (task='sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Observe the warning: it recommends that be more specific about which model to use. Hugging Face contains a sizeable repository of pretrained models for sentiment analysis in various context. 

Let us first start with a model suitable to english text: `distilbert-base-uncased-finetuned-sst-2-english` (This also happens to be the default one when we don't specify the model)

In [3]:
sentiment_classifier = pipeline ('sentiment-analysis', 
                                 model='distilbert-base-uncased-finetuned-sst-2-english')

If we would like sentiment inference on a single text instance, we can do as follows:

In [4]:
text = "The NLP sessions have proved useful so far -- not!."

sentiment_classifier(text)

[{'label': 'NEGATIVE', 'score': 0.9982361793518066}]

In [5]:
#text = "Are you sure the nlp sessions are actually useful?"

text = "Huh!"
sentiment_classifier(text)

[{'label': 'POSITIVE', 'score': 0.8588464260101318}]

The model can take a list of inputs too.

In [6]:
texts = [
    "It is become so difficult to reach customer service for my product!", #1
    "It is endless call waiting to begin with.", #2
    "O frabjous day! Callooh! Callay!" #3
]
sentiment_classifier(texts)

[{'label': 'NEGATIVE', 'score': 0.9998125433921814},
 {'label': 'NEGATIVE', 'score': 0.9894068837165833},
 {'label': 'POSITIVE', 'score': 0.9910879731178284}]

#### Domain-specific sentiment analysis models

Certains domains such as finance have a more specific vocabulary, and the language used has a slightly different flavor. We should, therefore, take a model specifically pre-trained on financial text corpus. Let us pick a model that is  rather popular at the moment of this writing: `ProsusAI/finbert` 



In [7]:
MODEL = 'ProsusAI/finbert'
fin_sentiment_classifier = pipeline ('sentiment-analysis', 
                                 model=MODEL)

Downloading (…)lve/main/config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [8]:
texts = ['The trading is very bullish today', 
         'Today the market rallied, and Nasdaq gained but not much.']

fin_sentiment_classifier(texts)

[{'label': 'neutral', 'score': 0.7950862050056458},
 {'label': 'positive', 'score': 0.944980263710022}]

Let us contrast this with what the previous model trained on generic text corpus infers:

In [9]:
sentiment_classifier(texts)

[{'label': 'NEGATIVE', 'score': 0.9991171956062317},
 {'label': 'NEGATIVE', 'score': 0.9976709485054016}]

Clearly, pre-training on domain-specific text corpora produces superior results.