# Introduction to hugging face

Data source: https://www.kaggle.com/datasets/szymonjanowski/internet-articles-data-with-users-engagement?resource=download

In [89]:
# libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Load the data from the path
data_path = "../data/articles_data.csv"
news_data = pd.read_csv(data_path, index_col=0)

# Show data information
news_data.head()

Unnamed: 0,source_id,source_name,author,title,description,url,url_to_image,published_at,content,top_article,engagement_reaction_count,engagement_comment_count,engagement_share_count,engagement_comment_plugin_count
0,reuters,Reuters,Reuters Editorial,NTSB says Autopilot engaged in 2018 California...,The National Transportation Safety Board said ...,https://www.reuters.com/article/us-tesla-crash...,https://s4.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:22:20Z,WASHINGTON (Reuters) - The National Transporta...,0.0,0.0,0.0,2528.0,0.0
1,the-irish-times,The Irish Times,Eoin Burke-Kennedy,Unemployment falls to post-crash low of 5.2%,Latest monthly figures reflect continued growt...,https://www.irishtimes.com/business/economy/un...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T10:32:28Z,The States jobless rate fell to 5.2 per cent l...,0.0,6.0,10.0,2.0,0.0
2,the-irish-times,The Irish Times,Deirdre McQuillan,"Louise Kennedy AW2019: Long coats, sparkling t...",Autumn-winter collection features designer’s g...,https://www.irishtimes.com/\t\t\t\t\t\t\t/life...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T14:40:00Z,Louise Kennedy is showing off her autumn-winte...,1.0,,,,
3,al-jazeera-english,Al Jazeera English,Al Jazeera,North Korean footballer Han joins Italian gian...,Han is the first North Korean player in the Se...,https://www.aljazeera.com/news/2019/09/north-k...,https://www.aljazeera.com/mritems/Images/2019/...,2019-09-03T17:25:39Z,"Han Kwang Song, the first North Korean footbal...",0.0,0.0,0.0,7.0,0.0
4,bbc-news,BBC News,BBC News,UK government lawyer says proroguing parliamen...,"The UK government's lawyer, David Johnston arg...",https://www.bbc.co.uk/news/av/uk-scotland-4956...,https://ichef.bbci.co.uk/news/1024/branded_new...,2019-09-03T14:39:21Z,,0.0,0.0,0.0,0.0,0.0


In [7]:
# showing the information
news_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10437 entries, 0 to 10436
Data columns (total 14 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   source_id                        10437 non-null  object 
 1   source_name                      10437 non-null  object 
 2   author                           9417 non-null   object 
 3   title                            10435 non-null  object 
 4   description                      10413 non-null  object 
 5   url                              10436 non-null  object 
 6   url_to_image                     9781 non-null   object 
 7   published_at                     10436 non-null  object 
 8   content                          9145 non-null   object 
 9   top_article                      10435 non-null  float64
 10  engagement_reaction_count        10319 non-null  float64
 11  engagement_comment_count         10319 non-null  float64
 12  engagement_share_count 

## Translation

Using Hugging Face most of the time require to download a model from the hub and some model (like the one used below) depends on the Pytorch framework. So we will need to install  Pytorch with the following command: `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117`

In [90]:
from transformers import MarianTokenizer, MarianMTModel
import torch

# Get the name of the model
model_name = 'Helsinki-NLP/opus-mt-en-fr'

# Get the tokenizer
tokenizer = MarianTokenizer.from_pretrained(model_name)

# Instantiate the model
model = MarianMTModel.from_pretrained(model_name)

In [17]:
# Get the text
news_data["title"][:2].tolist()

['NTSB says Autopilot engaged in 2018 California Tesla crash',
 'Unemployment falls to post-crash low of 5.2%']

In [91]:
def format_batch_texts(language_code, batch_texts):
  """
  Format the batch texts for the model
  """
  formated_bach = [">>{}<< {}".format(language_code, text) for text in batch_texts]
  return formated_bach

def perform_translation(batch_texts, model, tokenizer, language="fr"):

  # Prepare the text data into appropriate format for the model
  formated_batch_texts = format_batch_texts(language, batch_texts)
 
  # Generate translation using model
  translated = model.generate(**tokenizer(formated_batch_texts, return_tensors="pt", padding=True))

  # Convert the generated tokens indices back into text
  translated_texts = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
 
  return translated_texts

# Check the model translation from the original language (English) to French
english_texts = news_data["title"].values # Attention: should be a list
trans_model = model
trans_model_tkn = tokenizer
translated_texts = perform_translation(english_texts[0:10], trans_model, trans_model_tkn)

for orig_text, trans_text in zip(english_texts, translated_texts):
  print("Original text: \n", orig_text)
  print("Translation : \n", trans_text)
  print("")

Original text: 
 NTSB says Autopilot engaged in 2018 California Tesla crash
Translation : 
 NTSB dit pilote automatique engagé dans 2018 Californie Tesla accident

Original text: 
 Unemployment falls to post-crash low of 5.2%
Translation : 
 Le chômage tombe à un bas niveau post-crash de 5,2 %

Original text: 
 Louise Kennedy AW2019: Long coats, sparkling tweed dresses and emerald knits
Translation : 
 Louise Kennedy AW2019: Robes longues, robes de tweed étincelantes et tricots émeraudes

Original text: 
 North Korean footballer Han joins Italian giants Juventus
Translation : 
 Le footballeur nord-coréen Han rejoint les géants italiens Juventus

Original text: 
 UK government lawyer says proroguing parliament 'political not legal'
Translation : 
 L'avocat du gouvernement britannique dit prorogeant le parlement "politique pas légal"

Original text: 
 'This Tender Land' is an affecting story about growing up
Translation : 
 "Cette terre d'appel d'offres" est une histoire touchante sur le

Le processus de traduction peut vite devenir assez long cependant et vous aurez alors besoin de plus de puissance

## Zero-Shot Classification

In [30]:
# original text
english_texts[0]

'NTSB says Autopilot engaged in 2018 California Tesla crash'

In [43]:
# import transformers
from transformers import pipeline


def classify_text(text, candidate_labels):
    # load the pipeline
    classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

    # classify the text
    prediction = classifier(text, candidate_labels)
    return prediction

def print_classification_result(prediction):
    print("\n la phrase utilisée: ", prediction['sequence'])
    print("\n les scores de classification zéro-shot:")
    display(pd.DataFrame(prediction).drop(["sequence"], axis=1))

# list candidates labels
candidate_labels = ["tech", "politics", "business", "finance"]

# example text
english_text = english_texts[0]

# classify the text and print the result
prediction = classify_text(english_text, candidate_labels)
print_classification_result(prediction)



 la phrase utilisée:  NTSB says Autopilot engaged in 2018 California Tesla crash

 les scores de classification zéro-shot:


Unnamed: 0,labels,scores
0,tech,0.958968
1,finance,0.020088
2,business,0.015485
3,politics,0.00546


Lors de la première execution du code ci-dessus et sans utiliser le paramètre `model`, le modèle `bart-large-mnli` est téléchargé dans le dossier `\.cache\huggingface\hub\models--facebook--bart-large-mnli`

In [44]:
# test with the last sentence
english_text = english_texts[-1]
prediction = classify_text(english_text, candidate_labels)
print_classification_result(prediction)


 la phrase utilisée:  Love, Hate & Obsession

 les scores de classification zéro-shot:


Unnamed: 0,labels,scores
0,tech,0.448697
1,politics,0.199477
2,business,0.192651
3,finance,0.159175


## Sentiment analysis

In [45]:
# model selection
model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

# model pipeline
distil_bert_model = pipeline(task="sentiment-analysis", model=model_checkpoint)

Downloading (…)lve/main/config.json: 100%|██████████| 629/629 [00:00<00:00, 626kB/s]
Downloading model.safetensors: 100%|██████████| 268M/268M [01:10<00:00, 3.82MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 24.0kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 5.14MB/s]
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [62]:
# Run the predictions
from tqdm import tqdm

# display the whole column with pandas
pd.set_option("display.max_colwidth", None)

sentence_list = english_texts[:10].tolist()
sentiment = []
score = []
for sentence in tqdm(sentence_list):
    result = distil_bert_model(sentence)
    sentiment.append(result[0]['label'])
    score.append(result[0]['score'])

pd.DataFrame({"sentence": sentence_list,"sentiment": sentiment, "score": score})

100%|██████████| 10/10 [00:00<00:00, 57.56it/s]


Unnamed: 0,sentence,sentiment,score
0,NTSB says Autopilot engaged in 2018 California Tesla crash,NEGATIVE,0.996138
1,Unemployment falls to post-crash low of 5.2%,NEGATIVE,0.999525
2,"Louise Kennedy AW2019: Long coats, sparkling tweed dresses and emerald knits",POSITIVE,0.99878
3,North Korean footballer Han joins Italian giants Juventus,POSITIVE,0.901961
4,UK government lawyer says proroguing parliament 'political not legal',NEGATIVE,0.999319
5,'This Tender Land' is an affecting story about growing up,POSITIVE,0.999831
6,EU wants to see if lawmakers will block Brexit before striking new deal: UK's Johnson,NEGATIVE,0.957
7,European third quarter profit outlook improves slightly but still in recession: Refinitv,NEGATIVE,0.994691
8,How are emotional support animals allowed on flights?,NEGATIVE,0.997652
9,Boris Johnson to meet Leo Varadkar in Dublin on Monday,POSITIVE,0.99913


## Question answering

In [63]:
# load the pipeline
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

# specify the model
model_checkpoint = "deepset/roberta-base-squad2"

# specify the task
task = 'question-answering'

# instantiate the pipeline
QA_model = pipeline(task, model=model_checkpoint, tokenizer=model_checkpoint)

Downloading (…)lve/main/config.json: 100%|██████████| 571/571 [00:00<?, ?B/s] 
Downloading model.safetensors: 100%|██████████| 496M/496M [02:00<00:00, 4.10MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 79.0/79.0 [00:00<00:00, 78.3kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 3.85MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 2.93MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 772/772 [00:00<?, ?B/s] 


In [76]:
text_sample = news_data['content'][10204]
print(text_sample)

A top Pakistani health official says authorities are battling one of the worst-ever dengue fever outbreaks in the country, including the capital Islamabad as hospitals continued to receive scores of patients, putting strain on emergency services.
Rana Mohamm… [+562 chars]


In [82]:
# Q&A function
def get_model_response(question, context):
    QA_input = {
        'question': question,
        'context': context
    }

    model_response = QA_model(QA_input)

    return pd.DataFrame([model_response])

# Usage:
question = 'what is the capital of Pakistan?'
context = text_sample

result = get_model_response(question, context)
result

Unnamed: 0,score,start,end,answer
0,0.963601,145,154,Islamabad


In [88]:
result = get_model_response("what is the problem?", text_sample)
result

Unnamed: 0,score,start,end,answer
0,0.276675,84,106,dengue fever outbreaks


To go further with hugging face you can visit: https://huggingface.co/models