## Installation on Google Colab

The transformers package is not installed by default on Google Colab. So let's install it with pip: 

In [1]:
!pip install transformers[sentencepiece]

Collecting transformers[sentencepiece]
[?25l  Downloading https://files.pythonhosted.org/packages/b5/d5/c6c23ad75491467a9a84e526ef2364e523d45e2b0fae28a7cbe8689e7e84/transformers-4.8.1-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 4.3MB/s 
Collecting huggingface-hub==0.0.12
  Downloading https://files.pythonhosted.org/packages/2f/ee/97e253668fda9b17e968b3f97b2f8e53aa0127e8807d24a547687423fe0b/huggingface_hub-0.0.12-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 31.9MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (

## Sentiment analysis in english

In this article, we will use the high-level pipeline interface, which makes it extremely easy to use pre-trained transformer models.

Basically, we just need to tell the pipeline what we want to do, and possibly to tell it which model to use for this task.

Here we're going to do sentiment analysis in English, so we select the `sentiment-analysis` task, and the default model: 

In [2]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=48.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




The pipeline is ready, and we can now use it: 

In [3]:
classifier(["this is a great tutorial, thank you", 
            "your content just sucks"])

[{'label': 'POSITIVE', 'score': 0.9998582601547241},
 {'label': 'NEGATIVE', 'score': 0.9971919059753418}]

We sent two sentences through the pipeline. The first one is predicted to be positive and the second one negative with very high confidence. 

Sounds good! 

Now let's see what happens if we send in french sentences: 

In [4]:
classifier(["Ton tuto est vraiment bien", 
            "il est complètement nul"])

[{'label': 'POSITIVE', 'score': 0.7650704979896545},
 {'label': 'POSITIVE', 'score': 0.8282670974731445}]

This time, the classification does not work... 

Indeed, the second sentence, which means "this tutorial is complete crap", is classified as positive. 

That's not a surprise: the default model for the sentiment analysis task has been trained on English text, so it does not understand French.

### Sentiment analysis in Dutch, German, French, Spanish and Italian

So what can you do if you want to work with text in another language, say French? 

You just need to search the hub for a [french classification model](https://huggingface.co/models?filter=fr&pipeline_tag=text-classification&sort=downloads). 

Several models are available, and I decided to select [nlptown/bert-base-multilingual-uncased-sentiment](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment). 

We can specify this model as the one to be used when we create our `sentiment-analysis` pipeline: 

In [5]:
multilang_classifier = pipeline("sentiment-analysis", 
                                model="nlptown/bert-base-multilingual-uncased-sentiment")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=953.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=669491321.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=39.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=871891.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




In [6]:
multilang_classifier(["Ton tuto est vraiment bien", 
                      "il est complètement nul"])

[{'label': '5 stars', 'score': 0.5787978172302246},
 {'label': '1 star', 'score': 0.9223358035087585}]

And it worked! The second sentence is properly classified as very negative. 

You might be wondering while the confidence for the first sentence is lower. I'm pretty sure that it's because this sentence scores high on '4 stars' as well. 

Now let's try with an actual review for a restaurant near my place: 

In [8]:
import pprint
sentence="Contente de pouvoir retourner au restaurant... Quelle déception... L accueil peu chaleureux... Un plat du jour plus disponible à 12h45...rien à me proposer à la place... Une pizza pas assez cuite et pour finir une glace pleine de glaçons... Et au gout très fade... Je pensais que les serveuses seraient plus aimable à l idée de retrouver leur clientèle.. Dommage"
pprint.pprint(sentence)

('Contente de pouvoir retourner au restaurant... Quelle déception... L accueil '
 'peu chaleureux... Un plat du jour plus disponible à 12h45...rien à me '
 'proposer à la place... Une pizza pas assez cuite et pour finir une glace '
 'pleine de glaçons... Et au gout très fade... Je pensais que les serveuses '
 'seraient plus aimable à l idée de retrouver leur clientèle.. Dommage')


In [9]:
multilang_classifier([sentence])

[{'label': '2 stars', 'score': 0.5843755602836609}]

2 stars! on Google Review, this review has 1 star. Not a bad prediction. 

### Translation 

In [None]:
en_to_fr = pipeline("translation_en_to_fr", 
                    model="Helsinki-NLP/opus-mt-en-fr")

In [None]:
en_to_fr("your tutorial is really good")

[{'translation_text': 'votre tutoriel est vraiment bon'}]

In [None]:
fr_to_en = pipeline("translation_fr_to_en", 
                    model="Helsinki-NLP/opus-mt-fr-en")

In [None]:
fr_to_en("ton tutoriel est super")

[{'translation_text': 'Your tutorial is great.'}]

### Zero-shot classification in french

In [None]:
classifier = pipeline("zero-shot-classification", 
                      model="BaptisteDoyen/camembert-base-xlni")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=882.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442587593.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=810912.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=299.0, style=ProgressStyle(description_…




In [None]:
sequence = "Colin est en train d'écrire un article au sujet du traitement du langage naturel"
candidate_labels = ["science","politique","education", "news"]
classifier(sequence, candidate_labels)     

{'labels': ['science', 'news', 'education', 'politique'],
 'scores': [0.4613836407661438,
  0.20861364901065826,
  0.20573210716247559,
  0.12427058815956116],
 'sequence': "Colin est en train d'écrire un article au sujet du traitement du langage naturel"}

### Writing summaries in French

In [None]:
summarizer = pipeline("summarization", 
                       model="moussaKam/barthez-orangesum-title")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1465.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=864357306.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1115393.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2630980.0, style=ProgressStyle(descript…




In [None]:
import pprint
sentence = "Le premier tour des élections régionales, dimanche 20 juin, a été marqué par un niveau d’abstention inédit (66,7 %). Une première clé est de savoir si ce second tour mobilisera davantage les électeurs que le premier. Il a également fait apparaître un paysage politique fragmenté. Ce morcellement se retrouvera lors du second tour, dimanche 27 juin, lui aussi inédit à bien des égards."
pprint.pprint(sentence)

('Le premier tour des élections régionales, dimanche 20 juin, a été marqué par '
 'un niveau d’abstention inédit (66,7 %). Une première clé est de savoir si ce '
 'second tour mobilisera davantage les électeurs que le premier. Il a '
 'également fait apparaître un paysage politique fragmenté. Ce morcellement se '
 'retrouvera lors du second tour, dimanche 27 juin, lui aussi inédit à bien '
 'des égards.')


In [None]:
summarizer(sentence, max_length=50)

[{'summary_text': "Régionales : le niveau d'abstention inédit au second tour"}]

### Named entity recognition

In [None]:
ner = pipeline("token-classification", model="Jean-Baptiste/camembert-ner")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=876.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440227047.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=269.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=810912.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=210.0, style=ProgressStyle(description_…




In [None]:
nes = ner("Colin est parti à Saint-André acheter de la mozzarella")
pprint.pprint(nes)

[{'end': 5,
  'entity': 'PER',
  'index': 1,
  'score': 0.94243556,
  'start': 0,
  'word': '▁Colin'},
 {'end': 23,
  'entity': 'LOC',
  'index': 5,
  'score': 0.99605554,
  'start': 17,
  'word': '▁Saint'},
 {'end': 24,
  'entity': 'LOC',
  'index': 6,
  'score': 0.9967083,
  'start': 23,
  'word': '-'},
 {'end': 29,
  'entity': 'LOC',
  'index': 7,
  'score': 0.99609375,
  'start': 24,
  'word': 'André'}]


In [None]:
cur = None
agg = []
for ne in nes: 
  entity=ne['entity']
  if entity != cur: 
    if not cur: 
      cur=entity
    if agg: 
      print(cur, ner.tokenizer.convert_tokens_to_string(agg))
      agg = []
      cur = entity
  agg.append(ne['word'])
print(cur, ner.tokenizer.convert_tokens_to_string(agg))


PER Colin
LOC Saint-André
