# pysentimiento: A multilingual toolkit for Sentiment Analysis and SocialNLP tasks

En esta notebook mostramos un breve ejemplo de cÃ³mo usar [pysentimiento](https://github.com/pysentimiento/pysentimiento/), un toolkit multilingual para extracciÃ³n de opiniones y anÃ¡lisis de sentimientos (aunque centrado en el idioma espaÃ±ol)

`pysentimiento` es un una librerÃ­a que utiliza modelos pre-entrenados de [transformers](https://github.com/huggingface/transformers) para distintas tareas de SocialNLP. Usa como modelos bases a [BETO](https://github.com/dccuchile/beto) y [RoBERTuito](https://github.com/pysentimiento/robertuito) en EspaÃ±ol, BERTweet en inglÃ©s, y otros modelos similares en italiano y portuguÃ©s.

-- 

In this notebook we show a brief example of how to use [pysentimiento](https://github.com/pysentimiento/pysentimiento/), a multilingual toolkit for opinion mining and sentiment analysis.

`pysentimiento` is a library that uses pre-trained models of [transformers] (https://github.com/huggingface/transformers) for different SocialNLP tasks. It uses as base models [BETO] (https://github.com/dccuchile/beto) and [RoBERTuito] (https://github.com/pysentimiento/robertuito) in Spanish, BERTweet in English, and similar models in Italian and Portuguese.

 
First, let's install the library

In [26]:
!pip install pysentimiento

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pysentimiento==0.6.1rc2
  Downloading pysentimiento-0.6.1rc2-py3-none-any.whl (36 kB)
Installing collected packages: pysentimiento
  Attempting uninstall: pysentimiento
    Found existing installation: pysentimiento 0.6.1rc1
    Uninstalling pysentimiento-0.6.1rc1:
      Successfully uninstalled pysentimiento-0.6.1rc1
Successfully installed pysentimiento-0.6.1rc2


Let's create an analyzer. The `create_analyzer` receives the task and the language as parameters (currently supports "es" and "en").

In [27]:
from pysentimiento import create_analyzer
analyzer = create_analyzer(task="sentiment", lang="es")


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--pysentimiento--robertuito-sentiment-analysis/snapshots/12e030859ce19539e24b486ac84ffebb9b68ecf1/config.json
Model config RobertaConfig {
  "_name_or_path": "pysentimiento/robertuito-sentiment-analysis",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "NEG",
    "1": "NEU",
    "2": "POS"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "NEG": 0,
    "NEU": 1,
    "POS": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 130,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_


Let's check out some examples:

Veamos algunos ejemplos:

In [28]:
analyzer.predict("QuÃ© gran jugador es Messi")

AnalyzerOutput(output=POS, probas={POS: 0.946, NEU: 0.037, NEG: 0.017})

In [29]:
analyzer.predict("Esto es pÃ©simo")

AnalyzerOutput(output=NEG, probas={NEG: 0.887, NEU: 0.098, POS: 0.014})

In [30]:
analyzer.predict("QuÃ© es esto?")

AnalyzerOutput(output=NEU, probas={NEU: 0.548, NEG: 0.412, POS: 0.041})

### PredicciÃ³n en batch

Si tenemos un conjunto de oraciones, `pysentimiento` hace la predicciÃ³n en conjunto de manera eficiente

In [31]:
%%time
from tqdm.auto import tqdm
oraciones = [
    "QuÃ© gran jugador es Messi",
    "Esto es pÃ©simo",
    "No sÃ©, cÃ³mo se llama?",    
] * 20
for sent in tqdm(oraciones):
    analyzer.predict(sent)

  0%|          | 0/60 [00:00<?, ?it/s]

CPU times: user 6.84 s, sys: 44.7 ms, total: 6.88 s
Wall time: 11.9 s


In [32]:
%%time
rets = analyzer.predict(oraciones)

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

The following columns in the test set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 60
  Batch size = 32
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


CPU times: user 5.01 s, sys: 518 ms, total: 5.52 s
Wall time: 8.23 s


### Emojis

It supports the use of emojis through the [emoji](https://pypi.org/project/emoji/) library.

Soporta tambiÃ©n el uso de emojis

In [33]:
analyzer.predict("ðŸ¤¢")

AnalyzerOutput(output=NEG, probas={NEG: 0.936, NEU: 0.057, POS: 0.007})

O de hashtags

In [34]:
analyzer.predict("#EstoEsUnaMierda")

AnalyzerOutput(output=NEG, probas={NEG: 0.976, NEU: 0.020, POS: 0.004})

## Emotion Analysis

`pysentimiento` provee anÃ¡lisis de emociones a travÃ©s de modelos pre-entrenados con los datasets de [EmoEvent](https://github.com/fmplaza/EmoEvent-multilingual-corpus/)

In [35]:
emotion_analyzer = create_analyzer(task="emotion", lang="en")

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--finiteautomata--bertweet-base-emotion-analysis/snapshots/64046df9cc41eab40e1ecde7d2b7fb42b971be5b/config.json
Model config RobertaConfig {
  "_name_or_path": "finiteautomata/bertweet-base-emotion-analysis",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "others",
    "1": "joy",
    "2": "sadness",
    "3": "anger",
    "4": "surprise",
    "5": "disgust",
    "6": "fear"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "anger": 3,
    "disgust": 5,
    "fear": 6,
    "joy": 1,
    "others": 0,
    "sadness": 2,
    "surprise": 4
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 130,
  "model_

In [36]:
emotion_analyzer.predict("This is so terrible...")

AnalyzerOutput(output=sadness, probas={sadness: 0.978, fear: 0.013, disgust: 0.003, others: 0.002, surprise: 0.002, anger: 0.001, joy: 0.001})

In [37]:
emotion_analyzer.predict("omg")

AnalyzerOutput(output=surprise, probas={surprise: 0.982, others: 0.007, fear: 0.003, joy: 0.003, sadness: 0.002, anger: 0.002, disgust: 0.001})

In [38]:
emotion_analyzer.predict("yayyyy")

AnalyzerOutput(output=joy, probas={joy: 0.879, others: 0.106, surprise: 0.005, anger: 0.005, sadness: 0.002, disgust: 0.002, fear: 0.002})

In [39]:
emotion_analyzer.predict("People in the world is really worried because of Coronavirus")

AnalyzerOutput(output=fear, probas={fear: 0.939, others: 0.043, surprise: 0.005, joy: 0.004, disgust: 0.004, sadness: 0.002, anger: 0.002})

## Hate Speech

`pysentimiento` also supports hate speech detection, by training models using the [HatEval](https://competitions.codalab.org/competitions/19935) dataset

In [40]:
hate_speech_analyzer = create_analyzer(task="hate_speech", lang="es")

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--pysentimiento--robertuito-hate-speech/snapshots/db125ee7be2ad74457b900ae49a7e0f14f7a496c/config.json
Model config RobertaConfig {
  "_name_or_path": "pysentimiento/robertuito-hate-speech",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "hateful",
    "1": "targeted",
    "2": "aggressive"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "aggressive": 2,
    "hateful": 0,
    "targeted": 1
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 130,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "probl

This model is a multi-label classification algorithm, returning three different variables at the same time:

- Is the message hateful or not?
- Is the hateful message targeted at a specific person or a group?
- Is the hateful message aggressive?

In [41]:
hate_speech_analyzer.predict("Esto es una mierda pero no es odio")

AnalyzerOutput(output=[], probas={hateful: 0.020, targeted: 0.006, aggressive: 0.016})

In [42]:
hate_speech_analyzer.predict("Esto es odio porque los inmigrantes deben ser aniquilados")

AnalyzerOutput(output=['hateful', 'aggressive'], probas={hateful: 0.902, targeted: 0.009, aggressive: 0.539})

In [43]:
hate_speech_analyzer.predict("Vaya guarra barata y de poca monta es Juana PÃ©rez!")

AnalyzerOutput(output=['hateful', 'targeted', 'aggressive'], probas={hateful: 0.982, targeted: 0.982, aggressive: 0.964})

## Token Labeling tasks

`pysentimiento` also features POS tagging & NER analyzers, specially crafted for Twitter data, thanks to the [LinCE](https://ritual.uh.edu/lince/) dataset. 

`pysentimiento` cuenta con analizadores para POS tagging & NER gracias al dataset multilingual [LinCE](https://ritual.uh.edu/lince/)


In [44]:
ner_analyzer = create_analyzer("ner", lang="es")



loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--pysentimiento--robertuito-ner/snapshots/43dde6356afd3e8bf4f1b00a191b5122ccdfd9b3/config.json
Model config RobertaConfig {
  "_name_or_path": "pysentimiento/robertuito-ner",
  "architectures": [
    "RobertaForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-EVENT",
    "2": "I-EVENT",
    "3": "B-GROUP",
    "4": "I-GROUP",
    "5": "B-LOC",
    "6": "I-LOC",
    "7": "B-ORG",
    "8": "I-ORG",
    "9": "B-OTHER",
    "10": "I-OTHER",
    "11": "B-PER",
    "12": "I-PER",
    "13": "B-PROD",
    "14": "I-PROD",
    "15": "B-TIME",
    "16": "I-TIME",
    "17": "B-TITLE",
    "18": "I-TITLE"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label

In [45]:
ner_analyzer.predict("Me voy de vacaciones a RepÃºblica Dominicana ðŸ˜Ž")

[{'type': 'LOC', 'text': 'RepÃºblica Dominicana', 'start': 23, 'end': 43}]

In [46]:
ner_analyzer.predict("Me llamo Juan Manuel PÃ©rez y vivo en ðŸ‡¦ðŸ‡·ðŸ˜Ž")

[{'type': 'PER', 'text': 'Juan Manuel PÃ©rez', 'start': 9, 'end': 26}]

In [47]:
pos_tagger = create_analyzer("pos", "es")

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--pysentimiento--robertuito-pos/snapshots/c65b4a1da16bbf15cb89a7cadc4dbb7b11ccd22d/config.json
Model config RobertaConfig {
  "_name_or_path": "pysentimiento/robertuito-pos",
  "architectures": [
    "RobertaForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "B-VERB",
    "1": "B-PUNCT",
    "2": "B-PRON",
    "3": "B-NOUN",
    "4": "B-DET",
    "5": "B-ADV",
    "6": "B-ADP",
    "7": "B-INTJ",
    "8": "B-CONJ",
    "9": "B-ADJ",
    "10": "B-AUX",
    "11": "B-SCONJ",
    "12": "B-PART",
    "13": "B-PROPN",
    "14": "B-NUM",
    "15": "B-UNK",
    "16": "B-X"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-ADJ": 9,
    "B-ADP": 6,
    

In [48]:
pos_tagger.predict("Me llamo Juan Manuel PÃ©rez y vivo en Argentina")

[{'type': 'PRON', 'text': 'Me', 'start': 0, 'end': 2},
 {'type': 'VERB', 'text': 'llamo', 'start': 3, 'end': 8},
 {'type': 'PROPN', 'text': 'Juan', 'start': 9, 'end': 13},
 {'type': 'PROPN', 'text': 'Manuel', 'start': 14, 'end': 20},
 {'type': 'PROPN', 'text': 'PÃ©rez', 'start': 21, 'end': 26},
 {'type': 'CONJ', 'text': 'y', 'start': 27, 'end': 28},
 {'type': 'VERB', 'text': 'vivo', 'start': 29, 'end': 33},
 {'type': 'ADP', 'text': 'en', 'start': 34, 'end': 36},
 {'type': 'PROPN', 'text': 'Argentina', 'start': 37, 'end': 46}]

## Preprocessing

`pysentimiento` tiene un mÃ³dulo de preprocesamiento de tweets con varias 
opciones para manipular hashtags, emojis, repeticiÃ³n de caracteres y demÃ¡s.

`pysentimiento` features a preprocessing module with various options for manipulating hashtags, emojis, character repetition, and so on.

In [49]:
from pysentimiento.preprocessing import preprocess_tweet

preprocess_tweet("ðŸ“¢ @realDonaldTrump ha sido banneado de Twitter #BreakingNews")

'emoji altavoz de mano emoji  @usuario ha sido banneado de Twitter hashtag breaking news'

In [50]:
preprocess_tweet("ðŸ“¢ @realDonaldTrump ha sido banneado de Twitter #BreakingNews", preprocess_handles=False, demoji=False)

'ðŸ“¢ @realDonaldTrump ha sido banneado de Twitter hashtag breaking news'

In [51]:
preprocess_tweet??