<a href="https://colab.research.google.com/github/rvats20/pysentimiento/blob/master/notebooks/examples/pysentimiento_sentiment_analysis_in_spanish.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# pysentimiento: A multilingual toolkit for Sentiment Analysis and SocialNLP tasks

En esta notebook mostramos un breve ejemplo de cómo usar [pysentimiento](https://github.com/pysentimiento/pysentimiento/), un toolkit multilingual para extracción de opiniones y análisis de sentimientos (aunque centrado en el idioma español)

`pysentimiento` es un una librería que utiliza modelos pre-entrenados de [transformers](https://github.com/huggingface/transformers) para distintas tareas de SocialNLP. Usa como modelos bases a [BETO](https://github.com/dccuchile/beto) y [RoBERTuito](https://github.com/pysentimiento/robertuito) en Español, BERTweet en inglés, y otros modelos similares en italiano y portugués.

--

In this notebook we show a brief example of how to use [pysentimiento](https://github.com/pysentimiento/pysentimiento/), a multilingual toolkit for opinion mining and sentiment analysis.

`pysentimiento` is a library that uses pre-trained models of [transformers] (https://github.com/huggingface/transformers) for different SocialNLP tasks. It uses as base models [BETO] (https://github.com/dccuchile/beto) and [RoBERTuito] (https://github.com/pysentimiento/robertuito) in Spanish, BERTweet in English, and similar models in Italian and Portuguese.


First, let's install the library

In [None]:
!pip install pysentimiento


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Let's create an analyzer. The `create_analyzer` receives the task and the language as parameters.

In [None]:
from pysentimiento import create_analyzer
import transformers

transformers.logging.set_verbosity(transformers.logging.ERROR)

analyzer = create_analyzer(task="sentiment", lang="es")



Let's check out some examples:

Veamos algunos ejemplos:

In [None]:
analyzer.predict("Qué gran jugador es Messi")

AnalyzerOutput(output=POS, probas={POS: 0.946, NEU: 0.037, NEG: 0.017})

In [None]:
analyzer.predict("Esto es pésimo")

AnalyzerOutput(output=NEG, probas={NEG: 0.887, NEU: 0.098, POS: 0.014})

In [None]:
analyzer.predict("Qué es esto?")

AnalyzerOutput(output=NEU, probas={NEU: 0.548, NEG: 0.412, POS: 0.041})

### Predicción en batch

Si tenemos un conjunto de oraciones, `pysentimiento` hace la predicción en conjunto de manera eficiente

In [None]:
%%time
from tqdm.auto import tqdm
oraciones = [
    "Qué gran jugador es Messi",
    "Esto es pésimo",
    "No sé, cómo se llama?",
] * 20
for sent in tqdm(oraciones):
    analyzer.predict(sent)

  0%|          | 0/60 [00:00<?, ?it/s]

CPU times: user 1.17 s, sys: 13.8 ms, total: 1.19 s
Wall time: 1.18 s


In [None]:
%%time
rets = analyzer.predict(oraciones)

  0%|          | 0/2 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 60
  Batch size = 32
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


CPU times: user 3.98 s, sys: 522 ms, total: 4.5 s
Wall time: 4.48 s


### Emojis

It supports the use of emojis through the [emoji](https://pypi.org/project/emoji/) library.

Soporta también el uso de emojis

In [None]:
analyzer.predict("🤢")

AnalyzerOutput(output=NEG, probas={NEG: 0.936, NEU: 0.057, POS: 0.007})

O de hashtags

In [None]:
analyzer.predict("#EstoEsUnaMierda")

AnalyzerOutput(output=NEG, probas={NEG: 0.976, NEU: 0.020, POS: 0.004})

## Emotion Analysis

`pysentimiento` provee análisis de emociones a través de modelos pre-entrenados con los datasets de [EmoEvent](https://github.com/fmplaza/EmoEvent-multilingual-corpus/)

In [None]:
emotion_analyzer = create_analyzer(task="emotion", lang="en")

loading configuration file config.json from cache at /users/jmperez/.cache/huggingface/hub/models--finiteautomata--bertweet-base-emotion-analysis/snapshots/c482c9e1750a29dcc393234816bcf468ff77cd2d/config.json
Model config RobertaConfig {
  "_name_or_path": "finiteautomata/bertweet-base-emotion-analysis",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "others",
    "1": "joy",
    "2": "sadness",
    "3": "anger",
    "4": "surprise",
    "5": "disgust",
    "6": "fear"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "anger": 3,
    "disgust": 5,
    "fear": 6,
    "joy": 1,
    "others": 0,
    "sadness": 2,
    "surprise": 4
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 130,


In [None]:
emotion_analyzer.predict("This is so terrible...")

AnalyzerOutput(output=sadness, probas={sadness: 0.978, fear: 0.013, disgust: 0.003, others: 0.002, surprise: 0.002, anger: 0.001, joy: 0.001})

In [None]:
emotion_analyzer.predict("omg")

AnalyzerOutput(output=surprise, probas={surprise: 0.982, others: 0.007, fear: 0.003, joy: 0.003, sadness: 0.002, anger: 0.002, disgust: 0.001})

In [None]:
emotion_analyzer.predict("yayyyy")

AnalyzerOutput(output=joy, probas={joy: 0.879, others: 0.106, surprise: 0.005, anger: 0.005, sadness: 0.002, disgust: 0.002, fear: 0.002})

In [None]:
emotion_analyzer.predict("People in the world is really worried because of Coronavirus")

AnalyzerOutput(output=fear, probas={fear: 0.939, others: 0.043, surprise: 0.005, joy: 0.004, disgust: 0.004, sadness: 0.002, anger: 0.002})

## Hate Speech

`pysentimiento` also supports hate speech detection, by training models using the [HatEval](https://competitions.codalab.org/competitions/19935) dataset

In [None]:
hate_speech_analyzer = create_analyzer(task="hate_speech", lang="es")

loading configuration file config.json from cache at /users/jmperez/.cache/huggingface/hub/models--pysentimiento--robertuito-hate-speech/snapshots/db125ee7be2ad74457b900ae49a7e0f14f7a496c/config.json
Model config RobertaConfig {
  "_name_or_path": "pysentimiento/robertuito-hate-speech",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "hateful",
    "1": "targeted",
    "2": "aggressive"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "aggressive": 2,
    "hateful": 0,
    "targeted": 1
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 130,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",

This model is a multi-label classification algorithm, returning three different variables at the same time:

- Is the message hateful or not?
- Is the hateful message targeted at a specific person or a group?
- Is the hateful message aggressive?

In [None]:
hate_speech_analyzer.predict("Esto es una mierda pero no es odio")

AnalyzerOutput(output=[], probas={hateful: 0.020, targeted: 0.006, aggressive: 0.016})

In [None]:
hate_speech_analyzer.predict("Esto es odio porque los inmigrantes deben ser aniquilados")

AnalyzerOutput(output=['hateful', 'aggressive'], probas={hateful: 0.902, targeted: 0.009, aggressive: 0.539})

In [None]:
hate_speech_analyzer.predict("Vaya guarra barata y de poca monta es Juana Pérez!")

AnalyzerOutput(output=['hateful', 'targeted', 'aggressive'], probas={hateful: 0.982, targeted: 0.982, aggressive: 0.964})

## Token Labeling tasks

`pysentimiento` also features POS tagging & NER analyzers, specially crafted for Twitter data, thanks to the [LinCE](https://ritual.uh.edu/lince/) dataset.

`pysentimiento` cuenta con analizadores para POS tagging & NER gracias al dataset multilingual [LinCE](https://ritual.uh.edu/lince/)


In [None]:
ner_analyzer = create_analyzer("ner", lang="es")

loading configuration file config.json from cache at /users/jmperez/.cache/huggingface/hub/models--pysentimiento--robertuito-ner/snapshots/43dde6356afd3e8bf4f1b00a191b5122ccdfd9b3/config.json
Model config RobertaConfig {
  "_name_or_path": "pysentimiento/robertuito-ner",
  "architectures": [
    "RobertaForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-EVENT",
    "2": "I-EVENT",
    "3": "B-GROUP",
    "4": "I-GROUP",
    "5": "B-LOC",
    "6": "I-LOC",
    "7": "B-ORG",
    "8": "I-ORG",
    "9": "B-OTHER",
    "10": "I-OTHER",
    "11": "B-PER",
    "12": "I-PER",
    "13": "B-PROD",
    "14": "I-PROD",
    "15": "B-TIME",
    "16": "I-TIME",
    "17": "B-TITLE",
    "18": "I-TITLE"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,

In [None]:
ner_analyzer.predict("Me voy de vacaciones a República Dominicana 😎")

TokenClassificationOutput(entities=[República Dominicana (LOC)], tokens=['Me', 'voy', 'de', 'vacaciones', 'a', 'República', 'Dominicana', '😎'], labels=['O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O'])

In [None]:
ner_analyzer.predict("Me llamo Juan Manuel Pérez y vivo en 🇦🇷😎")

TokenClassificationOutput(entities=[Juan Manuel Pérez (PER)], tokens=['Me', 'llamo', 'Juan', 'Manuel', 'Pérez', 'y', 'vivo', 'en', '🇦', '🇷', '😎'], labels=['O', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O'])

In [None]:
pos_tagger = create_analyzer("pos", "es")

loading configuration file config.json from cache at /users/jmperez/.cache/huggingface/hub/models--pysentimiento--robertuito-pos/snapshots/c65b4a1da16bbf15cb89a7cadc4dbb7b11ccd22d/config.json
Model config RobertaConfig {
  "_name_or_path": "pysentimiento/robertuito-pos",
  "architectures": [
    "RobertaForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "B-VERB",
    "1": "B-PUNCT",
    "2": "B-PRON",
    "3": "B-NOUN",
    "4": "B-DET",
    "5": "B-ADV",
    "6": "B-ADP",
    "7": "B-INTJ",
    "8": "B-CONJ",
    "9": "B-ADJ",
    "10": "B-AUX",
    "11": "B-SCONJ",
    "12": "B-PART",
    "13": "B-PROPN",
    "14": "B-NUM",
    "15": "B-UNK",
    "16": "B-X"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-ADJ": 9,
    "B-ADP"

In [None]:
pos_tagger.predict("Me llamo Juan Manuel Pérez y vivo en Argentina")

TokenClassificationOutput(tokens=['Me', 'llamo', 'Juan', 'Manuel', 'Pérez', 'y', 'vivo', 'en', 'Argentina'], labels=['PRON', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'CONJ', 'VERB', 'ADP', 'PROPN'])

## Preprocessing

`pysentimiento` tiene un módulo de preprocesamiento de tweets con varias
opciones para manipular hashtags, emojis, repetición de caracteres y demás.

`pysentimiento` features a preprocessing module with various options for manipulating hashtags, emojis, character repetition, and so on.

In [None]:
from pysentimiento.preprocessing import preprocess_tweet

preprocess_tweet("📢 @realDonaldTrump ha sido banneado de Twitter #BreakingNews")

'emoji altavoz de mano emoji  @usuario ha sido banneado de Twitter hashtag breaking news'

In [None]:
preprocess_tweet("📢 @realDonaldTrump ha sido banneado de Twitter #BreakingNews", preprocess_handles=False, demoji=False)

'📢 @realDonaldTrump ha sido banneado de Twitter hashtag breaking news'

In [None]:
preprocess_tweet??

[0;31mSignature:[0m
[0mpreprocess_tweet[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtext[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlang[0m[0;34m=[0m[0;34m'es'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0muser_token[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0murl_token[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpreprocess_hashtags[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhashtag_token[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mchar_replace[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdemoji[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshorten[0m[0;34m=[0m[0;36m3[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnormalize_laughter[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0memoji_wrapper[0m[0;34m=[0m[0;34m'emoji'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpreproces