# Procesamiento del Lenguaje Natural

Rodrigo S. Cortez Madrigal

<img src="https://pcic.posgrado.unam.mx/wp-content/uploads/Ciencia-e-Ingenieria-de-la-Computacion_color.png" alt="Logo PCIC" width="128" />  

In [1]:
import numpy as np
import pandas as pd
import re
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

from plotly import graph_objs as go
from plotly import express as px
from plotly.subplots import make_subplots


## Sentiment Analysis

In [2]:

text = "I actually don't think this comment will be classified correctly, " \
"because it has happy words, and I'm happy while writing it, " \
"even if I'm saying something that is not beneficial for the application itself."

### VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) es un modelo de análisis de sentimientos basado en reglas y léxicos. Este modelo fue desarrollado por C.J. Hutto y Eric Gilbert en 2014. VADER es una
herramienta de análisis de sentimientos que es específica para los medios sociales, y que está diseñada para ser rápida y fácil de usar. VADER no requiere entrenamiento previo, y es capaz de manejar tanto texto en inglés como en otros idiomas.

In [3]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

score = analyzer.polarity_scores(text)

print(score)

fig = px.bar(x=list(score.keys()), y=list(score.values()))
fig.show()

{'neg': 0.059, 'neu': 0.76, 'pos': 0.181, 'compound': 0.7179}


## TextBlob

In [11]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

In [13]:
!python -m spacy download en_core_web_sm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [14]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("spacytextblob")

<spacytextblob.spacytextblob.SpacyTextBlob at 0x33ae1be00>

In [19]:
doc = nlp(text)

print(f'Polarity: {doc._.blob.polarity}')
print(f'Subjectivity: {doc._.blob.subjectivity}')

fig = px.bar(x=['Polarity', 'Subjectivity'], y=[doc._.blob.polarity, doc._.blob.subjectivity])
fig.show()

Polarity: 0.5333333333333333
Subjectivity: 0.7000000000000001


In [25]:
print(doc._.blob.sentiment_assessments.assessments)

[(['actually'], 0.0, 0.1, None), (['happy'], 0.8, 1.0, None), (['happy'], 0.8, 1.0, None)]


In [26]:
# ['happy'] is the most positive word in the text, with a polarity of 0.8 and a subjectivity of 1.0
# ['actually'] is the most neutral word in the text, with a polarity of 0.0 and a subjectivity of 0.1

### PySentimento

In [27]:
from pysentimiento import create_analyzer
analyzer = create_analyzer(task="sentiment", lang="es")

loading configuration file https://huggingface.co/pysentimiento/robertuito-sentiment-analysis/resolve/main/config.json from cache at /Users/roicort/.cache/huggingface/transformers/034fd09e9530137fb6e6c042529972a92619fb02df8b40e7a4cfc50090943c46.ba567638740ab836f48b011b60649b828abc78b1aafda381bf9ac862d58d1ff5
Model config RobertaConfig {
  "_name_or_path": "pysentimiento/robertuito-sentiment-analysis",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "NEG",
    "1": "NEU",
    "2": "POS"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "NEG": 0,
    "NEU": 1,
    "POS": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 130,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "

In [8]:
prediction = analyzer.predict(text)

In [9]:
prediction

AnalyzerOutput(output=NEU, probas={NEU: 0.448, NEG: 0.318, POS: 0.234})

In [10]:
# Plot probas 

fig = px.bar(x=list(prediction.probas.keys()), y=list(prediction.probas.values()))
fig.show()