# **Aplicación de análisis de sentimientos**

En esta actividad se realiza un análisis de sentimientos en comentarios y reseñas de comida de algunos usuarios de Amazon, utilizando diversas técnicas de procesamiento de lenguaje natural.

In [None]:
!pip install nltk

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Se comienza cargando el conjunto de datos de reseñas y se hace un análisis de los mismos.

In [None]:
df = pd.read_csv('Reviews.csv')
df = df.head(500)
print(df.shape)

(500, 10)


In [None]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


Se obtiene un ejemplo, en este caso, la reseña seleccionada es la número 190 en el conjunto de datos, y comenzamos tokenizando el texto en palabras y haciendo un etiquetado gramatical para analizar su estructura.

In [None]:
example = df['Text'][190]
print(example)

This coffee is great because it's all organic ingredients!  No pesticides to worry about, plus it tastes good, and you have the healing effects for ganoderma.


In [None]:
# Tokenización
tokens = nltk.word_tokenize(example)
tokens[:10]

['This',
 'coffee',
 'is',
 'great',
 'because',
 'it',
 "'s",
 'all',
 'organic',
 'ingredients']

In [None]:
# Etiquetado gramatical
tagged = nltk.pos_tag(tokens)
tagged[:10]

[('This', 'DT'),
 ('coffee', 'NN'),
 ('is', 'VBZ'),
 ('great', 'JJ'),
 ('because', 'IN'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('all', 'DT'),
 ('organic', 'JJ'),
 ('ingredients', 'NNS')]

Comenzamos con la aplicación de análisis de sentimientos.

In [None]:
!pip install transformers

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm
from transformers import pipeline

In [None]:
sentiment_analyzer = SentimentIntensityAnalyzer()

Aplicamos el modelo pre entrenado 'RoBERTa' que se utiliza para el análisis de sentimiento en tweets de Twitter.

In [None]:
pretrained_model = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
model = AutoModelForSequenceClassification.from_pretrained(pretrained_model)

In [None]:
encoded_text = tokenizer(example, return_tensors='pt')
output = model(**encoded_text)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
scores_dict = {
    'roberta_neg' : scores[0],
    'roberta_neu' : scores[1],
    'roberta_pos' : scores[2]
}
print(scores_dict)

{'roberta_neg': 0.0021076456, 'roberta_neu': 0.016815921, 'roberta_pos': 0.9810765}


In [None]:
def polarity_scores_roberta(example):
    encoded_text = tokenizer(example, return_tensors='pt')
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    scores_dict = {
        'roberta_neg' : scores[0],
        'roberta_neu' : scores[1],
        'roberta_pos' : scores[2]
    }
    return scores_dict

In [None]:
res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
    try:
        text = row['Text']
        myid = row['Id']
        roberta_result = polarity_scores_roberta(text)
        res[myid] = roberta_result
    except RuntimeError:
        print(f'Broke for id {myid}')

  0%|          | 0/500 [00:00<?, ?it/s]

Broke for id 83
Broke for id 187


In [None]:
results_df = pd.DataFrame(res).T
results_df = results_df.reset_index().rename(columns={'index': 'Id'})
results_df = results_df.merge(df, how='left')

In [None]:
results_df.columns

Index(['Id', 'roberta_neg', 'roberta_neu', 'roberta_pos', 'ProductId',
       'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [None]:
results_df.query('Score == 1') \
    .sort_values('roberta_pos', ascending=False)['Text'].values[0]

'I felt energized within five minutes, but it lasted for about 45 minutes. I paid $3.99 for this drink. I could have just drunk a cup of coffee and saved my money.'

In [None]:
results_df.query('Score == 5') \
    .sort_values('roberta_neg', ascending=False)['Text'].values[0]

'this was sooooo deliscious but too bad i ate em too fast and gained 2 pds! my fault'

Con ayuda de la canalización se proporcionan resultados rápidos y precisos para el análisis de sentimientos, con distintos ejemplos.

In [None]:
sent_pipeline = pipeline("sentiment-analysis")

In [None]:
print(sent_pipeline('This is an amazing product, I love it!'))
print(sent_pipeline('I had a terrible experience with their customer service'))
print(sent_pipeline('This movie is amazing. I highly recommend it'))

[{'label': 'POSITIVE', 'score': 0.9998859167098999}]
[{'label': 'NEGATIVE', 'score': 0.999445378780365}]
[{'label': 'POSITIVE', 'score': 0.9998862743377686}]


Por último, se muestra la contabilización del número total de oraciones.

In [None]:
num_sentences = 0

for i, row in df.iterrows():
    text = row['Text']
    sentences = nltk.sent_tokenize(text)
    num_sentences += len(sentences)

print(f"Número total de oraciones en el dataset: {num_sentences}")

Número total de oraciones en el dataset: 2230
