# **Análisis de sentimiento sobre un corpus de tweets en español**

Para realizar este trabajo se utilizaron datasets provistos por la Sociedad Española del Procesado del Lenguaje Natural (SEPLN) para el Taller de Análisis de Sentimiento en español (TASS) en distintos años. Los archivos están en formato XML y poseen diversos campos, en este caso solo utilizaremos 'content' (el texto del tweet) y 'polarity' (la polaridad del tweet). Estas polaridades pueden ser:

- P: positiva.
- N: negativa.
- NEU: neutral.
- NONE: ninguna.

### **Parsing de los archivos**

Para poder usar los archivos, se realiza un parsing de los .xml a .csv para poder crear el dataframe con pandas. Esto se hace para los archivos correspondientes a los tweets de cada país. En caso de ejecutar el Google Colab, este paso no es necesario.

In [None]:
import pandas as pd
pd.set_option('max_colwidth',1000)

#Parseo de los tweets en el archivo TASS2019_country_UY_train.xml

try:
    general_tweets_corpus_train_UY = pd.read_csv('TASS\\general-tweets-train-tagged-UY.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_UY_train.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    general_tweets_corpus_train_UY = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        general_tweets_corpus_train_UY = general_tweets_corpus_train_UY.append(row_s)
    general_tweets_corpus_train_UY.to_csv('TASS\\general-tweets-train-tagged-UY.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_UY_dev.xml

try:
    dev_tweets_corpus_train_UY = pd.read_csv('TASS\\dev-tweets-train-tagged-UY.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_UY_dev.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    dev_tweets_corpus_train_UY = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        dev_tweets_corpus_train_UY = dev_tweets_corpus_train_UY.append(row_s)
    dev_tweets_corpus_train_UY.to_csv('TASS\\dev-tweets-train-tagged-UY.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_CR_train.xml

try:
    general_tweets_corpus_train_CR = pd.read_csv('TASS\\general-tweets-train-tagged-CR.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_CR_train.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    general_tweets_corpus_train_CR = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        general_tweets_corpus_train_CR = general_tweets_corpus_train_CR.append(row_s)
    general_tweets_corpus_train_CR.to_csv('TASS\\general-tweets-train-tagged-CR.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_CR_dev.xml

try:
    dev_tweets_corpus_train_CR = pd.read_csv('TASS\\dev-tweets-train-tagged-CR.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_CR_dev.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    dev_tweets_corpus_train_CR = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        dev_tweets_corpus_train_CR = dev_tweets_corpus_train_CR.append(row_s)
    dev_tweets_corpus_train_CR.to_csv('TASS\\dev-tweets-train-tagged-CR.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_ES_train.xml

try:
    general_tweets_corpus_train_ES = pd.read_csv('TASS\\general-tweets-train-tagged-ES.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_ES_train.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    general_tweets_corpus_train_ES = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        general_tweets_corpus_train_ES = general_tweets_corpus_train_ES.append(row_s)
    general_tweets_corpus_train_ES.to_csv('TASS\\general-tweets-train-tagged-ES.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_ES_dev.xml

try:
    dev_tweets_corpus_train_ES = pd.read_csv('TASS\\dev-tweets-train-tagged-ES.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_ES_dev.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    dev_tweets_corpus_train_ES = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        dev_tweets_corpus_train_ES = dev_tweets_corpus_train_ES.append(row_s)
    dev_tweets_corpus_train_ES.to_csv('TASS\\dev-tweets-train-tagged-ES.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_MX_train.xml

try:
    general_tweets_corpus_train_MX = pd.read_csv('TASS\\general-tweets-train-tagged-MX.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_MX_train.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    general_tweets_corpus_train_MX = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        general_tweets_corpus_train_MX = general_tweets_corpus_train_MX.append(row_s)
    general_tweets_corpus_train_MX.to_csv('TASS\\general-tweets-train-tagged-MX.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_MX_dev.xml

try:
    dev_tweets_corpus_train_MX = pd.read_csv('TASS\\dev-tweets-train-tagged-MX.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_MX_dev.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    dev_tweets_corpus_train_MX = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        dev_tweets_corpus_train_MX = dev_tweets_corpus_train_MX.append(row_s)
    dev_tweets_corpus_train_MX.to_csv('TASS\\dev-tweets-train-tagged-MX.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_PE_train.xml

try:
    general_tweets_corpus_train_PE = pd.read_csv('TASS\\general-tweets-train-tagged-PE.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_PE_train.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    general_tweets_corpus_train_PE = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        general_tweets_corpus_train_PE = general_tweets_corpus_train_PE.append(row_s)
    general_tweets_corpus_train_PE.to_csv('TASS\\general-tweets-train-tagged-PE.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_PE_dev.xml

try:
    dev_tweets_corpus_train_PE = pd.read_csv('TASS\\dev-tweets-train-tagged-PE.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_PE_dev.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    dev_tweets_corpus_train_PE = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        dev_tweets_corpus_train_PE = dev_tweets_corpus_train_PE.append(row_s)
    dev_tweets_corpus_train_PE.to_csv('TASS\\dev-tweets-train-tagged-PE.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo intertass-CR-development-tagged.xml

try:
    intertass_dev_CR = pd.read_csv('TASS\\intertass-dev-CR.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\intertass-CR-development-tagged.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    intertass_dev_CR = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        intertass_dev_CR = intertass_dev_CR.append(row_s)
    intertass_dev_CR.to_csv('TASS\\intertass-dev-CR.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo intertass-ES-development-tagged.xml

try:
    intertass_dev_ES = pd.read_csv('TASS\\intertass-dev-ES.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\intertass-ES-development-tagged.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    intertass_dev_ES = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        intertass_dev_ES = intertass_dev_ES.append(row_s)
    intertass_dev_ES.to_csv('TASS\\intertass-dev-ES.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo intertass-PE-development-tagged.xml

try:
    intertass_dev_PE = pd.read_csv('TASS\\intertass-dev-PE.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\intertass-PE-development-tagged.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    intertass_dev_PE = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        intertass_dev_PE = intertass_dev_PE.append(row_s)
    intertass_dev_PE.to_csv('TASS\\intertass-dev-PE.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo intertass-CR-train-tagged.xml

try:
    intertass_train_CR = pd.read_csv('TASS\\intertass-train-CR.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\intertass-CR-train-tagged.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    intertass_train_CR = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        intertass_train_CR = intertass_train_CR.append(row_s)
    intertass_train_CR.to_csv('TASS\\intertass-train-CR.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo intertass-ES-train-tagged.xml

try:
    intertass_train_ES = pd.read_csv('TASS\\intertass-train-ES.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\intertass-ES-train-tagged.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    intertass_train_ES = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        intertass_train_ES = intertass_train_ES.append(row_s)
    intertass_train_ES.to_csv('TASS\\intertass-train-ES.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo intertass-PE-train-tagged.xml

try:
    intertass_train_PE = pd.read_csv('TASS\\intertass-train-PE.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\intertass-PE-train-tagged.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    intertass_train_PE = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        intertass_train_PE = intertass_train_PE.append(row_s)
    intertass_train_PE.to_csv('TASS\\intertass-train-PE.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo politics-test-tagged.xml

try:
    politics_test = pd.read_csv('TASS\\politics-test.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\politics-test-tagged.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    politics_test = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiments.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        politics_test = politics_test.append(row_s)
    politics_test.to_csv('TASS\\politics-test.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_UY_test.xml, para testear

try:
    tweets_test_UY = pd.read_csv('TASS\\tweets-test-tagged-UY.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_UY_test.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    tweets_test_UY = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content'], [tweet.content.text]))
        row_s = pd.Series(row)
        row_s.name = i
        tweets_test_UY = tweets_test_UY.append(row_s)
    tweets_test_UY.to_csv('TASS\\tweets-test-tagged-UY.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_CR_test.xml, para testear

try:
    tweets_test_CR = pd.read_csv('TASS\\tweets-test-tagged-CR.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_CR_test.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    tweets_test_CR = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content'], [tweet.content.text]))
        row_s = pd.Series(row)
        row_s.name = i
        tweets_test_CR = tweets_test_CR.append(row_s)
    tweets_test_CR.to_csv('TASS\\tweets-test-tagged-CR.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_ES_test.xml, para testear

try:
    tweets_test_ES = pd.read_csv('TASS\\tweets-test-tagged-ES.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_ES_test.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    tweets_test_ES = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content'], [tweet.content.text]))
        row_s = pd.Series(row)
        row_s.name = i
        tweets_test_ES = tweets_test_ES.append(row_s)
    tweets_test_ES.to_csv('TASS\\tweets-test-tagged-ES.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_MX_test.xml, para testear

try:
    tweets_test_MX = pd.read_csv('TASS\\tweets-test-tagged-MX.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_MX_test.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    tweets_test_MX = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content'], [tweet.content.text]))
        row_s = pd.Series(row)
        row_s.name = i
        tweets_test_MX = tweets_test_MX.append(row_s)
    tweets_test_MX.to_csv('TASS\\tweets-test-tagged-MX.csv', index=False, encoding='utf-8')

In [None]:
#Parseo de los tweets en el archivo TASS2019_country_PE_test.xml, para testear

try:
    tweets_test_PE = pd.read_csv('TASS\\tweets-test-tagged-PE.csv', encoding='utf-8')
except:
    from lxml import objectify
    xml = objectify.parse(open('TASS\\TASS2019_country_PE_test.xml', encoding='utf-8'))
    #sample tweet object
    root = xml.getroot()
    tweets_test_PE = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content'], [tweet.content.text]))
        row_s = pd.Series(row)
        row_s.name = i
        tweets_test_PE = tweets_test_PE.append(row_s)
    tweets_test_PE.to_csv('TASS\\tweets-test-tagged-PE.csv', index=False, encoding='utf-8')

En caso de ejecutar desde Google Colab, ejecutar la siguiente celda, hacer click en 'choose files' y seleccionar los 22 archivos .csv provistos. En otro caso, no es necesario.

In [None]:


from google.colab import files
uploaded = files.upload()

import io
import pandas as pd

general_tweets_corpus_train_UY = pd.read_csv(io.BytesIO(uploaded['general-tweets-train-tagged-UY.csv']))
general_tweets_corpus_train_CR = pd.read_csv(io.BytesIO(uploaded['general-tweets-train-tagged-CR.csv']))
general_tweets_corpus_train_ES = pd.read_csv(io.BytesIO(uploaded['general-tweets-train-tagged-ES.csv']))
general_tweets_corpus_train_MX = pd.read_csv(io.BytesIO(uploaded['general-tweets-train-tagged-MX.csv']))
general_tweets_corpus_train_PE = pd.read_csv(io.BytesIO(uploaded['general-tweets-train-tagged-PE.csv']))

dev_tweets_corpus_train_UY = pd.read_csv(io.BytesIO(uploaded['dev-tweets-train-tagged-UY.csv']))
dev_tweets_corpus_train_CR = pd.read_csv(io.BytesIO(uploaded['dev-tweets-train-tagged-CR.csv']))
dev_tweets_corpus_train_ES = pd.read_csv(io.BytesIO(uploaded['dev-tweets-train-tagged-ES.csv']))
dev_tweets_corpus_train_MX = pd.read_csv(io.BytesIO(uploaded['dev-tweets-train-tagged-MX.csv']))
dev_tweets_corpus_train_PE = pd.read_csv(io.BytesIO(uploaded['dev-tweets-train-tagged-PE.csv']))

tweets_test_UY = pd.read_csv(io.BytesIO(uploaded['tweets-test-tagged-UY.csv']))
tweets_test_CR = pd.read_csv(io.BytesIO(uploaded['tweets-test-tagged-CR.csv']))
tweets_test_ES = pd.read_csv(io.BytesIO(uploaded['tweets-test-tagged-ES.csv']))
tweets_test_MX = pd.read_csv(io.BytesIO(uploaded['tweets-test-tagged-MX.csv']))
tweets_test_PE = pd.read_csv(io.BytesIO(uploaded['tweets-test-tagged-PE.csv']))

intertass_dev_CR = pd.read_csv(io.BytesIO(uploaded['intertass-dev-CR.csv']))
intertass_dev_ES = pd.read_csv(io.BytesIO(uploaded['intertass-dev-ES.csv']))
intertass_dev_PE = pd.read_csv(io.BytesIO(uploaded['intertass-dev-PE.csv']))

intertass_train_CR = pd.read_csv(io.BytesIO(uploaded['intertass-train-CR.csv']))
intertass_train_ES = pd.read_csv(io.BytesIO(uploaded['intertass-train-ES.csv']))
intertass_train_PE = pd.read_csv(io.BytesIO(uploaded['intertass-train-PE.csv']))

politics_test = pd.read_csv(io.BytesIO(uploaded['politics-test.csv']))


Saving dev-tweets-train-tagged-CR.csv to dev-tweets-train-tagged-CR (2).csv
Saving dev-tweets-train-tagged-ES.csv to dev-tweets-train-tagged-ES (2).csv
Saving dev-tweets-train-tagged-MX.csv to dev-tweets-train-tagged-MX (2).csv
Saving dev-tweets-train-tagged-PE.csv to dev-tweets-train-tagged-PE (2).csv
Saving dev-tweets-train-tagged-UY.csv to dev-tweets-train-tagged-UY (2).csv
Saving general-tweets-train-tagged-CR.csv to general-tweets-train-tagged-CR (2).csv
Saving general-tweets-train-tagged-ES.csv to general-tweets-train-tagged-ES (2).csv
Saving general-tweets-train-tagged-MX.csv to general-tweets-train-tagged-MX (2).csv
Saving general-tweets-train-tagged-PE.csv to general-tweets-train-tagged-PE (2).csv
Saving general-tweets-train-tagged-UY.csv to general-tweets-train-tagged-UY (2).csv
Saving intertass-dev-CR.csv to intertass-dev-CR (2).csv
Saving intertass-dev-ES.csv to intertass-dev-ES (2).csv
Saving intertass-dev-PE.csv to intertass-dev-PE (2).csv
Saving intertass-train-CR.csv to

Una vez realizado el parsing de todos los archivos de los datasets, formamos un solo gran corpus concatenando todos ellos. Se realiza una división en dos grupos:

- **tweets_corpus**: que corresponde al conjunto de entrenamiento, por lo tanto posee el contenido de los tweets y su polaridad.
- **tweets_test**: que corresponde al conjunto de testing y solo posee el  contenido de los tweets.

In [None]:
tweets_corpus = pd.concat([
    general_tweets_corpus_train_UY,
    dev_tweets_corpus_train_UY,
    general_tweets_corpus_train_CR,
    dev_tweets_corpus_train_CR,
    general_tweets_corpus_train_ES,
    dev_tweets_corpus_train_ES,
    general_tweets_corpus_train_MX,
    dev_tweets_corpus_train_MX,
    general_tweets_corpus_train_PE,
    dev_tweets_corpus_train_PE,
    intertass_dev_CR,
    intertass_dev_ES,
    intertass_dev_PE,
    intertass_train_CR,
    intertass_train_ES,
    intertass_train_PE,
    politics_test
])

tweets_test = pd.concat([
    tweets_test_UY,
    tweets_test_CR,
    tweets_test_ES,
    tweets_test_MX,
    tweets_test_PE
])

tweets_corpus.sample(10)

Unnamed: 0,content,polarity
195,"@jose97angel ¿eliminada? Nooo, no la han elimi...",P
235,@anacubilla @ppopular lo primero q hay q hacer...,NEU
300,@mario_hart de parte de mi hermana ella dice q...,NEU
558,@YurikitoKSP que bien yuriko. Felicidades,P
292,@eunhyukeeprince no te preocupes ^^ como dije....,NEU
2302,Independientes a favor de Pedro Costa Morata (...,P
829,"Penoso lo de hoy, pensando en que hacer en el ...",NEU
81,@lirondos Ya lo tenía fichado y tengo ganas de...,P
695,"Está navidad me la pasare durmiendo, para desp...",NONE
427,"Sopa Negra, opción plato fuerte en la soda. Di...",P


Vemos que, entre todos los archivos, obtenemos 13879 tweets de entrenamiento y 7264 de testing.

In [None]:
tweets_corpus.shape

(13879, 2)

### **Preprocesamiento del texto**

El texto escrito en redes sociales tiende a ser informal, por lo que es susceptible a un uso más "libre" del lenguaje. Esto puede incluir: risas, palabras mal escritas, palabras sin tildes, etc. Se buscará obtener un vocabulario más consistente eliminando las tildes, las risas y contenido propio de la red social Twitter (como hashtags, links o nombres de usuario en menciones). Antes de realizar todo este preprocesamiento, nos quedamos solo con aquellos tweets que tengan polaridad positiva, negativa o neutral, es decir, eliminamos los tweets con polaridad NONE.

In [None]:
import re


def elimina_tildes(s):
    replacements = (
        ("á", "a"),
        ("é", "e"),
        ("í", "i"),
        ("ó", "o"),
        ("ú", "u"),
        ("Á", "A"),
        ("É", "E"),
        ("Í", "I"),
        ("Ó", "O"),
        ("Ú", "U"),
        ("à", "a"),
        ("è", "e"),
        ("ì", "i"),
        ("ò", "o"),
        ("ù", "u"),
    )
    for a, b in replacements:
        s = s.replace(a, b).replace(a.upper(), b.upper())
    return s

#Elimino los de porlaridad NONE
tweets_corpus = tweets_corpus.query('polarity != "NONE"')

#Elimino links
tweets_corpus['content'] = tweets_corpus['content'].str.replace(r'^http.*$','')
tweets_corpus['content'] = tweets_corpus['content'].str.replace(r'^https.*$','')

#Elimino los nombres de ususario de las menciones(@usuario)
tweets_corpus['content'] = tweets_corpus['content'].str.replace(r'(@[A-Za-z0-9_]+)','')

#Elimino hashtags
tweets_corpus['content'] = tweets_corpus['content'].str.replace(r'(#[A-Za-z0-9_]+)','')

#Elimino tildes
for index, row in tweets_corpus.iterrows():
    row['content'] = elimina_tildes(row['content'])

#Elimino risas
tweets_corpus['content'] = tweets_corpus['content'].str.replace(r'ja(ja)+(j)*','')

tweets_corpus['content'] = tweets_corpus['content'].str.replace(r'Ja(ja)+(j)*','') 

tweets_corpus['content'] = tweets_corpus['content'].str.replace(r'JA(JA)+(JA)*','')

tweets_corpus['content'] = tweets_corpus['content'].str.replace(r'je(je)+(j)*','')

tweets_corpus.shape


(11318, 2)

Vemos que, luego de este proceso, nos quedamos con 11318 tweets

In [None]:
tweets_corpus.sample(20)

Unnamed: 0,content,polarity
27,"Estoy en el ""Taco loco""venden mas sopas,carnes...",N
576,que el domingo voy a votar a para daros la p...,N
395,"Que feo cuando se hacen cagones, pinches malcr...",N
280,"Si tu dia esta amargo ""pues dale una movida"" a...",NEU
499,"En mi cabeza, en mi cabeza vas dando vueltas! ...",P
352,por que siempre en las finales los hacen los ...,N
112,Buenos dias a todos!! Lamento tener que pospon...,N
426,Mi novia me compro un pajarito hermoso y como ...,NEU
738,"Hoy no es un dia cualquiera, hoy cumplo 2 años...",P
760,", Os recomiendo leer este informe antes de ir ...",P


### **Tokenización y Stemming**

Utilizaremos las stopwords en español provistas por nltk, a las cuales también les quitaremos las tildes para que sean consistentes con el vocabulario de nuestro corpus.

In [None]:
#descargar stopwords en español
import nltk
nltk.download("stopwords")

from nltk.corpus import stopwords
spanish_stopwords = stopwords.words('spanish')

espa_stopwords = []

#elimino los tildes de las stopwords, para que coincidan con el dataset
for word in spanish_stopwords:
  espa_stopwords.append(elimina_tildes(word))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
espa_stopwords

['de',
 'la',
 'que',
 'el',
 'en',
 'y',
 'a',
 'los',
 'del',
 'se',
 'las',
 'por',
 'un',
 'para',
 'con',
 'no',
 'una',
 'su',
 'al',
 'lo',
 'como',
 'mas',
 'pero',
 'sus',
 'le',
 'ya',
 'o',
 'este',
 'si',
 'porque',
 'esta',
 'entre',
 'cuando',
 'muy',
 'sin',
 'sobre',
 'tambien',
 'me',
 'hasta',
 'hay',
 'donde',
 'quien',
 'desde',
 'todo',
 'nos',
 'durante',
 'todos',
 'uno',
 'les',
 'ni',
 'contra',
 'otros',
 'ese',
 'eso',
 'ante',
 'ellos',
 'e',
 'esto',
 'mi',
 'antes',
 'algunos',
 'que',
 'unos',
 'yo',
 'otro',
 'otras',
 'otra',
 'el',
 'tanto',
 'esa',
 'estos',
 'mucho',
 'quienes',
 'nada',
 'muchos',
 'cual',
 'poco',
 'ella',
 'estar',
 'estas',
 'algunas',
 'algo',
 'nosotros',
 'mi',
 'mis',
 'tu',
 'te',
 'ti',
 'tu',
 'tus',
 'ellas',
 'nosotras',
 'vosotros',
 'vosotras',
 'os',
 'mio',
 'mia',
 'mios',
 'mias',
 'tuyo',
 'tuya',
 'tuyos',
 'tuyas',
 'suyo',
 'suya',
 'suyos',
 'suyas',
 'nuestro',
 'nuestra',
 'nuestros',
 'nuestras',
 'vuestro'

Definimos la lista de símbolos a eliminar. Se añaden a ella los signos de apertura de pregunta y admiración, que se utilizan en el español.

In [None]:
from string import punctuation
non_words = list(punctuation)

#añado los signos de pregunta y admiración usados en español
non_words.extend(['¿', '¡'])
non_words.extend(map(str,range(10)))
non_words

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~',
 '¿',
 '¡',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9']

Se definen las funciones de tokenización y stemming.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer     
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# basado on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = SnowballStemmer('spanish')

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    # remover non words
    text = ''.join([c for c in text if c not in non_words])
    # tokenize
    tokens =  word_tokenize(text)

    # stem
    try:
        stems = stem_tokens(tokens, stemmer)
    except Exception as e:
        print(e)
        print(text)
        stems = ['']
    return stems



### **Modelo: entrenamiento**

Tomaremos a este problema como un problema de clasificación de 3 clases (positivo, negativo, neutral). Por lo tanto, creamos una nueva columna 'polarity_num' donde realizamos la siguiente asignación numérica para cada polaridad:

- 0: polaridad negativa.
- 1: polaridad positiva.
- 2: polaridad neutral.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Tres clases
# N = 0
# P = 1
# NEU = 2
tweets_corpus['polarity_num'] = 0
tweets_corpus.polarity_num[tweets_corpus.polarity.isin(['P'])] = 1
tweets_corpus.polarity_num[tweets_corpus.polarity.isin(['NEU'])] = 2

tweets_corpus.polarity_num.value_counts(normalize = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


0    0.437180
1    0.341050
2    0.221771
Name: polarity_num, dtype: float64

A continuación se presenta el modelo utilizado: Linear Support Vector Classifier (LinearSVC) provisto por scikit learn y una representación de bolsa de palabras con un vectorizador CountVectorizer.

Se utilizó Grid Search para la búsqueda de los hiperparámetros y accuracy como medida de score, debido a que se trata de un problema con múltiples clases.

In [None]:
vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = espa_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', LinearSVC()),
])


parameters = {
    'vect__max_df': (0.5, 1.0, 1.9),
    'vect__min_df': (5, 10, 20),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2), (1,4)), 
    'cls__C': (0.003, 0.02, 0.2),
    'cls__loss': ('hinge', 'squared_hinge'),
    'cls__max_iter': (500, 20000)
}

import sklearn

nltk.download('punkt')
  
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='accuracy')
grid_search.fit(tweets_corpus.content, tweets_corpus.polarity_num)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


  'stop_words.' % sorted(inconsistent))


GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        p

In [None]:
grid_search.best_params_

{'cls__C': 0.2,
 'cls__loss': 'hinge',
 'cls__max_iter': 20000,
 'vect__max_df': 0.5,
 'vect__max_features': 1000,
 'vect__min_df': 10,
 'vect__ngram_range': (1, 1)}

Se utiliza crossvalidation para ver la performance del modelo. A pesar de el uso de Grid Search, se obtuvieron mejores resultados ajustando los hiperparámetros de forma manual. 

In [None]:
nltk.download('punkt')

model = LinearSVC(C=23, loss='squared_hinge',max_iter=1000000,multi_class='ovr',
              random_state=None,
              penalty='l1',
              tol=0.0001,
              dual=False
)

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = espa_stopwords,
    min_df = 5,
    max_df = 1.0,
    ngram_range=(1, 4),
    max_features=10000
)

pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word', # n-gramas de palabras
            tokenizer = tokenize,
            lowercase = True,
            stop_words = espa_stopwords,
            min_df = 5, # Añade al vocabulario palabras que aparezcan en al menos 5 tweets
            max_df = 1.0, # Ignora palabras que aparezcan en todos los tweets
            ngram_range=(1, 4), # Se usan desde unigramas a cuatrigramas
            max_features=10000
            )),
    ('cls', LinearSVC(C=23, loss='squared_hinge',max_iter=100000,multi_class='ovr',
             random_state=None,
             penalty='l1',
             tol=0.0001,
             dual=False
             )),
])

corpus_data_features = vectorizer.fit_transform(tweets_corpus.content)
corpus_data_features_nd = corpus_data_features.toarray()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


  'stop_words.' % sorted(inconsistent))


In [None]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(tweets_corpus)],
    y=tweets_corpus.polarity_num,
    scoring='accuracy',
    cv=5
    )

scores.mean()

0.7104487773039635

Vemos que la accuracy producida es de 71%, que no está mal, pero no es muy alta. Esto puede deberse a la mezcla de vocabulario de distintos países, el uso de ironía u otros modos de expresión donde el texto se debe interpretar más allá del significado real de las palabras, palabras mal escritas que producen inconsistencia en el vocabulario, la relativamente poca cantidad de tweets, la cantidad de temáticas distintas en el texto, entre otros.

### **Modelo: predicciones de polaridad**

aplicamos el mismo preprocesamiento al conjunto de tweets de prueba y utilizamos el modelo para realizar preedicciones sobre la polaridad de los mismos.

In [None]:
#Elimino links
tweets_test['content'] = tweets_test['content'].str.replace(r'^http.*$','')
tweets_test['content'] = tweets_test['content'].str.replace(r'^https.*$','')

#Elimino los nombres de ususario de las menciones(@usuario)
tweets_test['content'] = tweets_test['content'].str.replace(r'(@[A-Za-z0-9_]+)','')

#Elimino hashtags
tweets_test['content'] = tweets_test['content'].str.replace(r'(#[A-Za-z0-9_]+)','')

#Elimino tildes
tweets_test['content'] = tweets_test['content'].apply(elimina_tildes)

#Elimino risas
tweets_test['content'] = tweets_test['content'].str.replace(r'ja(ja)+(j)*','')

tweets_test['content'] = tweets_test['content'].str.replace(r'Ja(ja)+(j)*','') 

tweets_test['content'] = tweets_test['content'].str.replace(r'JA(JA)+(JA)*','')

tweets_test['content'] = tweets_test['content'].str.replace(r'je(je)+(j)*','')

tweets_test.shape


(7264, 2)

In [None]:
pipeline.fit(tweets_corpus.content, tweets_corpus.polarity_num)
tweets_test['polarity'] = pipeline.predict(tweets_test.content)

  'stop_words.' % sorted(inconsistent))


Vemos los resultados, recordemos que las clases son:

- 0: polaridad negativa.
- 1: polaridad positiva.
- 2: polaridad neutral.

In [1]:
tweets_test.sample(50)

NameError: ignored