# Etiquetado POS

El POS (part of speech) tagging o etiquetado morfológico es el proceso mediante el cual se clasifican las partes de un texto de acuerdo a su clasificación léxica.

Cada palabra recibirá una clasificación léxica a partir de una colección de etiquetas codificadas de acuerdo a su significado en el idioma correspondiente. Para poder realizar un etiquetado POS el texto debe estar previamente tokenizado.

NLKT ofrece una función llamada pos_tag. Esta función clasifica las palabras en ingés según un sistema de codificación pre-definido. Este etiquetador en particular está basado en machine learning y ha sido entrenado a partir de miles de ejemplos de oraciones pre-etiquetadas de manera manual. De esta manera puede estimar la clasificación léxica más probable de un término lo cuál no significa que esté libre de errores.

Es posible obtener una lista completa de los códigos de etiquetado para NLTK

In [16]:
import nltk
# nltk.download('averaged_perceptron_tagger')
# nltk.download('punkt')
# nltk.download('wordcloud')
# nltk.download('stopwords')
# nltk.download('tagsets')

Es posible obtener la descripción cada una categoría específica.

In [2]:
nltk.help.upenn_tagset("NNP")

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


Ya que este etiquetador puede no ser suficiente bueno en algunos casos es posible mejorar la eficiencia del etiquetado sumando etiquetadores POS creados manualmente.

## Etiquetado

In [3]:
example = "The Palace of Westminster serves as the meeting place for both the House of Commons and the House of Lords, the two houses of the Parliament of the United Kingdom. Informally known as the Houses of Parliament after its occupants, the Palace lies on the north bank of the River Thames in the City of Westminster, in central London, England."
# Tokenizar texto
tokenized_text = nltk.word_tokenize(example)
print(tokenized_text)
# Etiquetar texto con pos_tag
text_pos = nltk.pos_tag(tokenized_text)
print(text_pos)

['The', 'Palace', 'of', 'Westminster', 'serves', 'as', 'the', 'meeting', 'place', 'for', 'both', 'the', 'House', 'of', 'Commons', 'and', 'the', 'House', 'of', 'Lords', ',', 'the', 'two', 'houses', 'of', 'the', 'Parliament', 'of', 'the', 'United', 'Kingdom', '.', 'Informally', 'known', 'as', 'the', 'Houses', 'of', 'Parliament', 'after', 'its', 'occupants', ',', 'the', 'Palace', 'lies', 'on', 'the', 'north', 'bank', 'of', 'the', 'River', 'Thames', 'in', 'the', 'City', 'of', 'Westminster', ',', 'in', 'central', 'London', ',', 'England', '.']
[('The', 'DT'), ('Palace', 'NNP'), ('of', 'IN'), ('Westminster', 'NNP'), ('serves', 'NNS'), ('as', 'IN'), ('the', 'DT'), ('meeting', 'NN'), ('place', 'NN'), ('for', 'IN'), ('both', 'CC'), ('the', 'DT'), ('House', 'NNP'), ('of', 'IN'), ('Commons', 'NNPS'), ('and', 'CC'), ('the', 'DT'), ('House', 'NNP'), ('of', 'IN'), ('Lords', 'NNPS'), (',', ','), ('the', 'DT'), ('two', 'CD'), ('houses', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Parliament', 'NNP'), ('of'

# POS en español

Si queremos hacer un etiquetado morfológico en otro idioma entonces es necesario encontrar un etiquetador ya entrenado para ese idioma o entrenar uno nosotros mismos. También es necesario saber cuales son las clasificaciones de palabras existentes para dicho idioma.

En el siguiente <a href="https://colab.research.google.com/github/vitojph/kschool-nlp-18/blob/master/notebooks/pos-tagger-es.ipynb">enlace se muestra un ejemplo práctico de etiquetado POS en español.

In [4]:
import requests
import os
from dotenv import load_dotenv
import pandas as pd
import string
from nltk.tokenize import TweetTokenizer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import nltk
url = "https://api.twitter.com/2/tweets/search/recent"
# Cargar valores del archivo .env en las variables de entorno
load_dotenv()
# Cargar valor del token a variable
bearer_token = os.environ.get("BEARER_TOKEN")

# Ejercicio

- Obtener de la API una lista de Tweets que no sean retweet y que contengan el hashtag #GRAMMYs en inglés.
- Realizar la tokenización
- Realizar un etiquetado POS con la función pos_tag de NLTK
- Obtener la lista y frecuencia de los sustantivos en singular y plural

# Ejercicio N°1

In [5]:
params = {
    'query': '#GRAMMYs  lang:en -is:retweet',
    'tweet.fields':'created_at',
    'max_results':100
}
headers = {
    "Authorization": f"Bearer {bearer_token}",
    "User-Agent":"v2FullArchiveSearchPython"
}
response = requests.get(url, headers=headers, params=params)
print(response)
# Generar excepción si la respuesta no es exitosa
if response.status_code != 200:
    raise Exception(response.status_code, response.text)
print(response.json())

<Response [200]>
{'data': [{'created_at': '2021-10-07T13:26:29.000Z', 'id': '1446104646729170950', 'text': 'evermore (album) is so good. I hope it gets nominated and eventually win album of the year at the #GRAMMYs'}, {'created_at': '2021-10-07T13:17:27.000Z', 'id': '1446102370744684551', 'text': 'Yo @sarkodie the 02 arena you say wossop or you’re heading straight to the grammys 😂💔 \n\n#GRAMMYs #KalyJaySpace'}, {'created_at': '2021-10-07T12:21:12.000Z', 'id': '1446088214549250051', 'text': 'The latest The Gospel music Daily! https://t.co/0dVvRDr8lP Thanks to @AMBankstw @the_tunesclub #grammys #gospel'}, {'created_at': '2021-10-07T12:16:16.000Z', 'id': '1446086974847209474', 'text': '@ctoLarsson @michael_saylor this will always be a classic.  Plan to is Video for the Crypto Music awards #GRAMMYs Nominate. #music \n#MTV #BTC'}, {'created_at': '2021-10-07T12:14:08.000Z', 'id': '1446086439922397189', 'text': '#GRAMMYs Beyoncé be like https://t.co/6qHAR2AsH3'}, {'created_at': '2021-10-07T12

In [6]:
df = pd.json_normalize(response.json()['data'])
df

Unnamed: 0,created_at,id,text
0,2021-10-07T13:26:29.000Z,1446104646729170950,evermore (album) is so good. I hope it gets no...
1,2021-10-07T13:17:27.000Z,1446102370744684551,Yo @sarkodie the 02 arena you say wossop or yo...
2,2021-10-07T12:21:12.000Z,1446088214549250051,The latest The Gospel music Daily! https://t.c...
3,2021-10-07T12:16:16.000Z,1446086974847209474,@ctoLarsson @michael_saylor this will always b...
4,2021-10-07T12:14:08.000Z,1446086439922397189,#GRAMMYs Beyoncé be like https://t.co/6qHAR2AsH3
...,...,...,...
95,2021-10-05T14:31:38.000Z,1445396267123757068,and the grammy award for song of the year goes...
96,2021-10-05T13:32:04.000Z,1445381273531342855,Go Listen 🔥🔥🔥 you won’t regret it‼️\nD-Nasty -...
97,2021-10-05T13:22:49.000Z,1445378945206063107,🚨JUSTICE FOR LISA\n#YGLetLisaDoHerWork\n\n@Bul...
98,2021-10-05T12:54:20.000Z,1445371777215913988,Most tweeted Hashtags at rihanna:\nToday: \n#I...


In [7]:
# Tokenizar

tt = TweetTokenizer()

tokenized_text = df['text'].apply(tt.tokenize)
df["tokenized_text"] = tokenized_text
df

Unnamed: 0,created_at,id,text,tokenized_text
0,2021-10-07T13:26:29.000Z,1446104646729170950,evermore (album) is so good. I hope it gets no...,"[evermore, (, album, ), is, so, good, ., I, ho..."
1,2021-10-07T13:17:27.000Z,1446102370744684551,Yo @sarkodie the 02 arena you say wossop or yo...,"[Yo, @sarkodie, the, 02, arena, you, say, woss..."
2,2021-10-07T12:21:12.000Z,1446088214549250051,The latest The Gospel music Daily! https://t.c...,"[The, latest, The, Gospel, music, Daily, !, ht..."
3,2021-10-07T12:16:16.000Z,1446086974847209474,@ctoLarsson @michael_saylor this will always b...,"[@ctoLarsson, @michael_saylor, this, will, alw..."
4,2021-10-07T12:14:08.000Z,1446086439922397189,#GRAMMYs Beyoncé be like https://t.co/6qHAR2AsH3,"[#GRAMMYs, Beyoncé, be, like, https://t.co/6qH..."
...,...,...,...,...
95,2021-10-05T14:31:38.000Z,1445396267123757068,and the grammy award for song of the year goes...,"[and, the, grammy, award, for, song, of, the, ..."
96,2021-10-05T13:32:04.000Z,1445381273531342855,Go Listen 🔥🔥🔥 you won’t regret it‼️\nD-Nasty -...,"[Go, Listen, 🔥, 🔥, 🔥, you, won, ’, t, regret, ..."
97,2021-10-05T13:22:49.000Z,1445378945206063107,🚨JUSTICE FOR LISA\n#YGLetLisaDoHerWork\n\n@Bul...,"[🚨, JUSTICE, FOR, LISA, #YGLetLisaDoHerWork, @..."
98,2021-10-05T12:54:20.000Z,1445371777215913988,Most tweeted Hashtags at rihanna:\nToday: \n#I...,"[Most, tweeted, Hashtags, at, rihanna, :, Toda..."


In [8]:
# Aplicamos POS

data = []
# Crear lista de palabras
for x in tokenized_text:
    for word in x:
        data.append(word)

# Etiquetar texto con pos_tag
data_pos = nltk.pos_tag(data)
print(data_pos)
data_noun = []
for k,v in data_pos:
    if v in ["NN","NNS"]:
        data_noun.append(k)
print(data_noun)

[('evermore', 'NN'), ('(', '('), ('album', 'NN'), (')', ')'), ('is', 'VBZ'), ('so', 'RB'), ('good', 'JJ'), ('.', '.'), ('I', 'PRP'), ('hope', 'VBP'), ('it', 'PRP'), ('gets', 'VBZ'), ('nominated', 'VBN'), ('and', 'CC'), ('eventually', 'RB'), ('win', 'VB'), ('album', 'NN'), ('of', 'IN'), ('the', 'DT'), ('year', 'NN'), ('at', 'IN'), ('the', 'DT'), ('#GRAMMYs', 'NNP'), ('Yo', 'NNP'), ('@sarkodie', 'VBD'), ('the', 'DT'), ('02', 'CD'), ('arena', 'NN'), ('you', 'PRP'), ('say', 'VBP'), ('wossop', "''"), ('or', 'CC'), ('you', 'PRP'), ('’', 'VBP'), ('re', 'JJ'), ('heading', 'VBG'), ('straight', 'NN'), ('to', 'TO'), ('the', 'DT'), ('grammys', 'NN'), ('😂', 'NNP'), ('💔', 'NNP'), ('#GRAMMYs', 'NNP'), ('#KalyJaySpace', 'VBP'), ('The', 'DT'), ('latest', 'JJS'), ('The', 'DT'), ('Gospel', 'NNP'), ('music', 'NN'), ('Daily', 'NNP'), ('!', '.'), ('https://t.co/0dVvRDr8lP', 'NN'), ('Thanks', 'NNS'), ('to', 'TO'), ('@AMBankstw', 'VB'), ('@the_tunesclub', 'NNP'), ('#grammys', 'NNP'), ('#gospel', 'NNP'), ('@ct

In [9]:
# Obtener solo: NN - NNS

from nltk.probability import FreqDist

# Obtener frecuencia de cada término
fdist = FreqDist(data_noun)
# Convertir a dataframe
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
df_fdist.sort_values(by=['Frequency'], inplace=True, ascending=False)
#pd.set_option('display.max_rows', None)

df_fdist


Unnamed: 0_level_0,Frequency
Term,Unnamed: 1_level_1
#GRAMMYs,23
album,12
year,10
music,9
★,7
...,...
🎵,1
https://t.co/FjnEhR7PDa,1
bother,1
names,1


- Obtener la lista y frecuencia de los nombres propios en singular y plural


# Ejercicio N°2

In [10]:
# Aplicamos POS

data = []
# Crear lista de palabras
for x in tokenized_text:
    for word in x:
        data.append(word)

# Etiquetar texto con pos_tag
data_pos = nltk.pos_tag(data)
print(data_pos)
data_names = []
for k,v in data_pos:
    if v in ["NNP","NNPS"]:
        data_names.append(k)
print(data_names)

[('evermore', 'NN'), ('(', '('), ('album', 'NN'), (')', ')'), ('is', 'VBZ'), ('so', 'RB'), ('good', 'JJ'), ('.', '.'), ('I', 'PRP'), ('hope', 'VBP'), ('it', 'PRP'), ('gets', 'VBZ'), ('nominated', 'VBN'), ('and', 'CC'), ('eventually', 'RB'), ('win', 'VB'), ('album', 'NN'), ('of', 'IN'), ('the', 'DT'), ('year', 'NN'), ('at', 'IN'), ('the', 'DT'), ('#GRAMMYs', 'NNP'), ('Yo', 'NNP'), ('@sarkodie', 'VBD'), ('the', 'DT'), ('02', 'CD'), ('arena', 'NN'), ('you', 'PRP'), ('say', 'VBP'), ('wossop', "''"), ('or', 'CC'), ('you', 'PRP'), ('’', 'VBP'), ('re', 'JJ'), ('heading', 'VBG'), ('straight', 'NN'), ('to', 'TO'), ('the', 'DT'), ('grammys', 'NN'), ('😂', 'NNP'), ('💔', 'NNP'), ('#GRAMMYs', 'NNP'), ('#KalyJaySpace', 'VBP'), ('The', 'DT'), ('latest', 'JJS'), ('The', 'DT'), ('Gospel', 'NNP'), ('music', 'NN'), ('Daily', 'NNP'), ('!', '.'), ('https://t.co/0dVvRDr8lP', 'NN'), ('Thanks', 'NNS'), ('to', 'TO'), ('@AMBankstw', 'VB'), ('@the_tunesclub', 'NNP'), ('#grammys', 'NNP'), ('#gospel', 'NNP'), ('@ct

In [11]:
# Obtener solo: NN - NNS

from nltk.probability import FreqDist

# Obtener frecuencia de cada término
fdist = FreqDist(data_names)
# Convertir a dataframe
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
df_fdist.sort_values(by=['Frequency'], inplace=True, ascending=False)
#pd.set_option('display.max_rows', None)

df_fdist

Unnamed: 0_level_0,Frequency
Term,Unnamed: 1_level_1
#GRAMMYs,36
Best,10
#grammys,10
"""",9
Album,8
...,...
#DespicableMe,1
#Cannonball,1
#FamilyGuy,1
#MorningJoe,1


- Obtener la lista y frecuencia de los verbos en todos los tiempos verbales

# Ejercicio N°3

In [12]:
# Aplicamos POS

data = []
# Crear lista de palabras
for x in tokenized_text:
    for word in x:
        data.append(word)

# Etiquetar texto con pos_tag
data_pos = nltk.pos_tag(data)
print(data_pos)
data_verbs = []
for k,v in data_pos:
    if v in ["VBZ", "VBP", "VBN", "VBG", "VBD", "VB"]:
        data_verbs.append(k)
print(data_verbs)

[('evermore', 'NN'), ('(', '('), ('album', 'NN'), (')', ')'), ('is', 'VBZ'), ('so', 'RB'), ('good', 'JJ'), ('.', '.'), ('I', 'PRP'), ('hope', 'VBP'), ('it', 'PRP'), ('gets', 'VBZ'), ('nominated', 'VBN'), ('and', 'CC'), ('eventually', 'RB'), ('win', 'VB'), ('album', 'NN'), ('of', 'IN'), ('the', 'DT'), ('year', 'NN'), ('at', 'IN'), ('the', 'DT'), ('#GRAMMYs', 'NNP'), ('Yo', 'NNP'), ('@sarkodie', 'VBD'), ('the', 'DT'), ('02', 'CD'), ('arena', 'NN'), ('you', 'PRP'), ('say', 'VBP'), ('wossop', "''"), ('or', 'CC'), ('you', 'PRP'), ('’', 'VBP'), ('re', 'JJ'), ('heading', 'VBG'), ('straight', 'NN'), ('to', 'TO'), ('the', 'DT'), ('grammys', 'NN'), ('😂', 'NNP'), ('💔', 'NNP'), ('#GRAMMYs', 'NNP'), ('#KalyJaySpace', 'VBP'), ('The', 'DT'), ('latest', 'JJS'), ('The', 'DT'), ('Gospel', 'NNP'), ('music', 'NN'), ('Daily', 'NNP'), ('!', '.'), ('https://t.co/0dVvRDr8lP', 'NN'), ('Thanks', 'NNS'), ('to', 'TO'), ('@AMBankstw', 'VB'), ('@the_tunesclub', 'NNP'), ('#grammys', 'NNP'), ('#gospel', 'NNP'), ('@ct

In [13]:
from nltk.probability import FreqDist

# Obtener frecuencia de cada término
fdist = FreqDist(data_verbs)
# Convertir a dataframe
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
df_fdist.sort_values(by=['Frequency'], inplace=True, ascending=False)
#pd.set_option('display.max_rows', None)

df_fdist

Unnamed: 0_level_0,Frequency
Term,Unnamed: 1_level_1
be,15
is,12
has,6
are,6
reveal,5
...,...
HopeULike,1
mixset,1
meant,1
according,1


- Obtener la lista y frecuencia de todos los adjetivos

# Ejercicio N°3

In [14]:
# Aplicamos POS

data = []
# Crear lista de palabras
for x in tokenized_text:
    for word in x:
        data.append(word)

# Etiquetar texto con pos_tag
data_pos = nltk.pos_tag(data)
print(data_pos)
data_adjective = []
for k,v in data_pos:
    if v in ["JJ", "JJR", "JJS"]: # "JJ", "JJR", "JJS"
        data_adjective.append(k)
print(data_adjective)

[('evermore', 'NN'), ('(', '('), ('album', 'NN'), (')', ')'), ('is', 'VBZ'), ('so', 'RB'), ('good', 'JJ'), ('.', '.'), ('I', 'PRP'), ('hope', 'VBP'), ('it', 'PRP'), ('gets', 'VBZ'), ('nominated', 'VBN'), ('and', 'CC'), ('eventually', 'RB'), ('win', 'VB'), ('album', 'NN'), ('of', 'IN'), ('the', 'DT'), ('year', 'NN'), ('at', 'IN'), ('the', 'DT'), ('#GRAMMYs', 'NNP'), ('Yo', 'NNP'), ('@sarkodie', 'VBD'), ('the', 'DT'), ('02', 'CD'), ('arena', 'NN'), ('you', 'PRP'), ('say', 'VBP'), ('wossop', "''"), ('or', 'CC'), ('you', 'PRP'), ('’', 'VBP'), ('re', 'JJ'), ('heading', 'VBG'), ('straight', 'NN'), ('to', 'TO'), ('the', 'DT'), ('grammys', 'NN'), ('😂', 'NNP'), ('💔', 'NNP'), ('#GRAMMYs', 'NNP'), ('#KalyJaySpace', 'VBP'), ('The', 'DT'), ('latest', 'JJS'), ('The', 'DT'), ('Gospel', 'NNP'), ('music', 'NN'), ('Daily', 'NNP'), ('!', '.'), ('https://t.co/0dVvRDr8lP', 'NN'), ('Thanks', 'NNS'), ('to', 'TO'), ('@AMBankstw', 'VB'), ('@the_tunesclub', 'NNP'), ('#grammys', 'NNP'), ('#gospel', 'NNP'), ('@ct

In [15]:
from nltk.probability import FreqDist

# Obtener frecuencia de cada término
fdist = FreqDist(data_adjective)
# Convertir a dataframe
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
df_fdist.sort_values(by=['Frequency'], inplace=True, ascending=False)
#pd.set_option('display.max_rows', None)

df_fdist

Unnamed: 0_level_0,Frequency
Term,Unnamed: 1_level_1
#music,9
official,6
#GRAMMYs,6
early,5
latest,5
...,...
https://t.co/O0kYXsAfIS,1
eye-popping,1
#aoty,1
@BTS_twt,1
