# Etiquetado POS

El POS (part of speech) tagging o etiquetado morfológico es el proceso mediante el cual se clasifican las partes de un texto de acuerdo a su clasificación léxica.

Cada palabra recibirá una clasificación léxica a partir de una colección de etiquetas codificadas de acuerdo a su significado en el idioma correspondiente. Para poder realizar un etiquetado POS el texto debe estar previamente tokenizado.

NLKT ofrece una función llamada pos_tag. Esta función clasifica las palabras en ingés según un sistema de codificación pre-definido. Este etiquetador en particular está basado en machine learning y ha sido entrenado a partir de miles de ejemplos de oraciones pre-etiquetadas de manera manual. De esta manera puede estimar la clasificación léxica más probable de un término lo cuál no significa que esté libre de errores.

Es posible obtener una lista completa de los códigos de etiquetado para NLTK

In [1]:
import nltk

# nltk.download('averaged_perceptron_tagger')
# nltk.download('punkt')
# nltk.download('wordcloud')
# nltk.download('stopwords')
# nltk.download('tagsets')

<IPython.core.display.Javascript object>

Es posible obtener la descripción cada una categoría específica.

In [2]:
nltk.help.upenn_tagset("NNS")
tokenized_text = ["texas"]
text_pos = nltk.pos_tag(tokenized_text)
print(text_pos)

NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
[('texas', 'NN')]


<IPython.core.display.Javascript object>

Ya que este etiquetador puede no ser suficiente bueno en algunos casos es posible mejorar la eficiencia del etiquetado sumando etiquetadores POS creados manualmente.

## Etiquetado

In [3]:
example = "The Palace of Westminster serves as the meeting place for both the House of Commons and the House of Lords, the two houses of the Parliament of the United Kingdom. Informally known as the Houses of Parliament after its occupants, the Palace lies on the north bank of the River Thames in the City of Westminster, in central London, England."
# Tokenizar texto
tokenized_text = nltk.word_tokenize(example)
print(tokenized_text)
# Etiquetar texto con pos_tag
text_pos = nltk.pos_tag(tokenized_text)
print(text_pos)

['The', 'Palace', 'of', 'Westminster', 'serves', 'as', 'the', 'meeting', 'place', 'for', 'both', 'the', 'House', 'of', 'Commons', 'and', 'the', 'House', 'of', 'Lords', ',', 'the', 'two', 'houses', 'of', 'the', 'Parliament', 'of', 'the', 'United', 'Kingdom', '.', 'Informally', 'known', 'as', 'the', 'Houses', 'of', 'Parliament', 'after', 'its', 'occupants', ',', 'the', 'Palace', 'lies', 'on', 'the', 'north', 'bank', 'of', 'the', 'River', 'Thames', 'in', 'the', 'City', 'of', 'Westminster', ',', 'in', 'central', 'London', ',', 'England', '.']
[('The', 'DT'), ('Palace', 'NNP'), ('of', 'IN'), ('Westminster', 'NNP'), ('serves', 'NNS'), ('as', 'IN'), ('the', 'DT'), ('meeting', 'NN'), ('place', 'NN'), ('for', 'IN'), ('both', 'CC'), ('the', 'DT'), ('House', 'NNP'), ('of', 'IN'), ('Commons', 'NNPS'), ('and', 'CC'), ('the', 'DT'), ('House', 'NNP'), ('of', 'IN'), ('Lords', 'NNPS'), (',', ','), ('the', 'DT'), ('two', 'CD'), ('houses', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Parliament', 'NNP'), ('of'

<IPython.core.display.Javascript object>

# POS en español

Si queremos hacer un etiquetado morfológico en otro idioma entonces es necesario encontrar un etiquetador ya entrenado para ese idioma o entrenar uno nosotros mismos. También es necesario saber cuales son las clasificaciones de palabras existentes para dicho idioma.

En el siguiente <a href="https://colab.research.google.com/github/vitojph/kschool-nlp-18/blob/master/notebooks/pos-tagger-es.ipynb">enlace se muestra un ejemplo práctico de etiquetado POS en español.

In [4]:
import requests
import os
from dotenv import load_dotenv
import pandas as pd
from nltk.tokenize import TweetTokenizer
import nltk

url = "https://api.twitter.com/2/tweets/search/recent"
# Cargar valores del archivo .env en las variables de entorno
load_dotenv()
# Cargar valor del token a variable
bearer_token = os.environ.get("BEARER_TOKEN")

<IPython.core.display.Javascript object>

# Ejercicio

- Obtener de la API una lista de Tweets que no sean retweet y que contengan el hashtag #GRAMMYs en inglés.
- Realizar la tokenización
- Realizar un etiquetado POS con la función pos_tag de NLTK
- Obtener la lista y frecuencia de los sustantivos en singular y plural

# Ejercicio N°1

In [5]:
params = {
    "query": "#GRAMMYs  lang:en -is:retweet",
    "tweet.fields": "created_at",
    "max_results": 100,
}
headers = {
    "Authorization": f"Bearer {bearer_token}",
    "User-Agent": "v2FullArchiveSearchPython",
}
response = requests.get(url, headers=headers, params=params)
print(response)
# Generar excepción si la respuesta no es exitosa
if response.status_code != 200:
    raise Exception(response.status_code, response.text)
print(response.json())

<Response [200]>
{'data': [{'created_at': '2021-10-23T14:32:32.000Z', 'id': '1451919474106552321', 'text': 'For your consideration:)\n@RecordingAcad 1st round voting is open!\n#contemporaryblues #bluesmusic #GRAMMYs https://t.co/NuOaPvT3ta'}, {'created_at': '2021-10-23T14:30:20.000Z', 'id': '1451918921456574469', 'text': 'Big up our President and CEO @harveymasonjr \n\nCongrats! #grammys #recordingacademy @recordingacademy @ Monom Recording Studio https://t.co/Dll5dB1GGI'}, {'created_at': '2021-10-23T14:29:42.000Z', 'id': '1451918761058054149', 'text': "#7 Grammy in the building nobody works harder than our founder @malikyusef100 . Oh you think you know someone else...I'll wait. #badkids #grammys #ye #donda https://t.co/wuRFbX62I2 https://t.co/HxOk7CF4Mq"}, {'created_at': '2021-10-23T14:29:14.000Z', 'id': '1451918641348284422', 'text': 'A moment in history. #Grammys #ARIAsMileyCyrus https://t.co/R4iyIsE100'}, {'created_at': '2021-10-23T14:25:03.000Z', 'id': '1451917592197742593', 'text

<IPython.core.display.Javascript object>

In [6]:
df = pd.json_normalize(response.json()["data"])
df

Unnamed: 0,created_at,id,text
0,2021-10-23T14:32:32.000Z,1451919474106552321,For your consideration:)\n@RecordingAcad 1st r...
1,2021-10-23T14:30:20.000Z,1451918921456574469,Big up our President and CEO @harveymasonjr \n...
2,2021-10-23T14:29:42.000Z,1451918761058054149,#7 Grammy in the building nobody works harder ...
3,2021-10-23T14:29:14.000Z,1451918641348284422,A moment in history. #Grammys #ARIAsMileyCyrus...
4,2021-10-23T14:25:03.000Z,1451917592197742593,#7 Grammy in the building nobody works harder ...
...,...,...,...
95,2021-10-23T02:56:45.000Z,1451744374245638146,For Ur Grammy Consideration #NCT127 #Sticker \...
96,2021-10-23T02:52:16.000Z,1451743246359662593,As seen in @Billboard Magazine's GRAMMY® Editi...
97,2021-10-23T02:50:24.000Z,1451742776220078081,OMG!!! YOU'RE TELLING ME THAT THE BOYS CAN VE ...
98,2021-10-23T02:40:14.000Z,1451740216436277251,#TOMORROW_X_TOGETHER is now eligible for 2022 ...


<IPython.core.display.Javascript object>

In [7]:
# Tokenizar

tt = TweetTokenizer()

tokenized_text = df["text"].apply(tt.tokenize)
df["tokenized_text"] = tokenized_text
df

Unnamed: 0,created_at,id,text,tokenized_text
0,2021-10-23T14:32:32.000Z,1451919474106552321,For your consideration:)\n@RecordingAcad 1st r...,"[For, your, consideration, :), @RecordingAcad,..."
1,2021-10-23T14:30:20.000Z,1451918921456574469,Big up our President and CEO @harveymasonjr \n...,"[Big, up, our, President, and, CEO, @harveymas..."
2,2021-10-23T14:29:42.000Z,1451918761058054149,#7 Grammy in the building nobody works harder ...,"[#, 7, Grammy, in, the, building, nobody, work..."
3,2021-10-23T14:29:14.000Z,1451918641348284422,A moment in history. #Grammys #ARIAsMileyCyrus...,"[A, moment, in, history, ., #Grammys, #ARIAsMi..."
4,2021-10-23T14:25:03.000Z,1451917592197742593,#7 Grammy in the building nobody works harder ...,"[#, 7, Grammy, in, the, building, nobody, work..."
...,...,...,...,...
95,2021-10-23T02:56:45.000Z,1451744374245638146,For Ur Grammy Consideration #NCT127 #Sticker \...,"[For, Ur, Grammy, Consideration, #NCT127, #Sti..."
96,2021-10-23T02:52:16.000Z,1451743246359662593,As seen in @Billboard Magazine's GRAMMY® Editi...,"[As, seen, in, @Billboard, Magazine's, GRAMMY,..."
97,2021-10-23T02:50:24.000Z,1451742776220078081,OMG!!! YOU'RE TELLING ME THAT THE BOYS CAN VE ...,"[OMG, !, !, !, YOU'RE, TELLING, ME, THAT, THE,..."
98,2021-10-23T02:40:14.000Z,1451740216436277251,#TOMORROW_X_TOGETHER is now eligible for 2022 ...,"[#TOMORROW_X_TOGETHER, is, now, eligible, for,..."


<IPython.core.display.Javascript object>

In [8]:
# Aplicamos POS

data = []
# Crear lista de palabras
for x in tokenized_text:
    for word in x:
        data.append(word)

# Etiquetar texto con pos_tag
data_pos = nltk.pos_tag(data)
print(data_pos)
data_noun = []
for k, v in data_pos:
    if v in ["NN", "NNS"]:
        data_noun.append(k)
print(data_noun)

[('For', 'IN'), ('your', 'PRP$'), ('consideration', 'NN'), (':)', 'VBD'), ('@RecordingAcad', '$'), ('1st', 'CD'), ('round', 'NN'), ('voting', 'NN'), ('is', 'VBZ'), ('open', 'JJ'), ('!', '.'), ('#contemporaryblues', 'NNS'), ('#bluesmusic', 'JJ'), ('#GRAMMYs', 'NNP'), ('https://t.co/NuOaPvT3ta', 'NN'), ('Big', 'NNP'), ('up', 'IN'), ('our', 'PRP$'), ('President', 'NNP'), ('and', 'CC'), ('CEO', 'NNP'), ('@harveymasonjr', 'NNP'), ('Congrats', 'NNP'), ('!', '.'), ('#grammys', 'NN'), ('#recordingacademy', 'NNP'), ('@recordingacademy', 'NNP'), ('@', 'NNP'), ('Monom', 'NNP'), ('Recording', 'NNP'), ('Studio', 'NNP'), ('https://t.co/Dll5dB1GGI', 'NN'), ('#', '#'), ('7', 'CD'), ('Grammy', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('building', 'NN'), ('nobody', 'NN'), ('works', 'VBZ'), ('harder', 'JJR'), ('than', 'IN'), ('our', 'PRP$'), ('founder', 'NN'), ('@malikyusef100', 'NN'), ('.', '.'), ('Oh', 'UH'), ('you', 'PRP'), ('think', 'VBP'), ('you', 'PRP'), ('know', 'VBP'), ('someone', 'NN'), ('else', 'RB

<IPython.core.display.Javascript object>

In [9]:
# Obtener solo: NN - NNS

from nltk.probability import FreqDist

# Obtener frecuencia de cada término
fdist = FreqDist(data_noun)
# Convertir a dataframe
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
df_fdist.sort_values(by=['Frequency'], inplace=True, ascending=False)
# pd.set_option('display.max_rows', None)

df_fdist


Unnamed: 0_level_0,Frequency
Term,Unnamed: 1_level_1
#GRAMMYs,13
album,13
consideration,10
"""",6
music,6
...,...
@caseydriessen,1
https://t.co/zHXUOBXlt1,1
#Inclusivity,1
#language,1


<IPython.core.display.Javascript object>

- Obtener la lista y frecuencia de los nombres propios en singular y plural


# Ejercicio N°2

In [10]:
# Aplicamos POS

data = []
# Crear lista de palabras
for x in tokenized_text:
    for word in x:
        data.append(word)

# Etiquetar texto con pos_tag
data_pos = nltk.pos_tag(data)
print(data_pos)
data_names = []
for k, v in data_pos:
    if v in ["NNP", "NNPS"]:
        data_names.append(k)
print(data_names)

[('For', 'IN'), ('your', 'PRP$'), ('consideration', 'NN'), (':)', 'VBD'), ('@RecordingAcad', '$'), ('1st', 'CD'), ('round', 'NN'), ('voting', 'NN'), ('is', 'VBZ'), ('open', 'JJ'), ('!', '.'), ('#contemporaryblues', 'NNS'), ('#bluesmusic', 'JJ'), ('#GRAMMYs', 'NNP'), ('https://t.co/NuOaPvT3ta', 'NN'), ('Big', 'NNP'), ('up', 'IN'), ('our', 'PRP$'), ('President', 'NNP'), ('and', 'CC'), ('CEO', 'NNP'), ('@harveymasonjr', 'NNP'), ('Congrats', 'NNP'), ('!', '.'), ('#grammys', 'NN'), ('#recordingacademy', 'NNP'), ('@recordingacademy', 'NNP'), ('@', 'NNP'), ('Monom', 'NNP'), ('Recording', 'NNP'), ('Studio', 'NNP'), ('https://t.co/Dll5dB1GGI', 'NN'), ('#', '#'), ('7', 'CD'), ('Grammy', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('building', 'NN'), ('nobody', 'NN'), ('works', 'VBZ'), ('harder', 'JJR'), ('than', 'IN'), ('our', 'PRP$'), ('founder', 'NN'), ('@malikyusef100', 'NN'), ('.', '.'), ('Oh', 'UH'), ('you', 'PRP'), ('think', 'VBP'), ('you', 'PRP'), ('know', 'VBP'), ('someone', 'NN'), ('else', 'RB

<IPython.core.display.Javascript object>

In [11]:
# Obtener solo: NN - NNS

from nltk.probability import FreqDist

# Obtener frecuencia de cada término
fdist = FreqDist(data_names)
# Convertir a dataframe
df_fdist = pd.DataFrame.from_dict(fdist, orient="index")
df_fdist.columns = ["Frequency"]
df_fdist.index.name = "Term"
df_fdist.sort_values(by=["Frequency"], inplace=True, ascending=False)
# pd.set_option('display.max_rows', None)

df_fdist

Unnamed: 0_level_0,Frequency
Term,Unnamed: 1_level_1
#GRAMMYs,43
Best,26
Grammy,20
Album,18
’,15
...,...
RYZE,1
FYC,1
@ryzemagazine,1
#adifferentleague,1


<IPython.core.display.Javascript object>

- Obtener la lista y frecuencia de los verbos en todos los tiempos verbales

# Ejercicio N°3

In [12]:
# Aplicamos POS

data = []
# Crear lista de palabras
for x in tokenized_text:
    for word in x:
        data.append(word)

# Etiquetar texto con pos_tag
data_pos = nltk.pos_tag(data)
print(data_pos)
data_verbs = []
for k, v in data_pos:
    if v in ["VBZ", "VBP", "VBN", "VBG", "VBD", "VB"]:
        data_verbs.append(k)
print(data_verbs)

[('For', 'IN'), ('your', 'PRP$'), ('consideration', 'NN'), (':)', 'VBD'), ('@RecordingAcad', '$'), ('1st', 'CD'), ('round', 'NN'), ('voting', 'NN'), ('is', 'VBZ'), ('open', 'JJ'), ('!', '.'), ('#contemporaryblues', 'NNS'), ('#bluesmusic', 'JJ'), ('#GRAMMYs', 'NNP'), ('https://t.co/NuOaPvT3ta', 'NN'), ('Big', 'NNP'), ('up', 'IN'), ('our', 'PRP$'), ('President', 'NNP'), ('and', 'CC'), ('CEO', 'NNP'), ('@harveymasonjr', 'NNP'), ('Congrats', 'NNP'), ('!', '.'), ('#grammys', 'NN'), ('#recordingacademy', 'NNP'), ('@recordingacademy', 'NNP'), ('@', 'NNP'), ('Monom', 'NNP'), ('Recording', 'NNP'), ('Studio', 'NNP'), ('https://t.co/Dll5dB1GGI', 'NN'), ('#', '#'), ('7', 'CD'), ('Grammy', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('building', 'NN'), ('nobody', 'NN'), ('works', 'VBZ'), ('harder', 'JJR'), ('than', 'IN'), ('our', 'PRP$'), ('founder', 'NN'), ('@malikyusef100', 'NN'), ('.', '.'), ('Oh', 'UH'), ('you', 'PRP'), ('think', 'VBP'), ('you', 'PRP'), ('know', 'VBP'), ('someone', 'NN'), ('else', 'RB

<IPython.core.display.Javascript object>

In [13]:
from nltk.probability import FreqDist

# Obtener frecuencia de cada término
fdist = FreqDist(data_verbs)
# Convertir a dataframe
df_fdist = pd.DataFrame.from_dict(fdist, orient="index")
df_fdist.columns = ["Frequency"]
df_fdist.index.name = "Term"
df_fdist.sort_values(by=["Frequency"], inplace=True, ascending=False)
# pd.set_option('display.max_rows', None)

df_fdist

Unnamed: 0_level_0,Frequency
Term,Unnamed: 1_level_1
is,20
submitted,14
has,14
be,12
been,9
...,...
#Grammys,1
IS,1
released,1
https://t.co/OYX46aJeVt,1


<IPython.core.display.Javascript object>

- Obtener la lista y frecuencia de todos los adjetivos

# Ejercicio N°3

In [14]:
# Aplicamos POS

data = []
# Crear lista de palabras
for x in tokenized_text:
    for word in x:
        data.append(word)

# Etiquetar texto con pos_tag
data_pos = nltk.pos_tag(data)
print(data_pos)
data_adjective = []
for k, v in data_pos:
    if v in ["JJ", "JJR", "JJS"]:  # "JJ", "JJR", "JJS"
        data_adjective.append(k)
print(data_adjective)

[('For', 'IN'), ('your', 'PRP$'), ('consideration', 'NN'), (':)', 'VBD'), ('@RecordingAcad', '$'), ('1st', 'CD'), ('round', 'NN'), ('voting', 'NN'), ('is', 'VBZ'), ('open', 'JJ'), ('!', '.'), ('#contemporaryblues', 'NNS'), ('#bluesmusic', 'JJ'), ('#GRAMMYs', 'NNP'), ('https://t.co/NuOaPvT3ta', 'NN'), ('Big', 'NNP'), ('up', 'IN'), ('our', 'PRP$'), ('President', 'NNP'), ('and', 'CC'), ('CEO', 'NNP'), ('@harveymasonjr', 'NNP'), ('Congrats', 'NNP'), ('!', '.'), ('#grammys', 'NN'), ('#recordingacademy', 'NNP'), ('@recordingacademy', 'NNP'), ('@', 'NNP'), ('Monom', 'NNP'), ('Recording', 'NNP'), ('Studio', 'NNP'), ('https://t.co/Dll5dB1GGI', 'NN'), ('#', '#'), ('7', 'CD'), ('Grammy', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('building', 'NN'), ('nobody', 'NN'), ('works', 'VBZ'), ('harder', 'JJR'), ('than', 'IN'), ('our', 'PRP$'), ('founder', 'NN'), ('@malikyusef100', 'NN'), ('.', '.'), ('Oh', 'UH'), ('you', 'PRP'), ('think', 'VBP'), ('you', 'PRP'), ('know', 'VBP'), ('someone', 'NN'), ('else', 'RB

<IPython.core.display.Javascript object>

In [15]:
from nltk.probability import FreqDist

# Obtener frecuencia de cada término
fdist = FreqDist(data_adjective)
# Convertir a dataframe
df_fdist = pd.DataFrame.from_dict(fdist, orient="index")
df_fdist.columns = ["Frequency"]
df_fdist.index.name = "Term"
df_fdist.sort_values(by=["Frequency"], inplace=True, ascending=False)
# pd.set_option('display.max_rows', None)

df_fdist

Unnamed: 0_level_0,Frequency
Term,Unnamed: 1_level_1
#GRAMMYs,12
best,9
American,4
#Grammys,4
’,3
...,...
musician,1
dear,1
award-winning,1
nominated.then,1


<IPython.core.display.Javascript object>