---- -> Stemming <- ----

 La derivación(stemming) es el proceso de reducir una palabra a su raíz que se afija a sufijos y prefijos o a las raíces de las palabras conocidas como lemas. Por ejemplo: palabras como “Likes”, “liked”, “likely” y “liking” se reducirán a “like” después de la derivación1.


---- -> Esta es la forma en la que Porter Stemmer hace 'stemming' de palabras. <- ----

In [1]:

# Descargamos la librería Porter Stemmer de NLTK

import nltk
from nltk.stem.porter import *

p_stemmer = PorterStemmer()

palabras = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly']

for p in palabras:
    print(p + ' -> ' + p_stemmer.stem(p))


run -> run
runner -> runner
running -> run
ran -> ran
runs -> run
easily -> easili
fairly -> fairli


---- -> Esta es la forma en la que Snowball Stemmer hace 'stemming' de palabras. <- ----

In [2]:
from nltk.stem.snowball import SnowballStemmer

#este derivador de palabras necesita de parametros como el lenguaje del texto para procesar las diferentes palabras

s_stemmer = SnowballStemmer('english', False)

palabras = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly']

for p in palabras:
    print(p + ' --> ' + s_stemmer.stem(p))



run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


----  -> Stemming con SpaCy y NLTK <- ----

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")

texto1 = nlp(u"John Adam is one the researcher who invent the direction of way towards success!")

for token in texto1:
    print(token.text, '\t', token.pos_, '\t', token.lemma_) 

John 	 PROPN 	 John
Adam 	 PROPN 	 Adam
is 	 AUX 	 be
one 	 NUM 	 one
the 	 DET 	 the
researcher 	 NOUN 	 researcher
who 	 PRON 	 who
invent 	 VERB 	 invent
the 	 DET 	 the
direction 	 NOUN 	 direction
of 	 ADP 	 of
way 	 NOUN 	 way
towards 	 ADP 	 towards
success 	 NOUN 	 success
! 	 PUNCT 	 !


In [6]:
def mostrarLemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma_}')

In [7]:
texto2 = nlp(u"John Adam is one the researcher who invent the direction of way towards success!")

mostrarLemmas(texto2)

John         PROPN  John
Adam         PROPN  Adam
is           AUX    be
one          NUM    one
the          DET    the
researcher   NOUN   researcher
who          PRON   who
invent       VERB   invent
the          DET    the
direction    NOUN   direction
of           ADP    of
way          NOUN   way
towards      ADP    towards
success      NOUN   success
!            PUNCT  !


In [8]:
#Ahora  vamos a hacer Stemming con nltk Porter Stemmer y con Corpus stopwords

import nltk
nltk.download('popular')

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords


[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\SGarc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     C:\Users\SGarc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     C:\Users\SGarc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     C:\Users\SGarc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     C:\Users\SGarc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]   

In [9]:
parrafo = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career """

In [12]:
oraciones = nltk.sent_tokenize(parrafo)

stemmer = PorterStemmer()

for i in range(len(oraciones)):
    palabras = nltk.word_tokenize(oraciones[i])
    palabras = [stemmer.stem(palabra) for palabra in palabras if palabra not in set(stopwords.words('english'))]
    oraciones[i] = ' '.join(palabras)

oraciones

['i three vision india .',
 'in 3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .',
 'yet done nation .',
 'we conquer anyon .',
 'we grab land , cultur , histori tri enforc way life .',
 'my good fortun work three great mind .',
 'dr. vikram sarabhai dept .',
 'space , professor satish dhawan , succeed dr. brahm prakash , father nuclear materi .',
 'i lucki work three close consid great opportun life .',
 'i see four mileston career']

----  -> Lemmatization con NLTK <- ----

La lematización en NLP (Procesamiento de Lenguaje Natural) consiste en reducir las palabras a su forma base o canónica, conocida como lema. Esto se logra mediante el uso de un vocabulario y un análisis minucioso de la construcción de las palabras para eliminar únicamente las terminaciones inflexibles.

In [13]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

parrafo = """ Thank you all so very much. Thank you to the Academy. 
               Thank you to all of you in this room. I have to congratulate 
               the other incredible nominees this year. The Revenant was 
               the product of the tireless efforts of an unbelievable cast
               support leaders around the world who do not speak for the 
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and 
               billions of underprivileged people out there who would be
               most affected by this. For our children’s children, and 
               for those people out there whose voices have been drowned
               out by the politics of greed. I thank you all for this 
               amazing award tonight. Let us not take this planet for 
               granted. I do not take tonight for granted. Thank you so very much."""

oraciones = nltk.sent_tokenize(parrafo)
lematizador = WordNetLemmatizer()

#así lematizamos el parrafo

for i in range(len(oraciones)):
    palabras = nltk.word_tokenize(oraciones[i])
    palabras = [lematizador.lemmatize(palabra) for palabra in palabras if palabra not in set(stopwords.words('english'))]
    oraciones[i] = ' '.join(palabras)

oraciones

['Thank much .',
 'Thank Academy .',
 'Thank room .',
 'I congratulate incredible nominee year .',
 'The Revenant product tireless effort unbelievable cast support leader around world speak big polluter , speak humanity , indigenous people world , billion billion underprivileged people would affected .',
 'For child ’ child , people whose voice drowned politics greed .',
 'I thank amazing award tonight .',
 'Let u take planet granted .',
 'I take tonight granted .',
 'Thank much .']

Ahora hagámoslo en español:

In [14]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

parrafo = """ Estoy profundamente honrado y agradecido por haber sido elegido para recibir este prestigioso premio. 
    Quiero agradecer al comité de selección por reconocer mi trabajo y esfuerzo.
    También quiero agradecer a mi familia y amigos por su apoyo constante en todo momento. Sin ellos, no estaría aquí hoy.
    Este premio es un gran logro para mí y me motiva a seguir trabajando duro para alcanzar mis metas. Espero poder seguir contribuyendo positivamente en mi campo de trabajo.
    Una vez más, muchas gracias por este honor. Lo aprecio mucho."""

oraciones = nltk.sent_tokenize(parrafo)
lematizador = WordNetLemmatizer()

# así lematizamos el parrafo

for i in range(len(oraciones)):
    palabras = nltk.word_tokenize(oraciones[i])
    palabras = [lematizador.lemmatize(
        palabra) for palabra in palabras if palabra not in set(stopwords.words('spanish'))]
    oraciones[i] = ' '.join(palabras)

oraciones


['Estoy profundamente honrado agradecido haber sido elegido recibir prestigioso premio .',
 'Quiero agradecer comité selección reconocer trabajo esfuerzo .',
 'También quiero agradecer familia amigo apoyo constante momento .',
 'Sin , aquí hoy .',
 'Este premio gran logro motiva seguir trabajando duro alcanzar metas .',
 'Espero poder seguir contribuyendo positivamente campo trabajo .',
 'Una vez , muchas gracias honor .',
 'Lo aprecio .']