# NLP: Stemming e Lemmatization

Este Notebook explora as técnicas de **Stemming** e **Lemmatization** pro processamento de linguagem natural (NLP). Essas técnicas ajudam a reduzir palavras à sua forma raiz, melhorando análises de texto e modelos de machine learning.

## Conteúdo
- **Stemming**: Processo de redução de palavras a sua raiz, sem levar em conta o contexto (ex.: "running" → "run").
- **Lemmatization**: Similar ao stemming, mas leva em consideração o contexto e a forma correta da palavra (ex.: "better" → "good").
- Comparativo entre as abordagens com exemplos práticos.


## Dependências
- Python 3.x
- NLTK


In [1]:
import nltk

### STEMMING
Reduzir as palavras sem levar em conta o contexto

In [2]:
from nltk.stem import PorterStemmer

In [3]:
porter = PorterStemmer()

In [4]:
porter.stem('walking')

'walk'

In [5]:
porter.stem('walked')

'walk'

In [6]:
porter.stem('walks')

'walk'

In [7]:
porter.stem('ran')

'ran'

In [8]:
porter.stem('running')

'run'

In [9]:
porter.stem('bosses')

'boss'

In [10]:
porter.stem('replacement')

'replac'

In [12]:
# Vamos gerar uma sentença e aplicar o stemming nela:
sentence = 'Lemmatization is more sophisticated than stemming'.split()
sentence

['Lemmatization', 'is', 'more', 'sophisticated', 'than', 'stemming']

In [14]:
for token in sentence:
    print(porter.stem(token), end=' ')

lemmat is more sophist than stem 

In [15]:
# Outros exemplos
porter.stem('uncecessary')

'uncecessari'

In [16]:
porter.stem('berry')

'berri'

### Lemmatization
Reduzir as palavras levando em consideração o contexto

In [17]:
from nltk.stem import WordNetLemmatizer

In [18]:
nltk.download('wordnet') # modulo para identificar significados das palavras

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Estela\AppData\Roaming\nltk_data...


True

In [19]:
from nltk.corpus import wordnet

In [20]:
lemmatizer = WordNetLemmatizer()

In [21]:
lemmatizer.lemmatize('as')

'a'

In [22]:
lemmatizer.lemmatize('walking')

'walking'

In [23]:
lemmatizer.lemmatize('walking', pos=wordnet.VERB)

'walk'

In [24]:
lemmatizer.lemmatize('going')

'going'

In [25]:
lemmatizer.lemmatize('going', pos=wordnet.VERB)

'go'

In [26]:
lemmatizer.lemmatize('ran', pos=wordnet.VERB)

'run'

In [27]:
lemmatizer.lemmatize('mice')

'mouse'

In [28]:
# Comparando Stemming e Lemmatization
porter.stem('was')

'wa'

In [29]:
lemmatizer.lemmatize('was', pos=wordnet.VERB)

'be'

In [31]:
porter.stem('is')

'is'

In [32]:
lemmatizer.lemmatize('is', pos=wordnet.VERB)

'be'

In [33]:
porter.stem('better')

'better'

In [34]:
lemmatizer.lemmatize('better', pos=wordnet.ADJ)

'good'

### Identificando se a palavra é um sinonimo, verbo, etc. para usar no lemmatizer

In [35]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [37]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Estela\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

In [38]:
sentence = "Demi Moore has a devoted following".split()
sentence

['Demi', 'Moore', 'has', 'a', 'devoted', 'following']

In [40]:
# Observe que pos_tag identifica o tipo da palavra (verbo, adjetivo, etc.), e retorna uma tupla
words_and_tags = nltk.pos_tag(sentence)
words_and_tags

[('Demi', 'NNP'),
 ('Moore', 'NNP'),
 ('has', 'VBZ'),
 ('a', 'DT'),
 ('devoted', 'VBN'),
 ('following', 'NN')]

In [41]:
for word, tag in words_and_tags:
    lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
    print(lemma, end=' ')

Demi Moore have a devote following 

In [42]:
# Outro exemplo
sentence = 'The cat was following the bird as it flew by'.split()

In [43]:
words_and_tags = nltk.pos_tag(sentence)
words_and_tags

[('The', 'DT'),
 ('cat', 'NN'),
 ('was', 'VBD'),
 ('following', 'VBG'),
 ('the', 'DT'),
 ('bird', 'NN'),
 ('as', 'IN'),
 ('it', 'PRP'),
 ('flew', 'VBD'),
 ('by', 'IN')]

In [44]:
for word, tag in words_and_tags:
    lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
    print(lemma, end=' ')

The cat be follow the bird a it fly by 