# Tokenization and Lemmatization/Stemming in python
 
- The goal of this notebok is to demonstrate the word stemming capabilities of the [nltk](https://www.nltk.org/) and [spaCy](https://spacy.io/) package


### Use Natural Language Tool Kit to retrieve lemma and stem of words in a sample text file

In [1]:
# Import Natural Language Tool Kit packages
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import porter, WordNetLemmatizer

In [2]:
# Read text file containing stop words, or a set of commonly used words which are likely irrelevant to
# proposal sorting and thus are filtered out before the natural language processing occurs
with open('../utils/stopwords.txt', 'r') as test_file:
    tmp = test_file.readlines()
    stop_words = [val.strip('\n') for val in tmp]

In [3]:
nltk.download('wordnet')  # lexical database for the English language
nltk.download('punkt')  # tokenizer which divides text into a list of sentences
nltk.download('averaged_perceptron_tagger')  # tags words with their part of speech

[nltk_data] Downloading package wordnet to /Users/tking/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/tking/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/tking/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
# Read in some example text to play with.
with open('../proposal_data/Cycle25/0001.txtx', 'r') as test_file:
    text = test_file.readlines()
    text = [val.strip('\n') for val in text]

In [5]:
# remove new line characters
text = [val.strip('\n') for val in text]

In [6]:
text

['Hubble Space Telescope',
 '',
 'Cycle 25 AR Proposal',
 '',
 '1',
 '',
 'Numerical Modeling of Superluminous and Peculiar',
 'Supernovae',
 'Scientific Category: Stellar Physics',
 'Scientific Keywords: Circumstellar Matter, Massive Stars, Radiative Transfer, Supernovae, Transients',
 'Budget Size: Regular',
 'Theory: Yes',
 '',
 'Abstract',
 'The Hubble Space Telescope (HST) has been instrumental in elucidating the nature of the intriguing',
 'superluminous supernovae (SLSNe) explosions by providing unparalleled observations of the progenitor stars,',
 'supernova imposters such as "Luminous Blue Variables" (LBVs) and their host galaxy properties. Furthermore,',
 'HST has directly imaged one of the earliest SLSN discovered, SN 2006gy, more than two years after the',
 'explosion. Now, more than a decade since the first modern discovery of SLSNe and with more than a hundred',
 'members of the class observed, the question on the explosion and energy input mechanism of these',
 'unpreced

In [7]:
# Generate a lexicon by separating all of the words at spaces
lexicon = [val.split(' ')[0] for val in text if val != '']

In [8]:
for word in lexicon[:10]:
        print(word)

Hubble
Cycle
1
Numerical
Supernovae
Scientific
Scientific
Budget
Theory:
Abstract


In [9]:
# Tokenize words in the lexicon
lexicon = [word_tokenize(word) for word in lexicon if len(word) != 0]

In [10]:
def nltk2wn_tag(nltk_tag):
    """
    Convenience function for converting NLTK Part of Speech (POS) tags to wordnet equivalents
    """
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        # If it's unclear what it is, just assume the default [NOUN]
        return wordnet.NOUN

In [11]:
# Determine the proper parts-of-speech tag for each token and convert them from NLTK to wordnet
final_lexicon = []
for lex in lexicon[:-1]:
    pos_tag = nltk.pos_tag(lex)  # determine the part of speech (POS) tag
    wdnet_pos_tag = nltk2wn_tag(pos_tag[0][1])  # retrieve the wordnet equivalent POS
    final_lexicon.append((lex[0], wdnet_pos_tag))  # append the word and it's wordnet POS to a final lexicon

In [12]:
final_lexicon[0]

('Hubble', 'a')

In [13]:
# Use the wordnet lemmatizer to group together various forms of a word
lemmatizer = WordNetLemmatizer()

In [14]:
# Use the robust Porter Stemmer to reduce words to their stems
stemmer = porter.PorterStemmer()

In [15]:
for lex in final_lexicon[:10]:
    print(f'Word: {lex} \nLemma: {lemmatizer.lemmatize(lex[0], pos=lex[1])}\nStem: {stemmer.stem(lex[0])}\n')

Word: ('Hubble', 'a') 
Lemma: Hubble
Stem: hubbl

Word: ('Cycle', 'n') 
Lemma: Cycle
Stem: cycl

Word: ('1', 'n') 
Lemma: 1
Stem: 1

Word: ('Numerical', 'a') 
Lemma: Numerical
Stem: numer

Word: ('Supernovae', 'n') 
Lemma: Supernovae
Stem: supernova

Word: ('Scientific', 'n') 
Lemma: Scientific
Stem: scientif

Word: ('Scientific', 'n') 
Lemma: Scientific
Stem: scientif

Word: ('Budget', 'v') 
Lemma: Budget
Stem: budget

Word: ('Theory', 'n') 
Lemma: Theory
Stem: theori

Word: ('Abstract', 'n') 
Lemma: Abstract
Stem: abstract



<hr>

### Perform the same steps using a class-based approach with spaCy.

- spaCy is different in that it prefers to receive the abstract in a single chunk 

In [16]:
# Import spaCy packages
import spacy
from spacy.lang.en import English
import string

In [17]:
nlp = spacy.load("en_core_web_sm")

In [18]:
# Stop words already exist within the package
print(len(nlp.Defaults.stop_words))

326


In [19]:
# Stop words from our text file
print(len(stop_words))

339


In [20]:
# Store stop words as a set
spacy_stop = set(nlp.Defaults.stop_words)
custom_stop = set(stop_words)

In [21]:
# Compare stop words in order to add in missing stop words
missing_stop_words = custom_stop.difference(spacy_stop)
print(len(missing_stop_words))

49


In [22]:
# combine them into a single list of stop words
nlp.Defaults.stop_words |= set(missing_stop_words)

In [23]:
print(len(nlp.Defaults.stop_words))

375


In [24]:
# Read in example scientific justification text file
with open('../proposal_data/Cycle25/0002.txtx', 'r') as test_file:
    t = test_file.readlines()
    scijust = [val.strip('\n') for val in t]
    scijust = ' '.join(scijust) 

In [25]:
scidoc = nlp(scijust)
print(len(scidoc))

6151


In [26]:
trim_stop_words = []
autogen_stop_words = []

In [27]:
# Remove stop words from document
for token in scidoc:
    if token.is_stop:
        autogen_stop_words.append(token)
        continue
    trim_stop_words.append(token)

In [28]:
print(len(trim_stop_words)/len(scidoc))

0.7120793366932207


In [29]:
# Create list of punctuation marks
punctuations = string.punctuation

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(text, return_type='str'):
    # Creating our token object, which contains each word token parsed from the text.
    mytokens = parser(text)
    num_tokens = len(mytokens)
    # Next, lemmatize each token and standardize the capitalization to be lower case
    mytokens = [
        word.lemma_.lower().strip()
        if word.lemma_ != "-PRON-" else word.lower_ 
        for word in mytokens 
    ]

    # Removing stop words and punctuation
    mytokens = [
        word for word in mytokens 
        if word not in stop_words and word not in punctuations
    ]
    print(f"Processed text represents {len(mytokens)/num_tokens:0.2f}% of the input text")
    
    return mytokens

In [30]:
tokens = spacy_tokenizer(scijust)

Processed text represents 0.51% of the input text


In [31]:
# Print list of tokens
print(tokens)

['hubble', 'space', 'telescope', '2', 'cycle', '25', 'proposal', 'blue', 'diffuse', 'dwarfs', 'missing', 'link', 'dwarf', 'galaxy', 'evolution', 'scientific', 'category', 'galaxies', 'igm', 'scientific', 'keywords', 'dwarf', 'galaxies', 'galaxy', 'formation', 'evolution', 'irregular', 'galaxies', 'star', 'formation', 'histories', 'stellar', 'populations', 'instruments', 'acs', 'wfc3', 'proprietary', 'period', '6', 'months', 'proposal', 'size', 'small', 'orbit', 'request', 'prime', 'parallel', 'cycle', '25', '10', '10', 'abstract', 'dwarf', 'galaxies', 'systems', 'form', 'stars', 'universe', 'understanding', 'starformation', 'histories', 'sfh', 'cosmic', 'time', 'imperative', 'galaxy', 'formation', 'evolution', 'studies', 'particular', 'understanding', 'metal', 'poor', 'systems', 'blue', 'compact', 'dwarf', 'bcd', 'galaxies', 'remain', 'chemically', 'pristine', 'despite', 'long', 'periods', 'moderate', 'sf', 'recent', 'burst', 'remains', 'challenge', 'understand', 'starbursting', 'bcds'