# Natural Language in Python

Natural language is fun to work with in Python, thanks to easy-to-use tools. Text can be processed quickly with regular expressions, or libraries like `nltk` and `spaCy` can run pre-trained models to tokenize, parse, and vectorize text.

This guide is intended as a quick overview of the options you have in Python.

* Regular expressions
    * Getting rid of punctuation
    * Scrubbing XML
* Natural Language ToolKit (nltk)
    * Tokenization
    * Part-of-speech tagging
    * Sentence tokenization
    * Stemming
    * Lemmatization
* spaCy library
    * Tokens and dependencies
    * Named entity recognition
    * Word vectors

## Regular Expressions

Regular expressions (regex) are extremely useful in natural language processing. You can use them with the [`re` library](https://docs.python.org/3/library/re.html) in python. Regex may be intimidating, but it's worth the effort.

In [1]:
import re

### Punctuation

It's a common thing to want to remove punctuation from text. Regex makes this easy.

In [2]:
text = "The brown dog jumped over the lazy cheese; repeatedly. Without the cheese, there is boredom: the dog?"

print(re.sub(r'[.,;:!?-]', ' ', text))

The brown dog jumped over the lazy cheese  repeatedly  Without the cheese  there is boredom  the dog 


The above gives extra spaces which can be removed with regex.

In [3]:
text = "The brown dog jumped over the lazy cheese; repeatedly. Without the cheese, there is boredom: the dog?"

spaces = re.sub(r'[.,;:!?-]', ' ', text)

print(re.sub(r'[ ]+', ' ', spaces))

The brown dog jumped over the lazy cheese repeatedly Without the cheese there is boredom the dog 


Regular expression looks weird at first, but a [good cheatsheet](https://pycon2016.regex.training/cheat-sheet) helps. Microsoft also makes [printable cheasheets](https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference), but there may be slight difference between implementations.

### Getting rid of XML tags

If you're working with web data, regex is useful for cleaning. For example, the 100MB wikiepedia dataset has lots of XML and HTML tag everywhere, which you normally don't want.

In [4]:
with open("./enwik8.txt", "r") as f:
    enwik8 = f.read().splitlines()

print(enwik8[50:60])

['        <id>8029</id>', '      </contributor>', '      <minor />', '      <comment>adding cur_id=5: {{R from CamelCase}}</comment>', '      <text xml:space="preserve">#REDIRECT [[Algeria]]{{R from CamelCase}}</text>', '    </revision>', '  </page>', '  <page>', '    <title>AmericanSamoa</title>', '    <id>6</id>']


Luckily, a researcher named [Matt Mahoney](http://mattmahoney.net/) wrote a nice perl script for cleaning that stuff out. The fastText team has made that script available [here](https://github.com/facebookresearch/fastText/blob/master/wikifil.pl) and I've translated it into python below.

In [5]:
cleaned_enwik8 = []

# I've kept the comments in the code, but I've otherwise tweaked it to run in Python

# Program to filter Wikipedia XML dumps to "clean" text consisting only of lowercase
# letters (a-z, converted from A-Z), and spaces (never consecutive).  
# All other characters are converted to spaces.  Only text which normally appears 
# in the web browser is displayed.  Tables are removed.  Image captions are 
# preserved.  Links are converted to normal text.  Digits are spelled out.

# Written by Matt Mahoney, June 10, 2006.  This program is released to the public domain.
for line in enwik8:
    if "<text" in line.lower() and "#redirect" not in line.lower():
        line = line.lower()
        line = re.sub(r"<.*>", r"", line) # remove xml tags
        line = re.sub(r"&amp;", r"&", line) # decode URL encoded chars
        line = re.sub(r"&lt;", r"<", line)
        line = re.sub(r"&gt;", r">", line)
        line = re.sub(r"<ref[^<]*<\/ref>", r"", line) # remove references <ref...> ... </ref>
        line = re.sub(r"<[^>]*>", r"", line) # remove xhtml tags
        line = re.sub(r"\[http:[^] ]*", r"[]", line) # remove normal url, preserve visible text
        line = re.sub(r"\|thumb", "", line) # remove images links, preserve caption
        line = re.sub(r"\|left", "", line)
        line = re.sub(r"\|right", "", line)
        line = re.sub(r"\|\d+px", "", line)
        line = re.sub(r"\[\[image:[^\[\]]*\|", "", line)
        line = re.sub(r"\[\[category:([^|\]]*)[^]]*\]\]", "[[$1]]", line) # show categories without markup
        line = re.sub(r"\[\[[a-z\-]*:[^\]]*\]\]", "", line) # remove links to other languages
        line = re.sub(r"\[\[[^\|\]]*\|", "[[", line) # remove wiki url, preserve visible text
        line = re.sub(r"\{\{[^\}]*\}\}", "", line) # remove {{icons}} and {tables}
        line = re.sub(r"\{[^\}]*\}", "", line) # remove [ and ]
        line = re.sub(r"\[", "", line)
        line = re.sub(r"\]", "", line)
        line = re.sub(r"&[^;]*;", "", line) # remove URL encoded chars
        # convert to lowercase letters and spaces, spell digits
        line = " "+line+" "
        line = re.sub(r"0", " zero ", line)
        line = re.sub(r"1", " one ", line)
        line = re.sub(r"2", " two ", line)
        line = re.sub(r"3", " three ", line)
        line = re.sub(r"4", " four ", line)
        line = re.sub(r"5", " five ", line)
        line = re.sub(r"6", " six ", line)
        line = re.sub(r"7", " seven ", line)
        line = re.sub(r"8", " eight ", line)
        line = re.sub(r"9", " nine", line)
        line = re.sub(r"[^\w]+", " ", line)
        line = re.sub(r"[ ]+", " ", line)
        line = line.strip()
        if len(line) > 0 :
            cleaned_enwik8.append(line)

print(cleaned_enwik8[:5])

['notes', 'view of abu dhabi', 'for other uses see achilles disambiguation', 'for other uses of the name abraham lincoln see abraham lincoln disambiguation', 'infobox_philosopher']


The short script scrubs the data clean and leaves the text behind.

## Natural Language ToolKit

The `nltk` python package has lots of tools to help you work with text. The following functions may all appear to be magic, but they're mostly based off of statistical model.

You can find tokenizers and part-of-speech taggers for language other than English.

In [6]:
import nltk

# You will likely have to download nltk packages to use them
#nltk.download()

### Tokenization

You can split text into tokens (words) using the punkt tokenizer model (`punkt`). Tokenization is extremely useful for natural language modelling.

Notice that the punctuation is properly separated from the words.

In [7]:
paragram = "The brown dog jumped over the lazy cheese; repeatedly. Without the cheese, there is boredom: the dog?"

print(nltk.word_tokenize(paragram))

['The', 'brown', 'dog', 'jumped', 'over', 'the', 'lazy', 'cheese', ';', 'repeatedly', '.', 'Without', 'the', 'cheese', ',', 'there', 'is', 'boredom', ':', 'the', 'dog', '?']


### Part of speech tagging

The natural language toolkit can also do something called "part of speech tagging" (`averaged_perceptron_tagger` + `treebank`). It will identify the subjects, predicates, etc in your sentence.

In [8]:
tokenized = nltk.word_tokenize(paragram)
nltk.pos_tag(tokenized)

[('The', 'DT'),
 ('brown', 'JJ'),
 ('dog', 'NN'),
 ('jumped', 'VBD'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('cheese', 'NN'),
 (';', ':'),
 ('repeatedly', 'RB'),
 ('.', '.'),
 ('Without', 'IN'),
 ('the', 'DT'),
 ('cheese', 'NN'),
 (',', ','),
 ('there', 'EX'),
 ('is', 'VBZ'),
 ('boredom', 'NN'),
 (':', ':'),
 ('the', 'DT'),
 ('dog', 'NN'),
 ('?', '.')]

### Sentence tokenization

Sentence tokenization is helpful when you want to feed sentences to your model but your raw data is in paragraphs. This uses the `punkt` tokenizer from before.

In [9]:
with open("./Principio.txt", "r") as f:
    principio = " ".join(f.readlines()).replace("\n", "")

print("Raw text")
print(principio)
print()
print("Tokenized sentences")
print(nltk.sent_tokenize(principio))

Raw text
Urbem Romam a principio reges habuere; libertatem et consulatum L. Brutus instituit. Dictaturae ad tempus sumebantur; neque decemviralis potestas ultra biennium, neque tribunorum militum consulare ius diu valuit. Non Cinnae, non Sullae longa dominatio; et Pompei Crassique potentia cito in Caesarem, Lepidi atque Antonii arma in Augustum cessere, qui cuncta discordiis civilibus fessa nomine principis sub imperium accepit.

Tokenized sentences
['Urbem Romam a principio reges habuere; libertatem et consulatum L. Brutus instituit.', 'Dictaturae ad tempus sumebantur; neque decemviralis potestas ultra biennium, neque tribunorum militum consulare ius diu valuit.', 'Non Cinnae, non Sullae longa dominatio; et Pompei Crassique potentia cito in Caesarem, Lepidi atque Antonii arma in Augustum cessere, qui cuncta discordiis civilibus fessa nomine principis sub imperium accepit.']


### Stemming

[Stemming](https://en.wikipedia.org/wiki/Stemming) algorithms crop words to their roots. They're a way of reducing your vocabulary size.

In [10]:
porter_stemmer = nltk.stem.snowball.SnowballStemmer("english")

paragram = "The quick foxes quickly jumped over the laziest dog. The dogs' owners are saddened."

for word in nltk.word_tokenize(paragram):
    stemmed = porter_stemmer.stem(word)
    print(f"Original: {word:<10} Stemmed: {stemmed:<10} Changed: {'Y'*(word!=stemmed)}")

Original: The        Stemmed: the        Changed: Y
Original: quick      Stemmed: quick      Changed: 
Original: foxes      Stemmed: fox        Changed: Y
Original: quickly    Stemmed: quick      Changed: Y
Original: jumped     Stemmed: jump       Changed: Y
Original: over       Stemmed: over       Changed: 
Original: the        Stemmed: the        Changed: 
Original: laziest    Stemmed: laziest    Changed: 
Original: dog        Stemmed: dog        Changed: 
Original: .          Stemmed: .          Changed: 
Original: The        Stemmed: the        Changed: Y
Original: dogs       Stemmed: dog        Changed: Y
Original: '          Stemmed: '          Changed: 
Original: owners     Stemmed: owner      Changed: Y
Original: are        Stemmed: are        Changed: 
Original: saddened   Stemmed: sadden     Changed: Y
Original: .          Stemmed: .          Changed: 


### Lemmatization

The `wordnet` [lemmatizer](https://en.wikipedia.org/wiki/Lemmatisation) will group inflections together, which serves to reduce your vocabulary size. An inflection is the modification of a word for various reasons. Examples are words in plural with an `s` or verbs that whether one person or many people perform it.

The `lemmatize()` takes a `pos` argument, but I haven't been able to find this well-explained online. As a basic step I suggest using POS tagging to distinguish nouns and verbs. Below you can see that this helps lemmatize `learned` alright.

In [11]:
wordnet_lemmatizer = nltk.stem.WordNetLemmatizer()

paragram = "I learned Latin. The learned learn Latin. Therefore I am learned."

for word, pos in nltk.pos_tag(nltk.word_tokenize(paragram)):
    if pos[0] == "V": pos_arg = "v"
    else: pos_arg = "n"
    lemmatized = wordnet_lemmatizer.lemmatize(word, pos=pos_arg)
    print(f"Original: {word:<10} Lemmatized: {lemmatized:<10} Changed: {'Y'*(word!=lemmatized):<5} POS: {pos}")

Original: I          Lemmatized: I          Changed:       POS: PRP
Original: learned    Lemmatized: learn      Changed: Y     POS: VBD
Original: Latin      Lemmatized: Latin      Changed:       POS: NNP
Original: .          Lemmatized: .          Changed:       POS: .
Original: The        Lemmatized: The        Changed:       POS: DT
Original: learned    Lemmatized: learned    Changed:       POS: JJ
Original: learn      Lemmatized: learn      Changed:       POS: NN
Original: Latin      Lemmatized: Latin      Changed:       POS: NNP
Original: .          Lemmatized: .          Changed:       POS: .
Original: Therefore  Lemmatized: Therefore  Changed:       POS: NNP
Original: I          Lemmatized: I          Changed:       POS: PRP
Original: am         Lemmatized: be         Changed: Y     POS: VBP
Original: learned    Lemmatized: learn      Changed: Y     POS: VBN
Original: .          Lemmatized: .          Changed:       POS: .


## spaCy library

The `spacy` library can also work with natural text. I find it has a more modern feel than `nltk`. I recommend visiting their [website](https://spacy.io/usage/) since it has a lot of examples.

You have to download and install models before you can use them.

```
python -m spacy download en
```

After you've installed a model, you can then load it in spaCy.

In [12]:
import spacy

nlp = spacy.load('en')

### Part-of-speech, lemmas, etc

This [code snippet](https://spacy.io/usage/linguistic-features#section-pos-tagging) from the website shows how to quickly do various transformations on your text. You can quickly get information from your text.

Notice that spaCy is making some errors with its POS tagging. It always interprets `learned` as a verb.

In [13]:
doc = nlp(u"I learned Latin. The learned learn Latin. Therefore I am learned.")

print("{:<10} {:<10} {:<10} {:<10} {:<10} {:<10} {:<10} {:<10}".format(
    "text", "lemma", "pos", "tag", "dep", "shape", "is_alpha", "is_stop"
     ))
print(80*"-")

for token in doc:
    print("{:<10} {:<10} {:<10} {:<10} {:<10} {:<10} {:<10} {:<10}".format(
        token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop
         ))

text       lemma      pos        tag        dep        shape      is_alpha   is_stop   
--------------------------------------------------------------------------------
I          -PRON-     PRON       PRP        nsubj      X          1          0         
learned    learn      VERB       VBD        ROOT       xxxx       1          0         
Latin      latin      PROPN      NNP        dobj       Xxxxx      1          0         
.          .          PUNCT      .          punct      .          0          0         
The        the        DET        DT         det        Xxx        1          0         
learned    learn      VERB       VBN        nsubj      xxxx       1          0         
learn      learn      VERB       VBP        ROOT       xxxx       1          0         
Latin      latin      PROPN      NNP        dobj       Xxxxx      1          0         
.          .          PUNCT      .          punct      .          0          0         
Therefore  therefore  ADV        RB    

spaCy conveniently draws visuals for you. The `dep` style means dependency.

In [14]:
doc = nlp(u"The brown dog jumped over the lazy cheese.")

spacy.displacy.render(doc, style="dep", jupyter=True, options={"distance" : 120})

### Getting a bigger better model

I'm going to switch to the medium model `en_core_web_md`, which is 120MB. This will make the following examples work better.

There seems to be a problem with timeout when installing these larger models. You can use the [`pip install` instructions](https://spacy.io/usage/models#download-pip) in the guide with the `--timeout=10000` option.

In [15]:
nlp = spacy.load('en_core_web_md')

### Named entity recognition

spaCy can also do named entity recognition. A model is used here, so the label is not 100% accurate (seems trained on current events).

In [16]:
doc = nlp("""The despotisms of Cinna and Sulla were brief; """
          """the rule of Pompeius and of Crassus soon yielded before Caesar; """
          """the arms of Lepidus and Antonius before Augustus; """
          """who, when the world was wearied by civil strife, subjected it to empire under the title of Princeps.""")
    
print("{:<20} {:<10} {:<10} {:<10}".format(
    "text", "start_char", "end_char", "label"
     ))
print(50*"-")

for ent in doc.ents:
    print("{:<20} {:<10} {:<10} {:<10}".format(
        ent.text, ent.start_char, ent.end_char, ent.label_
         ))

text                 start_char end_char   label     
--------------------------------------------------
Cinna                18         23         PERSON    
Sulla                28         33         PERSON    
Pompeius             58         66         ORG       
Crassus              74         81         ORG       
Caesar               102        108        PRODUCT   
Lepidus              122        129        WORK_OF_ART
Antonius             134        142        PERSON    
Augustus             150        158        DATE      
Princeps             251        259        ORG       


Again, spaCy can draw these nicely.

In [17]:
doc = nlp("""The despotisms of Cinna and Sulla were brief; """
          """the rule of Pompeius and of Crassus soon yielded before Caesar; """
          """the arms of Lepidus and Antonius before Augustus; """
          """who, when the world was wearied by civil strife, subjected it to empire under the title of Princeps.""")

spacy.displacy.render(doc, style="ent", jupyter=True)

### Word vectors

The spaCy library also comes with pretrained word embeddings. They recommend using a larger model than the default `en` (the default is "sm" for small), so the `md` model we got above is suitable.

You can then check the similarity of tokens. Ham and bacon are similar to one another, and cars and trucks are similar to one another.

The examples I show below can also be found in the [spaCy vector examples](https://spacy.io/usage/vectors-similarity).

In [18]:
tokens = nlp(u'ham bacon cars trucks')

for token1 in tokens:
    print(f"{token1}\n-----")
    for token2 in tokens:
        print(f"{token1.text:<10} {token2.text:<10} Similarity: {token1.similarity(token2):5.2f}")
    print(f"{token1}\n-----")

ham
-----
ham        ham        Similarity:  1.00
ham        bacon      Similarity:  0.74
ham        cars       Similarity:  0.10
ham        trucks     Similarity:  0.12
ham
-----
bacon
-----
bacon      ham        Similarity:  0.74
bacon      bacon      Similarity:  1.00
bacon      cars       Similarity:  0.11
bacon      trucks     Similarity:  0.14
bacon
-----
cars
-----
cars       ham        Similarity:  0.10
cars       bacon      Similarity:  0.11
cars       cars       Similarity:  1.00
cars       trucks     Similarity:  0.72
cars
-----
trucks
-----
trucks     ham        Similarity:  0.12
trucks     bacon      Similarity:  0.14
trucks     cars       Similarity:  0.72
trucks     trucks     Similarity:  1.00
trucks
-----


You can look up vectors, and if your word isn't in the vocabulary you'll get nothing.

In [19]:
tokens = nlp(u'ham bus hambus')

print("{:<10} {:<10} {:>15} {:<10}".format(
    "token", "has_vector", "vector_norm", "is_oov"
     ))
print(45*"-")

for token in tokens:
    print("{:<10} {:<10} {:15.2f} {:<10}".format(
        token.text, token.has_vector, token.vector_norm, token.is_oov
         ))

token      has_vector     vector_norm is_oov    
---------------------------------------------
ham        1                     7.30 0         
bus        1                     7.10 0         
hambus     0                     0.00 1         


You can access the vector value with the `.vector` property. You get a `numpy` array.

In [20]:
print(nlp(u'ham').vector.shape)

(300,)


If you want to use these vectors in a model, you can retrieve them all from a sentence and then average them.

In [21]:
import numpy as np

tokens = nlp(u'The brown dog jumped over the lazy cheese.')

print(np.mean([token.vector for token in tokens], axis=0).shape)

(300,)


I hope this was useful. Please report any issues!