# Natural Language in Python

In [1]:
import nltk

In [2]:
# You will likely have to download nltk packages to use it
#nltk.download()

## Natural Language ToolKit

The `nltk` python package has lots of tools to help you work with text. The following functions may all appear to be magic, but they're mostly based off of statistical models.

You can find tokenizers and part-of-speech taggers for other language.

In [3]:
paragram = "The quick brown fox jumps over the lazy dog"

You can split text into tokens (words) using the punkt tokenizer.

In [4]:
nltk.word_tokenize(paragram)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

The part-of-speech tagger requires the averaged perceptron tagger.

In [5]:
tokenized = nltk.word_tokenize(paragram)
nltk.pos_tag(tokenized)

[('The', 'DT'),
 ('quick', 'JJ'),
 ('brown', 'NN'),
 ('fox', 'NN'),
 ('jumps', 'VBZ'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog', 'NN')]

In [6]:
with open("./Principio.txt", "r") as f:
    principio = " ".join(f.readlines()).replace("\n", "")

print(principio)

Urbem Romam a principio reges habuere; libertatem et consulatum L. Brutus instituit. Dictaturae ad tempus sumebantur; neque decemviralis potestas ultra biennium, neque tribunorum militum consulare ius diu valuit. Non Cinnae, non Sullae longa dominatio; et Pompei Crassique potentia cito in Caesarem, Lepidi atque Antonii arma in Augustum cessere, qui cuncta discordiis civilibus fessa nomine principis sub imperium accepit.


The nltk library can also tokenize sentences. This is useful for when you want to build a corpus of sentences.

In [7]:
nltk.sent_tokenize(principio)

['Urbem Romam a principio reges habuere; libertatem et consulatum L. Brutus instituit.',
 'Dictaturae ad tempus sumebantur; neque decemviralis potestas ultra biennium, neque tribunorum militum consulare ius diu valuit.',
 'Non Cinnae, non Sullae longa dominatio; et Pompei Crassique potentia cito in Caesarem, Lepidi atque Antonii arma in Augustum cessere, qui cuncta discordiis civilibus fessa nomine principis sub imperium accepit.']

## Gensim for word2vec



In [13]:
import gensim.models.word2vec as word2vec

import re

In [27]:
with open("./enwik8.txt", "r") as f:
    enwik8 = f.read().splitlines()

In [28]:
enwik8[50:60]

['        <id>8029</id>',
 '      </contributor>',
 '      <minor />',
 '      <comment>adding cur_id=5: {{R from CamelCase}}</comment>',
 '      <text xml:space="preserve">#REDIRECT [[Algeria]]{{R from CamelCase}}</text>',
 '    </revision>',
 '  </page>',
 '  <page>',
 '    <title>AmericanSamoa</title>',
 '    <id>6</id>']

If you're doing word-based vectorization, you're wasting energy treating "castle" (no adjacent period) the same as "castle.". You can `word_tokenize()` to fix this. Doing this also saves memory by reducing vocabulary size.

https://github.com/facebookresearch/fastText/blob/master/wikifil.pl

I checked 

In [108]:
cleaned_enwik8 = []

# I've kept the comments in the code, but I've otherwise tweaked it to run in Python

# Program to filter Wikipedia XML dumps to "clean" text consisting only of lowercase
# letters (a-z, converted from A-Z), and spaces (never consecutive).  
# All other characters are converted to spaces.  Only text which normally appears 
# in the web browser is displayed.  Tables are removed.  Image captions are 
# preserved.  Links are converted to normal text.  Digits are spelled out.

# Written by Matt Mahoney, June 10, 2006.  This program is released to the public domain.
for line in enwik8:
    if "<text" in line.lower() and "#redirect" not in line.lower():
        line = line.lower()
        line = re.sub(r"<.*>", r"", line) # remove xml tags
        line = re.sub(r"&amp;", r"&", line) # decode URL encoded chars
        line = re.sub(r"&lt;", r"<", line)
        line = re.sub(r"&gt;", r">", line)
        line = re.sub(r"<ref[^<]*<\/ref>", r"", line) # remove references <ref...> ... </ref>
        line = re.sub(r"<[^>]*>", r"", line) # remove xhtml tags
        line = re.sub(r"\[http:[^] ]*", r"[]", line) # remove normal url, preserve visible text
        line = re.sub(r"\|thumb", "", line) # remove images links, preserve caption
        line = re.sub(r"\|left", "", line)
        line = re.sub(r"\|right", "", line)
        line = re.sub(r"\|\d+px", "", line)
        line = re.sub(r"\[\[image:[^\[\]]*\|", "", line)
        line = re.sub(r"\[\[category:([^|\]]*)[^]]*\]\]", "[[$1]]", line) # show categories without markup
        line = re.sub(r"\[\[[a-z\-]*:[^\]]*\]\]", "", line) # remove links to other languages
        line = re.sub(r"\[\[[^\|\]]*\|", "[[", line) # remove wiki url, preserve visible text
        line = re.sub(r"\{\{[^\}]*\}\}", "", line) # remove {{icons}} and {tables}
        line = re.sub(r"\{[^\}]*\}", "", line) # remove [ and ]
        line = re.sub(r"\[", "", line)
        line = re.sub(r"\]", "", line)
        line = re.sub(r"&[^;]*;", "", line) # remove URL encoded chars
        # convert to lowercase letters and spaces, spell digits
        line = " "+line+" "
        line = re.sub(r"0", " zero ", line)
        line = re.sub(r"1", " one ", line)
        line = re.sub(r"2", " two ", line)
        line = re.sub(r"3", " three ", line)
        line = re.sub(r"4", " four ", line)
        line = re.sub(r"5", " five ", line)
        line = re.sub(r"6", " six ", line)
        line = re.sub(r"7", " seven ", line)
        line = re.sub(r"8", " eight ", line)
        line = re.sub(r"9", " nine", line)
        line = re.sub(r"[^\w]+", " ", line)
        line = re.sub(r"[ ]+", " ", line)
        line = line.strip()
        if len(line) > 0 :
            cleaned_enwik8.append(line)

In [109]:
print(cleaned_enwik8[:20])

['notes', 'view of abu dhabi', 'for other uses see achilles disambiguation', 'for other uses of the name abraham lincoln see abraham lincoln disambiguation', 'infobox_philosopher', 'an american in paris is also a one nine five one film musical starring gene kelly', 'the academy awards popularly known as the oscars are the most prominent film awards in the united states and arguably the world the awards are granted by the academy of motion picture arts and sciences a professional honorary organization which as of two zero zero three had a voting membership of five eight one six actors with a membership of one three one one make up the largest voting bloc the votes have been tabulated and certified by auditing firm pricewaterhousecoopers since close to the awards inception', 'temps atomique international tai or international atomic time is a very accurate and stable time scale it is a weighted average of the time kept by about three zero zero atomic clocks including a large number of cae

In [33]:
help(re.sub)

Help on function sub in module re:

sub(pattern, repl, string, count=0, flags=0)
    Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the match object and must return
    a replacement string to be used.



Punctuation will now be separated.

In [34]:
print(enwik8[:10])

['<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en">', '  <siteinfo>', '    <sitename>Wikipedia</sitename>', '    <base>http://en.wikipedia.org/wiki/Main_Page</base>', '    <generator>MediaWiki 1.6alpha</generator>', '    <case>first-letter</case>', '      <namespaces>', '      <namespace key="-2">Media</namespace>', '      <namespace key="-1">Special</namespace>', '      <namespace key="0" />']


In [98]:
blobbed_enwik8 = " ".join(cleaned_enwik8)

How many sentences do we have now?

In [99]:
print(len(sentences))

1145801


In [100]:
print(sentences[:100])

notes view of abu dhabi for other uses see achilles disambiguation for other uses of the name abraha


In [101]:
line_sentences = word2vec.LineSentence(sentences)

In [102]:
word2vec_model = word2vec.Word2Vec(blobbed_enwik8, size=100, window=5, min_count=2, workers=4, sg=1, iter=10)

In [103]:
word2vec_model.wv.n_similarity("king", "queen")

0.3256045857578719

In [104]:
word2vec_model.wv.n_similarity("prince", "princess")

0.7944913855594391

In [105]:
word2vec_model.wv.n_similarity("edible", "pikachu")

0.26912014338901086

In [106]:
word2vec_model.wv.n_similarity("pikachu", "raichu")

0.8298283016765463

In [107]:
word2vec_model.wv.n_similarity("pikachu", "bulbasaur")

0.5774525851874228

In [97]:
help(word2vec.LineSentence)

Help on class LineSentence in module gensim.models.word2vec:

class LineSentence(builtins.object)
 |  Simple format: one sentence = one line; words already preprocessed and separated by whitespace.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, source, max_sentence_length=10000, limit=None)
 |      `source` can be either a string or a file object. Clip the file to the first
 |      `limit` lines (or not clipped if limit is None, the default).
 |      
 |      Example::
 |      
 |          sentences = LineSentence('myfile.txt')
 |      
 |      Or for compressed files::
 |      
 |          sentences = LineSentence('compressed_text.txt.bz2')
 |          sentences = LineSentence('compressed_text.txt.gz')
 |  
 |  __iter__(self)
 |      Iterate through the lines in the source.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref