# Natural Language in Python

In [19]:
import nltk

import gensim.models.word2vec as word2vec
import gensim.models.fasttext as fasttext

In [2]:
# You will likely have to download nltk packages to use it
#nltk.download()

## Natural Language ToolKit

The `nltk` python package has lots of tools to help you work with text. The following functions may all appear to be magic, but they're mostly based off of statistical models.

You can find tokenizers and part-of-speech taggers for other language.

In [3]:
paragram = "The quick brown fox jumps over the lazy dog"

You can split text into tokens (words) using the punkt tokenizer.

In [4]:
nltk.word_tokenize(paragram)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

The part-of-speech tagger requires the averaged perceptron tagger.

In [5]:
tokenized = nltk.word_tokenize(paragram)
nltk.pos_tag(tokenized)

[('The', 'DT'),
 ('quick', 'JJ'),
 ('brown', 'NN'),
 ('fox', 'NN'),
 ('jumps', 'VBZ'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog', 'NN')]

In [6]:
with open("./Principio.txt", "r") as f:
    principio = " ".join(f.readlines()).replace("\n", "")

print(principio)

Urbem Romam a principio reges habuere; libertatem et consulatum L. Brutus instituit. Dictaturae ad tempus sumebantur; neque decemviralis potestas ultra biennium, neque tribunorum militum consulare ius diu valuit. Non Cinnae, non Sullae longa dominatio; et Pompei Crassique potentia cito in Caesarem, Lepidi atque Antonii arma in Augustum cessere, qui cuncta discordiis civilibus fessa nomine principis sub imperium accepit.


The nltk library can also tokenize sentences. This is useful for when you want to build a corpus of sentences.

In [7]:
nltk.sent_tokenize(principio)

['Urbem Romam a principio reges habuere; libertatem et consulatum L. Brutus instituit.',
 'Dictaturae ad tempus sumebantur; neque decemviralis potestas ultra biennium, neque tribunorum militum consulare ius diu valuit.',
 'Non Cinnae, non Sullae longa dominatio; et Pompei Crassique potentia cito in Caesarem, Lepidi atque Antonii arma in Augustum cessere, qui cuncta discordiis civilibus fessa nomine principis sub imperium accepit.']

## Gensim for word2vec



In [8]:
with open("./enwik8.txt", "r") as f:
    enwik8 = f.readlines()

sentences = nltk.sent_tokenize(" ".join(enwik8))

How many sentences do we have now?

In [9]:
print(len(sentences))

515688


In [10]:
line_sentences = word2vec.LineSentence(sentences)

In [11]:
word2vec_model = word2vec.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

In [21]:
fasttext_model = fasttext.FastText()
fasttext_model.build_vocab(sentences)
fasttext_model.train(sentences, total_examples=fasttext_model.corpus_count, epochs=fasttext_model.iter)

KeyboardInterrupt: 

In [None]:
word2vec_model.wv.n_similarity("bacon", "dog")

In [None]:
word2vec_model.wv.n_similarity("bacon", "steak")

In [None]:
word2vec_model.wv.n_similarity("pikachu", "bulbasaur")

In [None]:
word2vec_model.doesnt_match("bacon", "ham", "cheese")

In [None]:
word2vec_model.wv.most_similar(positive=['woman', 'king'], negative=['man'])