# Lecture I - Basic Text Processing, Data sources and Corpora

This is my first notebook from the master's class Natural Language Processing. It is not all the same, this is more like summary and I will add some additional stuff.


## Short Introduction to NLTK

NLTK is a python library for text processing.

First let's import the nltk libraries

In [1]:
import nltk
nltk.download("averaged_perceptron_tagger")
nltk.download("maxent_ne_chunker")
nltk.download("words")
nltk.download("reuters")
nltk.download("gutenberg")
nltk.download("wordnet")
nltk.download("tagsets")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/tavchija/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/tavchija/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /home/tavchija/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package reuters to /home/tavchija/nltk_data...
[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/tavchija/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package wordnet to /home/tavchija/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package tagsets to /home/tavchija/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

Example of tokenization of a sentence.

In [2]:
from nltk import word_tokenize, pos_tag, ne_chunk

text = "Jhon works at OBI."

#Morphology Level
tokens = word_tokenize(text)
print("Tokens:", tokens)

#Syntax Level
tagged_tokens = pos_tag(tokens)
print("POS tagging:", tagged_tokens)

#Semantics Level
ner_tree = ne_chunk(tagged_tokens)
print("Light parsing:", ner_tree)

Tokens: ['Jhon', 'works', 'at', 'OBI', '.']
POS tagging: [('Jhon', 'NNP'), ('works', 'VBZ'), ('at', 'IN'), ('OBI', 'NNP'), ('.', '.')]
Light parsing: (S (PERSON Jhon/NNP) works/VBZ at/IN (ORGANIZATION OBI/NNP) ./.)


Using the Text object as example. Sometimes it can be useful.

In [3]:
from nltk import Text 
from nltk.corpus import reuters

text = Text(reuters.words())

print("Similar words to Monday:")
text.similar('Monday', 5)

print("Common contexts to a list of words August, June")
text.common_contexts(['August', 'June'])

print("Contexts of a word Monday")
text.concordance('Monday')

Similar words to Monday:
april march friday february january
Common contexts to a list of words August, June
and_and in_the in_and last_the on_the last_when between_and in_to
last_that for_shipment in_because ended_shr from_to since_and and_at
last_to in_last for_to in_u in_they
Contexts of a word Monday
Displaying 25 of 240 matches:
said . Trade Minister Saleh said on Monday that Indonesia , as the world ' s s
Reuters to clarify his statement on Monday in which he said the pact should be
 the 11 - member CPA which began on Monday . They said producers agreed that c
ief Burkhard Junger was arrested on Monday on suspicion of embezzlement and of
ween one and 1 . 25 billion dlrs on Monday and Tuesday . The spokesman said Mo
ay and Tuesday . The spokesman said Monday ' s float included 500 mln dlrs in 
s ranged from minus 500 mln dlrs on Monday , when cash letter errors at two ea
 a deficit on Thursday , Friday and Monday but held excess reserves on the fin
ed temporary reserves indirectly

Example of work with n-grams (bigrams, trigrams) and collocations extraction

In [3]:
import nltk 
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

#Bigrams
finder = BigramCollocationFinder.from_words(nltk.corpus.reuters.words())
finder.apply_freq_filter(5)

print("Best 50 bigrams according to PMI:", finder.nbest(bigram_measures.pmi, 50))

#Trigrams
finder = TrigramCollocationFinder.from_words(nltk.corpus.reuters.words())
finder.apply_freq_filter(5)

print("Best 50 bigrams according to PMI:", finder.nbest(trigram_measures.pmi, 50))

Best 50 bigrams according to PMI: [('DU', 'PONT'), ('Keng', 'Yaik'), ('Kwik', 'Save'), ('Nihon', 'Keizai'), ('corenes', 'pora'), ('fluidized', 'bed'), ('Akbar', 'Hashemi'), ('Constructions', 'Telephoniques'), ('Elevator', 'Mij'), ('Entre', 'Rios'), ('Graan', 'Elevator'), ('JIM', 'WALTER'), ('Taikoo', 'Shing'), ('der', 'Vorm'), ('di', 'Clemente'), ('Borrowing', 'Requirement'), ('FOOTE', 'MINERAL'), ('Hawker', 'Siddeley'), ('JARDINE', 'MATHESON'), ('PRORATION', 'FACTOR'), ('Wildlife', 'Refuge'), ('Kohlberg', 'Kravis'), ('Almir', 'Pazzionotto'), ('Bankhaus', 'Centrale'), ('Corpus', 'Christi'), ('Kuala', 'Lumpur'), ('Maple', 'Leaf'), ('Stats', 'Oljeselskap'), ('Zoete', 'Wedd'), ('Neutral', 'Zone'), ('Tadashi', 'Kuranari'), ('Drawing', 'Rights'), ('EASTMAN', 'KODAK'), ('Martinez', 'Cuenca'), ('Mathematical', 'Applications'), ('Townsend', 'Thoresen'), ('Sector', 'Borrowing'), ('Hashemi', 'Rafsanjani'), ('Hossein', 'Mousavi'), ('Kitty', 'Hawk'), ('SLAUGHTER', 'GUESSTIMATES'), ('Task', 'Force'

Conversion between diferent data types

In [5]:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.tag import untag, str2tuple, tuple2str
from nltk.chunk import tree2conllstr, conllstr2tree, conlltags2tree, tree2conlltags
 
text = "John works at OBI."
 
tokens = word_tokenize(text)
print("Tokens: ", tokens)
 
tagged_tokens = pos_tag(tokens)
print("\nTagged tokens: ", tagged_tokens)
 
print("\nUntagged tokens", untag(tagged_tokens))
 
tagged_tokens = [tuple2str(t) for t in tagged_tokens] 
print("\nTagged tokens to strings:", tagged_tokens)
 
tagged_tokens = [str2tuple(t) for t in tagged_tokens]
print("\nTagged tokens from strings to tuples:",  tagged_tokens)
 
ner_tree = ne_chunk(tagged_tokens)
print("\nNER tree:", ner_tree)
 
iob_tagged = tree2conlltags(ner_tree)
print("\nIOB tagged tree:", iob_tagged)
 
ner_tree = conlltags2tree(iob_tagged)
print("\nBack to tree:", ner_tree)
 
tree_str = tree2conllstr(ner_tree)
print("\nTree as CoNLL string:\n", tree_str)
 
ner_tree = conllstr2tree(tree_str, chunk_types=('PERSON', 'ORGANIZATION'))
print("\nCoNLL string to tree:", ner_tree)
 

Tokens:  ['John', 'works', 'at', 'OBI', '.']

Tagged tokens:  [('John', 'NNP'), ('works', 'VBZ'), ('at', 'IN'), ('OBI', 'NNP'), ('.', '.')]

Untagged tokens ['John', 'works', 'at', 'OBI', '.']

Tagged tokens to strings: ['John/NNP', 'works/VBZ', 'at/IN', 'OBI/NNP', './.']

Tagged tokens from strings to tuples: [('John', 'NNP'), ('works', 'VBZ'), ('at', 'IN'), ('OBI', 'NNP'), ('.', '.')]

NER tree: (S (PERSON John/NNP) works/VBZ at/IN (ORGANIZATION OBI/NNP) ./.)

IOB tagged tree: [('John', 'NNP', 'B-PERSON'), ('works', 'VBZ', 'O'), ('at', 'IN', 'O'), ('OBI', 'NNP', 'B-ORGANIZATION'), ('.', '.', 'O')]

Back to tree: (S (PERSON John/NNP) works/VBZ at/IN (ORGANIZATION OBI/NNP) ./.)

Tree as CoNLL string:
 John NNP B-PERSON
works VBZ O
at IN O
OBI NNP B-ORGANIZATION
. . O

CoNLL string to tree: (S (PERSON John/NNP) works/VBZ at/IN (ORGANIZATION OBI/NNP) ./.)


# Sentence splitting