In [33]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, PunktSentenceTokenizer
from nltk.corpus import stopwords, state_union
from nltk.stem import PorterStemmer

### Tokenizing words and Sentences

In [3]:
#nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Tokenizers

Form of grouping stuff

* Word Tokenizers - separates by word
* Sentence Tokenizers - separates by sentence

### Corpora

body of text. eg medical journals, presidential speeches

### Lexicon
dictionary - words and their meanings
investor speak vs english
investor speak - slangs "bull on the market"
english - bull: an animal.



In [18]:
example_text = "Liverpool will annihilate Chelsea. Easily a 4-0 victory for Liverpool."

#### Word Tokenization
Split by space.

#### Sentence Tokenization
Could use split punctuation, but could trip you out. Regex will be a pain. 

#### Example

In [19]:
print(sent_tokenize(example_text))
print(word_tokenize(example_text))

['Liverpool will annihilate Chelsea.', 'Easily a 4-0 victory for Liverpool.']
['Liverpool', 'will', 'annihilate', 'Chelsea', '.', 'Easily', 'a', '4-0', 'victory', 'for', 'Liverpool', '.']


#### Stop Words

English has a lot of filler words that appear very frequently like “and”, “the”, and “a”. When doing statistics on text, these words introduce a lot of noise since they appear way more frequently than other word. Stop words are usually identified by just by checking a hardcoded list of known stop words. But there’s no standard list of stop words that is appropriate for all applications. The list of words to ignore can vary depending on your application.

In [21]:
stop_words = set(stopwords.words("english"))
print(stop_words)

{"it's", 'if', 'needn', "shouldn't", 'hadn', 'no', "didn't", 'on', 'how', 'too', 'has', 're', 'couldn', 'd', 'an', 'by', "shan't", "should've", "you're", 'we', 'her', 'as', 'to', 'll', 'these', 'while', 'is', "haven't", "that'll", "aren't", 'that', 'did', 'weren', 'between', 'can', "hadn't", 'had', 'itself', 'but', 'being', 'the', 'are', 'over', 'same', "weren't", 'who', 'yourself', 'both', 'below', 'then', 'ours', 'with', 'after', 'theirs', 'do', 'down', 'will', 'should', 'ma', 'its', 'until', 'any', "mightn't", 'doing', 'because', 'most', 'again', 'not', 'having', 'them', 'am', "isn't", 'shouldn', 'for', 'more', 'were', 'you', "wouldn't", 'won', 'a', 'been', 'so', 'into', 'isn', 'where', 'don', 'those', 'this', 'about', 'your', 'than', 'and', 'it', 'which', 'above', 'what', 'nor', "hasn't", 'all', 'few', 'me', 'further', 'such', 'o', 'up', 'our', 'haven', "mustn't", "don't", 'their', 'here', 'whom', 'he', 'she', "you've", 'him', 'of', 'against', 'there', "couldn't", 'why', 'yourselve

In [24]:
filtered_sentence = [w for w in word_tokenize(example_text) if not w in stop_words]
filtered_sentence

['Liverpool',
 'annihilate',
 'Chelsea',
 '.',
 'Easily',
 '4-0',
 'victory',
 'Liverpool',
 '.']

#### Stemming

Form of "normalization". Take words then take the stem of the word.
for example - riding, ridden -- **root** is ride.

We do this because we might have a variation of words but really the meaning of the sentence is really unchanged.

I was taking a ride in the car

I was riding in the car.

two words having the same definition. Pointless, causes redundancy. 

PorterStemmer (circa 1979) used 

In [31]:
ps = PorterStemmer()
example_words = ["destroy","destroyed","destroying","destroys"]
stemmed_words = [ps.stem(w) for w in example_words]
stemmed_words

['destroy', 'destroy', 'destroy', 'destroy']

#### Part of Speech Tagging
labelling a part of speech to every word 


PunktSentenceTokenizer is an sentence boundary detection algorithm that must be trained to be used [1]. NLTK already includes a pre-trained version of the PunktSentenceTokenizer.

So if you use initialize the tokenizer without any arguments, it will default to the pre-trained version

You can also provide your own training data to train the tokenizer before using it. Punkt tokenizer uses an unsupervised algorithm, meaning you just train it with regular text.

https://stackoverflow.com/questions/35275001/use-of-punktsentencetokenizer-in-nltk

In [66]:
train_text = state_union.raw("2006-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
## sentence tokenizer
tokenized = custom_sent_tokenizer.tokenize(sample_text)


In [83]:
tagged = [nltk.pos_tag(nltk.word_tokenize(i)) for i in tokenized]


#### Chunking

Who is the sentence talking about. Named enity (many nouns) in the account? Words that modify that noun. Descriptive bunch of words surrounding that noun.

Chunking is a process of extracting phrases from unstructured text. Instead of just simple tokens which may not represent the actual meaning of the text, its advisable to use phrases such as “South Africa” as a single word instead of ‘South’ and ‘Africa’ separate words.

Can chunk to find noun phrases. United States of America needs to be together, President Bush should be together. Chunks help it keep it together.



https://rikenshah.github.io/articles/natural-language-

https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb


In [78]:
sentence = "President Obama Barack White House barked at the cat"
grammar = ('''
    NP: {<DT>?<JJ>*<NN>} # NP
    ''')
chunkParser = nltk.RegexpParser(grammar)
tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
tree = chunkParser.parse(tagged)


In [74]:
tree.draw()

### Chinking
A chink is what we wish to remove from the chunk. Can Use regular expressions to remove unwanted 

#### Name Entity Recoginition

In [82]:
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged, binary=True)

namedEnt.draw()

#### Lemmatizing

Similar to stemming. Real word; a synonym.