# 1 Tokenizing Text and WordNet Basics

Tokenizing text into sentences

f Tokenizing sentences into words

f Tokenizing sentences using regular expressions

f Training a sentence tokenizer

f Filtering stopwords in a tokenized sentence

f Looking up Synsets for a word in WordNet

f Looking up lemmas and synonyms in WordNet

f Calculating WordNet Synset similarity

f Discovering word collocations

In [1]:
from nltk.tokenize import sent_tokenize

para = "Hello World. It's good to see you. Thanks for buying this book."

sent_tokenize(para)

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

In [3]:
import nltk.data
from nltk.tokenize import PunktSentenceTokenizer
ps= PunktSentenceTokenizer()

#tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')

ps.tokenize(para)

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

In [None]:
#

### Tokenizing sentences into words

In [4]:
from nltk.tokenize import word_tokenize

word_tokenize('Hello World.')


['Hello', 'World', '.']

In [5]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

tokenizer.tokenize('Hello World.')

['Hello', 'World', '.']

In [6]:
word_tokenize("can't")

['ca', "n't"]

### PunktWordTokenizer

In [12]:
# 修正 PunktWordTokenizer 新版本中没有  弃用
#from nltk.tokenize import PunktWordTokenizer
from nltk.tokenize.punkt import PunktSentenceTokenizer

tokenizer = PunktSentenceTokenizer()

tokenizer.tokenize("Can't is a contraction.")

["Can't is a contraction."]

In [13]:
from nltk.tokenize import word_tokenize

word_tokenize("Can't is a contraction.")

['Ca', "n't", 'is', 'a', 'contraction', '.']

### WordPunctTokenizer

# 新版本停止使用 

### Tokenizing sentences using regular expressions

In [14]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize("Can't is a contraction.")

["Can't", 'is', 'a', 'contraction']

 a simple helper function you can use if you don't want to instantiate the class,
as shown in the following code:

In [15]:
from nltk.tokenize import regexp_tokenize

regexp_tokenize("Can't is a contraction.", "[\w+']+")

["Can't", 'is', 'a', 'contraction']

In [None]:
##

### Simple whitespace tokenizer

RegexpTokenizer can also work by matching the gaps, as opposed to the tokens. Instead
of using re.findall(), the RegexpTokenizer class will use re.split(). This is how the
BlanklineTokenizer class in nltk.tokenize is implemented

In [16]:
tokenizer =RegexpTokenizer('\s+', gaps=True)
tokenizer.tokenize("Can't is a contraction.")

["Can't", 'is', 'a', 'contraction.']

### Training a sentence tokenizer

In [19]:
from nltk.tokenize import PunktSentenceTokenizer

from nltk.corpus import webtext

text = webtext.raw('overheard.txt')

sent_tokenizer=PunktSentenceTokenizer()

sents_1= sent_tokenizer.tokenize(text)

sents_1[0]

'White guy: So, do you have any plans for this evening?'

In [21]:
from nltk.tokenize import sent_tokenize
sents_2= sent_tokenize(text)

sents_2[0]

'White guy: So, do you have any plans for this evening?'

In [22]:
sents_1[678]

'I only have a dollar...Can you spare some change?'

In [23]:
sents_2[678]

'Girl: But you already have a Big Mac...\nHobo: Oh, this is all theatrical.'

 This difference is a good demonstration of why it can
be useful to train your own sentence tokenizer, especially when your text isn't in the typical
paragraph-sentence structure.

## Filtering stopwords in a tokenized sentence

In [24]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

words = ["Can't", 'is', 'a', 'contraction']

[word for word in words if word not in stop_words]

["Can't", 'contraction']

In [25]:
stopwords.fileids()

['arabic',
 'azerbaijani',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

In [26]:
stopwords.words('dutch')

['de',
 'en',
 'van',
 'ik',
 'te',
 'dat',
 'die',
 'in',
 'een',
 'hij',
 'het',
 'niet',
 'zijn',
 'is',
 'was',
 'op',
 'aan',
 'met',
 'als',
 'voor',
 'had',
 'er',
 'maar',
 'om',
 'hem',
 'dan',
 'zou',
 'of',
 'wat',
 'mijn',
 'men',
 'dit',
 'zo',
 'door',
 'over',
 'ze',
 'zich',
 'bij',
 'ook',
 'tot',
 'je',
 'mij',
 'uit',
 'der',
 'daar',
 'haar',
 'naar',
 'heb',
 'hoe',
 'heeft',
 'hebben',
 'deze',
 'u',
 'want',
 'nog',
 'zal',
 'me',
 'zij',
 'nu',
 'ge',
 'geen',
 'omdat',
 'iets',
 'worden',
 'toch',
 'al',
 'waren',
 'veel',
 'meer',
 'doen',
 'toen',
 'moet',
 'ben',
 'zonder',
 'kan',
 'hun',
 'dus',
 'alles',
 'onder',
 'ja',
 'eens',
 'hier',
 'wie',
 'werd',
 'altijd',
 'doch',
 'wordt',
 'wezen',
 'kunnen',
 'ons',
 'zelf',
 'tegen',
 'na',
 'reeds',
 'wil',
 'kon',
 'niets',
 'uw',
 'iemand',
 'geweest',
 'andere']

## Looking up Synsets for a word in WordNet

In [31]:
from nltk.corpus import wordnet

syn = wordnet.synsets('cookbook')
syn

[Synset('cookbook.n.01')]

In [28]:
syn[0]

Synset('cookbook.n.01')

In [32]:
syn[0].definition()

'a book of recipes and cooking directions'

In [33]:
wordnet.synsets('cooking')[0].examples()

['cooking can be a great art',
 'people are needed who have experience in cookery',
 'he left the preparation of meals to his wife']

### hypernyms

In [34]:
syn

[Synset('cookbook.n.01')]

In [36]:
syn[0].hypernyms()

[Synset('reference_book.n.01')]

In [37]:
syn[0].hypernyms()[0].hypernyms()

[Synset('book.n.01')]

In [38]:
syn[0].root_hypernyms()

[Synset('entity.n.01')]

In [39]:
syn[0].hypernym_paths()

[[Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('artifact.n.01'),
  Synset('creation.n.02'),
  Synset('product.n.02'),
  Synset('work.n.02'),
  Synset('publication.n.01'),
  Synset('book.n.01'),
  Synset('reference_book.n.01'),
  Synset('cookbook.n.01')]]

### Part of speech (POS)

In [40]:
syn[0].pos()

'n'

In [41]:
len(wordnet.synsets('great'))

7

In [42]:
len(wordnet.synsets('great', pos='n'))

1

In [43]:
len(wordnet.synsets('great', pos='a'))

6

### Looking up lemmas and synonyms in WordNet

In [46]:
from nltk.corpus import wordnet

syn = wordnet.synsets('cookbook')[0]

lemmas = syn.lemmas()

lemmas

[Lemma('cookbook.n.01.cookbook'), Lemma('cookbook.n.01.cookery_book')]

In [50]:
lemmas[0].name()

'cookbook'

In [51]:
lemmas[1].name()

'cookery_book'

In this way, a Synset represents a group
of lemmas that all have the same meaning, while a lemma represents a distinct word form.

In [52]:
lemmas[0].synset() == lemmas[1].synset()

True

### All possible synonyms

In [54]:
synonyms=[]

for syn in wordnet.synsets('book'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
        
synonyms

['book',
 'book',
 'volume',
 'record',
 'record_book',
 'book',
 'script',
 'book',
 'playscript',
 'ledger',
 'leger',
 'account_book',
 'book_of_account',
 'book',
 'book',
 'book',
 'rule_book',
 'Koran',
 'Quran',
 "al-Qur'an",
 'Book',
 'Bible',
 'Christian_Bible',
 'Book',
 'Good_Book',
 'Holy_Scripture',
 'Holy_Writ',
 'Scripture',
 'Word_of_God',
 'Word',
 'book',
 'book',
 'book',
 'reserve',
 'hold',
 'book',
 'book',
 'book']

### Antonyms

In [55]:
gn2 = wordnet.synset('good.n.02')

gn2.definition()

'moral excellence or admirableness'

In [56]:
evil =gn2.lemmas()[0].antonyms()[0]
evil.name()

'evil'

### Calculating WordNet Synset similarity

In [57]:
from nltk.corpus import wordnet

cb = wordnet.synset('cookbook.n.01')
ib = wordnet.synset('instruction_book.n.01')

cb.wup_similarity(ib)

0.9166666666666666

In [59]:
ref = cb.hypernyms()[0]
ref


Synset('reference_book.n.01')

In [60]:
cb.shortest_path_distance(ref)

1

In [61]:
ib.shortest_path_distance(ref)

1

In [62]:
cb.shortest_path_distance(ib)

2

In [64]:
dog= wordnet.synsets('dog')[0]
dog.wup_similarity(cb)

0.38095238095238093

In [65]:
sorted(dog.common_hypernyms(cb))

[Synset('entity.n.01'),
 Synset('object.n.01'),
 Synset('physical_entity.n.01'),
 Synset('whole.n.02')]

#### Comparing verbs

In [67]:
cook = wordnet.synset('cook.v.01')
bake = wordnet.synset('bake.v.02')

In [68]:
cook.wup_similarity(bake)

0.6666666666666666

## Discovering word collocations