Topics:
    
    1. Language Processing and Python
    2. Accessing Text Corpora and Lexical Resources
   ### 3. Processing Raw Text
    4. Writing Structured Programs
    5. Categorizing and Tagging Words (minor fixes still required)
    6. Learning to Classify Text
    7. Extracting Information from Text
    8. Analyzing Sentence Structure
    9. Building Feature Based Grammars
    10. Analyzing the Meaning of Sentences (minor fixes still required)
    11. Managing Linguistic Data (minor fixes still required)
    12. Afterword: Facing the Language Challenge

### WordNet

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. We'll begin by looking at synonyms and how they are accessed in WordNet.

In [1]:
from nltk.corpus import wordnet

In [2]:
wordnet.synsets('motorcar')

[Synset('car.n.01')]

Thus, motorcar has just one possible meaning and it is identified as car.n.01, the first noun sense of car. The entity car.n.01 is called a synset, or "synonym set", a collection of synonymous words (or "lemmas"):

In [5]:
wordnet.synset('car.n.01').lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

Each word of a synset can have several meanings, e.g., car can also signify a train carriage, a gondola, or an elevator car. However, we are only interested in the single meaning that is common to all words of the above synset. Synsets also come with a prose definition and some example sentences:

In [8]:
wordnet.synset('car.n.01').definition()

'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

In [9]:
wordnet.synset('car.n.01').examples()

['he needs a car to get to work']

Although definitions help humans to understand the intended meaning of a synset, the words of the synset are often more useful for our programs. To eliminate ambiguity, we will identify these words as car.n.01.automobile, car.n.01.motorcar, and so on. This pairing of a synset with a word is called a lemma. We can get all the lemmas for a given synset

In [10]:
wordnet.synset('car.n.01').lemmas()

[Lemma('car.n.01.car'),
 Lemma('car.n.01.auto'),
 Lemma('car.n.01.automobile'),
 Lemma('car.n.01.machine'),
 Lemma('car.n.01.motorcar')]

In [11]:
wordnet.synsets('car')

[Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('car.n.03'),
 Synset('car.n.04'),
 Synset('cable_car.n.01')]

In [12]:
wordnet.synsets('motorcar')

[Synset('car.n.01')]

Unlike the word motorcar, which is unambiguous and has one synset, the word car is ambiguous, having five synsets:

In [13]:
for synset in wordnet.synsets('car'):
    print(synset.lemma_names())

['car', 'auto', 'automobile', 'machine', 'motorcar']
['car', 'railcar', 'railway_car', 'railroad_car']
['car', 'gondola']
['car', 'elevator_car']
['cable_car', 'car']


For convenience, we can access all the lemmas involving the word car as follows.

In [14]:
wordnet.lemmas('car')

[Lemma('car.n.01.car'),
 Lemma('car.n.02.car'),
 Lemma('car.n.03.car'),
 Lemma('car.n.04.car'),
 Lemma('cable_car.n.01.car')]

Lets try for another word dish to see how synsets and lemmas vary

In [15]:
wordnet.synsets('dish')

[Synset('dish.n.01'),
 Synset('dish.n.02'),
 Synset('dish.n.03'),
 Synset('smasher.n.02'),
 Synset('dish.n.05'),
 Synset('cup_of_tea.n.01'),
 Synset('serve.v.06'),
 Synset('dish.v.02')]

In [17]:
for synset in wordnet.synsets('dish'):
    print(synset.lemmas())

[Lemma('dish.n.01.dish')]
[Lemma('dish.n.02.dish')]
[Lemma('dish.n.03.dish'), Lemma('dish.n.03.dishful')]
[Lemma('smasher.n.02.smasher'), Lemma('smasher.n.02.stunner'), Lemma('smasher.n.02.knockout'), Lemma('smasher.n.02.beauty'), Lemma('smasher.n.02.ravisher'), Lemma('smasher.n.02.sweetheart'), Lemma('smasher.n.02.peach'), Lemma('smasher.n.02.lulu'), Lemma('smasher.n.02.looker'), Lemma('smasher.n.02.mantrap'), Lemma('smasher.n.02.dish')]
[Lemma('dish.n.05.dish'), Lemma('dish.n.05.dish_aerial'), Lemma('dish.n.05.dish_antenna'), Lemma('dish.n.05.saucer')]
[Lemma('cup_of_tea.n.01.cup_of_tea'), Lemma('cup_of_tea.n.01.bag'), Lemma('cup_of_tea.n.01.dish')]
[Lemma('serve.v.06.serve'), Lemma('serve.v.06.serve_up'), Lemma('serve.v.06.dish_out'), Lemma('serve.v.06.dish_up'), Lemma('serve.v.06.dish')]
[Lemma('dish.v.02.dish')]


In [18]:
for synset in wordnet.synsets('dish'):
    print(synset.lemma_names())

['dish']
['dish']
['dish', 'dishful']
['smasher', 'stunner', 'knockout', 'beauty', 'ravisher', 'sweetheart', 'peach', 'lulu', 'looker', 'mantrap', 'dish']
['dish', 'dish_aerial', 'dish_antenna', 'saucer']
['cup_of_tea', 'bag', 'dish']
['serve', 'serve_up', 'dish_out', 'dish_up', 'dish']
['dish']


#### The WordNet Hierarchy

WordNet synsets correspond to abstract concepts, and they don't always have corresponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event — these are called unique beginners or root synsets. Others, such as gas guzzler and hatchback, are much more specific

WordNet makes it easy to navigate between concepts. For example, given a concept like motorcar, we can look at the concepts that are more specific; the (immediate) hyponyms.

In [19]:
motorcar = wordnet.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()

In [20]:
types_of_motorcar

[Synset('ambulance.n.01'),
 Synset('beach_wagon.n.01'),
 Synset('bus.n.04'),
 Synset('cab.n.03'),
 Synset('compact.n.03'),
 Synset('convertible.n.01'),
 Synset('coupe.n.01'),
 Synset('cruiser.n.01'),
 Synset('electric.n.01'),
 Synset('gas_guzzler.n.01'),
 Synset('hardtop.n.01'),
 Synset('hatchback.n.01'),
 Synset('horseless_carriage.n.01'),
 Synset('hot_rod.n.01'),
 Synset('jeep.n.01'),
 Synset('limousine.n.01'),
 Synset('loaner.n.02'),
 Synset('minicar.n.01'),
 Synset('minivan.n.01'),
 Synset('model_t.n.01'),
 Synset('pace_car.n.01'),
 Synset('racer.n.02'),
 Synset('roadster.n.01'),
 Synset('sedan.n.01'),
 Synset('sport_utility.n.01'),
 Synset('sports_car.n.01'),
 Synset('stanley_steamer.n.01'),
 Synset('stock_car.n.01'),
 Synset('subcompact.n.01'),
 Synset('touring_car.n.01'),
 Synset('used-car.n.01')]

In [23]:
sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas())

['Model_T',
 'S.U.V.',
 'SUV',
 'Stanley_Steamer',
 'ambulance',
 'beach_waggon',
 'beach_wagon',
 'bus',
 'cab',
 'compact',
 'compact_car',
 'convertible',
 'coupe',
 'cruiser',
 'electric',
 'electric_automobile',
 'electric_car',
 'estate_car',
 'gas_guzzler',
 'hack',
 'hardtop',
 'hatchback',
 'heap',
 'horseless_carriage',
 'hot-rod',
 'hot_rod',
 'jalopy',
 'jeep',
 'landrover',
 'limo',
 'limousine',
 'loaner',
 'minicar',
 'minivan',
 'pace_car',
 'patrol_car',
 'phaeton',
 'police_car',
 'police_cruiser',
 'prowl_car',
 'race_car',
 'racer',
 'racing_car',
 'roadster',
 'runabout',
 'saloon',
 'secondhand_car',
 'sedan',
 'sport_car',
 'sport_utility',
 'sport_utility_vehicle',
 'sports_car',
 'squad_car',
 'station_waggon',
 'station_wagon',
 'stock_car',
 'subcompact',
 'subcompact_car',
 'taxi',
 'taxicab',
 'tourer',
 'touring_car',
 'two-seater',
 'used-car',
 'waggon',
 'wagon']

We can also navigate up the hierarchy by visiting hypernyms. Some words have multiple paths, because they can be classified in more than one way. There are two paths between car.n.01 and entity.n.01 because wheeled_vehicle.n.01 can be classified as both a vehicle and a container.

In [24]:
motorcar.hypernyms()

[Synset('motor_vehicle.n.01')]

In [28]:
paths = motorcar.hypernym_paths()
paths

[[Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('artifact.n.01'),
  Synset('instrumentality.n.03'),
  Synset('container.n.01'),
  Synset('wheeled_vehicle.n.01'),
  Synset('self-propelled_vehicle.n.01'),
  Synset('motor_vehicle.n.01'),
  Synset('car.n.01')],
 [Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('artifact.n.01'),
  Synset('instrumentality.n.03'),
  Synset('conveyance.n.03'),
  Synset('vehicle.n.01'),
  Synset('wheeled_vehicle.n.01'),
  Synset('self-propelled_vehicle.n.01'),
  Synset('motor_vehicle.n.01'),
  Synset('car.n.01')]]

In [30]:
len(paths)

2

In [31]:
paths = motorcar.hypernym_distances()

len(paths)

19

In [32]:
paths

{(Synset('artifact.n.01'), 6),
 (Synset('artifact.n.01'), 7),
 (Synset('car.n.01'), 0),
 (Synset('container.n.01'), 4),
 (Synset('conveyance.n.03'), 5),
 (Synset('entity.n.01'), 10),
 (Synset('entity.n.01'), 11),
 (Synset('instrumentality.n.03'), 5),
 (Synset('instrumentality.n.03'), 6),
 (Synset('motor_vehicle.n.01'), 1),
 (Synset('object.n.01'), 8),
 (Synset('object.n.01'), 9),
 (Synset('physical_entity.n.01'), 9),
 (Synset('physical_entity.n.01'), 10),
 (Synset('self-propelled_vehicle.n.01'), 2),
 (Synset('vehicle.n.01'), 4),
 (Synset('wheeled_vehicle.n.01'), 3),
 (Synset('whole.n.02'), 7),
 (Synset('whole.n.02'), 8)}

#### More Lexical Relations

Hypernyms and hyponyms are called lexical relations because they relate one synset to another. These two relations navigate up and down the "is-a" hierarchy. Another important way to navigate the WordNet network is from items to their components (meronyms) or to the things they are contained in (holonyms). For example, the parts of a tree are its trunk, crown, and so on; the part_meronyms(). The substance a tree is made of includes heartwood and sapwood; the substance_meronyms(). A collection of trees forms a forest; the member_holonyms():

In [35]:
wordnet.synset('tree.n.01').part_meronyms()

[Synset('burl.n.02'),
 Synset('crown.n.07'),
 Synset('limb.n.02'),
 Synset('stump.n.01'),
 Synset('trunk.n.01')]

In [36]:
wordnet.synset('tree.n.01').substance_meronyms()

[Synset('heartwood.n.01'), Synset('sapwood.n.01')]

In [38]:
wordnet.synset('tree.n.01').member_holonyms()

[Synset('forest.n.01')]

To see just how intricate things can get, consider the word mint, which has several closely-related senses. We can see that mint.n.04 is part of mint.n.02 and the substance from which mint.n.05 is made.

In [39]:
for synset in wordnet.synsets('mint', wordnet.NOUN):
    print(synset.name() + ':', synset.definition())

batch.n.02: (often followed by `of') a large number or amount or extent
mint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
mint.n.03: any member of the mint family of plants
mint.n.04: the leaves of a mint plant used fresh or candied
mint.n.05: a candy that is flavored with a mint oil
mint.n.06: a plant where money is coined by authority of the government


In [40]:
wordnet.synset('mint.n.04').part_holonyms()

[Synset('mint.n.02')]

In [41]:
wordnet.synset('mint.n.04').substance_holonyms()

[Synset('mint.n.05')]

There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails stepping. Some verbs have multiple entailments:

In [42]:
wordnet.synset('walk.v.01').entailments()

[Synset('step.v.01')]

In [43]:
wordnet.synset('eat.v.01').entailments()

[Synset('chew.v.01'), Synset('swallow.v.01')]

In [45]:
wordnet.synset('tease.v.03').entailments()

[Synset('arouse.v.07'), Synset('disappoint.v.01')]

Some lexical relationships hold between lemmas, e.g., antonymy:

In [46]:
wordnet.lemma('supply.n.02.supply').antonyms()

[Lemma('demand.n.02.demand')]

In [47]:
wordnet.lemma('rush.v.01.rush').antonyms()

[Lemma('linger.v.04.linger')]

In [49]:
wordnet.lemma('horizontal.a.01.horizontal').antonyms()

[Lemma('vertical.a.01.vertical'), Lemma('inclined.a.02.inclined')]

In [50]:
wordnet.lemma('staccato.r.01.staccato').antonyms()

[Lemma('legato.r.01.legato')]

#### Semantic similarity

We have seen that synsets are linked by a complex network of lexical relations. Given a particular synset, we can traverse the WordNet network to find synsets with related meanings. Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term like vehicle will match documents containing specific terms like limousine.

Recall that each synset has one or more hypernym paths that link it to a root hypernym such as entity.n.01. Two synsets linked to the same root may have several hypernyms in common.

 If two synsets share a very specific hypernym — one that is low down in the hypernym hierarchy — they must be closely related.

In [52]:
right = wordnet.synset('right_whale.n.01')
orca = wordnet.synset('orca.n.01')
minke = wordnet.synset('minke_whale.n.01')
tortoise = wordnet.synset('tortoise.n.01')
novel = wordnet.synset('novel.n.01')
right.lowest_common_hypernyms(minke)

[Synset('baleen_whale.n.01')]

Of course we know that whale is very specific (and baleen whale even more so), while vertebrate is more general and entity is completely general. We can quantify this concept of generality by looking up the depth of each synset:

In [53]:
wordnet.synset('baleen_whale.n.01').min_depth()

14

In [54]:
wordnet.synset('whale.n.02').min_depth()

13

In [55]:
wordnet.synset('vertebrate.n.01').min_depth()

8

In [56]:
wordnet.synset('entity.n.01').min_depth()

0

Similarity measures have been defined over the collection of WordNet synsets which incorporate the above insight. For example, path_similarity assigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernym hierarchy (-1 is returned in those cases where a path cannot be found). Comparing a synset with itself will return 1. Consider the following similarity scores, relating right whale to minke whale, orca, tortoise, and novel. Although the numbers won't mean much, they decrease as we move away from the semantic space of sea creatures to inanimate objects.

In [58]:
right.path_similarity(minke)

0.25

In [59]:

right.path_similarity(orca)

0.16666666666666666

In [60]:

right.path_similarity(tortoise)

0.07692307692307693

In [61]:

right.path_similarity(novel)

0.043478260869565216

Summary


    A text corpus is a large, structured collection of texts. NLTK comes with many corpora, e.g., the Brown Corpus, nltk.corpus.brown.
    Some text corpora are categorized, e.g., by genre or topic; sometimes the categories of a corpus overlap each other.
    A conditional frequency distribution is a collection of frequency distributions, each one for a different condition. They can be used for counting word frequencies, given a context or a genre.
    Python programs more than a few lines long should be entered using a text editor, saved to a file with a .py extension, and accessed using an import statement.
    Python functions permit you to associate a name with a particular block of code, and re-use that code as often as necessary.
    Some functions, known as "methods", are associated with an object and we give the object name followed by a period followed by the function, like this: x.funct(y), e.g., word.isalpha().
    To find out about some variable v, type help(v) in the Python interactive interpreter to read the help entry for this kind of object.
    WordNet is a semantically-oriented dictionary of English, consisting of synonym sets — or synsets — and organized into a network.
    Some functions are not available by default, but must be accessed using Python's import statement.
