# WordNet使用范例
    https://wordnet.princeton.edu/
    http://www.nltk.org
    
    WordNet NLTK API
    http://www.nltk.org/api/nltk.corpus.reader.html?highlight=wordnet#module-nltk.corpus.reader.wordnet

* 在代码中引入wordnet包
* 返回所有同义词集
* 返回所有词条
* 查询一个词所在的所有词集
* 查询一个同义词集的定义
* 查询词语一个词义的例子
* 查询词语某种词性所在的同义词集合
* 查询一个同义词集中的所有词
* 输出词集和词的配对——词条
* 利用词条查询反义词
* 查询两个词之间的语义相似度
* 获取词集路径 hypernym_paths()
* 获取根上位词集root_hypernyms()
* 获取上位词集
* 获取下位词集
* 打印词语树形结构

#### WordNet
    WordNet是由Princeton 大学的心理学家，语言学家和计算机工程师联合设计的一种基于认知语言学的英语词典。它不是光把单词以字母顺序排列，而且按照单词的意义组成一个“单词的网络”。

名词概念：
* synset:同义词集
* Lammas:词条

#### word2vec
    Word2vec，是一群用来产生词向量的相关模型。这些模型为浅而双层的神经网络，用来训练以重新建构语言学之词文本。网络以词表现，并且需猜测相邻位置的输入词，在word2vec中词袋模型假设下，词的顺序是不重要的。训练完成之后，word2vec模型可用来映射每个词到一个向量，可用来表示词对词之间的关系，该向量为神经网络之隐藏层。

In [1]:
import os
import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import wordnet as wn

# 在代码中引入wordnet包

In [2]:
from nltk.corpus import wordnet as wn

# 返回所有同义词集

In [3]:
from nltk.corpus import wordnet as wn
lst=wn.all_synsets()
#--list不能与next同时使用--

#每次检索下一个结果，效率高
val1=next(lst)   
print(val1)

#list一次返回所有结果，效率低
#vals=list(lst)  
#print(vals[:2])

Synset('a_cappella.r.01')


# 返回所有词条

In [4]:
from nltk.corpus import wordnet as wn
lst=wn.all_lemma_names()
print(type(lst))
val1=next(lst)
print(val1)

<class 'dict_keyiterator'>
serratus_anterior


# 查询一个词所在的所有词集（synsets）

    synsets(lemma, pos=None, lang='eng', check_exceptions=True)[source]

    Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.

    dog.n.01 => n-名词,01-第一个`
    

In [5]:
wn.synsets('dog')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

# 查询一个同义词集的定义

In [6]:
print(wn.synset('dog.n.01').definition())

a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds


# 查询词语一个词义的例子

In [7]:
wn.synset('dog.n.01').examples()

['the dog barked all night']

# 查询词语某种词性所在的同义词集合
    注：pos值可以为——NOUN,VERB,ADJ,ADV…

In [8]:
wn.synsets('dog',pos=wn.NOUN) #名词

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01')]

In [9]:
wn.synsets('dog',pos=wn.VERB) #动词

[Synset('chase.v.01')]

# 查询一个同义词集中的所有词

In [10]:
wn.synset('dog.n.01').lemma_names()

['dog', 'domestic_dog', 'Canis_familiaris']

In [11]:
wn.synsets('Canis_familiaris')

[Synset('dog.n.01')]

# 输出词集和词的配对——词条（lemma）

In [12]:
lst=wn.synset('dog.n.01').lemmas()
item=lst[0]
print(wn.synset('dog.n.01').lemmas())

print(item.antonyms()) # 反义词
print(item.count())    # Return the frequency count for this Lemma
print(item.derivationally_related_forms())
print(item.frame_ids())
print(item.frame_strings())
print(item.key())
print(item.lang())
print(item.name())
print(item.pertainyms())
print(item.synset())
print(item.syntactic_marker())
print(item.unicode_repr())

print(item.hypernyms)
print(item.hyponyms)

[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')]
[]
42
[]
[]
[]
dog%1:05:00::
eng
dog
[]
Synset('dog.n.01')
None
Lemma('dog.n.01.dog')
<bound method _WordNetObject.hypernyms of Lemma('dog.n.01.dog')>
<bound method _WordNetObject.hyponyms of Lemma('dog.n.01.dog')>


# 利用词条查询反义词

In [13]:
good = wn.synset('good.a.01')
print(good.lemmas()[0].antonyms()) #反义词

[Lemma('bad.a.01.bad')]


# 查询两个词之间的语义相似度
    path_similarity函数，值从0-1，越大表示相似度越高
    值得注意的是，名词和动词被组织成了完整的层次式分类体系，形容词和副词没有被组织成分类体系，所以不能用path_distance。
    形容词和副词最有用的关系是similar to。

In [14]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
dog.path_similarity(cat)

0.2

In [15]:
beau=wn.synset('beautiful.a.01')
beau.similar_tos()

[Synset('beauteous.s.01'),
 Synset('bonny.s.01'),
 Synset('dishy.s.01'),
 Synset('exquisite.s.04'),
 Synset('fine-looking.s.01'),
 Synset('glorious.s.03'),
 Synset('gorgeous.s.01'),
 Synset('lovely.s.01'),
 Synset('picturesque.s.01'),
 Synset('pretty-pretty.s.01'),
 Synset('pretty.s.01'),
 Synset('pulchritudinous.s.01'),
 Synset('ravishing.s.01'),
 Synset('scenic.s.01'),
 Synset('stunning.s.04')]

# 获取词集路径 hypernym_paths()

In [16]:
from nltk.corpus import wordnet as wn
paths= wn.synset('dog.n.01').hypernym_paths()
print(paths)

[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('chordate.n.01'), Synset('vertebrate.n.01'), Synset('mammal.n.01'), Synset('placental.n.01'), Synset('carnivore.n.01'), Synset('canine.n.02'), Synset('dog.n.01')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('domestic_animal.n.01'), Synset('dog.n.01')]]


#  获取根上位词集root_hypernyms()

In [17]:
from nltk.corpus import wordnet as wn
paths= wn.synset('dog.n.01').root_hypernyms()
print(paths)

[Synset('entity.n.01')]


# 获取上位词集

In [18]:
from nltk.corpus import wordnet as wn
top_syn=wn.synset('dog.n.01').hypernyms()
for item in top_syn:
    print(item)

Synset('canine.n.02')
Synset('domestic_animal.n.01')


# 获取下位词集

In [19]:
from nltk.corpus import wordnet as wn
bot_syn=wn.synset('dog.n.01').hyponyms()
for item in bot_syn:
    print(item)

Synset('basenji.n.01')
Synset('corgi.n.01')
Synset('cur.n.01')
Synset('dalmatian.n.02')
Synset('great_pyrenees.n.01')
Synset('griffon.n.02')
Synset('hunting_dog.n.01')
Synset('lapdog.n.01')
Synset('leonberg.n.01')
Synset('mexican_hairless.n.01')
Synset('newfoundland.n.01')
Synset('pooch.n.01')
Synset('poodle.n.01')
Synset('pug.n.01')
Synset('puppy.n.01')
Synset('spitz.n.01')
Synset('toy_dog.n.01')
Synset('working_dog.n.01')


# 树形结构

In [20]:
from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
hyp = lambda s:s.hypernyms()
from pprint import pprint
pprint(dog.tree(hyp))

[Synset('dog.n.01'),
 [Synset('canine.n.02'),
  [Synset('carnivore.n.01'),
   [Synset('placental.n.01'),
    [Synset('mammal.n.01'),
     [Synset('vertebrate.n.01'),
      [Synset('chordate.n.01'),
       [Synset('animal.n.01'),
        [Synset('organism.n.01'),
         [Synset('living_thing.n.01'),
          [Synset('whole.n.02'),
           [Synset('object.n.01'),
            [Synset('physical_entity.n.01'),
             [Synset('entity.n.01')]]]]]]]]]]]]],
 [Synset('domestic_animal.n.01'),
  [Synset('animal.n.01'),
   [Synset('organism.n.01'),
    [Synset('living_thing.n.01'),
     [Synset('whole.n.02'),
      [Synset('object.n.01'),
       [Synset('physical_entity.n.01'), [Synset('entity.n.01')]]]]]]]]]


# 层次关系
<img src="images/wordnet_tree.png" />

In [21]:
motorcar=wn.synset('dog.n.01')
types_of_motorcar=motorcar.hyponyms()
types_of_motorcar[0]
sorted(
    [lemma.name()
    for synset in types_of_motorcar
         for lemma in synset.lemmas()])

['Belgian_griffon',
 'Brussels_griffon',
 'Great_Pyrenees',
 'Leonberg',
 'Mexican_hairless',
 'Newfoundland',
 'Newfoundland_dog',
 'Welsh_corgi',
 'barker',
 'basenji',
 'bow-wow',
 'carriage_dog',
 'coach_dog',
 'corgi',
 'cur',
 'dalmatian',
 'doggie',
 'doggy',
 'griffon',
 'hunting_dog',
 'lapdog',
 'mongrel',
 'mutt',
 'pooch',
 'poodle',
 'poodle_dog',
 'pug',
 'pug-dog',
 'puppy',
 'spitz',
 'toy',
 'toy_dog',
 'working_dog']

In [22]:
print(wn.synsets(u'结婚', lang='cmn'))
#for synset in wn.synsets(u'计算机', lang='cmn'):
#    types_of_computer = synset.hyponyms()
#    print(sorted([lemma.name() for synset in types_of_ computer for lemma in synset.lemmas('cmn')]))
print(wn.synsets(u'windows', lang='eng'))

[Synset('marriage.n.03'), Synset('marry.v.01'), Synset('marriage.n.01')]
[Synset('windows.n.01'), Synset('window.n.01'), Synset('window.n.02'), Synset('window.n.03'), Synset('window.n.04'), Synset('window.n.05'), Synset('windowpane.n.01'), Synset('window.n.07'), Synset('window.n.08')]


#  WordNetCorpusReader
    class nltk.corpus.reader.wordnet.WordNetCorpusReader(root, omw_reader)
    
    A corpus reader used to access wordnet or its variants.
    ADJ = 'a'
    ADJ_SAT = 's'
    ADV = 'r'
    MORPHOLOGICAL_SUBSTITUTIONS = {'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'n': [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 'r': [], 's': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'v': [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')]}
    NOUN = 'n'
    VERB = 'v'

### all_lemma_names
    返回所有词条名称
    
    all_lemma_names(pos=None, lang='eng')[source]
    Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.    

In [23]:
#返回所有词条名称
from nltk.corpus import wordnet as wn
lst=wn.all_lemma_names()
print('return Type:')
print(type(lst))
lst=list(lst)
#词条数目
print('numuber of all_lemma_names:')
print(len(lst))
#显示前2个词条名称 
print('display top 2 lemmas:')
for sname in lst[:2]:
    print(sname)

return Type:
<class 'dict_keyiterator'>
numuber of all_lemma_names:
147306
display top 2 lemmas:
serratus_anterior
genus_claviceps


### all_synsets
    返回所有同义词集
    
    all_synsets(pos=None)
    Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.

In [24]:
from nltk.corpus import wordnet as wn
lst=wn.all_synsets()
print(type(lst))

#--list不能与next同时使用--

#每次检索下一个结果，效率高
val1=next(lst)   
print(type(val1))
print(val1)

#list一次返回所有结果，效率低
#vals=list(lst)  
#print(type(vals))
#print(vals[:2])

<class 'generator'>
<class 'nltk.corpus.reader.wordnet.Synset'>
Synset('a_cappella.r.01')


### lemma
    lemma(name, lang='eng')[source]
    Return lemma object that matches the name


In [25]:
from nltk.corpus import wordnet as wn
obj=wn.synset('dog.n.01')
print(type(obj))

<class 'nltk.corpus.reader.wordnet.Synset'>
