NLTK是python环境下NLP工具包，包含了丰富的文本处理和文本挖掘API  

开源的自然语言处理库：
* Natural language toolkit (NLTK)
* Apache OpenNLP
* Stanford NLP suite
* Gate NLP library

## NLTK的安装
使用pip安装NLTK

* 执行(sudo) pip install nltk

easy_install  

Conda install

要检查NTLK是否正确安装完成，可以打开python终端并输入以下内容：import nltk  


## Using Corpora in NLTK

* 语料库下载

In [2]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

* Use Gutenberg Corpus

In [27]:
from nltk.corpus import gutenberg as gt

print(gt.fileids())

#to access the file shakespeare-macbeth.txt
shakespear_macbeth = gt.words("shakespeare-macbeth.txt") # words
print(shakespear_macbeth)

raw = gt.raw("shakespeare-macbeth.txt") # raw texts
sent = gt.sents("shakespeare-macbeth.txt") # sentences

for fileid in gt.fileids():
    num_words = len(gt.words(fileid))
    num_sents = len(gt.sents(fileid))
    print("FileName: %s\nNumber of words: %s\nNumber of Sentences: %s\n\n"%(fileid, num_words, num_sents))

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', ...]
FileName: austen-emma.txt
Number of words: 192427
Number of Sentences: 7752


FileName: austen-persuasion.txt
Number of words: 98171
Number of Sentences: 3747


FileName: austen-sense.txt
Number of words: 141576
Number of Sentences: 4999


FileName: bible-kjv.txt
Number of words: 1010654
Number of Sentences: 30103


FileName: blake-poems.txt
Number of words: 8354
Number of Sentences: 438


FileName: bryant-stories.txt
Number of words: 55563
Number of Sentences: 2863


FileName: burgess-busterbrown.txt
Number of words: 189

## 文本分词

In [7]:
text = '''"China will place greater importance on imports," Xi told business leaders at the expo. More than 3,800 enterprises and 181 countries, regions and international organizations are attending the expo."'''
tokens = [t for t in text.split()]
print(tokens)

['"China', 'will', 'place', 'greater', 'importance', 'on', 'imports,"', 'Xi', 'told', 'business', 'leaders', 'at', 'the', 'expo.', 'More', 'than', '3,800', 'enterprises', 'and', '181', 'countries,', 'regions', 'and', 'international', 'organizations', 'are', 'attending', 'the', 'expo."']


In [2]:
from nltk.tokenize import word_tokenize, sent_tokenize

text = '''"China will place greater importance on imports," Xi told business leaders at the expo. More than 3,800 enterprises and 181 countries, regions and international organizations are attending the expo."'''

tokens = word_tokenize(text, language = "english")
print(tokens)

sent = sent_tokenize(text) 
print(sent)


['``', 'China', 'will', 'place', 'greater', 'importance', 'on', 'imports', ',', "''", 'Xi', 'told', 'business', 'leaders', 'at', 'the', 'expo', '.', 'More', 'than', '3,800', 'enterprises', 'and', '181', 'countries', ',', 'regions', 'and', 'international', 'organizations', 'are', 'attending', 'the', 'expo', '.', "''"]
['"China will place greater importance on imports," Xi told business leaders at the expo.', 'More than 3,800 enterprises and 181 countries, regions and international organizations are attending the expo."']


## Stemming
* porter stemmer

In [3]:
from nltk.stem import PorterStemmer
print(tokens)

ps =PorterStemmer()

for w in tokens:
  print(ps.stem(w))


['``', 'China', 'will', 'place', 'greater', 'importance', 'on', 'imports', ',', "''", 'Xi', 'told', 'business', 'leaders', 'at', 'the', 'expo', '.', 'More', 'than', '3,800', 'enterprises', 'and', '181', 'countries', ',', 'regions', 'and', 'international', 'organizations', 'are', 'attending', 'the', 'expo', '.', "''"]
``
china
will
place
greater
import
on
import
,
''
Xi
told
busi
leader
at
the
expo
.
more
than
3,800
enterpris
and
181
countri
,
region
and
intern
organ
are
attend
the
expo
.
''


## Stop Words 停用词

Stop words are the words which are mostly used as fillers and hardly have any useful meaning. We can make a list of words to be used as stop words and then filter these words from the data we want to process

In [19]:
from nltk.corpus import stopwords

#To check the list of stop words stored for English language
stop_words = stopwords.words("English") 
print(stop_words)

tokens_filtered = [w for w in tokens if not w in stop_words]
print(tokens_filtered)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

## POS Tagging & Chunking with NLTK

Part of speech tagging creates *tuples* of words and parts of speech. It labels words in a sentence as nouns, adjectives, verbs, etc. It can also label by tense, and more. 

In [7]:
from nltk import word_tokenize, pos_tag

text = '''"China will place greater importance on imports," Xi told business leaders at the expo. More than 3,800 enterprises and 181 countries, regions and international organizations are attending the expo."'''
tokens = word_tokenize(text, language = "english")
print(tokens)

pos = pos_tag(tokens)
print(pos)

for item in pos:
    print(item[0] + "_" + item[1], end = " ")

['``', 'China', 'will', 'place', 'greater', 'importance', 'on', 'imports', ',', "''", 'Xi', 'told', 'business', 'leaders', 'at', 'the', 'expo', '.', 'More', 'than', '3,800', 'enterprises', 'and', '181', 'countries', ',', 'regions', 'and', 'international', 'organizations', 'are', 'attending', 'the', 'expo', '.', "''"]
[('``', '``'), ('China', 'NNP'), ('will', 'MD'), ('place', 'VB'), ('greater', 'JJR'), ('importance', 'NN'), ('on', 'IN'), ('imports', 'NNS'), (',', ','), ("''", "''"), ('Xi', 'NNP'), ('told', 'VBD'), ('business', 'NN'), ('leaders', 'NNS'), ('at', 'IN'), ('the', 'DT'), ('expo', 'NN'), ('.', '.'), ('More', 'JJR'), ('than', 'IN'), ('3,800', 'CD'), ('enterprises', 'NNS'), ('and', 'CC'), ('181', 'CD'), ('countries', 'NNS'), (',', ','), ('regions', 'NNS'), ('and', 'CC'), ('international', 'JJ'), ('organizations', 'NNS'), ('are', 'VBP'), ('attending', 'VBG'), ('the', 'DT'), ('expo', 'NN'), ('.', '.'), ("''", "''")]
``_`` China_NNP will_MD place_VB greater_JJR importance_NN on_IN 

In [10]:
from nltk import word_tokenize, pos_tag

text = '''"China will place greater importance on imports," Xi told business leaders at the expo. More than 3,800 enterprises and 181 countries, regions and international organizations are attending the expo."'''
tokens = word_tokenize(text, language = "english")

pos = pos_tag(tokens)
posString = [i[0] + "_" + i[1] for i in pos]
print(" ".join(posString))

``_`` China_NNP will_MD place_VB greater_JJR importance_NN on_IN imports_NNS ,_, ''_'' Xi_NNP told_VBD business_NN leaders_NNS at_IN the_DT expo_NN ._. More_JJR than_IN 3,800_CD enterprises_NNS and_CC 181_CD countries_NNS ,_, regions_NNS and_CC international_JJ organizations_NNS are_VBP attending_VBG the_DT expo_NN ._. ''_''


* Chunking is used to add more structure to the sentenceby following parts of speech (POS) tagging. It is also known as shallow parsing. The resulted group of words is called "chunks". 

* 现有的chunking语料库：CoNLL2000

In [9]:
from nltk.corpus import conll2000

sent = conll2000.chunked_sents()
print(sent[0])
 
#chunked_sentence = conll2000.chunked_sents()[0] #第0个句子
#print(chunked_sentence)

(S
  (NP Confidence/NN)
  (PP in/IN)
  (NP the/DT pound/NN)
  (VP is/VBZ widely/RB expected/VBN to/TO take/VB)
  (NP another/DT sharp/JJ dive/NN)
  if/IN
  (NP trade/NN figures/NNS)
  (PP for/IN)
  (NP September/NNP)
  ,/,
  due/JJ
  (PP for/IN)
  (NP release/NN)
  (NP tomorrow/NN)
  ,/,
  (VP fail/VB to/TO show/VB)
  (NP a/DT substantial/JJ improvement/NN)
  (PP from/IN)
  (NP July/NNP and/CC August/NNP)
  (NP 's/POS near-record/JJ deficits/NNS)
  ./.)


* 正则表达式chunking

In [12]:
import nltk

text = 'Ravi is the CEO of a company. He is very powerful public speaker also.'

# 词性语法规则
grammar = '\n'.join([
    'NP: {<DT>*<NNP>}', # 一个或多个DT后紧跟一个NNP
    'NP: {<JJ>*<NN>}', # 一个或多个JJ后紧跟一个NN
    'NP: {<NNP>+}',# 一个或多个NNP组成
])

sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    tags = nltk.pos_tag(words)
    # 将语法规则放到RegexpParser对象中
    chunkparser = nltk.RegexpParser(grammar)
    result = chunkparser.parse(tags)
    print(result)

(S
  (NP Ravi/NNP)
  is/VBZ
  (NP the/DT CEO/NNP)
  of/IN
  a/DT
  (NP company/NN)
  ./.)
(S
  He/PRP
  is/VBZ
  very/RB
  (NP powerful/JJ public/JJ speaker/NN)
  also/RB
  ./.)


### Parsed Corpora

In [14]:
from nltk.corpus import treebank
#print(treebank.fileids())

#the Penn Treebank (Wall Stree Journal Sample)
tree = treebank.parsed_sents("wsj_0003.mrg")[0]
print(tree)
tree.draw()

(S
  (S-TPC-1
    (NP-SBJ
      (NP (NP (DT A) (NN form)) (PP (IN of) (NP (NN asbestos))))
      (RRC
        (ADVP-TMP (RB once))
        (VP
          (VBN used)
          (NP (-NONE- *))
          (S-CLR
            (NP-SBJ (-NONE- *))
            (VP
              (TO to)
              (VP
                (VB make)
                (NP (NNP Kent) (NN cigarette) (NNS filters))))))))
    (VP
      (VBZ has)
      (VP
        (VBN caused)
        (NP
          (NP (DT a) (JJ high) (NN percentage))
          (PP (IN of) (NP (NN cancer) (NNS deaths)))
          (PP-LOC
            (IN among)
            (NP
              (NP (DT a) (NN group))
              (PP
                (IN of)
                (NP
                  (NP (NNS workers))
                  (RRC
                    (VP
                      (VBN exposed)
                      (NP (-NONE- *))
                      (PP-CLR (TO to) (NP (PRP it)))
                      (ADVP-TMP
                        (NP
                 

* 句法分析

In [15]:
import nltk

def SRParserExample(grammer,textlist):
    parser = nltk.parse.ShiftReduceParser(grammer)
    for text in textlist:
        sentence = nltk.word_tokenize(text)
        for tree in parser.parse(sentence):
            print(tree)
            tree.draw()

grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> NNP VBZ
VP -> IN NNP | DT NN IN NNP
NNP -> 'Tajmahal' | 'Agra' | 'Bangalore' | 'Karnataka'
VBZ -> 'is'
IN -> 'in' | 'of'
DT -> 'the'
NN -> 'capital'
""")

text = [
    "Tajmahal is in Agra",
    "Bangalore is the capital of Karnataka",
]

SRParserExample(grammar,text)

(S (NP (NNP Tajmahal) (VBZ is)) (VP (IN in) (NNP Agra)))


## Frequency Distribution

NLTK provides the FreqDist class that let's us easily calculate a frequency distribution given a lost as input

In [17]:
import nltk

text = '''"China will place greater importance on imports," Xi told business leaders at the expo. More than 3,800 enterprises and 181 countries, regions and international organizations are attending the expo."'''
tokens = nltk.word_tokenize(text, language = "english")

fd = nltk.FreqDist(tokens)
print(fd.most_common(10))

[(',', 2), ("''", 2), ('the', 2), ('expo', 2), ('.', 2), ('and', 2), ('``', 1), ('China', 1), ('will', 1), ('place', 1)]


In [20]:
import nltk

brown = nltk.corpus.brown.tagged_words()

pos_tag = [pos for i, pos in brown]

fd = nltk.FreqDist(pos_tag)
print(fd.most_common(10)) #输出最高频的10个词类

print(fd['JJR']) #形容词比较级的出现频次

[('NN', 152470), ('IN', 120557), ('AT', 97959), ('JJ', 64028), ('.', 60638), (',', 58156), ('NNS', 55110), ('CC', 37718), ('RB', 36464), ('NP', 34476)]
1958


## Loading your Own Corpus

In [31]:
from nltk.corpus import PlaintextCorpusReader

# As it reads in a corpus, NLTK applies word tokenization and sentence tokenization

#corpusRoot = "E:/myCorpus"
myCorpus = PlaintextCorpusReader("E:/myCorpus",".*txt") # all files ending in 'txt'
#print(myCorpus.fileids())

#Building word frequency distribution for the entire corpus
myCorpus_freq = nltk.FreqDist(myCorpus.words())
myCorpus_freq.most_common(100)

myCorpus_freq.get("she") # frequency of one word

sent = myCorpus.sents() #返回语料库句子

i = 0
for wordList in sent:
    if "she" in wordList:
        print(str(i) + ": " + " ".join(wordList) + "\n")
    i += 1

#pos = pos_tag(sent)
#posString = [i[0] + "_" + i[1] for i in pos]
#print(" ".join(posString)) #得到词法标注


5: Even before Miss Taylor had ceased to hold the nominal office of governess , the mildness of her temper had hardly allowed her to impose any restraint ; and the shadow of authority being now long passed away , they had been living together as friend and friend very mutually attached , and Emma doing just what she liked ; highly esteeming Miss Taylor ' s judgment , but directed chiefly by her own .

12: Her father composed himself to sleep after dinner , as usual , and she had then only to sit and think of what she had lost .

14: Mr . Weston was a man of unexceptionable character , easy fortune , suitable age , and pleasant manners ; and there was some satisfaction in considering with what self - denying , generous friendship she had always wished and promoted the match ; but it was a black morning ' s work for her .

16: She recalled her past kindness -- the kindness , the affection of sixteen years -- how she had taught and how she had played with her from five years old -- how sh

7501: " But it is ," returned she ; " for Mrs . Long has just been here , and she told me all about it ."

7527: When a woman has five grown - up daughters , she ought to give over thinking of her own beauty ."

7538: Lizzy is not a bit better than the others ; and I am sure she is not half so handsome as Jane , nor half so good - humoured as Lydia .

7551: When she was discontented , she fancied herself nervous .

7554: He had always intended to visit him , though to the last always assuring his wife that he should not go ; and till the evening after the visit was paid she had no knowledge of it .

7567: " Kitty has no discretion in her coughs ," said her father ; " she times them ill ." " I do not cough for my own amusement ," replied Kitty fretfully .

7570: " Aye , so it is ," cried her mother , " and Mrs . Long does not come back till the day before ; so it will be impossible for her to introduce him , for she will not know him herself ."

7576: But if we do not venture somebody e

13501: Mrs . John Dashwood had never been a favourite with any of her husband ' s family ; but she had had no opportunity , till the present , of shewing them with how little attention to the comfort of other people she could act when occasion required it .

13502: So acutely did Mrs . Dashwood feel this ungracious behaviour , and so earnestly did she despise her daughter - in - law for it , that , on the arrival of the latter , she would have quitted the house for ever , had not the entreaty of her eldest girl induced her first to reflect on the propriety of going , and her own tender love for all her three children determined her afterwards to stay , and for their sakes avoid a breach with their brother .

13504: She had an excellent heart ;-- her disposition was affectionate , and her feelings were strong ; but she knew how to govern them : it was a knowledge which her mother had yet to learn ; and which one of her sisters had resolved never to be taught .

13507: She was generous ,