# process 20newsgroup dataset 

We download [20newsgroup dataset](http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz) and store it at /Users/huangwaleking/git/detector/20news-18828。([http://qwone.com/~jason/20Newsgroups/](http://qwone.com/~jason/20Newsgroups/))

Firstly, we read it by nltk.

In [2]:
import nltk
newsgroups = \
  nltk.corpus.PlaintextCorpusReader('/Users/huangwaleking/git/detector/20news-18828', '.*/[0-9]+', encoding='latin1')

In [12]:
ids = newsgroups.fileids()
print("在20newsgroups中有%s个文件" % len(ids))
print("前20个文件的id是：\n %s" % ids[:20])
print("第一个文件的内容是%s" % list(newsgroups.words(fileids=ids[0])))
print("newsgroups.words的类型是%s" % type(newsgroups.words(fileids=ids[0])))

在20newsgroups中有18828个文件
前20个文件的id是：
 ['alt.atheism/49960', 'alt.atheism/51060', 'alt.atheism/51119', 'alt.atheism/51120', 'alt.atheism/51121', 'alt.atheism/51122', 'alt.atheism/51123', 'alt.atheism/51124', 'alt.atheism/51125', 'alt.atheism/51126', 'alt.atheism/51127', 'alt.atheism/51128', 'alt.atheism/51130', 'alt.atheism/51131', 'alt.atheism/51132', 'alt.atheism/51133', 'alt.atheism/51134', 'alt.atheism/51135', 'alt.atheism/51136', 'alt.atheism/51139']
第一个文件的内容是[u'From', u':', u'mathew', u'<', u'mathew', u'@', u'mantis', u'.', u'co', u'.', u'uk', u'>', u'Subject', u':', u'Alt', u'.', u'Atheism', u'FAQ', u':', u'Atheist', u'Resources', u'Archive', u'-', u'name', u':', u'atheism', u'/', u'resources', u'Alt', u'-', u'atheism', u'-', u'archive', u'-', u'name', u':', u'resources', u'Last', u'-', u'modified', u':', u'11', u'December', u'1992', u'Version', u':', u'1', u'.', u'0', u'Atheist', u'Resources', u'Addresses', u'of', u'Atheist', u'Organizations', u'USA', u'FREEDOM', u'FROM', u'RELIG

### NLTK用于语法解析
有关nltk的例子来自于 [PyCon2016](http://pycon.districtdatalabs.com/tutorial)

In [13]:
import nltk
grammar = nltk.grammar.CFG.fromstring("""
S -> NP
NP -> N N | ADJP NP | DET N
ADJP -> ADJ NP
DET -> 'an'
N ->'airplane'
""")

parser = nltk.parse.ChartParser(grammar)
p=list(parser.parse(nltk.word_tokenize("an airplane")))
for a in p:
    a.pprint()

(S (NP (DET an) (N airplane)))


### NLTK用于词干提取(stemming)和lemmatization

In [9]:
import time
start = time.time()

import nltk
import string
lemmatizer= nltk.stem.wordnet.WordNetLemmatizer()
stopwords=set(nltk.corpus.stopwords.words('english'))
punctuation=string.punctuation

def normalize(text):
    for token in nltk.word_tokenize(text):
        token=token.lower()
        token=lemmatizer.lemmatize(token)
        if token not in stopwords and token not in punctuation:
            yield token

print(list(normalize("The eagle flies at midnight.")))


end = time.time()
print("The stemming and lemmatization cost %s seconds" % (end-start))

['eagle', u'fly', 'midnight']
The stemming and lemmatization cost 0.00257587432861 seconds


### NLTK用于命名实体识别

In [8]:
print(
    nltk.ne_chunk(
        nltk.pos_tag(
            nltk.word_tokenize(
                "John Smith is from the United States of America and works at Microsoft Research Labs"))))

(S
  (PERSON John/NNP)
  (PERSON Smith/NNP)
  is/VBZ
  from/IN
  the/DT
  (GPE United/NNP States/NNPS)
  of/IN
  (GPE America/NNP)
  and/CC
  works/VBZ
  at/IN
  (ORGANIZATION Microsoft/NNP Research/NNP Labs/NNP))


### NLTK使用Stanford NER系统进行命名实体识别

In [55]:
# change the paths below to point to wherever you unzipped the Stanford NER download file
import os
import nltk.tag.stanford as st

stanford_root = '/Users/huangwaleking/Documents/stanford-ner-2014-01-04'
stanford_data = os.path.join(stanford_root,'classifiers/english.all.3class.distsim.crf.ser.gz')
stanford_jar  = os.path.join(stanford_root,'stanford-ner.jar')

tagger = st.StanfordNERTagger(stanford_data, stanford_jar, 'utf-8')

for tagged in tagger.tag(
  "John Bengfort is from the United States of America and "
  "works at Microsoft Research Labs".split()):
    print('[' + tagged[1] + '] ' + tagged[0])
    

import time
start = time.time()

for tagged in tagger.tag(
  "Pandas makes it super simple to apply custom functions over groups of data.".split()):
    print('[' + tagged[1] + '] ' + tagged[0])
    
end = time.time()
print("The NER process costs %s seconds" % (end-start))



[PERSON] John
[PERSON] Bengfort
[O] is
[O] from
[O] the
[LOCATION] United
[LOCATION] States
[LOCATION] of
[LOCATION] America
[O] and
[O] works
[O] at
[ORGANIZATION] Microsoft
[ORGANIZATION] Research
[ORGANIZATION] Labs
[O] Pandas
[O] makes
[O] it
[O] super
[O] simple
[O] to
[O] apply
[O] custom
[O] functions
[O] over
[O] groups
[O] of
[O] data.
The NER process costs 3.45166993141 seconds


### NLTK使用Distributed Representation
文档的vector encoding表示方法有如下四种：
(1) bag of words
(2) one hot（在神经网络中使用较多）
(3) TF-IDF
(4) distributed representation (doc2vec，可以参考[论文](https://cs.stanford.edu/~quocle/paragraph_vector.pdf),或在[本地查看](http://localhost:8081/mypapers/Files/D9/D9263895-27FC-4213-9DCF-502F6EC7E7E4.pdf))

In [10]:
import gensim
documents=gensim.models.doc2vec.TaggedLineDocument("usedForNLTK_example.txt")
model = gensim.models.doc2vec.Doc2Vec(documents, size=7, min_count=0)
model.docvecs[0]

array([-0.02257801, -0.02108288,  0.02712592,  0.00483102,  0.04937694,
       -0.03264765, -0.03207875], dtype=float32)

In [56]:
import sklearn
cropus=sklearn.datasets.fetch_20newsgroups(remove=('headers','footers','quotes'))

In [58]:
type(cropus)

sklearn.datasets.base.Bunch

In [60]:
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 50

# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

t0 = time()
print("Loading dataset and extracting TF-IDF features...")
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))

vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features,
                             stop_words='english')
tfidf = vectorizer.fit_transform(dataset.data[:n_samples])
print("done in %0.3fs." % (time() - t0))

# Fit the NMF model
print("Fitting the NMF model with n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
nmf = NMF(n_components=n_topics, random_state=1).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(nmf.components_):
    print("Topic #%d:" % topic_idx)
    print(" ".join([feature_names[i]
                    for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

Loading dataset and extracting TF-IDF features...
done in 4.528s.
Fitting the NMF model with n_samples=2000 and n_features=1000...
done in 4.741s.
Topic #0:
god jesus believe faith christians bible christ truth people christian church rutgers does hell life evidence question atheists say true apr man jews religion did brian peace existence islam mean accept belief reason exist 1993 things religious fact word know cwru death earth world physical john come spirit muslims judge
()
Topic #1:
key chip encryption clipper keys algorithm secret security government public uni chips use des communications netcom number phone information bit using message law data house device bits available standard does clinton used text door att code private technology know 80 org known ca request able administration technical end mail systems
()
Topic #2:
edu cs university article nntp host posting writes cc pitt reply distribution science berkeley computer colorado caltech keith uiuc news cwru institute cmu 

In [None]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck <L.J.Buitinck@uva.nl>
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()


# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data
print("done in %0.3fs." % (time() - t0))

# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, #max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

# Fit the NMF model
print("Fitting the NMF model with tf-idf features,"
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
exit()
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online', learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Loading dataset...
done in 3.599s.
Extracting tf-idf features for NMF...
done in 5.298s.
Extracting tf features for LDA...
done in 5.165s.
Fitting the NMF model with tf-idf features,n_samples=2000 and n_features=1000...
done in 4.328s.

Topics in NMF model:
Topic #0:
edu com article writes don like just people university posting think host nntp know ca good cs time new distribution
Topic #1:
god jesus bible christians faith christian believe christ people life hell truth say church christianity religion sin heaven does existence
Topic #2:
pitt geb banks gordon cs cadre dsl n3jxp chastity shameful skepticism intellect surrender pittsburgh edu univ soon science computer reply
Topic #3:
ohio magnus state acs edu university ryan magnusug cis cleveland scharfy rscharfy nntp host drugs posting cwru article nielsen oil
Topic #4:
windows dos file window card files mouse ms program video use screen pc drivers problem thanks using graphics version help
Topic #5:
sandvik kent apple newton ksand a