<a href="https://colab.research.google.com/github/tanimuranaomichi/Information_System_Analysis/blob/master/Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# nltkの文章群にscikit-learnを用いてクラスタリングを適用してみる

## 導入編

### 必要なライブラリ・データセットのインポート

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
import collections

### 今回は以下のnltkの機能を使用できる様にする


In [None]:
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("reuters")
nltk.download("punkt")
nltk.download("brown")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

### データを取得

In [None]:
from nltk.corpus import brown as corpus
# !unzip /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora

### datasetの中身を確認。場合によって、次のようなコードを実行する必要があります。
"!unzip /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora"

In [None]:
for n,item in enumerate(corpus.words(corpus.fileids()[0])[:300]):
    print(item, end=" ")
    if (n%25) ==24:
      print(" ")

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .  
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise  
and thanks of the City of Atlanta '' for the manner in which the election was conducted . The September-October term jury had been charged  
by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan  
Allen Jr. . `` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in  
the election , the number of voters and the size of this city '' . The jury said it did find that many of Georgia's  
registration and election laws `` are outmoded or inadequate and often ambiguous '' . It recommended that Fulton legislators act `

### 全document数

In [None]:
len(corpus.fileids())

500

### (例) 前からk個のdocumentのみで学習する場合

In [None]:
k = 100
docs=[corpus.words(fileid) for fileid in corpus.fileids()[:k]]

### 全documentで学習する場合

In [None]:
# docs=[corpus.words(fileid) for fileid in corpus.fileids()]

# print(docs[:5])
# print("num of docs:", len(docs))

## 前処理編

### 例 : ストップワードリストの作成

### nltkのストップワードリスト

In [None]:
en_stop = nltk.corpus.stopwords.words('english')

### 例:【発展】記号や数字は正規表現で消してみる

In [None]:
en_stop= ["``","/",",.",".,",";","--",":",")","(",'"','&',"'",'),',',"','-','.,','.,"','.-',"?",">","<"]                  \
         +["0","1","2","3","4","5","6","7","8","9","10","11","12","86","1986","1987","000"]                                                      \
         +["said","say","u","v","mln","ct","net","dlrs","tonne","pct","shr","nil","company","lt","share","year","billion","price"]          \
         +en_stop

### 前処理関数の作成

In [None]:
from nltk.corpus import wordnet as wn #lemmatize関数のためのimport

def preprocess_word(word, stopwordset):
    
    #1.make words lower ex: Python =>python
    word=word.lower()
    
    #2.remove "," and "."
    if word in [",","."]:
        return None
    
    #3.remove stopword  ex: the => (None) 
    if word in stopwordset:
        return None
    
    #4.lemmatize  ex: cooked=>cook
    lemma = wn.morphy(word)
    if lemma is None:
        return word

    elif lemma in stopwordset: #lemmatizeしたものがstopwordである可能性がある
        return None
    else:
        return lemma
    

def preprocess_document(document):
    document=[preprocess_word(w, en_stop) for w in document]
    document=[w for w in document if w is not None]
    return document

def preprocess_documents(documents):
    return [preprocess_document(document) for document in documents]

### 前処理の結果を出力してみる

### 前処理前

In [None]:
print(docs[0][:25]) 

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']


### 前処理後

In [None]:
print(preprocess_documents(docs)[0][:25])

['fulton', 'county', 'grand', 'jury', 'friday', 'investigation', "atlanta's", 'recent', 'primary', 'election', 'produce', 'evidence', "''", 'irregularity', 'take', 'place', 'jury', 'term-end', 'presentment', 'city', 'executive', 'committee', 'over-all', 'charge', 'election']


## クラスタリング編

### tf idfで上記の前処理済みの文章をベクトル化
### vectorizerを使用する（ハイパーパラメーターの設定）

In [None]:
pre_docs=preprocess_documents(docs)
pre_docs=[" ".join(doc) for doc in pre_docs]
print(pre_docs[0])

vectorizer = TfidfVectorizer(max_features=200, token_pattern=u'(?u)\\b\\w+\\b' )



### fitする

In [None]:
tf_idf = vectorizer.fit_transform(pre_docs)

### K-means
### kmeansの設定

In [None]:
num_clusters = 1
km = KMeans(n_clusters=num_clusters, random_state = 0)

### fitする

In [None]:
clusters = km.fit_predict(tf_idf)

### 出力結果

In [None]:
for doc, cls in zip(pre_docs, clusters):

    print(cls,doc)

0 austin texas committee approval gov. daniel's abandon property '' act seem certain thursday despite adamant protest texas banker daniel personally led fight measure water considerably since rejection two previous legislature public hearing house committee revenue taxation committee rule go automatically subcommittee one week question committee member taunt banker appearing witness left little doubt recommend passage daniel term extremely conservative '' estimate would produce 17 million dollar help erase anticipate deficit 63 million dollar end current fiscal next aug. 31 tell committee measure would merely provide means enforce escheat law book since texas republic '' permit state take bank account stocks personal property person miss seven years bill daniel draft personally would force banks insurance firm pipeline corporation report property state treasurer escheat law cannot enforce almost impossible locate property daniel declare dewey lawrence tyler lawyer represent texas banke

In [None]:
import numpy as np
from sklearn import metrics

def purity_score(y_true, y_pred):
    # compute contingency matrix (also called confusion matrix)
    contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
    # return purity
    return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)

test_clusters0 = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
                 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3
                 ])
purity_score(test_clusters0, clusters)

0.44

## 応用
クラスタリング編でコードを以下に指示に従って変更する事で結果がどの様に変わるのかを確認してみましょう．<br>
    （１）講義で学んだ他の手法でベクトル化してみる(例：bag-of-words)<br>
    （２）kmeans以外の手法、又はkmeansを可視化してみる(例：階層型クラスタリング)


## ヒント

scikit-learnのvectorizerとkmeansにはたくさんのハイパーパラメータがあります。vectorizerのハイパーパラメータの中には前処理機能(例：stop_words)もあります。
    ハイパーパラメータの設定を変える事で最終的な結果は変わります。以下のURLにアクセスしてハイパーパラメータの独自で設定してみてください。<br>
    ・TF-IDFに関するパラメータ<br>
    https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html<br>
    ・Kmeansに関するパラメータ<br>
    https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html<br>

