# NLTK Corpus Clustering with scikit-learn package

## Preparation
First of all, let us import necessary libraries.
* nltk
* sklearn


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
import collections

We will use the following datasets in this tutorial.

In [15]:
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("semcor")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package semcor to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Load the corpus from NLTK package and check out the contents.

The corpus I used is Semcor corpus from NLTK data. Semcor, Semantic Concordance package is a corpus provided by Princeton University which contains 352 documents from Brown corpus. It contains Brown1 (103 semantically tagged Brown Corpus files contains all content words tagged), Brown2 (83 semantically tagged Brown Corpus files contains all content words tagged), and Brownv (166 semantically tagged Brown Corpus files contains verb only).

In [23]:
from nltk.corpus import semcor as corpus

for n,item in enumerate(corpus.words(corpus.fileids()[0])[:1000]):
    print(item, end=" ")
    if (n%25) ==24:
      print(" ")
#Total number of documents
print("")
print("")
print("Total number of documents:",len(corpus.fileids()))

The Fulton County Grand Jury said Friday an investigation of Atlanta 's recent primary election produced `` no evidence '' that any irregularities took place  
. The jury further said in term end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves  
the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted . The September October term jury  
had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won  
by Mayor-nominate Ivan Allen Jr. . `` Only a relative handful of such reports was received '' , the jury said , `` considering the  
widespread interest in the election , the number of voters and the size of this city '' . The jury said it did find that  
many of Georgia 's registration and election laws `` are outmoded or inadequate and often ambiguous '' . It recommended that Fulton legislators act

In [24]:
#Train the model with first K number of documents or all documents. 
# K=352
# docs=[corpus.words(fileid) for fileid in corpus.fileids()[:K]]

# All documents
docs=[corpus.words(fileid) for fileid in corpus.fileids()]

print(docs[:10])
print("num of documents:", len(docs))

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...], ['Committee', 'approval', 'of', 'Gov.', 'Price', ...], ['The', 'Orioles', 'tonight', 'retained', 'the', ...], ['A', 'Texas', 'halfback', 'who', 'does', "n't", ...], ['Rookie', 'Ron', 'Nischwitz', 'continued', 'his', ...], ['Nick', 'Skorich', ',', 'the', 'line', 'coach', 'for', ...], ['If', 'the', 'Cardinals', 'heed', 'Manager', 'Gene', ...], ['Sizzling', 'temperatures', 'and', 'hot', 'summer', ...], ['The', 'nuclear', 'war', 'is', 'already', 'being', ...], ['It', 'is', 'not', 'news', 'that', 'Nathan', ...]]
num of documents: 352


## Data preprocessing
First, let us define some stopwords. Here we consider English stopwords from the NLTK package and some noises that may affect our result.  
(Optional) Try to ignore numbers and words through regular expression.

In [34]:
# English stopwords defined by the NLTK package.
en_stop = nltk.corpus.stopwords.words('english')

# Ignore noises that might affect our result.
en_stop = ["``","/",",.",".,",";","--",":",")","(",'"','&',"'",'),',',"','-','.,','.,"','.-',"?",">","<","''"]                  \
         +["0","1","2","3","4","5","6","7","8","9","10","11","12","86","1986","1987","000"]                                                      \
         +en_stop

Next, let us define several preprocessing functions.

In [35]:
from nltk.corpus import wordnet as wn # import for lemmatize

def preprocess_word(word, stopwordset):
    
    #1.convert words to lowercase (e.g., Python =>python)
    word=word.lower()
    
    #2.remove "," and "."
    if word in [",",".","'s"]:
        return None
    
    #3.remove stopwords  (e.g., the => (None)) 
    if word in stopwordset:
        return None
    
    #4.lemmatize  (e.g., cooked=>cook)
    lemma = wn.morphy(word)
    if lemma is None:
        return word

    # lemmatized words could be in the stopwords set
    elif lemma in stopwordset: 
        return None
    else:
        return lemma
    

def preprocess_document(document):
    document=[preprocess_word(w, en_stop) for w in document]
    document=[w for w in document if w is not None]
    return document

def preprocess_documents(documents):
    return [preprocess_document(document) for document in documents]

Let us check out the preprocessing result.

In [36]:
# before
print(docs[0][:25]) 

# after
print(preprocess_documents(docs)[0][:25])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', 'Atlanta', "'s", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place']
['fulton', 'county', 'grand', 'jury', 'say', 'friday', 'investigation', 'atlanta', 'recent', 'primary', 'election', 'produce', 'evidence', 'irregularity', 'take', 'place', 'jury', 'say', 'term', 'end', 'presentment', 'city', 'executive', 'committee', 'over-all']


## Clustering
Document vectorization with tf-idf. We use the TfidfVectorizer that provided by the sklearn package (and set the hyperparameter).

In [53]:
# define the vectorizer
pre_docs=preprocess_documents(docs)
pre_docs=["".join(doc) for doc in pre_docs]
print(pre_docs[0])

vectorizer = TfidfVectorizer(max_features=400, token_pattern=u'(?u)\\b\\w+\\b' )

# fit
tf_idf = vectorizer.fit_transform(pre_docs)
print(tf_idf)
print("")
print("")

#Using Cosinus Similarity
def cos_similarity(pre_docs):
    tfidf = vectorizer.fit_transform(pre_docs)
    return (tfidf * tfidf.T).toarray()
cos_similarity(pre_docs)

  (0, 53)	0.3485430776693097
  (0, 44)	0.3615994698104478
  (0, 121)	0.5388261313788167
  (0, 95)	0.5119512190380637
  (0, 118)	0.2858433467766268
  (0, 107)	0.3370897165650773
  (1, 51)	0.29824473529954854
  (1, 390)	0.25623171538187794
  (1, 355)	0.29966789821937906
  (1, 153)	0.46331134837597926
  (1, 36)	0.3557489493643548
  (1, 242)	0.37417717363825465
  (1, 60)	0.20612509396757228
  (1, 389)	0.30758811252669577
  (1, 44)	0.23461294242268235
  (1, 118)	0.1854608600349057
  (1, 107)	0.2187105260558536
  (2, 278)	0.25297317704012967
  (2, 387)	0.19616497026320626
  (2, 13)	0.23635784169127289
  (2, 48)	0.12920081938035904
  (2, 4)	0.6238612149084335
  (2, 146)	0.5059463540802593
  (2, 398)	0.17040555602351828
  (2, 390)	0.1732327775858986
  :	:
  (346, 356)	0.8407193516226817
  (346, 127)	0.3927503574541057
  (346, 101)	0.263569847029976
  (346, 375)	0.263569847029976
  (347, 385)	0.4805810146092568
  (347, 41)	0.6669026151965052
  (347, 379)	0.4172156988034108
  (347, 36)	0.2682756

array([[1.        , 0.21157374, 0.07168157, ..., 0.        , 0.        ,
        0.14847472],
       [0.21157374, 1.        , 0.17645886, ..., 0.20902381, 0.        ,
        0.20217679],
       [0.07168157, 0.17645886, 1.        , ..., 0.14131652, 0.        ,
        0.        ],
       ...,
       [0.        , 0.20902381, 0.14131652, ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.14847472, 0.20217679, 0.        , ..., 0.        , 0.        ,
        1.        ]])

We use K-means to cluster our documents.

In [38]:
# K-means setting, K parameter = 4
num_clusters = 4
km = KMeans(n_clusters=num_clusters, random_state = 0)

# fit
clusters = km.fit_predict(tf_idf)

for doc, cls in zip(preprocess_documents(docs)[0], clusters):
    print(cls,doc)

2 fulton
2 county
2 grand
2 jury
2 say
2 friday
2 investigation
3 atlanta
2 recent
2 primary
1 election
2 produce
2 evidence
2 irregularity
2 take
2 place
2 jury
2 say
2 term
3 end
2 presentment
3 city
2 executive
2 committee
2 over-all
2 charge
0 election
2 deserve
2 praise
3 thanks
3 city
2 atlanta
2 manner
2 election
0 conduct
2 september
3 october
1 term
2 jury
1 charge
2 fulton
2 superior
2 court
2 judge
2 durwood
2 pye
2 investigate
2 report
3 possible
2 irregularity
2 hard-fought
2 primary
2 mayor-nominate
2 ivan
2 allen
2 jr.
2 relative
2 handful
2 report
2 receive
2 jury
3 say
2 consider
2 widespread
2 interest
2 election
0 number
2 voter
2 size
2 city
2 jury
2 say
2 find
2 many
2 georgia
2 registration
2 election
3 laws
3 outmode
3 inadequate
2 often
2 ambiguous
2 recommend
2 fulton
2 legislator
2 act
2 laws
2 study
2 revise
2 end
2 modernize
2 improve
2 grand
2 jury
2 comment
2 number
2 topic
2 among
2 atlanta
2 fulton
2 county
2 purchasing
1 department
2 say
2 well
2 operat

Check out the clustering result.

In [39]:
# K-means setting, K parameter = 5
num_clusters = 5
km = KMeans(n_clusters=num_clusters, random_state = 0)

# fit
clusters = km.fit_predict(tf_idf)

for doc, cls in zip(preprocess_documents(doｃs)[0], clusters):
    print(cls,doc)



2 fulton
2 county
2 grand
2 jury
2 say
2 friday
2 investigation
3 atlanta
2 recent
2 primary
1 election
2 produce
2 evidence
2 irregularity
2 take
2 place
2 jury
2 say
2 term
3 end
2 presentment
3 city
2 executive
2 committee
2 over-all
2 charge
0 election
2 deserve
2 praise
2 thanks
3 city
2 atlanta
2 manner
2 election
0 conduct
2 september
3 october
1 term
2 jury
1 charge
2 fulton
2 superior
2 court
2 judge
2 durwood
2 pye
2 investigate
2 report
3 possible
2 irregularity
2 hard-fought
2 primary
2 mayor-nominate
2 ivan
2 allen
2 jr.
2 relative
2 handful
2 report
2 receive
2 jury
3 say
2 consider
2 widespread
2 interest
2 election
0 number
2 voter
2 size
2 city
2 jury
2 say
2 find
2 many
2 georgia
2 registration
2 election
3 laws
3 outmode
3 inadequate
2 often
2 ambiguous
2 recommend
2 fulton
4 legislator
2 act
2 laws
2 study
2 revise
2 end
2 modernize
2 improve
2 grand
2 jury
2 comment
2 number
2 topic
2 among
2 atlanta
2 fulton
2 county
2 purchasing
1 department
2 say
2 well
2 operat

In [40]:
# K-means setting, K parameter = 6
num_clusters = 6
km = KMeans(n_clusters=num_clusters, random_state = 0)

# fit
clusters = km.fit_predict(tf_idf)

for doc, cls in zip(preprocess_documents(doｃs)[0], clusters):
    print(cls,doc)


5 fulton
5 county
5 grand
3 jury
5 say
4 friday
5 investigation
1 atlanta
2 recent
5 primary
5 election
5 produce
5 evidence
5 irregularity
3 take
5 place
5 jury
0 say
5 term
1 end
5 presentment
1 city
3 executive
5 committee
5 over-all
5 charge
4 election
5 deserve
5 praise
3 thanks
1 city
0 atlanta
0 manner
0 election
4 conduct
0 september
1 october
4 term
5 jury
5 charge
5 fulton
5 superior
5 court
0 judge
5 durwood
0 pye
5 investigate
0 report
1 possible
0 irregularity
5 hard-fought
5 primary
5 mayor-nominate
5 ivan
5 allen
3 jr.
5 relative
5 handful
5 report
5 receive
5 jury
1 say
5 consider
5 widespread
5 interest
5 election
4 number
5 voter
5 size
5 city
5 jury
5 say
5 find
5 many
5 georgia
5 registration
5 election
5 laws
1 outmode
1 inadequate
5 often
5 ambiguous
5 recommend
5 fulton
4 legislator
5 act
5 laws
5 study
5 revise
5 end
5 modernize
5 improve
5 grand
5 jury
5 comment
5 number
5 topic
5 among
5 atlanta
5 fulton
5 county
5 purchasing
5 department
3 say
4 well
3 operat

NameError: ignored

## Hints

There are many hyperparameters in the vectorizer and kmeans of scikit-learn. The vectorizer method also provides data preprocessing functions with hyperparameters (e.g., stop_words). The clustering result will change according to the change of these hyperparameters. You can try different hyperparameter settings to check out the result refer to the following URL.   
* About TF-IDF   
    https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html   
* About K-means   
    https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html


## Try it yourself
Modify the above code with the following methods to check out the differences:
1. Try other vectorization methods (e.g., bag-of-words)
2. Try other clustering methods (e.g., hierarchical clustering) or visualize the result of K-means.