# Session 5. Web Content Mining: Text Clustering and Classification

As usual, we start by importing all the packages/classes that are of interest today.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline

from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import NMF, LatentDirichletAllocation

import numpy as np

## 1. Clustering

### 1.1. Hard Clustering
We start this session by applying a K-Means algorithm on a well-known dataset: the 20 newsgroup dataset. This dataset is composed of news from different topics. Each document is labelled which is perfect for classification purpose. 

First, let load the dataset using the following command. It is included in the `sklearn` library.

In [2]:
dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=100, remove=('headers', 'footers', 'quotes'))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Try to explore this newly created `dataset` variable and store the labels (you can get them with the `dataset.target` command) and the number of classes into variables.

In [3]:
print("%d documents" % len(dataset.data))
print("%d categories" % len(dataset.target_names))
print(dataset.target_names[:5])
labels = dataset.target
true_k = np.unique(labels).shape[0]
print(true_k)
dataset["data"][0]
dataset.keys()

18846 documents
20 categories
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware']
20


dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

Now it's time to start manipulating the data. First create a variable to store the number of features we want to manipulate (i.e., the size of the vocabulary). Then, build the TF-IDF term-document matrix. For this, apply the `TfidfVectorizer` with the following parameter values: 
  - Filter out words that appear in more than 50% of the documents
  - Filter out words that appear in less than 2 documents
  - Use the idf
  - Filter out english stop words
  - Keep only `number_of_features` words

In [4]:
n_features = 1000 # vocabulary size
vectorizer = TfidfVectorizer(max_df=0.5, max_features=n_features,min_df=2, 
                             stop_words='english',use_idf=True)

Use the result of the vectorizer to fit and transform the data (accessible with `dataset.data`).

In [5]:
X = vectorizer.fit_transform(dataset.data)
print("n_samples: %d, n_features: %d" % X.shape)

n_samples: 18846, n_features: 1000


Now we get the term-document matrix, we can perform the clustering. We will use the `MiniBatchKMeans` algorithm which is a more efficient variant of the popular `KMeans` algorithm. Specify the number of clusters to be the number of distinct labels in the dataset and let the other parameters unchanged.

In [6]:
km = MiniBatchKMeans(n_clusters=true_k)
km.fit(X)


MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++',
        init_size=None, max_iter=100, max_no_improvement=10, n_clusters=20,
        n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0,
        verbose=0)

Since we have the labels of the documents, it is possible to evaluate the quality of the clustering by checking if it succeeded in grouping documents having the same labels. The package `metrics` of the `scikit` library has some interesting features for doing so. Have a look to the documentation and calculate the following quality scores:
  - Homgeneity
  - Completeness
  - V-score
  - Silouhette coefficient
  

In [7]:
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))   km.labels_
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, km.labels_, sample_size=1000))

Homogeneity: 0.202
Completeness: 0.243
V-measure: 0.221
Silhouette Coefficient: -0.015


We now have a look to the features (terms) that caracterize each cluster. For this, run the following piece of code and check if clusters look consistent. 

In [8]:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

Cluster 0: god jesus christ faith bible believe sin lord christian people
Cluster 1: government key gun chip clipper encryption law people keys public
Cluster 2: com ibm ve mail list like hp sun does said
Cluster 3: israel jews armenian israeli arab jewish armenians turkish people war
Cluster 4: insurance included car turned base necessary doubt needed friend week
Cluster 5: drive scsi drives hard disk ide controller floppy team apple
Cluster 6: windows dos window file program files use problem thanks mouse
Cluster 7: sure seen haven like ve make just know league years
Cluster 8: don know think people just like want use right good
Cluster 9: thanks use like used new need looking know problem mail
Cluster 10: car bike engine speed cars oil just new like ve
Cluster 11: game games hockey year team baseball night play goal don
Cluster 12: team win gm yes times thought hell says face article
Cluster 13: address mail send list email thanks know number phone interested
Cluster 14: server runn

### 1.2. Soft Clustering
We now focus on soft clustering, and more precisely on topic extraction. In this session, two algorithms are applieds, NMF and LSA. We start by loading the dataset and defining some variables we are using in the following. 

In [9]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, 
                             remove=('headers', 'footers', 'quotes'))
documents = dataset.data

n_features = 1000
n_topics = 20

As usual, the first step data preprocessing. Here, we will apply the same pretreatment than in the previous section and then build the term-document matrix. However, NMF and LDA do not require the same metric for building the term-document matrix. Indeed, since LDA because is a probabilistic graphical model, it makes no sense to use the TF-IDF. Thus, for LDA, we are using the term frequency (using the `CountVectorizer` class) to build the matrix whereas we are using the TF-IDF (using the `TfidfVectorizer` as in the previous section) for NMF. Apply these treatments for the methods accordingly and store the feature name (the 1000 words that are used as columns of the matrices). This can be done with the `get_feature_name` function applied on the newly created instances of the vectorizer. 

In [10]:
# NMF
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, max_features=n_features, 
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

In [11]:
# LDA
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2, max_features=n_features, 
                                stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

We can now apply the two algorithms, `NMF` and `lda` by setting the number of components to the number of desired topics and letting the other parameters as their default values (except for `lda` for which you should specificy `learning_method="batch"` to avoid a warning). Once the algorithm has been run (i.e., the `fit` function has been applied), we will store the resulting matrices (that contain information of the words that represent topics as well as wihich topics are included in which documents). To do so, the "document-topic" matrix can be obtained with the `transform(tfidf)` function (here `tfidf` is for NMF) of the model. The "topic-word" matrix can be obtained using the `components_` of the model.

In [12]:
nmf = NMF(n_components=n_topics).fit(tfidf)
nmf_W = nmf.transform(tfidf)
nmf_H = nmf.components_

In [13]:
lda = LatentDirichletAllocation(n_components=n_topics, 
                                learning_method="batch").fit(tf)
lda_W = lda.transform(tf)
lda_H = lda.components_

We are now going to visualize our results. Specifically, we are going to print, for each cluster, the 10 words that "characterize" the topic and the 10 documents in which the topic is the most represented. Writing this function is out of the scope of this course and that's why it is given to you below:

In [35]:
def display_topics_full(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print(documents[doc_index])
            print("------------------------")

Apply this function to the result of the NMF and lda and discuss the results.

In [36]:
n_top_words = 10
n_top_documents = 10
display_topics_full(nmf_H, nmf_W, tfidf_feature_names, documents, 
                    n_top_words, n_top_documents)
display_topics_full(lda_H, lda_W, tf_feature_names, documents, 
                    n_top_words, n_top_documents)

Topic 0:
people right government did said israel time state law gun
not only is it improper etiquette AND illegal but the people who
are responsible for junk mailings are *EVIL*!!!!


------------------------
<-> > But, do you knew how much organization is required to training a large
<-> > group of poeple twice a year.  Just to try to get the same people
<-> > every year, provide a basic training to new people so they can
<-> > be integrated into the force, and find a suitable location, it 
<-> > requires a continually standing committee of organizers.  
<-> 
<-> Again, my response is, "so what?"  Is Mr. Rutledge arguing that since
<-> the local and federal governments have abandoned their charter to support
<-> such activity, and passed laws prohibiting private organizations from 
<-> doing so, that they have eliminated the basis for the RKBA?   On the
<-> contrary, to anyone who understands the game, they have strengthened it.
<
<No, I originally argued that the Second Amendment was

## 2. Classification
We now move to the supervised scenario and start by loading the training set as follows. The code below shows you how to perform this classification and is taken from this excellent blog article [excellent blog article](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a). Please have a look to it to get full explainations. 

In [5]:
#Loading the data set - training data.

twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

In [12]:
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [16]:
# Extracting features from text files
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

In [18]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

In [19]:
# Machine Learning
# Training Naive Bayes (NB) classifier on training data.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [20]:
# Building a pipeline: We can write less code and do all of the above, by building a pipeline as follows:
# The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.
# We will be using the 'text_clf' going forward.

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [21]:
# Performance of NB Classifier
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.7738980350504514

In [22]:
# Training Support Vector Machines - SVM and calculating its performance

from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=5, random_state=42))])

text_clf_svm = text_clf_svm.fit(twenty_train.data, twenty_train.target)
predicted_svm = text_clf_svm.predict(twenty_test.data)
np.mean(predicted_svm == twenty_test.target)

0.82381837493361654

In [23]:
# Grid Search
# Here, we are creating a list of parameters for which we would like to do performance tuning. 
# All the parameters name start with the classifier name (remember the arbitrary name we gave). 
# E.g. vect__ngram_range; here we are telling to use unigram and bigrams and choose the one which is optimal.

from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__alpha': (1e-2, 1e-3)}

In [24]:
# Next, we create an instance of the grid search by passing the classifier, parameters 
# and n_jobs=-1 which tells to use multiple cores from user machine.

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

In [25]:
# To see the best mean score and the params, run the following code

print(gs_clf.best_score_)
print(gs_clf.best_params_)

# Output for above should be: The accuracy has now increased to ~90.6% for the NB classifier (not so naive anymore! 😄)
# and the corresponding parameters are {‘clf__alpha’: 0.01, ‘tfidf__use_idf’: True, ‘vect__ngram_range’: (1, 2)}.

0.906752695775
{'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


In [26]:
# Similarly doing grid search for SVM
from sklearn.model_selection import GridSearchCV
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False),'clf-svm__alpha': (1e-2, 1e-3)}

gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)

In [27]:
print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)

0.89791408874
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


In [28]:
# NLTK
# Removing stop words
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), 
                     ('clf', MultinomialNB())])

In [29]:
# Stemming Code

import nltk
#nltk.download()

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
    
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer()), 
                             ('mnb', MultinomialNB(fit_prior=False))])

text_mnb_stemmed = text_mnb_stemmed.fit(twenty_train.data, twenty_train.target)

predicted_mnb_stemmed = text_mnb_stemmed.predict(twenty_test.data)

np.mean(predicted_mnb_stemmed == twenty_test.target)

0.81678173127987252