#  Cluster articles by TF IDF and KMeans

## I have a number of articles from the site Medium that I have saved to pdf.  This notebook is intended to allow them or any group of articles to be clustered

Most of the inputs are the usual suspects with the exception of fitz  Fitz is used because PyPDF2 could not deal with the files.  Fitz is more robust.  
Fitz is the name of PyMuPDF. PyMuPDF is a Python binding for MuPDF – “a lightweight PDF and XPS viewer.  `https://github.com/pymupdf`

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import os
from pathlib import Path
import PyPDF2
import fitz

from nltk.corpus import stopwords
import re
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.metrics.pairwise import cosine_similarity

### Helper Functions

It is my practice to group functions rather than scatter them across the notebook

In [2]:
def rank_words(alist,features=1000,stops='english',ngram=(1,1)):
    vec = TfidfVectorizer(max_df=0.5, max_features=features,
                                 min_df=2, stop_words=stops,
                                  use_idf=True,
                                ngram_range=ngram)
    tfidf_matrix = vec.fit_transform(alist).toarray()
    df = pd.DataFrame({'Words': vec.get_feature_names(),
                       'Summed TFIDF': tfidf_matrix.sum(axis=0)})
    sorted_df = df.sort_values('Summed TFIDF', ascending=False)
    return sorted_df,tfidf_matrix,vec

def count_words(alist):
    cv=CountVectorizer(min_df=2, stop_words=stops) 
    
    # this steps generates word counts for the words in your docs 
    wcv=cv.fit_transform(alist)
    return wcv

def count_words2(alist,ngram_range=(1,1)):
    cv=CountVectorizer(min_df=2, stop_words=stops) 
    
    # this steps generates word counts for the words in your docs 
    wcv=cv.fit_transform(alist)
    return wcv, cv

### Constants and stuff

In [3]:
# This string does 2(1 or 2(0..9),: \ / ,) to pick up month and day
# then  2 or 4 (0..9) to pick up year or am or pm ti pick up time

date_time_str = "^(?:(?:[0-9]{1,2}[:\/,]){2}[0-9]{2,4}|am|pm)$"

date_str = "r'[0-9]{1,2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}'"

target ="C:\\Working\\Medium Articles"
inputs = os.listdir(target)
trace = False
other_stops = ['https','just','www', 'like','way','used','use','using','http', 'com']
stops = set(stopwords.words('english')+ other_stops)

nbr_features = 100

### Set up blank DF

In [5]:
cols = ['Title','Short_Title','Body']
art_df = pd.DataFrame(columns=cols)
art_df.head()

Unnamed: 0,Title,Short_Title,Body


### Get some data

In [8]:
for i in range(len(inputs)):
    if trace: print(i)
    somefo  = 'C:\\Working\\Medium Articles\\'+inputs[i]
    doc = fitz.open(somefo) 
    #print(doc.pageCount)
    artstr=""
    for j in range(doc.pageCount):
        page =doc.loadPage(j)
        text = page.getText('text')#.encode("utf8")
        textout =str( text).replace("\n", " ")
        artstr = artstr + textout
        #print(textout)
        #print("****")
    tit =  inputs[i][:-4]
    shrt_tit = tit[:20]
    artlist = artstr.split()
    new_list = [item for item in artlist if not re.search(r'[0-9]{1,2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', item)]
    art_df.loc[i] = [tit,shrt_tit, new_list]
   
        
art_df.head ()   

Unnamed: 0,Title,Short_Title,Body
0,10 Inventions from Arab Inventors We Were Neve...,10 Inventions from A,"[10, Inventions, from, Arab, Inventors, We, We..."
1,14 Different Types of Learning in Machine Lear...,14 Different Types o,"[14, Different, Types, of, Learning, in, Machi..."
2,4 Cognitive Bias Key Points Data Scientists Ne...,4 Cognitive Bias Key,"[4, Cognitive, Bias, Key, Points, Data, Scient..."
3,5 Statistical Traps Data Scientists Should Avoid,5 Statistical Traps,"[5, Statistical, Traps, Data, Scientists, Shou..."
4,A Beginners guide to Building your own Face Re...,A Beginners guide to,"[A, Beginners, guide, to, Building, your, own,..."


### Short examination of the data

In [9]:
words_in_articles = []
words_in_doc = []
# Glom into a list of strings
for i  in range(len(art_df)):
    words_in_articles.append(" ".join(art_df.iloc[i].Body) ) 

word_count,vec = count_words2( words_in_articles)
print("The shape of word counts, articles, unique words",word_count.shape)

The shape of word counts, articles, unique words (188, 9821)


In [10]:
sort_df = pd.DataFrame(word_count.sum(axis=0),
             columns=vec.get_feature_names()).T.sort_values(0,ascending=False)
sort_df.head(10)         

Unnamed: 0,0
data,3603
learning,3445
model,2418
100,2197
machine,1559
image,1467
python,1454
one,1343
science,1216
medium,1135


### Document clustering with TF IDF

In [11]:
sort_df,tf_matrix,vec = rank_words(words_in_articles)
sort_df.head(20)

Unnamed: 0,Words,Summed TFIDF
329,face,10.707552
262,detection,7.838385
729,recognition,7.658588
912,towardsdatascience,7.506433
443,images,5.519965
580,network,5.069635
601,object,4.70982
788,seaborn,4.566327
486,kdnuggets,4.493308
487,keras,4.441106


In [14]:
print("The TF IDF Matrix is articles by words, defaults to 1000 ",tf_matrix.shape)

The TF IDF Matrix is articles by words, defaults to 1000  (188, 1000)


### Distance is 1 - cosine similarity

this will have the dimensions articles by articles

In [15]:
dist = 1 - cosine_similarity(tf_matrix)

In [17]:
art_df['Dist'] =""
for i in range(len(art_df)):
    art_df.Dist.iloc[i] = dist[i]

art_df.head()

Unnamed: 0,Title,Short_Title,Body,Dist
0,10 Inventions from Arab Inventors We Were Neve...,10 Inventions from A,"[10, Inventions, from, Arab, Inventors, We, We...","[-2.220446049250313e-16, 0.974749517654662, 0...."
1,14 Different Types of Learning in Machine Lear...,14 Different Types o,"[14, Different, Types, of, Learning, in, Machi...","[0.974749517654662, -4.440892098500626e-16, 0...."
2,4 Cognitive Bias Key Points Data Scientists Ne...,4 Cognitive Bias Key,"[4, Cognitive, Bias, Key, Points, Data, Scient...","[0.9851655938314126, 0.9298484722934943, -2.22..."
3,5 Statistical Traps Data Scientists Should Avoid,5 Statistical Traps,"[5, Statistical, Traps, Data, Scientists, Shou...","[0.9750108243336555, 0.8915423404430383, 0.547..."
4,A Beginners guide to Building your own Face Re...,A Beginners guide to,"[A, Beginners, guide, to, Building, your, own,...","[0.9854114570245276, 0.9617390194930834, 0.947..."


### Cluster the articles based on the TF IDF Matrix

In [18]:
km = MiniBatchKMeans(n_clusters=20,
                    init='k-means++', n_init=1,
                    init_size=1000,
                    batch_size=1000 )

km.fit(tf_matrix)

MiniBatchKMeans(batch_size=1000, compute_labels=True, init='k-means++',
                init_size=1000, max_iter=100, max_no_improvement=10,
                n_clusters=20, n_init=1, random_state=None,
                reassignment_ratio=0.01, tol=0.0, verbose=0)

In [19]:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

terms = vec.get_feature_names()
for i in range(km.cluster_centers_.shape[0]):
    print("Cluster "+ str(i))
    for j in order_centroids[i,:10]:
        print(' %s' % terms[j], end='')
    print('\n')

Cluster 0
 object yolo detection vision visionwizard computer network boxes dive keras

Cluster 1
 reinforcement meta agent game policy reward state action actions planning

Cluster 2
 recognition facial detection object images projects 2020 towardsdatascience kdnuggets news

Cluster 3
 jupyter notebook colab windows notebooks transfer towardsdatascience task text gpu

Cluster 4
 optimization bayesian hyperparameter hyperopt xgboost acquisition search tuning hyperparameters optimize

Cluster 5
 seaborn visualization plot nerds neuralnets shall plots color random sns

Cluster 6
 face recognition faces detection images opencv celebrity facial keras cv2

Cluster 7
 life people schwartz things years day think utm_campaign utm_source book

Cluster 8
 kdnuggets cognitive bias html biases detection www distribution clustering sne

Cluster 9
 xgboost towardsdatascience sne feature variable false regression web hyper catboost

Cluster 10
 tensorflow tf cnn convolutional neural label multi class

### Add the clusters to the Data Frame

In [20]:
clusters = km.labels_.tolist()
art_df['Cluster'] = clusters

art_df.sort_values(['Cluster', 'Short_Title'])

Unnamed: 0,Title,Short_Title,Body,Dist,Cluster
11,A Simple Guide to Using Keras Pretrained Model...,A Simple Guide to Us,"[A, Simple, Guide, to, Using, Keras, Pretraine...","[0.9951149769326811, 0.9485160814292422, 0.977...",0
32,Computer Vision for Automatic Road Damage Dete...,Computer Vision for,"[Computer, Vision, for, Automatic, Road, Damag...","[0.9609625329766408, 0.9401862216903285, 0.989...",0
50,DETR _ Object Detection _ Facebook AI _ Vision...,DETR _ Object Detect,"[DETR, |, Object, Detection, |, Facebook, AI, ...","[0.9689487017940672, 0.9157330173729323, 0.977...",0
42,Deep Dive into the Computer Vision World_ Part...,Deep Dive into the C,"[Deep, Dive, into, the, Computer, Vision, Worl...","[0.9502165337103473, 0.9240473090544705, 0.973...",0
43,Deep Dive into the Computer Vision World_ Part...,Deep Dive into the C,"[Deep, Dive, into, the, Computer, Vision, Worl...","[0.941781243737909, 0.9259461616419888, 0.9697...",0
...,...,...,...,...,...
149,Sound Event Classification_ A to Z _ by Chathu...,Sound Event Classifi,"[Sound, Event, Classification:, A, to, Z, |, b...","[0.988232849639684, 0.9760101112847869, 0.9936...",18
0,10 Inventions from Arab Inventors We Were Neve...,10 Inventions from A,"[10, Inventions, from, Arab, Inventors, We, We...","[-2.220446049250313e-16, 0.974749517654662, 0....",19
58,Earliest hominin migrations into the Arabian P...,Earliest hominin mig,"[Earliest, hominin, migrations, into, the, Ara...","[0.8645867475945946, 0.9524164560443134, 0.963...",19
132,Origin of the Genus Homo _ SpringerLink,Origin of the Genus,"[Origin, of, the, Genus, Homo, |, SpringerLink...","[0.8796275220851925, 0.9820810317990794, 0.987...",19
