<a href="https://colab.research.google.com/github/rahiakela/data-science-research-and-practice/blob/main/data-science-bookcamp/case-study-4--job-resume-improvement/03_large_text_analysis_using_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Large text analysis using Clustering

In this notebook, we execute NLP on large collections
of real-world texts. This type of analysis is seemingly straightforward, given the
techniques presented thus far. For example, suppose we’re doing market research
across multiple online discussion forums. Each forum is composed of hundreds of
users who discuss a specific topic, such as politics, fashion, technology, or cars. We
want to automatically extract all the discussion topics based on the contents of the
user conversations. These extracted topics will be used to plan a marketing campaign,
which will target users based on their online interests.

How do we cluster user discussions into topics? 

One approach would be to do the following:
1. Convert all discussion texts into a matrix of word counts
2. Dimensionally reduce the word count matrix using singular value decomposition (SVD). This will allow us to efficiently complete all pairs of text similarities with matrix multiplication.
3. Utilize the matrix of text similarities to cluster the discussions into topics.
4. Explore the topic clusters to identify useful topics for our marketing campaign.



##Setup

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [42]:
from collections import defaultdict
from collections import Counter
import time
import numpy as np
from numpy.linalg import norm
import pandas as pd
import math
from math import sin, cos

from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import binarize

from sklearn.datasets import fetch_20newsgroups

import seaborn as sns
import matplotlib.pyplot as plt

##20Newsgroup dataset

Usenet, which is a well-established online collection
of discussion forums, are called newsgroups. Each individual
newsgroup focuses on some topic of discussion, which is briefly outlined in the newsgroup name.

We can load these newsgroup posts by importing `fetch_20newsgroups` from `sklearn.datasets`.

In [3]:
# Fetching the newsgroup dataset
newsgroups = fetch_20newsgroups(remove=("headers", "footers"))

The newsgroups object contains posts from 20 different newsgroups.

In [None]:
# Printing the names of all 20 newsgroups
print(newsgroups.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [None]:
print(len(newsgroups.target_names))

20


Now, let’s turn our attention to the actual newsgroup texts, which are stored as a list in the newsgroups.data attribute.

In [None]:
# Printing the first newsgroup post
print(newsgroups.data[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


In [4]:
# Printing the newsgroup name at index 0
origin = newsgroups.target_names[newsgroups.target[0]]
print(f"The post at index 0 first appeared in the '{origin}' group.")

The post at index 0 first appeared in the 'rec.autos' group.


Let’s dive deeper into our newsgroup dataset by printing out the dataset size.

In [4]:
# Counting the number of newsgroup posts
dataset_size = len(newsgroups.data)
print(f"Our dataset contains {dataset_size} newsgroup posts")

Our dataset contains 11314 newsgroup posts


##Vectorizing documents

We need
to efficiently compute newsgroup post similarities by representing our text data as a
matrix. 

To do so, we need to transform each newsgroup post into a term-frequency
(TF) vector.

Scikit-learn provides a `CountVectorizer` class for transforming input texts into TF vectors.

In [None]:
# Computing a TF matrix
vectorizer = CountVectorizer()
tf_matrix = vectorizer.fit_transform(newsgroups.data)
print(tf_matrix)

  (0, 108644)	4
  (0, 110106)	1
  (0, 57577)	2
  (0, 24398)	2
  (0, 79534)	1
  (0, 100942)	1
  (0, 37154)	1
  (0, 45141)	1
  (0, 70570)	1
  (0, 78701)	2
  (0, 101084)	4
  (0, 32499)	4
  (0, 92157)	1
  (0, 100827)	6
  (0, 79461)	1
  (0, 39275)	1
  (0, 60326)	2
  (0, 42332)	1
  (0, 96432)	1
  (0, 67137)	1
  (0, 101732)	1
  (0, 27703)	1
  (0, 49871)	2
  (0, 65338)	1
  (0, 14106)	1
  :	:
  (11313, 55901)	1
  (11313, 93448)	1
  (11313, 97535)	1
  (11313, 93393)	1
  (11313, 109366)	1
  (11313, 102215)	1
  (11313, 29148)	1
  (11313, 26901)	1
  (11313, 94401)	1
  (11313, 89686)	1
  (11313, 80827)	1
  (11313, 72219)	1
  (11313, 32984)	1
  (11313, 82912)	1
  (11313, 99934)	1
  (11313, 96505)	1
  (11313, 72102)	1
  (11313, 32981)	1
  (11313, 82692)	1
  (11313, 101854)	1
  (11313, 66399)	1
  (11313, 63405)	1
  (11313, 61366)	1
  (11313, 7462)	1
  (11313, 109600)	1


In [None]:
# Checking the data type
print(type(tf_matrix))

<class 'scipy.sparse.csr.csr_matrix'>


In [None]:
# Converting a CSR matrix to a NumPy array
tf_np_matrix = tf_matrix.toarray()
print(tf_np_matrix)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [None]:
# Checking the vocabulary size
assert tf_np_matrix.shape == tf_matrix.shape
num_posts, vocabulary_size = tf_np_matrix.shape
print(f"Our collection of {num_posts} newsgroup posts contain a total of {vocabulary_size} unique words")

Our collection of 11314 newsgroup posts contain a total of 114751 unique words


In [None]:
# Counting the unique words in the car post
tf_vector = tf_np_matrix[0]
non_zero_indices = np.flatnonzero(tf_vector)
num_unique_words = non_zero_indices.size

print(f"The newsgroup in row 0 contains {num_unique_words} unique words.")
print("The actual word counts map to the following column indices:\n")
print(non_zero_indices)

The newsgroup in row 0 contains 64 unique words.
The actual word counts map to the following column indices:

[ 14106  15549  22088  23323  24398  27703  29357  30093  30629  32194
  32305  32499  37154  39275  42332  42333  43643  45089  45141  49871
  49881  50165  54442  55453  57577  58321  58842  60116  60326  64083
  65338  67137  67140  68931  69080  70570  72915  75280  78264  78701
  79055  79461  79534  82759  84398  87690  89161  92157  93304  95225
  96145  96432 100406 100827 100942 101084 101732 108644 109086 109254
 109294 110106 112936 113262]


Let's find a mapping between TF vector indices and word values.

In [None]:
# Printing the unique words in the car post
words = vectorizer.get_feature_names()
unique_words = [words[i] for i in non_zero_indices]
print(unique_words)

['60s', '70s', 'addition', 'all', 'anyone', 'be', 'body', 'bricklin', 'bumper', 'called', 'can', 'car', 'could', 'day', 'door', 'doors', 'early', 'engine', 'enlighten', 'from', 'front', 'funky', 'have', 'history', 'if', 'in', 'info', 'is', 'it', 'know', 'late', 'looked', 'looking', 'made', 'mail', 'me', 'model', 'name', 'of', 'on', 'or', 'other', 'out', 'please', 'production', 'really', 'rest', 'saw', 'separate', 'small', 'specs', 'sports', 'tellme', 'the', 'there', 'this', 'to', 'was', 'were', 'whatever', 'where', 'wondering', 'years', 'you']


In [None]:
# confirming first and last word
print(words[14106])
print(words[113262])

60s
you


In [None]:
# Printing the most frequent words in the car post
data = {"Word": unique_words, "Count": tf_vector[non_zero_indices]}
df = pd.DataFrame(data).sort_values("Count", ascending=False)
print(df[:10].to_string(index=False))

  Word  Count
   the      6
  this      4
   was      4
   car      4
    if      2
    is      2
    it      2
  from      2
    on      2
anyone      2


In [None]:
# free memory
del tf_matrix
del tf_np_matrix
del tf_vector

The common words are a source of noise and increase the likelihood
that two unrelated documents will cluster together. 

NLP practitioners refer to
such noisy words as stop words because they are blocked from appearing in the vectorized
results. 

Stop words are generally deleted from the text before vectorization.

In [21]:
# Removing stop words during vectorization
vectorizer = CountVectorizer(stop_words="english")
tf_matrix = vectorizer.fit_transform(newsgroups.data)
assert tf_matrix.shape[1] < 114751 

# Common stop words have been filtered out
words = vectorizer.get_feature_names()
for common_word in ["the", "this", "is", "was", "if", "it", "on"]:
  assert common_word not in words

In [22]:
# Reprinting the top words after stop-word deletion
tf_np_matrix = tf_matrix.toarray()
tf_vector = tf_np_matrix[0]
non_zero_indices = np.flatnonzero(tf_vector)
unique_words = [words[i] for i in non_zero_indices]

data = {"Word": unique_words, "Count": tf_vector[non_zero_indices]}
df = pd.DataFrame(data).sort_values("Count", ascending=False)
print(f"After stop-word deletion, {df.shape[0]} unique words remain.")
print(f"The 10 most frequent words are:\n")
print(df[:10].to_string(index=False))

After stop-word deletion, 34 unique words remain.
The 10 most frequent words are:

      Word  Count
       car      4
       60s      1
       saw      1
   looking      1
      mail      1
     model      1
production      1
    really      1
      rest      1
  separate      1


##Ranking words

Each of the 34 words in df.Word appears in a certain fraction of newsgroup posts. In
NLP, this fraction is referred to as the document frequency of a word. 

We hypothesize
that the document frequencies can improve our word rankings.

We can compute these frequencies
using a series of NumPy matrix manipulations.

In [23]:
# Filtering matrix columns with non_zero_indices
sub_matrix = tf_np_matrix[:, non_zero_indices]
print("Our sub-matrix corresponds to the 34 words within post 0.\nThe first row of the sub-matrix is:")
print(sub_matrix[0])

Our sub-matrix corresponds to the 34 words within post 0.
The first row of the sub-matrix is:
[1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


However, we are not currently interested
in exact word counts: we just want to know whether each word is present or
absent from each post. 

So, we need to convert our counts into binary values.

In [24]:
# Converting word counts to binary values
binary_matrix = binarize(sub_matrix)
print(binary_matrix)

[[1 1 1 ... 1 1 1]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Now we need to add together the rows of our binary submatrix. 

Doing so will produce
a vector of integer counts.

In [25]:
# Summing matrix rows to obtain post counts
unique_post_mentions = binary_matrix.sum(axis=0)
print(f"This vector counts the unique posts in which each word is mentioned:\n {unique_post_mentions}")

This vector counts the unique posts in which each word is mentioned:
 [  18   21  202  314    4   26  802  536  842  154   67  348  184   25
    7  368  469 3093  238  268  780  901  292   95 1493  407  354  158
  574   95   98    2  295 1174]


In [26]:
# Computing post mention counts in a single line of code
np_post_mentions = binarize(tf_np_matrix[:, non_zero_indices]).sum(axis=0)
csr_post_mentions = binarize(tf_matrix[:, non_zero_indices]).sum(axis=0)

print(f"f'NumPy matrix-generated counts:\n {np_post_mentions}\n")
print(f"CSR matrix-generated counts:\n {csr_post_mentions}")

f'NumPy matrix-generated counts:
 [  18   21  202  314    4   26  802  536  842  154   67  348  184   25
    7  368  469 3093  238  268  780  901  292   95 1493  407  354  158
  574   95   98    2  295 1174]

CSR matrix-generated counts:
 [[  18   21  202  314    4   26  802  536  842  154   67  348  184   25
     7  368  469 3093  238  268  780  901  292   95 1493  407  354  158
   574   95   98    2  295 1174]]


The numbers in `np_post_mentions` and `csr_post_mentions` appear identical.

Let’s transform
these counts into document frequencies and align the frequencies with `df.Word`.

In [27]:
# Printing the words with the highest document frequency
document_frequencies = unique_post_mentions / dataset_size
data = {"Word": unique_words, "Count": tf_vector[non_zero_indices], "DF": document_frequencies}

df = pd.DataFrame(data)
# choose words with a document frequency greater than 1/10
df_common_words = df[df["DF"] >= .1]
print(df_common_words.to_string(index=False))

  Word  Count       DF
  know      1 0.273378
really      1 0.131960
 years      1 0.103765


As expected, these words are very general and not car specific. We thus can utilize document frequencies for ranking purposes.

Let’s rank our words by relevance in the following
manner. First, we sort the words by count, from greatest to smallest. 

Then, all words
with equal count are sorted by document frequency, from smallest to greatest.

In [28]:
# Ranking words by both count and document frequency
df_sorted = df.sort_values(["Count", "DF"], ascending=[False, True])
print(df_sorted[:10].to_string(index=False))

      Word  Count       DF
       car      4 0.047375
    tellme      1 0.000177
  bricklin      1 0.000354
     funky      1 0.000619
       60s      1 0.001591
       70s      1 0.001856
 enlighten      1 0.002210
    bumper      1 0.002298
     doors      1 0.005922
production      1 0.008397


Our sorting was successful. New car-related words, such as bumper, are now present in
our list of top-ranked words.

However, the actual sorting procedure was rather convoluted:
it required us to sort two columns separately. 

Perhaps we can simplify the process
by combining the word counts and document frequencies into a single score.

How can we do this? 

One approach is to divide each word count by its associated document
frequency. 

The resulting value will increase if either of the following is true:
* The word count goes up.
* The document frequency goes down.

Let’s combine the word counts and the document frequencies into a single score. We
start by computing 1 / document_frequencies. 

Doing so produces an array of inverse
document frequencies (IDFs). 

Next, we multiply df.Count by the IDF array to compute
the combined score.

In [29]:
# Combining counts and frequencies into a single score
inverse_document_frequencies = 1 / document_frequencies
df["IDF"] = inverse_document_frequencies
df["Combined"] = df.Count * inverse_document_frequencies
df_sorted = df.sort_values("Combined", ascending=False)
print(df_sorted[:10].to_string(index=False))

      Word  Count       DF         IDF    Combined
    tellme      1 0.000177 5657.000000 5657.000000
  bricklin      1 0.000354 2828.500000 2828.500000
     funky      1 0.000619 1616.285714 1616.285714
       60s      1 0.001591  628.555556  628.555556
       70s      1 0.001856  538.761905  538.761905
 enlighten      1 0.002210  452.560000  452.560000
    bumper      1 0.002298  435.153846  435.153846
     doors      1 0.005922  168.865672  168.865672
     specs      1 0.008397  119.094737  119.094737
production      1 0.008397  119.094737  119.094737


Our new ranking failed! The word car no longer appears at the top of the list.

There is a problem with the IDF values: some of them are huge!

Meanwhile, our word-count range is very small: from 1 to 4. 

Thus, when we multiply
word counts by IDF values, the IDF dominates, and the counts have no impact on the final results. We need to somehow make our IDF values smaller. 

What should we do?

Data scientists are commonly confronted with numeric values that are too large.

**One way to shrink the values is to apply a logarithmic function.**

In [30]:
# Shrinking a large value using its logarithm
assert np.log10(1000000) == 6
assert np.log10(10000) == 4
assert np.log10(100) == 2
assert np.log10(10) == 1
assert np.log10(1) == 0
assert np.log10(0) == -np.inf
assert math.isnan(np.log10(-1)) == math.isnan(float('nan'))

Let’s recompute our ranking score.

In [31]:
# Adjusting the combined score using logarithms
df["Combined"] = df.Count * np.log10(df.IDF)
df_sorted = df.sort_values("Combined", ascending=False)
print(df_sorted[:10].to_string(index=False))

     Word  Count       DF         IDF  Combined
      car      4 0.047375   21.108209  5.297806
   tellme      1 0.000177 5657.000000  3.752586
 bricklin      1 0.000354 2828.500000  3.451556
    funky      1 0.000619 1616.285714  3.208518
      60s      1 0.001591  628.555556  2.798344
      70s      1 0.001856  538.761905  2.731397
enlighten      1 0.002210  452.560000  2.655676
   bumper      1 0.002298  435.153846  2.638643
    doors      1 0.005922  168.865672  2.227541
    specs      1 0.008397  119.094737  2.075893


Our adjusted ranking score has yielded good results. The word car is once again present
at the top of the ranked list. 

Also, bumper still appears among the top 10 ranked
words. Meanwhile, really is missing from the list.

**Our effective score is called the term frequency-inverse document frequency (TFIDF)**.

The TFIDF can be computed by taking the product of the TF (word count) and the
log of the IDF.

Mathematically, $np.log(1 / x)$ is equal to $-np.log(x)$. 

Therefore, we
can compute the TFIDF directly from the document frequencies.

In [32]:
df["Combined"] = df.Count * -np.log10(document_frequencies)
df_sorted = df.sort_values("Combined", ascending=False)
print(df_sorted[:10].to_string(index=False))

     Word  Count       DF         IDF  Combined
      car      4 0.047375   21.108209  5.297806
   tellme      1 0.000177 5657.000000  3.752586
 bricklin      1 0.000354 2828.500000  3.451556
    funky      1 0.000619 1616.285714  3.208518
      60s      1 0.001591  628.555556  2.798344
      70s      1 0.001856  538.761905  2.731397
enlighten      1 0.002210  452.560000  2.655676
   bumper      1 0.002298  435.153846  2.638643
    doors      1 0.005922  168.865672  2.227541
    specs      1 0.008397  119.094737  2.075893


The TFIDF is a simple but powerful metric for ranking words in a document. Of
course, the metric is only relevant if that document is part of a larger document group. 

Otherwise, the computed TFIDF values all equal zero.

And it has
additional uses: it can be utilized to vectorize words in a document.

In this same manner, we can transform any TF vector into a
TFIDF vector. We just need to multiply the TF vector by the log of inverse document frequencies.

Is there a benefit to transforming TF vectors into more complicated TFIDF vectors?

Yes! In larger text datasets, TFIDF vectors provide a greater signal of textual similarity
and divergence.

For example, two texts that are both discussing cars are more
likely to cluster together if their irrelevant vector elements are penalized.

Thus, **penalizing common words using the IDF improves the clustering of large text collections.**

We therefore stand to gain by transforming our TF matrix into a TFIDF matrix.

In [33]:
# free memory
del tf_matrix
del tf_np_matrix
del tf_vector

##Computing TF-IDF vectors

That `TfidfVectorizer` class is nearly identical to `CountVectorizer`, except that it
takes IDF into account during the vectorization process.

In [None]:
# Computing a TFIDF matrix with scikit-learn
tfidf_vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = tfidf_vectorizer.fit_transform(newsgroups.data)

#assert tfidf_matrix.shape == tf_matrix.shape

In [37]:
# Confirming the preservation of vectorized word indices
assert tfidf_vectorizer.get_feature_names() == words

Since word order is preserved, we should expect the nonzero indices of `tfidf_
matrix[0]` to equal our previously computed `non_zero_indices` array.

In [38]:
# Confirming the preservation of nonzero indices
tfidf_np_matrix = tfidf_matrix.toarray()
tfidf_vector = tfidf_np_matrix[0]
tfidf_non_zero_indices = np.flatnonzero(tfidf_vector)

# The nonzero indices of tf_vector and tfidf_vector are identical.
assert np.array_equal(tfidf_non_zero_indices, non_zero_indices)

In [39]:
# Adding a TFIDF vector to the existing Pandas table
df["TF-IDF"] = tfidf_vector[non_zero_indices]

Sorting by `df.TF-IDF` should produce a relevance ranking that is consistent with our
previous observations. 

Let’s verify that both `df.TF-IDF` and `df.Combined` produce the
same word rankings after sorting.

In [41]:
# Sorting words by df.TFIDF
df_sorted_old = df.sort_values("Combined", ascending=False)
df_sorted_new = df.sort_values("TF-IDF", ascending=False)

assert np.array_equal(df_sorted_old["Word"].values, df_sorted_new["Word"].values)

print(df_sorted_new[:10].to_string(index=False))

     Word  Count       DF         IDF  Combined   TF-IDF
      car      4 0.047375   21.108209  5.297806 0.459552
   tellme      1 0.000177 5657.000000  3.752586 0.262118
 bricklin      1 0.000354 2828.500000  3.451556 0.247619
    funky      1 0.000619 1616.285714  3.208518 0.234280
      60s      1 0.001591  628.555556  2.798344 0.209729
      70s      1 0.001856  538.761905  2.731397 0.205568
enlighten      1 0.002210  452.560000  2.655676 0.200827
   bumper      1 0.002298  435.153846  2.638643 0.199756
    doors      1 0.005922  168.865672  2.227541 0.173540
    specs      1 0.008397  119.094737  2.075893 0.163752


Our word rankings have remained unchanged. However, the values of the `TF-IDF` and `Combined` columns are not identical.

Why is this the case?

As it turns out, scikit-learn automatically normalizes its `TFIDF` vector results. 

The magnitude of `df.TF-IDF` has been modified to equal 1. 

We can confirm by calling
`norm(df.TFIDF.values)`.

In [44]:
# Confirming that our TFIDF vector is normalized
assert norm(df["TF-IDF"].values) == 1

Why would scikit-learn automatically normalize the vectors?

As discussed, it’s easier to compute text vector similarity when all vector
magnitudes equal 1. 

Consequently, our normalized TFIDF matrix is primed for similarity analysis.

>To turn off normalization, we must pass `norm=None` into the vectorizer’s
initialization function. Running `TfidfVectorizer(norm=None, stop_
words='english')` returns a vectorizer in which normalization has been
deactivated.

##Computing document similarities