<a href="https://colab.research.google.com/github/rahiakela/data-science-research-and-practice/blob/main/data-science-bookcamp/case-study-4--job-resume-improvement/03_large_text_analysis_using_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Large text analysis using Clustering

In this notebook, we execute NLP on large collections
of real-world texts. This type of analysis is seemingly straightforward, given the
techniques presented thus far. For example, suppose we’re doing market research
across multiple online discussion forums. Each forum is composed of hundreds of
users who discuss a specific topic, such as politics, fashion, technology, or cars. We
want to automatically extract all the discussion topics based on the contents of the
user conversations. These extracted topics will be used to plan a marketing campaign,
which will target users based on their online interests.

How do we cluster user discussions into topics? 

One approach would be to do the following:
1. Convert all discussion texts into a matrix of word counts
2. Dimensionally reduce the word count matrix using singular value decomposition (SVD). This will allow us to efficiently complete all pairs of text similarities with matrix multiplication.
3. Utilize the matrix of text similarities to cluster the discussions into topics.
4. Explore the topic clusters to identify useful topics for our marketing campaign.



##Setup

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from collections import defaultdict
from collections import Counter
import time
import numpy as np
import pandas as pd
from math import sin, cos

from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.datasets import fetch_20newsgroups

import seaborn as sns
import matplotlib.pyplot as plt

##20Newsgroup dataset

Usenet, which is a well-established online collection
of discussion forums, are called newsgroups. Each individual
newsgroup focuses on some topic of discussion, which is briefly outlined in the newsgroup name.

We can load these newsgroup posts by importing `fetch_20newsgroups` from `sklearn.datasets`.

In [3]:
# Fetching the newsgroup dataset
newsgroups = fetch_20newsgroups(remove=("headers", "footers"))

The newsgroups object contains posts from 20 different newsgroups.

In [4]:
# Printing the names of all 20 newsgroups
print(newsgroups.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [5]:
print(len(newsgroups.target_names))

20


Now, let’s turn our attention to the actual newsgroup texts, which are stored as a list in the newsgroups.data attribute.

In [6]:
# Printing the first newsgroup post
print(newsgroups.data[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


In [7]:
# Printing the newsgroup name at index 0
origin = newsgroups.target_names[newsgroups.target[0]]
print(f"The post at index 0 first appeared in the '{origin}' group.")

The post at index 0 first appeared in the 'rec.autos' group.


Let’s dive deeper into our newsgroup dataset by printing out the dataset size.

In [8]:
# Counting the number of newsgroup posts
dataset_size = len(newsgroups.data)
print(f"Our dataset contains {dataset_size} newsgroup posts")

Our dataset contains 11314 newsgroup posts


##Vectorizing documents

We need
to efficiently compute newsgroup post similarities by representing our text data as a
matrix. 

To do so, we need to transform each newsgroup post into a term-frequency
(TF) vector.

Scikit-learn provides a `CountVectorizer` class for transforming input texts into TF vectors.

In [9]:
# Computing a TF matrix
vectorizer = CountVectorizer()
tf_matrix = vectorizer.fit_transform(newsgroups.data)
print(tf_matrix)

  (0, 108644)	4
  (0, 110106)	1
  (0, 57577)	2
  (0, 24398)	2
  (0, 79534)	1
  (0, 100942)	1
  (0, 37154)	1
  (0, 45141)	1
  (0, 70570)	1
  (0, 78701)	2
  (0, 101084)	4
  (0, 32499)	4
  (0, 92157)	1
  (0, 100827)	6
  (0, 79461)	1
  (0, 39275)	1
  (0, 60326)	2
  (0, 42332)	1
  (0, 96432)	1
  (0, 67137)	1
  (0, 101732)	1
  (0, 27703)	1
  (0, 49871)	2
  (0, 65338)	1
  (0, 14106)	1
  :	:
  (11313, 55901)	1
  (11313, 93448)	1
  (11313, 97535)	1
  (11313, 93393)	1
  (11313, 109366)	1
  (11313, 102215)	1
  (11313, 29148)	1
  (11313, 26901)	1
  (11313, 94401)	1
  (11313, 89686)	1
  (11313, 80827)	1
  (11313, 72219)	1
  (11313, 32984)	1
  (11313, 82912)	1
  (11313, 99934)	1
  (11313, 96505)	1
  (11313, 72102)	1
  (11313, 32981)	1
  (11313, 82692)	1
  (11313, 101854)	1
  (11313, 66399)	1
  (11313, 63405)	1
  (11313, 61366)	1
  (11313, 7462)	1
  (11313, 109600)	1


In [10]:
# Checking the data type
print(type(tf_matrix))

<class 'scipy.sparse.csr.csr_matrix'>


In [11]:
# Converting a CSR matrix to a NumPy array
tf_np_matrix = tf_matrix.toarray()
print(tf_np_matrix)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [12]:
# Checking the vocabulary size
assert tf_np_matrix.shape == tf_matrix.shape
num_posts, vocabulary_size = tf_np_matrix.shape
print(f"Our collection of {num_posts} newsgroup posts contain a total of {vocabulary_size} unique words")

Our collection of 11314 newsgroup posts contain a total of 114751 unique words


In [13]:
# Counting the unique words in the car post
tf_vector = tf_np_matrix[0]
non_zero_indices = np.flatnonzero(tf_vector)
num_unique_words = non_zero_indices.size

print(f"The newsgroup in row 0 contains {num_unique_words} unique words.")
print("The actual word counts map to the following column indices:\n")
print(non_zero_indices)

The newsgroup in row 0 contains 64 unique words.
The actual word counts map to the following column indices:

[ 14106  15549  22088  23323  24398  27703  29357  30093  30629  32194
  32305  32499  37154  39275  42332  42333  43643  45089  45141  49871
  49881  50165  54442  55453  57577  58321  58842  60116  60326  64083
  65338  67137  67140  68931  69080  70570  72915  75280  78264  78701
  79055  79461  79534  82759  84398  87690  89161  92157  93304  95225
  96145  96432 100406 100827 100942 101084 101732 108644 109086 109254
 109294 110106 112936 113262]


Let's find a mapping between TF vector indices and word values.

In [14]:
# Printing the unique words in the car post
words = vectorizer.get_feature_names()
unique_words = [words[i] for i in non_zero_indices]
print(unique_words)

['60s', '70s', 'addition', 'all', 'anyone', 'be', 'body', 'bricklin', 'bumper', 'called', 'can', 'car', 'could', 'day', 'door', 'doors', 'early', 'engine', 'enlighten', 'from', 'front', 'funky', 'have', 'history', 'if', 'in', 'info', 'is', 'it', 'know', 'late', 'looked', 'looking', 'made', 'mail', 'me', 'model', 'name', 'of', 'on', 'or', 'other', 'out', 'please', 'production', 'really', 'rest', 'saw', 'separate', 'small', 'specs', 'sports', 'tellme', 'the', 'there', 'this', 'to', 'was', 'were', 'whatever', 'where', 'wondering', 'years', 'you']


In [15]:
# confirming first and last word
print(words[14106])
print(words[113262])

60s
you


In [16]:
# Printing the most frequent words in the car post
data = {"Word": unique_words, "Count": tf_vector[non_zero_indices]}
df = pd.DataFrame(data).sort_values("Count", ascending=False)
print(df[:10].to_string(index=False))

  Word  Count
   the      6
  this      4
   was      4
   car      4
    if      2
    is      2
    it      2
  from      2
    on      2
anyone      2


The common words are a source of noise and increase the likelihood
that two unrelated documents will cluster together. 

NLP practitioners refer to
such noisy words as stop words because they are blocked from appearing in the vectorized
results. 

Stop words are generally deleted from the text before vectorization.

In [9]:
# Removing stop words during vectorization
vectorizer = CountVectorizer(stop_words="english")
tf_matrix = vectorizer.fit_transform(newsgroups.data)
assert tf_matrix.shape[1] < 114751 

# Common stop words have been filtered out
words = vectorizer.get_feature_names()
for common_word in ["the", "this", "is", "was", "if", "it", "on"]:
  assert common_word not in words

In [10]:
# Reprinting the top words after stop-word deletion
tf_np_matrix = tf_matrix.toarray()
tf_vector = tf_np_matrix[0]
non_zero_indices = np.flatnonzero(tf_vector)
unique_words = [words[i] for i in non_zero_indices]

data = {"Word": unique_words, "Count": tf_vector[non_zero_indices]}
df = pd.DataFrame(data).sort_values("Count", ascending=False)
print(f"After stop-word deletion, {df.shape[0]} unique words remain.")
print(f"The 10 most frequent words are:\n")
print(df[:10].to_string(index=False))

After stop-word deletion, 34 unique words remain.
The 10 most frequent words are:

      Word  Count
       car      4
       60s      1
       saw      1
   looking      1
      mail      1
     model      1
production      1
    really      1
      rest      1
  separate      1


##Ranking words