# Intro to Topic Modeling 

We will be using the 20 newsgroups dataset as an example.

In [1]:
import numpy as np
import pandas as pd
import pickle as pkl

## Load 20 newsgroups dataset

In [2]:
docs_raw = pkl.load(open('20news.pkl', 'rb'))

In [3]:
docs_raw[0]

"Hello World,\n\t     just bought a new Stealth two weeks ago. Got a grad student \n rebate. Someone told me that there's another $400 reabet for 1st time\n Chrysler buyer. True ? If yes can I still get it or am I too late ?\n"

## Mining $𝑘$ Topical Terms from Collection $𝐶$

Utilize TF-IDF as your scoring function!

With [sklearn.feature_extraction.text.TfidfVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(lowercase=True,
                             stop_words="english",
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             max_df=0.5,
                             min_df=10)
X = np.array(vectorizer.fit_transform(docs_raw).todense())
terms_name = vectorizer.get_feature_names_out()

In [5]:
topics = terms_name[np.max(X, axis=0) > 0.55]

In [6]:
print(len(topics))

541


## Computing Topic Coverage $\pi_{𝑖𝑗}$ of $i$-th document

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(vocabulary=topics)
count_X = count_vectorizer.fit_transform(docs_raw)

In [8]:
count_X.shape

(3273, 541)

In [9]:
np.sum(count_X, axis = 1) # 每个document里归一化的分母

matrix([[ 5],
        [ 7],
        [ 1],
        ...,
        [14],
        [ 4],
        [11]])

In [10]:
coverage = np.array(count_X / np.sum(count_X, axis = 1))
coverage

  return np.true_divide(self.todense(), other)


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [11]:
a = np.array([[1,1], [2,2], [3,3]])
np.sum(a, axis=1)

array([2, 4, 6])

In [12]:
doc = -1
print(docs_raw[doc])

Here is a story.  I bought a car about two weeks ago.  I finally can
get hold of the previous owner of the car and got all maintanence
history of the car.  In between '91 and '92, the instrument pannel 
of the car has been replaced and the odometer also has been reset
to zero.  Therefore, the true meter reading is the reading before
replacement plus current mileage.  That shows 35000 mile difference
comparing to the mileage on the odometer disclosure from.  The 
dealer never told me anything about that important story.

I hope that I can return the car with full refund.  Do u think this
is possible?  Does anyone have similar experiences?  Any comments
will be appreciated.  Thanks.


In [13]:
nonzero_index = np.array(coverage[doc] != 0).flatten()
print('This Doc, topic:', topics[nonzero_index])
print('This Doc, coverage:', coverage[doc][nonzero_index])

This Doc, topic: ['car' 'dealer' 'difference' 'similar' 'thanks' 'think' 'weeks']
This Doc, coverage: [0.45454545 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909
 0.09090909]
