In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups

# remove headers, footers, and quotes of other posts, to prevent overfitting on unhelpful features
dataset = fetch_20newsgroups(shuffle=True, random_state=2, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

In [2]:
documents[0]

"Something about how Koresh had threatened to cause local \nproblems with all these wepaons he had and was alleged to\nhave.  \n\nSomeone else will post more details soon, I'm sure.\n\nOther News:\nSniper injures 9 outside MCA buildling in L.A.  Man arrested--suspect\nwas disgruntled employee of Universal Studios, which\nis a division of M.C.A.\n\n\nQUESTION:\nWhat will Californians do with all those guns after the Reginald\ndenny trial?"

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_df=0.8, min_df=100, stop_words='english')
X = cv.fit_transform(documents)
feature_names = cv.get_feature_names()
print("{} documents, {} word featuers.".format(X.shape[0], X.shape[1]))

11314 documents, 1336 word featuers.


In [4]:
from sklearn.decomposition import LatentDirichletAllocation
import time

no_topics = 20

# n_components: number of topics
# max_iter: the number of times to iterate through the corpus to update the model
# learning_method: online, partition the training for faster training on larger datasets
# learning_offset: the higher this is, the less weight on earlier training batches to reduce its influence
# random_state: for reproducibility
# verbose: set to 1 or 2 for updates to be printed during training
# n_jobs: number of CPUs to use for parallelization

print("Training LDA.")
st = time.time()
lda = LatentDirichletAllocation(
    n_components=no_topics, 
    max_iter=5, 
    learning_method='online', 
    learning_offset=50, 
    random_state=0,
    n_jobs=4)
lda.fit(X)
print("Finished.")
print("{} seconds.".format(time.time() - st))

Training LDA.
Finished.
22.96079111099243 seconds.


# Exploring the learned model

The model learned by LDA consists of a _term-topic distribution_, which tells us the probability of emitting a word, given a topic. This distribution can be obtained as a matrix from the learned model, with n_component rows (representing each "topic") and d columns, where d is the number of word features. 

In [5]:
lda.components_.shape

(20, 1336)

To find the probability of a certain word in a certain topic, find the value at the topic's row and the word's column. 

The best way to "get to know" the discovered topics is to look at the words with the highest probability per topic. 

Here's a function that can easily extract that info for us. In order to know which word corresponds to which column, we need to pull in the feature names from the vectorizer. 

In [6]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic {}:".format(topic_idx))
        # arrange indexes of topic ascending order
        # get the indexes of the last no_top_words in the arranged vector
        # reverse them (so biggest is first)
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words-1:-1]]))


In [7]:
display_topics(lda, feature_names, 10)

Topic 0:
government president car people health cars congress house federal use
Topic 1:
mr does question true point evidence right answer say argument
Topic 2:
00 new price sale offer shipping condition 250 san sell
Topic 3:
edu server graphics cs mit uk com memory comp ac
Topic 4:
god jesus people bible christian life church christians does faith
Topic 5:
10 15 14 11 12 25 20 17 16 18
Topic 6:
jews war 000 people women children killed years men military
Topic 7:
software mac version pc user color using available sun image
Topic 8:
don just think like know good ve going time really
Topic 9:
state states israel public rights law new people israeli united
Topic 10:
space university internet nasa research information available anonymous ftp 1993
Topic 11:
game team year play games season hockey players league win
Topic 12:
problem time window use power speed work using like just
Topic 13:
said gun went people didn told guns says day time
Topic 14:
thanks com mail know does help send list

We can see that most of these topics are pretty coherent. Topic 0 is government, Topic 1 is debate, Topic 2 is sales, and so forth. There seem to be some unsavory topics as well; at any rate, LDA has helped "distil" the semantic content of this large text set and let us get an overview of what's "in" the corpus.

# Document-topic distribution

We can also use a trained LDA model to take in a document and tell us what its estimate of the distribution of topics for that document is. Here is document 1. We can see it's technical, about computers.

In [8]:
documents[1]

"I have an Okidata 2410 printer for which I would like to have a printer driver.\nHas anyone seen such a thing?  There is not one on the Microsoft BBS.\nI can print to it from Windows but I have no fonts available and with\nParadox for Windows I can't print labels on it unless there is a proper printer\ndefined.\n\n\nThanks,\n\nBryan K. Ward\nSurvey Research Center\nUniversity of Utah"

Let's extract the vector representation of this document, and feed it to the LDA model to see its corresponding topic distribution.

In [9]:
d1 = X[1, :]
d1_topics = lda.transform(d1)
d1_topics

array([[0.00263158, 0.00263158, 0.00263158, 0.00263158, 0.00263158,
        0.00263158, 0.00263158, 0.34075273, 0.23512517, 0.00263158,
        0.18215102, 0.00263158, 0.00263158, 0.00263158, 0.00263158,
        0.00263158, 0.19986582, 0.00263158, 0.00263158, 0.00263158]])

In [10]:
np.argsort(d1_topics)[0][::-1]

array([ 7,  8, 16, 10, 14, 18, 12, 19,  9, 17,  1,  4, 13,  6, 11,  5,  0,
        3,  2, 15])

The model says this document is 34% topic 7, 23% topic 8, and 16% topic 16.  Topic 7 and 16 are indeed computer-related topics. Topic 8 is a less coherent topic that seems to be storing "filler" words. 

# Sending a re-representation to a machine learning algorithm

LDA can be conceptualizing as "summarizing" data. Instead of representing each document as a count vector with 1336 features, we can respresent it as a vector of 20 topic proportions. On one hand, information is lost in this re-representation, but on the other hand, there is potential that this "summarization" makes patterns easier for machine learning classifiers to grasp. 

Each document in this dataset belongs to one of 20 categories.

In [11]:
dataset.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [12]:
from collections import Counter
Counter(dataset.target).most_common()

[(10, 600),
 (15, 599),
 (8, 598),
 (9, 597),
 (11, 595),
 (7, 594),
 (13, 594),
 (5, 593),
 (14, 593),
 (2, 591),
 (12, 591),
 (3, 590),
 (6, 585),
 (1, 584),
 (4, 578),
 (17, 564),
 (16, 546),
 (0, 480),
 (18, 465),
 (19, 377)]

The labels aren't evenly distributed but the most common label has a count of 600, so the baseline for this task is 600/11314.

In [13]:
600/11314

0.0530316422131872

First, let's train from the count vectors.

In [14]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

y = dataset.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337)

In [15]:
clf = LinearSVC()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [16]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.48166151126822804

Not bad. Now let's try training it only on the topic vectors.

In [17]:
X_topic = lda.transform(X)

In [18]:
X_topic.shape

(11314, 20)

In [19]:
y = dataset.target
X_train, X_test, y_train, y_test = train_test_split(X_topic, y, test_size=0.2, random_state=1337)

In [20]:
clf = LinearSVC()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [21]:
accuracy_score(y_test, y_pred)

0.3588157313300928

So, that didn't improve our score. We can conclude in this scenario that *IF* our goal is to predict a post's category then a simple count of the words in a post are better than a 20-topic topic model for achieving this task. 

However, the topic model gave us a human-understandable "glimpse" into the semantics of the the dataset. And there are other scenarios where topic vectors may substitute or be appended to other kinds of information before being input into an algorithm.

# Random Notes

1. The usual caveats for using machine learning apply. LDA is very sensitive to preprocessing and hyperparameters, and to tell the truth I tinkered with some settings to get the model we just saw now. These include the min_df, max_df, and the number of topics. Depending on the size of your data you may have to tweak the max_iter and online settings.
2. Single tweets probably won't work well out of the box on Tweets. As you know now, LDA relies on word counts, and since tweets are so short, the counts of words are small and its harder to estimate a document-topic distribution for a tweet. 
3. However, this will work well if you have a reason to aggregate tweets. For example, if each datapoint is a concatenation of all of a user's tweets, and you have tweets for many users. You could apply LDA here, find topics in the texts, and for each user obtain a user-topic distribution showing how often a user writes in a topic. 