# Topic Models

We begin by fetching the **20 Newsgroups** [dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html?highlight=fetch_20newsgroups#sklearn.datasets.fetch_20newsgroups) and take the first 2000 articles.

In [1]:
from sklearn.datasets import fetch_20newsgroups
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'),
                             return_X_y=True)

data = data[:2000]

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


**Exercise:** let's again use the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) but allow only non stop words words that occur at least 5 times in the corpus, are in a maximum of 30% of the documents, contain only letters a-z, and have a minimum length of 4 as well as a maximum length of 10 characters. Hint: for some of the requirements we can use the `token_pattern = '[a-zA-Z]{4,10}' `.

Size of our vocabulary

**Exercise:** Next let's import [LatentDirichletAllocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) `from sklearn.decomposition` and initialize it to estimate 50 topics.

In [None]:
lda.fit(X_bag_of_words)

We then need some boilerplate code to display the top words per topic.

In [None]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
# code from https://gist.github.com/aneesha/440f3d104415c6ae21851a062f3880d8#file-displaytopics-py

In [None]:
display_topics(lda, vectorizer.get_feature_names(), 10)

A more in depth review and comparison with [NMF](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF) is available in the scikit-learn [documentation](https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html). A popular alternative is [GENSIM](https://radimrehurek.com/gensim/).