# 04-Latent Dirichlet Allocation

LDA is a generative statitstical model. It locates similarities between texts in the form of key words that belong to potential topics. The higher the presence of certain key words in a text, the higher the probability of it belonging to the corresponding topic. The topics can then be visualized or used to label data.

In this exercice, you will learn to build and adjust an LDA model to generate potential topics, explore their corresponding key words, and visualize the model. You will be working on 3 categories of the 20 Newsgroups dataset available in Sklearn.

Run the code below to load the data into a dataframe and visualize an example.

In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

data = fetch_20newsgroups(categories = ['talk.politics.mideast', 'rec.sport.hockey','soc.religion.christian']) 

mydataframe = pd.DataFrame({'text': data.data})

mydataframe.head()

## Preprocessing 

You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [1]:
# Code here


Run code below to transform clean_text to string format.

In [None]:
mydataframe['clean_text'] = mydataframe['clean_text'].astype('str')

## Vectorize data

Like for many Natural Language Processing tasks, the text needs to be vectorized before being used for LDA modelling. Initiate a default CountVectorizer and vectorize clean texts to a standard Bag of Words.

In [2]:
# Code here

## Initiate Latent Dirichlet Allocation model

Once again, Sklearn offers a package to build an LDA model. Use it to initiate an LDA with the following parameters:
    - 3 topics

In [17]:
# Code here

## Fit  data to LDA

Fit the vectorized data to the LDA

In [18]:
# Code here



## Visualize potential topics

The function to print the potential topics is already made for you. You just have to pass the correct arguments ;)

In [None]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])
        

print_topics(# Code here

## Train LDA with different CountVectorizer parameters

CountVectorizer parameter `max_df` is used to ignore certain terms in the corpus. `max_df` specifies that a word with a document frequency higher than the given threshold should be ignored (considered corpus-specific stop words). When followed by an LDA model, playing around with max_df will improve interpretability as it removes irrelevant words.

Adjust that parameter and train an LDA with the vectors until you are satisfied with the interpretability of your model.

You could also try vectorizing as bigrams...

In [None]:
# Code here

## Predict topic of new text

Once you are happy with the interpretability of your LDA model, you can use it to predict the topic of a new text.

First, use vectorizer2 to vectorize the example. Then, fit to lda_model2 to make LDA vectors. The following code will print the results.

In [3]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [None]:
example_vectorized = # Code here

lda_vectors = # Code here

In [None]:
print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])
print("topic 2 :", lda_vectors[0][2])

The LDA model predicts the percentage chance of our example belonging to each category. Feel free to try different examples.

##  Graph  LDA 

Using pyLDAvis, prepare a graph with the following arguments:
- lda_model2
- data_vectorized2
- vectorizer2
- mds='tsne'

In [5]:
import pyLDAvis.sklearn
 
# Code here