# Topic Modeling

For more details on how topic modeling works, [see here](https://topix.io/tutorial/tutorial.html)

### Execute this cell to install required python module

After you've installed this once, you can delete this cell.

In [None]:
!pip install pyldavis

### Import dependencies

In [19]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups

# module to visualize topics
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings('ignore')

### Load 20newsgroups data

In [4]:
news = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
df = pd.DataFrame({"body": news.data})
df.shape

### Preprocess text

In [12]:
from utils import clean_text
df['body'] = df['body'].apply(lambda x: clean_text(x))

### Generate feature vectors

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(stop_words='english')
tf = tf_vectorizer.fit_transform(df['body'])
tf_feature_names = tf_vectorizer.get_feature_names()

### Fit feature vectors to the LDA topic model

In [35]:
from sklearn.decomposition import LatentDirichletAllocation

no_topics = 20

lda = LatentDirichletAllocation(n_components=no_topics, random_state=4, evaluate_every=1).fit(tf)

### Display top words for each topic

In [39]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic: {topic_idx}")
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
display_topics(lda6, tf_feature_names, no_top_words)

Topic: 0
gordon banks surrender block soon intellect skepticism njxp chastity shameful
Topic: 1
people jesus armenian armenians jews turkish know time turkey went
Topic: 2
scsi time right left dead ride know good tanks people
Topic: 3
byte allocation book cross unit bits linked attack quran offset
Topic: 4
water theory good used fallacy larson dept universe cooling material
Topic: 5
drive disk card hard drives price scsi controller sale offer
Topic: 6
file windows program thanks window using know files problem version
Topic: 7
entry entries points point good printer year need think laser
Topic: 8
file information internet data email available university image computer software
Topic: 9
space nasa know data work time launch good problem power
Topic: 10
university article professor judas books history pain john written reply
Topic: 11
period flyers power graphics puck shots package soderstrom available goal
Topic: 12
encryption chip government clipper keys information people security use

### Visualizing our topics in 2-dimensional space

How to interpret this visualization:
1. Each bubble represents a topic
2. Larger topics are more frequent in the corpus
3. Topics closer together are more similar
4. When you click on a topic, the most relevant terms for that topic show in red on the right, and in blue is the frequency of that term in all other topics
5. When you hover over a word in the chart on the right, the bubbles will adjust according to how relevant that term is to each topic


In [40]:
pyLDAvis.sklearn.prepare(lda4, tf, tf_vectorizer)

### Create Document - Topic Matrix

In [41]:
lda_output = lda4.transform(tf)

# column names
topicnames = ["Topic" + str(i) for i in range(no_topics)]

# index names
docnames = ["Doc" + str(i) for i in range(len(df))]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

# Styling
def color_green(val):
    color = 'green' if val > .1 else 'black'
    return 'color: {col}'.format(col=color)

def make_bold(val):
    weight = 700 if val > .1 else 400
    return 'font-weight: {weight}'.format(weight=weight)

# Apply Style
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,Topic10,Topic11,Topic12,Topic13,Topic14,Topic15,Topic16,Topic17,Topic18,Topic19,dominant_topic
Doc0,0.1,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.0,0.19,0.35,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10
Doc1,0.0,0.0,0.0,0.51,0.26,0.0,0.0,0.0,0.21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
Doc2,0.09,0.0,0.0,0.6,0.0,0.12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18,0.0,0.0,0.0,0.0,0.0,3
Doc3,0.01,0.01,0.01,0.34,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.18,0.01,0.34,0.01,0.01,0.01,0.01,3
Doc4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15,0.0,0.0,0.0,0.82,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11
Doc5,0.0,0.0,0.0,0.0,0.0,0.81,0.0,0.0,0.0,0.0,0.16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
Doc6,0.0,0.0,0.0,0.51,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12,0.0,0.0,0.0,0.0,0.0,0.0,0.34,3
Doc7,0.0,0.0,0.0,0.99,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
Doc8,0.0,0.0,0.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12,0.0,0.0,0.0,0.0,0.0,3
Doc9,0.0,0.0,0.0,0.48,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3,0.0,0.21,0.0,0.0,0.0,0.0,0.0,0.0,3
