# Topic Modeling

For more details on how topic modeling works, [see here](https://topix.io/tutorial/tutorial.html)

### Execute this cell to install required python module

After you've installed this once, you can delete this cell.

In [1]:
!pip install pyldavis

Collecting pyldavis
  Downloading pyLDAvis-2.1.2.tar.gz (1.6 MB)
Collecting funcy
  Downloading funcy-1.15-py2.py3-none-any.whl (32 kB)
Building wheels for collected packages: pyldavis
  Building wheel for pyldavis (setup.py): started
  Building wheel for pyldavis (setup.py): finished with status 'done'
  Created wheel for pyldavis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97716 sha256=8f9c2543a8d2d725b5c79445246c2e9eeb26aa2db4da9c34c93ad06c65c59d7e
  Stored in directory: c:\users\reset\appdata\local\pip\cache\wheels\3b\fb\41\e32e5312da9f440d34c4eff0d2207b46dc9332a7b931ef1e89
Successfully built pyldavis
Installing collected packages: funcy, pyldavis
Successfully installed funcy-1.15 pyldavis-2.1.2


### Import dependencies

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups

# module to visualize topics
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings('ignore')

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes

### Load 20newsgroups data

In [2]:
news = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
df = pd.DataFrame({"body": news.data})
df.shape

(11314, 1)

### Preprocess text

In [3]:
from utils import clean_text
df['body'] = df['body'].apply(lambda x: clean_text(x))

### Generate feature vectors

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(stop_words='english')
tf = tf_vectorizer.fit_transform(df['body'])
tf_feature_names = tf_vectorizer.get_feature_names()

### Fit feature vectors to the LDA topic model

In [5]:
from sklearn.decomposition import LatentDirichletAllocation

no_topics = 20

lda = LatentDirichletAllocation(n_components=no_topics, random_state=4, evaluate_every=1).fit(tf)

### Display top words for each topic

In [6]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic: {topic_idx}")
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
display_topics(lda, tf_feature_names, no_top_words)

Topic: 0
year israel think good israeli team game right time people
Topic: 1
think question believe point science true exist atheists make know
Topic: 2
government encryption chip clipper keys used security people right public
Topic: 3
armenian turkish armenians people turkey jews turks genocide armenia government
Topic: 4
game team play hockey games period season players league teams
Topic: 5
file information output internet email program anonymous space list data
Topic: 6
file firearms control crime police weapons guns weapon state firearm
Topic: 7
know judas bible fallacy hell argument really doug reply read
Topic: 8
file window available image program files using version windows software
Topic: 9
people jesus think know believe time life good bible christian
Topic: 10
cubs allocation greek cross suck john linked unit baltimore moncton
Topic: 11
april lines university states send santa sweden available group rates
Topic: 12
darren hawks company maria tomb station redesign patients c

### Visualizing our topics in 2-dimensional space

How to interpret this visualization:
1. Each bubble represents a topic
2. Larger topics are more frequent in the corpus
3. Topics closer together are more similar
4. When you click on a topic, the most relevant terms for that topic show in red on the right, and in blue is the frequency of that term in all other topics
5. When you hover over a word in the chart on the right, the bubbles will adjust according to how relevant that term is to each topic


In [7]:
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

### Create Document - Topic Matrix

In [8]:
lda_output = lda.transform(tf)

# column names
topicnames = ["Topic" + str(i) for i in range(no_topics)]

# index names
docnames = ["Doc" + str(i) for i in range(len(df))]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

# Styling
def color_green(val):
    color = 'green' if val > .1 else 'black'
    return 'color: {col}'.format(col=color)

def make_bold(val):
    weight = 700 if val > .1 else 400
    return 'font-weight: {weight}'.format(weight=weight)

# Apply Style
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,Topic10,Topic11,Topic12,Topic13,Topic14,Topic15,Topic16,Topic17,Topic18,Topic19,dominant_topic
Doc0,0.29,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37,0.0,0.31,0.0,0.0,0.0,0.0,13
Doc1,0.0,0.38,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.54,0.0,0.0,0.0,0.0,15
Doc2,0.03,0.05,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.32,0.0,0.48,0.0,0.0,0.0,0.0,15
Doc3,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.84,0.01,0.01,0.01,0.01,0.01,14
Doc4,0.0,0.0,0.0,0.11,0.0,0.0,0.0,0.0,0.43,0.33,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,8
Doc5,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.0,0.16,0.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9
Doc6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3,0.0,0.0,0.0,0.0,0.67,0.0,0.0,0.0,0.0,0.0,0.0,13
Doc7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.31,0.69,0.0,0.0,0.0,0.0,15
Doc8,0.0,0.0,0.0,0.0,0.0,0.0,0.21,0.0,0.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8
Doc9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18,0.0,0.0,0.0,0.0,0.0,0.0,0.55,0.26,0.0,0.0,0.0,15
