# Topic Modeling with LDA

In this notebook, I'll take my processed tweets (punctuation removed; stopwords removed; no links, hashtags, or emojis; and lemmatized) and use pyLDAvis to start exploring some of the underlying topics in this corpus. 

pyLDAvis is a great tool to both get top keywords for each topic as well as *visualize* these topics in relation to one another. 

## pyLDAvis Topic Modeling

In [46]:
# Basics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pickle 

#sklearn
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [4]:
import warnings
warnings.filterwarnings('ignore')

  and should_run_async(code)


In [2]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [5]:
with open('../data_files/processed_tweets.pickle', 'rb') as read_file:
    tweets = pickle.load(read_file)

In [6]:
tweets.head()

Unnamed: 0,clean,processed
0,islam kills are you trying to say that there w...,islam kill try say terrorist attack europe ref...
1,clinton trump should ve apologized more attack...,clinton trump apologize attack little
2,who was is the best president of the past year...,well president past retweet
3,i don t have to guess your religion christmasa...,guess religion christmasaftermath
4,pence and his lawyers decided which of his off...,pence lawyer decide official email public can see


In [7]:
proc_tweets = tweets.processed

In [8]:
vectorizer = CountVectorizer(token_pattern="\\b[a-z][a-z]+\\b",
                             binary=True,
                             stop_words='english')

In [9]:
dtm_tf = vectorizer.fit_transform(proc_tweets)
print(dtm_tf.shape)

(203482, 77822)


### 4 Topics

In [10]:
%%time

lda_4 = LatentDirichletAllocation(n_components=4, random_state=42)
lda_4.fit(dtm_tf)

CPU times: user 6min 13s, sys: 1.34 s, total: 6min 15s
Wall time: 6min 16s


LatentDirichletAllocation(n_components=4, random_state=42)

In [11]:
pyLDAvis.sklearn.prepare(lda_4, dtm_tf, vectorizer)

**Topic Analysis**
1. 2016 Election (main entities = Trump, Clinton; additional keywords like president, campaign, election, maga, email, fbi)
2. General Twitter - emphasis on BLM
3. Conservative Twitter (christian, conservative, pjnet, ted cruz, tea party)
4. Twitter News - negative slant

### 6 Topics

In [12]:
%%time

lda_6 = LatentDirichletAllocation(n_components=6, random_state=42)
lda_6.fit(dtm_tf)

CPU times: user 5min 26s, sys: 1.19 s, total: 5min 27s
Wall time: 5min 28s


LatentDirichletAllocation(n_components=6, random_state=42)

In [13]:
pyLDAvis.sklearn.prepare(lda_6, dtm_tf, vectorizer)

**Topic Analysis**
1. Trump
2. Violence
3. General Twitter
4. Clinton
5. Conservative Twitter (pjnet, christian, ted cruz, god, conservative)
6. News

### 8 Topics

In [36]:
%%time

lda_8 = LatentDirichletAllocation(n_components=8, random_state=42)
lda_8.fit(dtm_tf)

CPU times: user 5min 16s, sys: 1.1 s, total: 5min 17s
Wall time: 5min 18s


LatentDirichletAllocation(n_components=8, random_state=42)

In [37]:
pyLDAvis.sklearn.prepare(lda_8, dtm_tf, vectorizer)

**Topic Analysis**
1. Trump
2. Islam
3. Police Violence
4. Clinton
5. General Twitter
6. Obama
7. Conservative
8. Debates

### 12 Topics

In [38]:
%%time

lda_12 = LatentDirichletAllocation(n_components=12, random_state=42)
lda_12.fit(dtm_tf)

CPU times: user 5min 13s, sys: 1.49 s, total: 5min 14s
Wall time: 5min 16s


LatentDirichletAllocation(n_components=12, random_state=42)

In [39]:
pyLDAvis.sklearn.prepare(lda_12, dtm_tf, vectorizer)

**Topic Analysis**
1. Trump as a candidate
2. Trump's campaign
3. Police Violence
4. Clinton
5. General Twitter
6. Also Clinton
7. Conservative Twitter
8. Islam, refugees, ISIS
9. General Twitter
10. Conservative Twitter
11. General Twitter
12. German

While 12 is a lot and there seems to be some repeated topics, these could be further reduced to 7 main topics: 

* Trump (1, 2) 
* Clinton (4, 6)
* Police Violence (3)
* General Twitter (5, 9, 11)
* Middle East (8)
* Conservative (7, 10)
* German (12)

I'll use this framework going forward -- in the next few notebooks, I'll explore these LDA topics further, do sentiment analysis, and use CorEx to tease out these topics a little bit more. 