# Topic Modeling Exercise
- Stephen W. Thomas
- Queen's MMAI 891

In this exercise, we will use LDA to create topics from a twitter dataset. Namely, the dataset contains tweets from Donald Trump and Hillary Clinton leading up to the 2016 US presidential election.

**Your mission**: Try to create better topics. The main ways to do this are to enhance the preprocessing steps and to tweak the LDA parameters. (See code snippets marked with `## EDIT CODE HERE ##`.)

In [1]:
import pandas as pd
import os

# Make sure the text doesn't get truncated when printed
pd.set_option('display.max_colwidth', -1)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Read in the Data

Let's read in the data and look at its shape and some sample rows.

In [2]:
## EDIT CODE HERE ##
# Change these paths to where you downloaded the data
in_dir = "C:/Users/st50/Documents/sandbox/data"
out_dir = "C:/Users/st50/Documents/sandbox/out"

df = pd.read_csv(os.path.join(in_dir, "election-tweets-2016.csv"))

list(df)
df.info()
df.shape
df.head()
df.tail()

['IsRetweet',
 'Time',
 'Language',
 'RetweetCount',
 'FavoriteCount',
 'Longitude',
 'Latitude',
 'Author',
 'SourceURL',
 'Content',
 'OriginalAuthor',
 'Place']

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6444 entries, 0 to 6443
Data columns (total 12 columns):
IsRetweet         6444 non-null bool
Time              6444 non-null object
Language          6444 non-null object
RetweetCount      6444 non-null float64
FavoriteCount     6444 non-null float64
Longitude         12 non-null float64
Latitude          12 non-null float64
Author            6444 non-null object
SourceURL         6444 non-null object
Content           6444 non-null object
OriginalAuthor    722 non-null object
Place             204 non-null object
dtypes: bool(1), float64(4), object(7)
memory usage: 560.2+ KB


(6444, 12)

Unnamed: 0,IsRetweet,Time,Language,RetweetCount,FavoriteCount,Longitude,Latitude,Author,SourceURL,Content,OriginalAuthor,Place
0,False,2016-09-28 00:22:34,en,218.0,651.0,,,HillaryClinton,https://studio.twitter.com,The question in this election: Who can put the plans into action that will make your life better? https://t.co/XreEY9OicG,,
1,True,2016-09-27 23:45:00,en,2445.0,5308.0,,,HillaryClinton,http://twitter.com,"Last night, Donald Trump said not paying taxes was ""smart."" You know what I call it? Unpatriotic. https://t.co/t0xmBfj7zF",timkaine,
2,True,2016-09-27 23:26:40,en,7834.0,27234.0,,,HillaryClinton,https://about.twitter.com/products/tweetdeck,Couldn't be more proud of @HillaryClinton. Her vision and command during last night's debate showed that she's ready to be our next @POTUS.,POTUS,
3,False,2016-09-27 23:08:41,en,916.0,2542.0,,,HillaryClinton,https://studio.twitter.com,"If we stand together, there's nothing we can't do. Make sure you're ready to vote: https://t.co/tTgeqxNqYm https://t.co/Q3Ymbb7UNy",,
4,False,2016-09-27 22:30:27,en,859.0,2882.0,,,HillaryClinton,https://about.twitter.com/products/tweetdeck,Both candidates were asked about how they'd confront racial injustice. Only one had a real answer. https://t.co/sjnEokckis,,


Unnamed: 0,IsRetweet,Time,Language,RetweetCount,FavoriteCount,Longitude,Latitude,Author,SourceURL,Content,OriginalAuthor,Place
6439,False,2016-01-05 03:47:14,en,1110.0,4024.0,,,realDonaldTrump,http://twitter.com/download/android,"""@lilredfrmkokomo: @realDonaldTrump My Facebook Groups are all voting TRUMP /4000 people! !!"" Great!",,
6440,False,2016-01-05 03:44:17,en,855.0,3181.0,,,realDonaldTrump,http://twitter.com/download/android,"""@marybnall01: @realDonaldTrump watched lowell mass speech. Awesome. Great crowd. Make America Great Again!!!!!!""",,
6441,False,2016-01-05 03:42:10,en,2315.0,5992.0,,,realDonaldTrump,http://twitter.com/download/android,"""@ghosthunter_lol: Iowa key endorsement for @realDonaldTrump Can't wait for the Iowa caucus in 4 weeks! #Trump2016 https://t.co/JBfyFrZfFb""",,
6442,False,2016-01-05 03:39:11,en,1054.0,3258.0,,,realDonaldTrump,http://twitter.com/download/android,"""@iLoveiDevices: @EdwinRo47796972 @happyjack225 @FoxNews @krauthammer Minimizing dependency on China is crucial.Only Trump talks about that",,
6443,False,2016-01-05 03:36:53,en,748.0,2658.0,,,realDonaldTrump,http://twitter.com/download/android,"""@SalRiccobono: @realDonaldTrump @troyconway Donald get big business back and# MAKE AMERICA GREAT AGAIN FOR 2016""",,


# Text Preprocessing

Here is where the preprocess magic happens. In the below, I've written a function called `preprocess` that does a bunch of standard steps: lower casing, removing puncuation, etc. You can comment some of these steps out, or add more of your own. Up to you!

In [3]:
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer
import unidecode
import re

stop_words = set(stopwords.words('english') + stopwords.words('spanish'))

lemmer = WordNetLemmatizer()

# This functon take as input an entire document, preprocesses each word in the 
# document, and returns the preprocessed document.
def preprocess(x):
    # Lower case
    x = x.lower()
    
    # Remove stop words
    x = ' '.join([w for w in x.split() if w not in stop_words])
    
    ## EDIT CODE HERE ##
    
    
    return x

# Some test cases of our function
preprocess("Steve is the man with the plan.")
preprocess("The arsonist had oddly shaped feet.")
preprocess("@PatBatement I'm not really hungry, but I'd like to have a reservation someplace.")
preprocess("GOOOOOAAAAAAALLLLLL!!!!!")

'steve man plan.'

'arsonist oddly shaped feet.'

"@patbatement i'm really hungry, i'd like reservation someplace."

'goooooaaaaaaallllll!!!!!'

Now let's apply the `preprocess` function to the tweets in our dataframe.

In [4]:
df['Content_Clean'] = df['Content'].apply(preprocess)

Now, let's look at some of the preprocessed documents.

In [5]:
df.head()
df.tail()

Unnamed: 0,IsRetweet,Time,Language,RetweetCount,FavoriteCount,Longitude,Latitude,Author,SourceURL,Content,OriginalAuthor,Place,Content_Clean
0,False,2016-09-28 00:22:34,en,218.0,651.0,,,HillaryClinton,https://studio.twitter.com,The question in this election: Who can put the plans into action that will make your life better? https://t.co/XreEY9OicG,,,question election: put plans action make life better? https://t.co/xreey9oicg
1,True,2016-09-27 23:45:00,en,2445.0,5308.0,,,HillaryClinton,http://twitter.com,"Last night, Donald Trump said not paying taxes was ""smart."" You know what I call it? Unpatriotic. https://t.co/t0xmBfj7zF",timkaine,,"last night, donald trump said paying taxes ""smart."" know call it? unpatriotic. https://t.co/t0xmbfj7zf"
2,True,2016-09-27 23:26:40,en,7834.0,27234.0,,,HillaryClinton,https://about.twitter.com/products/tweetdeck,Couldn't be more proud of @HillaryClinton. Her vision and command during last night's debate showed that she's ready to be our next @POTUS.,POTUS,,proud @hillaryclinton. vision command last night's debate showed ready next @potus.
3,False,2016-09-27 23:08:41,en,916.0,2542.0,,,HillaryClinton,https://studio.twitter.com,"If we stand together, there's nothing we can't do. Make sure you're ready to vote: https://t.co/tTgeqxNqYm https://t.co/Q3Ymbb7UNy",,,"stand together, there's nothing can't do. make sure ready vote: https://t.co/ttgeqxnqym https://t.co/q3ymbb7uny"
4,False,2016-09-27 22:30:27,en,859.0,2882.0,,,HillaryClinton,https://about.twitter.com/products/tweetdeck,Both candidates were asked about how they'd confront racial injustice. Only one had a real answer. https://t.co/sjnEokckis,,,candidates asked they'd confront racial injustice. one real answer. https://t.co/sjneokckis


Unnamed: 0,IsRetweet,Time,Language,RetweetCount,FavoriteCount,Longitude,Latitude,Author,SourceURL,Content,OriginalAuthor,Place,Content_Clean
6439,False,2016-01-05 03:47:14,en,1110.0,4024.0,,,realDonaldTrump,http://twitter.com/download/android,"""@lilredfrmkokomo: @realDonaldTrump My Facebook Groups are all voting TRUMP /4000 people! !!"" Great!",,,"""@lilredfrmkokomo: @realdonaldtrump facebook groups voting trump /4000 people! !!"" great!"
6440,False,2016-01-05 03:44:17,en,855.0,3181.0,,,realDonaldTrump,http://twitter.com/download/android,"""@marybnall01: @realDonaldTrump watched lowell mass speech. Awesome. Great crowd. Make America Great Again!!!!!!""",,,"""@marybnall01: @realdonaldtrump watched lowell mass speech. awesome. great crowd. make america great again!!!!!!"""
6441,False,2016-01-05 03:42:10,en,2315.0,5992.0,,,realDonaldTrump,http://twitter.com/download/android,"""@ghosthunter_lol: Iowa key endorsement for @realDonaldTrump Can't wait for the Iowa caucus in 4 weeks! #Trump2016 https://t.co/JBfyFrZfFb""",,,"""@ghosthunter_lol: iowa key endorsement @realdonaldtrump can't wait iowa caucus 4 weeks! #trump2016 https://t.co/jbfyfrzffb"""
6442,False,2016-01-05 03:39:11,en,1054.0,3258.0,,,realDonaldTrump,http://twitter.com/download/android,"""@iLoveiDevices: @EdwinRo47796972 @happyjack225 @FoxNews @krauthammer Minimizing dependency on China is crucial.Only Trump talks about that",,,"""@iloveidevices: @edwinro47796972 @happyjack225 @foxnews @krauthammer minimizing dependency china crucial.only trump talks"
6443,False,2016-01-05 03:36:53,en,748.0,2658.0,,,realDonaldTrump,http://twitter.com/download/android,"""@SalRiccobono: @realDonaldTrump @troyconway Donald get big business back and# MAKE AMERICA GREAT AGAIN FOR 2016""",,,"""@salriccobono: @realdonaldtrump @troyconway donald get big business back and# make america great 2016"""


## Topic Modeling with Sci-kit Learn

In sklearn, before we runb LDA, we need to vectorize the text. The good news is, we have a nice opportunity to remove rare and common words.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

## EDIT CODE HERE

# The number of "features" (i.e., tokens) we want to keep in the BOW.
no_features = 1000

vectorizer = CountVectorizer(max_df=0.99, min_df=0.0001, 
                                max_features=no_features, ngram_range=[1,3])
%time dtm = vectorizer.fit_transform(df['Content_Clean'])
print(dtm.shape)

Wall time: 510 ms
(6444, 1000)


And just for fun, let's look at the top 30 tokens

In [7]:
from yellowbrick.text import FreqDistVisualizer

feature_names = vectorizer.get_feature_names()
visualizer = FreqDistVisualizer(features=feature_names, n=30)
visualizer.fit(dtm)
visualizer.poof()

FrequencyVisualizer(ax=<matplotlib.axes._subplots.AxesSubplot object at 0x000001A0E7A96978>,
          color=None,
          features=['00', '000', '000 000', '10', '100', '11', '12', '15', '16', '20', '2016', '30', '47246', '50', '7pm', 'able', 'access', 'across', 'act', 'action', 'actually', 'ad', 'address', 'ads', 'afford', 'affordable', 'african', 'african american', 'african americans', 'again', 'again https', 'again... 'you', 'you https', 'you https co', 'you makeamericagreatagain', 'you re', 'young', 'zero', 'zika'],
          n=None, orient='h')

<Figure size 800x550 with 1 Axes>

Now, we run LDA.

In [8]:
from sklearn.decomposition import LatentDirichletAllocation


## EDIT CODE HERE ##

lda_model = LatentDirichletAllocation(n_components=5,
                                      doc_topic_prior=None,
                                      topic_word_prior=None,
                                      max_iter=20, 
                                      learning_method='batch', 
                                      random_state=123,
                                      n_jobs=2,
                                      verbose=0)
%time lda_output = lda_model.fit(dtm)

# Log Likelihood: Higher the better
lda_model.score(dtm)

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
lda_model.perplexity(dtm)

# Theta = document-topic matrix
# Beta = components_ = topic-term matrix
theta = pd.DataFrame(lda_model.transform(dtm))
beta = pd.DataFrame(lda_model.components_)

Wall time: 10.1 s


-369992.9272526116

330.3561452032795

# Look at the Topics

Let's inspect the topics. Let's figure out the top words. Also, let's compute some simple metrics on each topic, like how many documents contain it (_support_) and what the total size of the topic is (_weight_).

In [9]:
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_columns', 0)

# How many words to display for each topic
no_top_words = 10
weight = theta.sum(axis=0)

# Number of documents that contain a topic more than 50%
support50 = (theta > 0.5).sum(axis=0)

# Number of documents that contain a document more than 10%
support10 = (theta > 0.1).sum(axis=0)
termss = list()
for topic_id, topic in enumerate(lda_model.components_):
    terms = " ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]])
    termss.append(terms)
topic_summary = pd.DataFrame({'TopicID': range(0, len(termss)), "Support50": support50, "Support10": support10, "Weight": weight, "Terms": termss})

# Display the topics; sort by Weight
topic_summary.sort_values('Weight', ascending=False)

Unnamed: 0,TopicID,Support50,Support10,Weight,Terms
2,2,1657,2386,1597.632208,co https co https trump donald donald trump hillary he realdonaldtrump it
3,3,1353,2416,1492.622506,co https co https thank trump2016 you makeamericagreatagain thank you trump2016 https trump2016 ...
0,0,1223,2450,1373.784758,people cruz amp great big trump realdonaldtrump ted like many
4,4,894,2387,1174.337207,hillary clinton we hillary clinton crooked crooked hillary re president women together
1,1,528,1722,805.623322,america great make again president make america first last america great great again


# Show some Documents that Match a Given Topic

In [11]:
# Make sure the text doesn't get truncated when printed
pd.set_option('display.max_colwidth', -1)

# Which topic are you interested in?
## EDIT CODE HERE ##
topic_id = 3

# Find the 5 documents with the largets membership (i.e., theta) for this topic
Memberships = theta.iloc[:, topic_id].nlargest(5)

Memberships

# Display those documents (and getting rid of some of the columns)
df[['Time', 'RetweetCount', 'FavoriteCount', 'Author', 'Content', 'Content_Clean']].iloc[Memberships.index]

260     0.963223
1658    0.962588
532     0.959687
6276    0.959604
2716    0.959503
Name: 3, dtype: float64

Unnamed: 0,Time,RetweetCount,FavoriteCount,Author,Content,Content_Clean
260,2016-09-21 22:35:19,10434.0,27350.0,realDonaldTrump,"Great new polls! Thank you Nevada, North Carolina &amp; Ohio. Join the MOVEMENT today &amp; lets #MAGA!… https://t.co/Y8Sb8MNyXA","great new polls! thank nevada, north carolina &amp; ohio. join movement today &amp; lets #maga!… https://t.co/y8sb8mnyxa"
1658,2016-08-04 15:17:01,5814.0,18830.0,realDonaldTrump,"Looking forward to IA &amp; WI with Gov. Pence, tomorrow. Join us! #MAGA https://t.co/3Hcnzj0Slx https://t.co/sEwLWkn1Sz https://t.co/0Ei3EdQdXB","looking forward ia &amp; wi gov. pence, tomorrow. join us! #maga https://t.co/3hcnzj0slx https://t.co/sewlwkn1sz https://t.co/0ei3edqdxb"
532,2016-09-15 02:33:39,6649.0,20014.0,realDonaldTrump,Great poll out of Nevada- thank you! See you soon. #MAGA #AmericaFirst https://t.co/3KWOl2ibaW https://t.co/27sR3MjjXc,great poll nevada- thank you! see soon. #maga #americafirst https://t.co/3kwol2ibaw https://t.co/27sr3mjjxc
6276,2016-01-18 17:48:53,2088.0,5831.0,realDonaldTrump,A great morning with everyone @LibertyU! Thank you! Off to New Hampshire now. #Trump2016 https://t.co/XUWGANbq8k https://t.co/aEMUMqSoWm,great morning everyone @libertyu! thank you! new hampshire now. #trump2016 https://t.co/xuwganbq8k https://t.co/aemumqsowm
2716,2016-07-14 14:53:46,8970.0,23690.0,realDonaldTrump,Another new poll. Thank you for your support! Join the MOVEMENT today! #ImWithYou https://t.co/3KWOl2ibaW https://t.co/miT4atHxQz,another new poll. thank support! join movement today! #imwithyou https://t.co/3kwol2ibaw https://t.co/mit4athxqz


# Visualize Topics with LDAVis

In [12]:
import pyLDAvis.sklearn
 
pyLDAvis.enable_notebook()
% time pyLDAvis.sklearn.prepare(lda_model, dtm, vectorizer, mds="tsne")

Wall time: 12.1 s


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
