### Context

As part of the House Intelligence Committee investigation into how Russia may have influenced the 2016 US Election, Twitter released the screen names of almost 3000 Twitter accounts believed to be connected to Russia’s Internet Research Agency, a company known for operating social media troll accounts. Twitter immediately suspended these accounts, deleting their data from Twitter.com and the Twitter API. A team at NBC News including Ben Popken and EJ Fox was able to reconstruct a dataset consisting of a subset of the deleted data for their investigation and were able to show how these troll accounts went on attack during key election moments. This dataset is the body of this open-sourced reconstruction.


## Content
The dataset contains two CSV files. tweets.csv includes details on individual tweets, while users.csv includes details on individual accounts.

To recreate a link to an individual tweet found in the dataset, replace user_key in https://twitter.com/user_key/status/tweet_id with the screen-name from the user_key field and tweet_id with the number in the tweet_id field.

Following the links will lead to a suspended page on Twitter. But some copies of the tweets as they originally appeared, including images, can be found by entering the links on web caches like archive.org and archive.is.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import nltk
import re
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
import json
import pandas as pd
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from gensim import similarities
from sklearn.metrics.pairwise import euclidean_distances


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/saviaga/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/saviaga/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
data=[]
datafile = pd.read_csv('tweets.csv')
data = datafile['text'].tolist()

NO_DOCUMENTS = len(data)
print(NO_DOCUMENTS)
print(type(data))
print(data[:5])

print(datafile.isnull().sum(axis = 0))
print(datafile.shape)

203482
<class 'list'>
['#IslamKills Are you trying to say that there were no terrorist attacks in Europe before refugees were let in?', 'Clinton: Trump should’ve apologized more, attacked less https://t.co/eJampkoHFZ', 'RT @ltapoll: Who was/is the best president of the past 25 years? (Vote &amp; Retweet)', "RT @jww372: I don't have to guess your religion! #ChristmasAftermath", 'RT @Shareblue: Pence and his lawyers decided which of his official emails the public could see\r\n\r\nhttps://t.co/HjhPguBK1Y by @alisonrose711']
user_id                    8065
user_key                      0
created_at                   21
created_str                  21
retweet_count            145399
retweeted                145399
favorite_count           145399
text                         21
tweet_id                   2314
source                   145398
hashtags                      0
expanded_urls                 0
posted                        0
mentions                      0
retweeted_status_id      

In [3]:
datafile = datafile[datafile['text'].notna()]
print(datafile.shape)
data = datafile["text"]

(203461, 16)


In [4]:
NUM_TOPICS = 10
STOPWORDS = stopwords.words('spanish')
#STOPWORDS.append(['https','verificado')
 
def clean_text(text):
    tokenized_text = word_tokenize(text.lower())
    cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)]
    return cleaned_text
 
# For gensim we need to tokenize the data and filter out stopwords
tokenized_data = []
for text in data:
    tokenized_data.append(clean_text(text))

In [5]:
# Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)

print(text)
# Transform the collection of texts to a numerical form
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
 
# Have a look at how the 20th document looks like: [(word_id, count), ...]
print(corpus[20])
# [(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2),  ...


RT @futureguru100: U cant just Upload a CD online &amp; thats it. Where is ya Product? Where Ya Work?
U Gotta Represent &amp; Your Brand w/ a  Qual…
[(17, 1), (21, 1), (111, 1), (178, 1), (179, 1), (180, 1)]


In [6]:
# Build the LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)
 
# Build the LSI model
lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

print("LDA Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))
 
print("=" * 20)


LDA Model:
Topic #0: 0.050*"the" + 0.043*"https" + 0.023*"this" + 0.017*"and" + 0.015*"that" + 0.014*"was" + 0.009*"she" + 0.009*"look" + 0.008*"one" + 0.008*"her"
Topic #1: 0.163*"https" + 0.014*"new" + 0.013*"the" + 0.013*"via" + 0.010*"news" + 0.010*"trump" + 0.010*"http" + 0.010*"for" + 0.010*"video" + 0.009*"with"
Topic #2: 0.097*"trump" + 0.096*"https" + 0.053*"clinton" + 0.049*"hillary" + 0.025*"donald" + 0.018*"for" + 0.017*"politics" + 0.011*"campaign" + 0.011*"obama" + 0.010*"says"
Topic #3: 0.050*"the" + 0.031*"you" + 0.031*"https" + 0.025*"and" + 0.022*"are" + 0.018*"they" + 0.017*"for" + 0.016*"that" + 0.015*"not" + 0.014*"people"
Topic #4: 0.070*"the" + 0.062*"https" + 0.028*"trump" + 0.020*"for" + 0.015*"and" + 0.012*"amp" + 0.010*"day" + 0.009*"media" + 0.008*"realdonaldtrump" + 0.007*"about"
Topic #5: 0.086*"https" + 0.031*"tcot" + 0.022*"pjnet" + 0.013*"ccot" + 0.013*"maga" + 0.011*"again" + 0.009*"america" + 0.009*"isis" + 0.008*"wakeupamerica" + 0.007*"the"
Topic #6

In [7]:
text = "Ingresa a nuestro sitio web y descarga los recursos disponibles"
bow = dictionary.doc2bow(clean_text(text))
 
print(lsi_model[bow])
# [(0, 0.091615426138426506), (1, -0.0085557463300508351), (2, 0.016744863677828108), (3, 0.040508186718598529), (4, 0.014201267714185898), (5, -0.012208538275305329), (6, 0.031254053085582149), (7, 0.017529584659403553), (8, 0.056957633371540077),
# (9, 0.025989149894888153)]
 
print(lda_model[bow])
# [(0, 0.020005183), (1, 0.020005869), (2, 0.02000626), (3, 0.020005472), (4, 0.020009108), (5, 0.020005926), (6, 0.81994385), (7, 0.020006068), (8, 0.020006327), (9, 0.020005994)]


[(0, 0.00040263501296447984), (1, -0.00019277469535821095), (2, 5.5608749833277724e-05), (3, -0.00029970111582406874), (4, 0.0001562308358922775), (5, -4.4730341114685035e-05), (6, 8.17114903444009e-05), (7, -0.0006038637518214931), (8, -0.00011906083217209624), (9, -2.2907524087936324e-05)]
[(0, 0.050030753), (1, 0.050030753), (2, 0.050030753), (3, 0.050030753), (4, 0.050030753), (5, 0.050030753), (6, 0.050030753), (7, 0.050030753), (8, 0.050030753), (9, 0.54972327)]


In [8]:
#gensim-similarities

 
lda_index = similarities.MatrixSimilarity(lda_model[corpus])
 
# Let's perform some queries
similarities = lda_index[lda_model[bow]]
# Sort the similarities
similarities = sorted(enumerate(similarities), key=lambda item: -item[1])
 
# Top most similar documents:
print(similarities[:10])
# [(104, 0.87591344), (178, 0.86124849), (31, 0.8604598), (77, 0.84932965), (85, 0.84843522), (135, 0.84421808), (215, 0.84184396), (353, 0.84038532), (254, 0.83498049), (13, 0.82832891)]
 
# Let's see what's the most similar document
document_id, similarity = similarities[0]
print(data[document_id][:1000])


[(24385, 1.0000001), (47582, 1.0000001), (73654, 1.0000001), (85584, 1.0000001), (120454, 1.0000001), (134648, 1.0000001), (156788, 1.0000001), (5, 1.0), (169, 1.0), (257, 1.0)]
RT @Delo_Taylor: At the rate this is going @HillaryClinton apologists are going to be voting for neoliberalism to avoid neofascism until th…


In [9]:
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
print(lda_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
 
# Build a Non-Negative Matrix Factorization Model
nmf_model = NMF(n_components=NUM_TOPICS)
nmf_Z = nmf_model.fit_transform(data_vectorized)
print(nmf_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
 
# Build a Latent Semantic Indexing Model
lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
lsi_Z = lsi_model.fit_transform(data_vectorized)
print(lsi_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
 
 
# Let's see how the first document in the corpus looks like in different topic spaces
print(lda_Z[0])
print(nmf_Z[0])
print(lsi_Z[0])
 
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])
 
print("LDA Model:")
print_topics(lda_model, vectorizer)
print("=" * 20)
 
print("NMF Model:")
print_topics(nmf_model, vectorizer)
print("=" * 20)
 
print("LSI Model:")
print_topics(lsi_model, vectorizer)
print("=" * 20)  

text = "Ingresa a nuestro sitio web y descarga los recursos disponibles"
x = nmf_model.transform(vectorizer.transform([text]))[0]
print("transformed: ",x)


 
def most_similar(x, Z, top_n=5):
    dists = euclidean_distances(x.reshape(1, -1), Z)
    pairs = enumerate(dists[0])
    most_similar = sorted(pairs, key=lambda item: item[1])[:top_n]
    return most_similar
 
similarities = most_similar(x, nmf_Z)
document_id, similarity = similarities[0]
print(data[document_id][:1000])

import pandas as pd
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet
output_notebook()

svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(data_vectorized)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data))
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

svd = TruncatedSVD(n_components=2)
words_2d = svd.fit_transform(data_vectorized.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], vectorizer.get_feature_names()
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

 
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
 
text = "Ingresa a nuestro sitio web y descarga los recursos disponibles"
x = lda_model.transform(vectorizer.transform([text]))[0]
print(x, x.sum())



(203461, 10)
(203461, 10)
(203461, 10)
[0.01111111 0.01111111 0.01111111 0.01111545 0.01111111 0.78888772
 0.01111111 0.12221906 0.01111111 0.01111111]
[0.00051181 0.00100233 0.00157907 0.00157502 0.00174148 0.00412459
 0.00380679 0.00021304 0.01527405 0.00563054]
[0.02926112 0.02567382 0.02882798 0.07884688 0.00509846 0.04113675
 0.12253351 0.05108473 0.07157021 0.03833418]
LDA Model:
Topic 0:
[('https', 10550.271956088687), ('hillary', 8180.525346980807), ('clinton', 5923.264868326948), ('amp', 4783.661854290255), ('trump', 3528.660915054555), ('day', 2197.30768833215), ('debate', 1563.498829861811), ('party', 1375.1695049558791), ('work', 1373.1499790896435), ('big', 1297.299357961737)]
Topic 1:
[('https', 10848.567030962502), ('trump', 5899.993313344944), ('news', 5022.522842425017), ('vote', 3263.638157393209), ('election', 3235.0080299087176), ('amp', 3116.187270502312), ('media', 2791.387849531086), ('right', 2742.9002537862366), ('america', 2625.2441558987784), ('make', 2236.42

[('amp', 0.6219424404740275), ('obama', 0.03353685940132449), ('realdonaldtrump', 0.0295846245609621), ('people', 0.024338083392231268), ('don', 0.020941646785270874), ('america', 0.012564088886192624), ('vote', 0.011541176833321164), ('media', 0.009604892341914405), ('potus', 0.009248450990362636), ('make', 0.00799226449056618)]
Topic 5:
[('obama', 0.9125098778903641), ('president', 0.09960117902291993), ('tcot', 0.04432051189606114), ('news', 0.04409303798137925), ('barack', 0.039139586340414596), ('michelle', 0.038194380087034206), ('politics', 0.03375500225893405), ('pjnet', 0.03243843507698438), ('hillary', 0.03140501492950974), ('says', 0.02957937661377087)]
Topic 6:
[('people', 0.5766516033817707), ('don', 0.42058556304719824), ('hillary', 0.29884338379238307), ('just', 0.26570993284997746), ('like', 0.2502150908060375), ('want', 0.09159303040716431), ('know', 0.08987745265238381), ('realdonaldtrump', 0.07927827238309015), ('think', 0.07150513647045047), ('black', 0.068037632306

[0.03333333 0.03333333 0.03333333 0.03333333 0.36666666 0.36666667
 0.03333333 0.03333333 0.03333333 0.03333333] 1.0


In [10]:
import pyLDAvis.sklearn
 
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, vectorizer, mds='tsne')
panel