### Loading and preprocessing data

The comments from Donald Trump's official Facebook page were collected using the Facebook Graph API and stored in a MongoDB database. In this file I define some time periods of interest, extract a random sample of 5 million comments from the database for each period, and store these as json files.

***Note: Unless you have collected the data on our own and have an identical database you cannot run this code, it is simply here to show how I preprocessed the data***

In [1]:
from pymongo import MongoClient
from db_tools import *
import re
import json
client = MongoClient('localhost', 27017)
db = client.DATABASE # Database name and credentials have been changed for security
db.authenticate('USERNAME', 'PASSWORD', source='USER')

True

In [3]:
t = db.DonaldTrump_comments

In [4]:
t.count()

16943290

In [8]:
t.find_one()

{'_id': ObjectId('59fdeaafdf6f4525f4ae8590'),
 'author_id': '1399832183389550',
 'comment_author': 'Hemin Badraddin',
 'comment_id': '10160057480090725_284567205380253',
 'comment_message': '#kurdistan \n#supportkurdistan\n#KurdistanBlockade\n#Peshmerga',
 'like_count': 0,
 'page_id': 'DonaldTrump',
 'position': 9,
 'status_id': '153080620724_10160057480090725',
 'timestamp': datetime.datetime(2017, 10, 26, 19, 27, 36)}

Creating indices on the collection to speed up query time.

In [9]:
%%time
import pymongo
t.create_index([('author_id', pymongo.ASCENDING)],unique=False)
t.create_index([('timestamp', pymongo.ASCENDING)],unique=False)
t.create_index([('comment_message', pymongo.TEXT)], unique=False, 
                               default_language='english')

CPU times: user 1.97 s, sys: 960 ms, total: 2.93 s
Wall time: 46min 53s


In [15]:
def SampleByDate(collection, START_DATE, END_DATE, sample_size=5):
    """Returns a sample of the records in a collection, stratified by
    the number of likes. Sample size can also be set."""
    S = collection.aggregate([
        {"$match": {"timestamp": {'$gte':START_DATE ,"$lt": END_DATE}}},
        {'$sample': { 'size': sample_size }}
    ],allowDiskUse=True)
    return [x for x in S]

Defining functions to clean up the raw text data.

In [20]:
def remove_url(text_string):
    """
    Accepts a text string and replaces:
    1) urls with URLHERE

    """
    giant_url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    parsed_text = re.sub(giant_url_regex, 'URL', text_string)
    return parsed_text

def tokenize(tweet):
    """Removes punctuation & excess whitespace, sets to lowercase,
    and stems tweets. Returns a list of stemmed tokens."""
    tweet = remove_url(tweet)
    tweet = " ".join(re.split("[^a-zA-Z]*", tweet.lower().strip()))
    return tweet.split()

In [25]:
s = 'This is a test string!!! http://www.test.com'
tokenize(s)

  return _compile(pattern, flags).split(string, maxsplit)


['this', 'is', 'a', 'test', 'string', 'url']

In [29]:
S = SampleByDate(t, datetime(2015,6,16), datetime(2015,12,16), 100)

In [33]:
S_ = [x['comment_message'] for x in S]

In [34]:
S_

['The best part- he is NOT a politician.',
 "Don't sign the pledge, don't sign the pledge, don't sign the pledge. You can win this thing even as an independent!",
 'Trump 2016',
 "I'm pretty sure he doesn't actually want yo be president. Like it was a joke that got too far and now has has to say all these ridiculous things so he doesn't have to back out",
 'As someone who lived in MD under his tenure as Governor. I can assure you he is as stupid as he seems.  He destroyed that state.',
 'RACISTS homophobes>>>lol pc?',
 "I'm on board!",
 "If Trump actually read the Bill of Rights, he'd know that he can't close a Mosque. So he either didn't read it. Doesn't understand it, or is pandering to the fools who think he'll still be in the race in 4 months. This is a publicity stunt. He's trailing Carson in the Iowa polls. He has zero chance of ever being elected to any office.",
 'Do it if the GOP establishment (old guard politicians) have turned their backs on you. And if the election is lost 

In [None]:
# Set of time periods to analyse
dates = [(datetime(2015,6,16), datetime(2015,12,15)), # start of campaign to final debate
         (datetime(2015,12,16), datetime(2016,5,1)), # until trump named presumptive nominee
         (datetime(2016,5,2), datetime(2016,11,8)), # until election day
         (datetime(2016,11,9), datetime(2017,4,26)), # until 100th day in office
         (datetime(2017,4,27), datetime(2017,11,3)), # until last friday
        ]
# For each period, get a sample of comments, clean them up, and store on disk
for i, d in enumerate(dates):
    start = d[0]
    end = d[1]
    print(i)
    sample = SampleByDate(t, start, end, sample_size=1000000)
    print(len(sample))
    sample = [tokenize(x['comment_message']) for x in sample]
    with open(str(i)+'.json','w') as f:
        json.dump(sample, f)

0
1000000


  return _compile(pattern, flags).split(string, maxsplit)


1
1000000
2
