In [38]:
import tweepy #https://github.com/tweepy/tweepy
import csv
import re
import sys
import os
%matplotlib inline

Set up API and such

In [39]:
#Twitter API credentials
consumer_key = "CONSUMER_KEY"
consumer_secret = "CONSUMER_SECRET"
access_key = "ACCESS_KEY"
access_secret = "MOTHERS_MAIDEN_NAME"



OAUTH_KEYS = {'consumer_key':consumer_key, 'consumer_secret':consumer_secret,
 'access_token_key':access_key, 'access_token_secret':access_secret}
auth = tweepy.OAuthHandler(OAUTH_KEYS['consumer_key'], OAUTH_KEYS['consumer_secret'])

# In order to manage the rate limiting, use these options below. 
# You will find later that rate limiting is, well, a limiting factor.

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

So what we want to do here is generate a corpus of text, from which the markov model can generate probabilities. The steps look like this:

1. Collect the data
2. 'Clean' the data
3. Combine the data
4. Apply Markov model
5. Generate some example tweets!

First I want to collect tweets from all the 'friends' of the account:

In [40]:

THECORPUSOFMYMIND = []
thought_leaders = []
user = 'infosec_truths'
print("importing thought leadership from friends of {}".format(user))
for friend in api.friends_ids(user):
    sn = api.get_user(friend).screen_name
    thought_leaders.append(sn)

    

for thought_leader in thought_leaders:
    #print("importing thought leadership from {}".format(thought_leader))
    tweets = api.user_timeline(screen_name =thought_leader,count=1000)
    for tweet in tweets:
            if not tweet.retweeted:
                THECORPUSOFMYMIND.append(tweet.text)

importing thought leadership from friends of infosec_truths


Then I wanted to remove not-thought-leading tweets, like Re-tweets, @ mentions, and stuff with useful links in them. I want just the pure, text-only, stream-of-consciousness thought leadership that makes infosec twitter great.

In [41]:
for i in range(3):
    for tweet in THECORPUSOFMYMIND:
        if "@" in tweet:
            THECORPUSOFMYMIND.remove(tweet)
    
http_regex = r"http.+?(?=[ \"\']|$|\n)"
at_regex = r"@.+?(?= |$)"
RT_regex = r"^RT.+?(?=[^ ])"


for tweet in THECORPUSOFMYMIND:
    idx = THECORPUSOFMYMIND.index(tweet)
    tweet = re.sub(http_regex, '', tweet)
    tweet = re.sub(at_regex, '', tweet)
    tweet = re.sub(RT_regex, '', tweet)
    THECORPUSOFMYMIND[idx] = tweet

In [42]:
def clean_tweets(tweetlist):
    for tweet in tweetlist:
        idx = tweetlist.index(tweet)
        tweet = re.sub(http_regex, '', tweet)
        tweet = re.sub(at_regex, '', tweet)
        tweet = re.sub(RT_regex, '', tweet)
        tweetlist[idx] = tweet
    return tweetlist

Take a sneak peek just to make sure we're getting what we want:

In [43]:
THECORPUSOFMYMIND[:10]

['Important point made: Your personal email account is the key to everything else. ',
 '"Our product will fix the problem that is thought to have caused X beach"  is a misleading sales pitch, and also re… ',
 'You should not be doing anything sexy until you can tell me that AV is installed on each endpoint and have mechanis… ',
 'somewhere in China a forensic analyst just found out that he wasted days and nights analyzing old systems ',
 '  True, having the data to search through is a great thing.',
 '#braggingrights ',
 'Once a transient city, over the last few years people have fallen in love with it -  buying homes and investing in… ',
 ' Later they would find a way to get our military technology without putting lives at risk...',
 'I want to know the collective time wasted by a single milkshake order at Potbelly during the lunch rush.',
 'This is going to exacerbate my issues with session hoarding. ']

Then I join the corpus into a single document I can use to generate the markov model.

In [44]:
FINALCORPUS = " ".join(THECORPUSOFMYMIND)

In [45]:
FINALCORPUS = FINALCORPUS.replace('\r', '').replace('\n', '')

In [46]:
ASCIICORPUS = str(FINALCORPUS.encode('ascii',errors='ignore'))

In [47]:
ASCIICORPUS[:1000]

'b\'Important point made: Your personal email account is the key to everything else.  "Our product will fix the problem that is thought to have caused X beach"  is a misleading sales pitch, and also re  You should not be doing anything sexy until you can tell me that AV is installed on each endpoint and have mechanis  somewhere in China a forensic analyst just found out that he wasted days and nights analyzing old systems    True, having the data to search through is a great thing. #braggingrights  Once a transient city, over the last few years people have fallen in love with it -  buying homes and investing in   Later they would find a way to get our military technology without putting lives at risk... I want to know the collective time wasted by a single milkshake order at Potbelly during the lunch rush. This is going to exacerbate my issues with session hoarding.  And couldn\\\'t manage to steal a better metal than bronze  Couple is watching videos on phone without headphones. Fligh

In [48]:
#add to existing corpus document
text_file = open("corpus.txt", "w")
text_file.write(ASCIICORPUS)
text_file.close()

For the Markov model, I use 'markovify', which I found from this @ChrisAlbon page: https://chrisalbon.com/python/other/generate_tweets_using_markov_chain/ 

Sidenote: if you don't use Chris Albon pages to learn python/data science, you should.

Now the fun part! Let's thought lead!

In [53]:
import markovify

# Get raw text as string.
with open("corpus.txt") as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

# Print five randomly-generated sentences

for i in range(5):
    print(text_model.make_short_sentence(280))
    print()

Sound only coming out of prison to find evil, always apply them against older datasets.

There should be horrified Dont make important mathematical decisions for your story of how If you\'re in US jurisdiction.

Defcon is for everyone, but also praise good web UI... and hmm this is called PowerShell Direct and it\'s what allows you to alert on critical events, search as needed.

Unnamed pass through most likely used by Windows, boy does this page have you covered: CSS is the most promising trends in information security is the 5th floor for #DayofShecurity.

A pipeline for telemetry that allows you to check that i was not a security vendor #infosec #Uber I wonder if hackers laugh or enjoy how we refer to them...



## Next Steps:

As you can see, the markov approach _kind of_ works...but makes a lot of nonsense. That's partly because the markov model only cares about the next word, not the whole sentence or thought. To build a model more aware of the structure of language, other folks have used neural networks, perhaps the most hilarious/ infamous being @deepdrumpf: https://twitter.com/deepdrumpf

News here: 

https://www.forbes.com/sites/janetwburns/2016/10/19/deepdrumpf-is-an-uncanny-twitterbot-thats-fundraising-for-girls-in-stem/#1f078fa649da

https://www.theguardian.com/technology/2016/mar/04/donald-trump-deep-drumpf-twitter-bot

This seems like the logical next step for infosec_truths. There are a bunch of tutorials online as well:

https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

https://dzone.com/articles/generating-tweets-using-a-recurrent-neural-net-tor
