# 1. Introduction
We are interested in observing the discussions on Twitter to identify vulnerabilites and exposures.
We plan to focus on collecting tweets consisting of particular words of interest. The data collected can then be cleaned and used for analysis on similar lines as the full disclosure mailing list.

The notebook is inspired by the work in the article [Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits](https://www.umiacs.umd.edu/~tdumitra/papers/USENIX-SECURITY-2015.pdf) where the idea is to use twitter analytics for early detecting exploits. There have been instances shared in the [presentation](http://www.umiacs.umd.edu/~tdumitra/blog/2015/08/02/predicting-vulnerability-exploits/), where vulnerabilities have been mentioned and discussed on twitter before the vulnerability is disclosed and this is the motivation for using twitter as a part of the research.

The researchers in the article started collecting tweets based on a list of 50 words. The list of words have not been mentioned in the article, hence we start our analysis by collecting tweets for keywords identified manually by doing our research on discussions related to vulnerabilities on twitter.

The following task can be achieved in two ways:
1. Search for historical tweets with specific words of interest using the **search API**
2. Monitor the feed on twitter for specific words of interest using the **streaming API**


In [30]:
myvars = {}
with open("Twitter_keys.txt") as myfile:
    for line in myfile:
        name, var = line.partition("=")[::2]
        myvars[name.strip()] = var

In [31]:
APP_KEY = myvars["APP_KEY"].rstrip()
APP_SECRET = myvars["APP_SECRET"].rstrip()
OAUTH_TOKEN = myvars["OAUTH_TOKEN"].rstrip()
OAUTH_TOKEN_SECRET = myvars["OAUTH_TOKEN_SECRET"].rstrip()

# 2. Tweet Extraction

We use the [twython](https://github.com/ryanmcgrath/twython) library for the tweet extraction. Twython is an actively maintained, pure Python wrapper for the Twitter API. It supports both normal and streaming Twitter APIs.

The primary task is to obtain the keys and tokens required to access the API and then access the functions in the wrappers.

The scripts below have two configurable parameters:
1. The **query_word** variable needs to be initialized with the keywords we are looking for on the twitter feed
2. The **max_tweets** variable is the number of tweets we plan to extract for the keywords mentioned above
    
If query_word is initialized to multiple words, the code will retrieve set of tweets that consists of all the words.

## 2.1. Streaming API

The streaming API is used to access all the current tweets. It returns approximately 1% of the tweets i.e. 60 tweets per second assuming a maximum of 6000 users tweet every second. There is no rate limit to the Streaming API.

The hyperlink has details related to the Twitter Streaming API limit:
1. [URL 1](https://stackoverflow.com/questions/34962677/twitter-streaming-api-limits)
2. [URL 2](https://stackoverflow.com/questions/13055370/how-many-percent-of-the-tweets-does-twitter-sample-api-give)

In [44]:
#import libraries
from twython import TwythonStreamer
from twython import Twython, TwythonError
import time
import sys


#Configurable parameters
query_word="flaw"
max_tweets=2


searched_tweets_strm=[]

class MyStreamer(TwythonStreamer):
    def on_success(self, data):
        if len(searched_tweets_strm)<max_tweets:
            if 'text' in data:
                #print data['text'].encode('utf-8')
                searched_tweets_strm.append(data['text'].encode('utf-8'))
            else:
                print ("No tweets found")
                self.disconnect()
        else: 
            print ("Max tweets extracted")
            sys.exit()

    def on_error(self, status_code, data):
        print (status_code, data)
        print ("Exception raised, waiting 15 minutes")
        time.sleep(15*60)

# Requires Authentication as of Twitter API v1.1
stream = MyStreamer(APP_KEY, APP_SECRET,
                    OAUTH_TOKEN, OAUTH_TOKEN_SECRET)


stream.statuses.filter(track=query_word)

Max tweets extracted


SystemExit: 

In [None]:
print (searched_tweets_strm)

## 2.2. Search API

The streaming API is used to collect tweets from the current feed and not the historical data. For historical data, the search API can be used to access tweets that are up to [three weeks old](https://stackoverflow.com/questions/1662151/getting-historical-data-from-twitter).
Once these historical tweets are collected, the rest of the tweets can be accessed by running the streaming API continuously.

**Limitations:**
To collect tweets for two different words, the Twitter API needs to be queried twice and there is no functionality of collecting it in one function call at the same time.

**Example:**
If we need to collect tweets with the word "data" in it and we also need to collect tweets with the word with the word "flaw" in it. The search API needs to be called twice, once for the word "data" and then for the word "flaw".
1. twitter.search(q="data")
2. twitter.search(q="flaw")

Note:
The search API limit is 180 requests every 15 minutes and hence the code will sleep for 15 minutes every time the API limit is reached.

In [40]:
#import libraries
from twython import Twython, TwythonError
import time
from time import gmtime, strftime


#Configurable parameters
query_word="flaw"
max_tweets=200


tweet_cnt=0
# Requires Authentication as of Twitter API v1.1
twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

searched_tweets_srch = []
while len(searched_tweets_srch) < max_tweets:
    remaining_tweets = max_tweets - len(searched_tweets_srch)
    try:
        search_results = twitter.search(q=query_word, count=100)
        
        if not search_results:
            print('no tweets found')
            break
            
        tweet_cnt=tweet_cnt+len(search_results["statuses"])
        searched_tweets_srch.extend(search_results["statuses"])
        
    except TwythonError as e:
        print (e)
        print ("exception raised, waiting 16 minutes")
        print (strftime("%H:%M:%S"+ gmtime()))
        time.sleep(16*60)

print ("Total tweets extracted for "+query_word+": "+str(tweet_cnt))

Total tweets extracted for flaw: 200


In [None]:
print (searched_tweets_srch)

References:

1. Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits,
https://www.umiacs.umd.edu/~tdumitra/papers/USENIX-SECURITY-2015.pdf

2. http://www.umiacs.umd.edu/~tdumitra/blog/2015/08/02/predicting-vulnerability-exploits/