# Analysis of Twitter stream data with the IPython Notebook

In this example, we use the IPython notebook to mine data from Twitter with the [Twython library](https://github.com/ryanmcgrath/twython).  Once we have fetched the raw stream for a specific query, we will at first do some basic word frequency analysis on the results using Python's builtin dictionaries, and then we will use the excellent [NetworkX](http://networkx.lanl.gov) library developed at Los Alamos National Laboratory to look at the results as a network and understand some of its properties.  

Using NetworkX, we aim to answer the following questions: for a given query, which words tend to appear together in tweets, and global pattern of relationships between these words emerges from the entire set of results?

Obviously the analysis of text corpora of this kind is a complex topic at the intersection of natural language processing, graph theory and statistics, and here we do not pretend to provide an exhaustive coverage of it.  Rather, we want to show you how with a small amount of easy to write code, it is possible to do a few non-trivial things based on real-time data from the Twitter stream.  Hopefully this will serve as a good starting point;  for further reading you can find in-depth discussions of analysing social network data in Python in the book [Mining the Social Web](http://shop.oreilly.com/product/0636920010203.do).


### Create a [Twitter App](https://apps.twitter.com/app/14491982/show)


## Initialization and libraries

We start by loading the pylab plot support and selecting our figure size to be a bit different than the automatic defaults.

In [5]:
#!conda install networkx -y
!pip install twython

Collecting twython
  Downloading https://files.pythonhosted.org/packages/8c/2b/c0883f05b03a8e87787d56395d6c89515fe7e0bf80abd3778b6bb3a6873f/twython-3.7.0.tar.gz
Collecting requests_oauthlib>=0.4.0 (from twython)
  Downloading https://files.pythonhosted.org/packages/94/e7/c250d122992e1561690d9c0f7856dadb79d61fd4bdd0e598087dce607f6c/requests_oauthlib-1.0.0-py2.py3-none-any.whl
Collecting oauthlib>=0.6.2 (from requests_oauthlib>=0.4.0->twython)
[?25l  Downloading https://files.pythonhosted.org/packages/e6/d1/ddd9cfea3e736399b97ded5c2dd62d1322adef4a72d816f1ed1049d6a179/oauthlib-2.1.0-py2.py3-none-any.whl (121kB)
[K    100% |████████████████████████████████| 122kB 3.4MB/s ta 0:00:01
[?25hBuilding wheels for collected packages: twython
  Running setup.py bdist_wheel for twython ... [?25ldone
[?25h  Stored in directory: /Users/brian/Library/Caches/pip/wheels/c2/b0/a3/5c4b4b87b8c9e4d99f1494a0b471f0134a74e5fb33d426d009
Successfully built twython
Installing collected packages: oauthlib, req

In [6]:
#%pylab inline
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(8, 5))
import networkx as nx
from twython import Twython


Now, we load a local library with some analysis utilities whose code is a bit long to display inline. The python module is called `text_utils.py` and can be [downloaded here](text_utils.py).

In [10]:
import text_utils as tu  # shorthand for convenience
import getpass

Finally, we'll need to use the free [Twython library](https://github.com/ryanmcgrath/twython) to query Twitter's stream:

In [None]:
!pip2 install twython

In [11]:
APP_KEY = getpass.getpass('YOUR_APP_KEY')
APP_SECRET = getpass.getpass('YOUR_APP_SECRET')

YOUR_APP_KEY········
YOUR_APP_SECRET········


In [12]:
twitter = Twython(APP_KEY, APP_SECRET, oauth_version=2)
ACCESS_TOKEN = twitter.obtain_access_token()
twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)


## Query declaration

Here we define which query we want to perform, as well as which words we want to filter out from our analysis because they appear very commonly and we're not interested in them.  

Typically you want to run the query once, and after seeing what comes out, fine-tune the removal list, as which words are 'noise' is fairly query-specific (and also changes over time, depending on what's happening out there on Twitter):

In [13]:
query = "big data"
words_to_remove = """with some your just have from it's /via &amp; that they your there this"""

In [14]:
tweets = twitter.search(q='biomedical informatics'+" lang:en")


In [None]:
help(twitter.search)

## Perform query to Twitter servers

This is the cell that actually fetches data from Twitter.  We limit the output to the first 30 pages of search max (typically Twitter stops returning results before that).

In [15]:
for key, value in tweets.items():
    print(key)
    print(value)

statuses
[{'created_at': 'Thu Jul 12 20:20:06 +0000 2018', 'id': 1017503889204170752, 'id_str': '1017503889204170752', 'text': 'RT @schap9899: Day 4 Summer Bridge for Health Sciences Academy students w/@GWSMHS Holland codes personality type, citation hunt, biomedical…', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'schap9899', 'name': 'Sherri Chapman', 'id': 119956804, 'id_str': '119956804', 'indices': [3, 13]}, {'screen_name': 'GWSMHS', 'name': 'GW SMHS', 'id': 49979511, 'id_str': '49979511', 'indices': [74, 81]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 144931880, 'id_str': '144931880', 'name': 'Alexandria City Public Schools

In [16]:
results = twitter.cursor(twitter.search, q='python lang:en')
for result in results:
    print (result)

{'created_at': 'Fri Jul 13 05:30:26 +0000 2018', 'id': 1017642385814310912, 'id_str': '1017642385814310912', 'text': '#Spectrum My #InternetSpeed :\nPing: 28.414 ms\nDownload: 185.77 Mbit/s\nUpload: 11.03 Mbit/s\n#automagic #python #corporateaccountability', 'truncated': False, 'entities': {'hashtags': [{'text': 'Spectrum', 'indices': [0, 9]}, {'text': 'InternetSpeed', 'indices': [13, 27]}, {'text': 'automagic', 'indices': [91, 101]}, {'text': 'python', 'indices': [102, 109]}, {'text': 'corporateaccountability', 'indices': [110, 134]}], 'symbols': [], 'user_mentions': [], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="https://www.mncarpenter.ninja" rel="nofollow">Internet tweeter</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 418307927, 'id_str': '418307927', 'name': 'Mark Carpenter Jr', 'screen_name':

{'created_at': 'Fri Jul 13 05:25:51 +0000 2018', 'id': 1017641234708549632, 'id_str': '1017641234708549632', 'text': 'RT @circl_lu: We just released an IMAP proxy in Python which can be used to sanitize malicious (based on PyCIRCLean library) attachment or…', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'circl_lu', 'name': 'CIRCL', 'id': 184762389, 'id_str': '184762389', 'indices': [3, 12]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 189118361, 'id_str': '189118361', 'name': 'Rayna', 'screen_name': 'MaliciaRogue', 'location': 'Here and there', 'description': 'Lady Data Security. Award-winning writer #Crisis/#risk mgment #OSINT #

{'created_at': 'Fri Jul 13 05:23:07 +0000 2018', 'id': 1017640544460296192, 'id_str': '1017640544460296192', 'text': 'RT @yashaslokesh_: Day 14: Made a stopwatch script that supports lapping using the time module. Learned to use exception catching to print…', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'yashaslokesh_', 'name': 'Yashas Lokesh', 'id': 2885151280, 'id_str': '2885151280', 'indices': [3, 17]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="https://100daysofcode.com" rel="nofollow">30days30sites</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 842956176958476289, 'id_str': '842956176958476289', 'name': 'CodersNotes', 'screen_name': '_30days30sites', 'location': 'codeanywhere.com', 'description': 'RT bot for #30Days30Sites #100DaysOfCode an

{'created_at': 'Fri Jul 13 05:18:25 +0000 2018', 'id': 1017639363939745792, 'id_str': '1017639363939745792', 'text': 'RT @QuantInsti: Have you ever tested your trading strategy in python using Fibonacci Retracement? https://t.co/tdUSgmIICr', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'QuantInsti', 'name': 'QuantInsti', 'id': 869660137, 'id_str': '869660137', 'indices': [3, 14]}], 'urls': [{'url': 'https://t.co/tdUSgmIICr', 'expanded_url': 'https://www.quantinsti.com/blog/fibonacci-retracement-trading-strategy-python/?utm_campaign=coschedule&utm_source=twitter&utm_medium=QuantInsti', 'display_url': 'quantinsti.com/blog/fibonacci…', 'indices': [98, 121]}]}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_rep

  


In [18]:
user_tweets = twitter.get_user_timeline(screen_name='chapmanbe',
                                        include_rts=True)
for tweet in user_tweets:
    tweet['text'] = Twython.html_for_tweet(tweet)
    print (tweet['text'])

Adela Grando leading a discussion about improving informatics education <a href="https://twitter.com/search?q=%23IEF2018" class="twython-hashtag">#IEF2018</a> <a href="https://twitter.com/search?q=%23AMIA" class="twython-hashtag">#AMIA</a> <a href="https://t.co/MuMwF7Xfqz" class="twython-media">pic.twitter.com/MuMwF7Xfqz</a>
<a href="https://t.co/U48uJyreLj" class="twython-url">goo.gl/forms/50RWHDGc…</a> <a href="https://twitter.com/search?q=%23IEF2018" class="twython-hashtag">#IEF2018</a> <a href="https://twitter.com/search?q=%23Jupyter" class="twython-hashtag">#Jupyter</a>
I'd like to be conference adept too. 
 <a href="https://t.co/iCFdVMK7sF" class="twython-url">sinews.siam.org/Default.aspx?t…</a> <a href="https://twitter.com/search?q=%23SIAM" class="twython-hashtag">#SIAM</a> <a href="https://twitter.com/search?q=%23SiNews" class="twython-hashtag">#SiNews</a>
This is very cool: an index of science comics and animations <a href="https://t.co/kmOPhsmMJu" class="twython-url">cartoons

## Text statistics

Let's see what the first 10 tweets look like:

In [21]:
type(tweets)
list(tweets.items())[:10]

[('statuses',
  [{'created_at': 'Thu Jul 12 20:20:06 +0000 2018',
    'id': 1017503889204170752,
    'id_str': '1017503889204170752',
    'text': 'RT @schap9899: Day 4 Summer Bridge for Health Sciences Academy students w/@GWSMHS Holland codes personality type, citation hunt, biomedical…',
    'truncated': False,
    'entities': {'hashtags': [],
     'symbols': [],
     'user_mentions': [{'screen_name': 'schap9899',
       'name': 'Sherri Chapman',
       'id': 119956804,
       'id_str': '119956804',
       'indices': [3, 13]},
      {'screen_name': 'GWSMHS',
       'name': 'GW SMHS',
       'id': 49979511,
       'id_str': '49979511',
       'indices': [74, 81]}],
     'urls': []},
    'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
    'source': '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>',
    'in_reply_to_status_id': None,
    'in_reply_to_status_id_str': None,
    'in_reply_to_user_id': None,
    'in_reply_to_user_id_str':

Now we do some cleanup of the common words above, so that we can then compute some basic statistics:

In [22]:
remove = tu.removal_set(words_to_remove, query)
lines = tu.lines_cleanup([tweet['text'].encode('utf-8') for tweet in results], remove=remove)
words = '\n'.join(lines).split()

Compute frequency histogram:

In [24]:
import importlib
importlib.reload(tu)

<module 'text_utils' from '/Users/brian/Code/decart_advanced_python_2018/module8-networks/text_utils.py'>

In [28]:
wf = tu.word_freq(words)
sorted_wf = tu.sort_freqs(wf)
words

[]

Let's look at a summary of the word frequencies from this dataset:

In [26]:
tu.summarize_freq_hist(sorted_wf)

Number of unique words: 0

10 least frequent words:


ValueError: max() arg is an empty sequence

Now we can plot the histogram of the `n_words` most frequent words:

In [None]:
n_words = 10
tu.plot_word_histogram(sorted_wf, n_words,"Frequencies for %s most frequent words" % n_words);

Above we trimmed the historgram to only show `n_words` because the distribution is very sharply peaked; this is what the histogram for the whole word list looks like:

In [None]:
tu.plot_word_histogram(sorted_wf, 1.0, "Frequencies for entire word list");

## Co-occurrence graph

An interesting question to ask is: which pairs of words co-occur in the same tweets?  We can find these relations and use them to construct a graph, which we can then analyze with NetworkX and plot with Matplotlib.

We limit the graph to have at most `n_nodes` (for the most frequent words) just to keep the visualization easier to read.

In [None]:
n_nodes = 10
popular = sorted_wf[-n_nodes:]
pop_words = [wc[0] for wc in popular]
co_occur = tu.co_occurrences(lines, pop_words)
wgraph = tu.co_occurrences_graph(popular, co_occur, cutoff=1)
wgraph = nx.connected_component_subgraphs(wgraph)[0]

An interesting summary of the graph structure can be obtained by ranking nodes based on a centrality measure.  NetworkX offers several centrality measures, in this case we look at the [Eigenvector Centrality](http://networkx.lanl.gov/reference/generated/networkx.algorithms.centrality.eigenvector_centrality.html#networkx.algorithms.centrality.eigenvector_centrality):

In [None]:
centrality = nx.eigenvector_centrality_numpy(wgraph)
tu.summarize_centrality(centrality)

And we can use this measure to provide an interesting view of the structure of our query dataset:

In [None]:
print "Graph visualization for query:", query
tu.plot_graph(wgraph, tu.centrality_layout(wgraph, centrality), plt.figure(figsize=(8,8)),
    title='Centrality and term co-occurrence graph, q="%s"' % query)

# References

