In order to authorise our app to access Twitter on our behalf, we need to use the OAuth interface:

In [1]:
import tweepy
from tweepy import OAuthHandler
 
consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'
 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)

### Environment Testing

In [None]:
<name_variable> = api.home_timeline()
for status in <name_variable>:
    print(status.text)

For example, we can read our own timeline (i.e. our Twitter homepage) with:

In [None]:
for status in tweepy.xxxx(api.home_xxxx).items(10):
    # Process a single status
    print(status.text)

Tweepy provides the convenient Cursor interface to iterate through different types of objects. In the example above we’re using 10 to limit the number of tweets we’re reading, but we can of course access more. The status variable is an instance of the Status() class, a nice wrapper to access the data. The JSON response from the Twitter API is available in the attribute _json (with a leading underscore), which is not the raw JSON string, but a dictionary.

So the code above can be re-written to process/store the JSON:

In [None]:
for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    print(status.xxxx)

What if we want to have a list of all our followers? There you go

In [None]:
for friend in tweepy.Cursor(api.xxxx).items():
    print(friend._json)

And how about a list of all our tweets? Simple:

In [None]:
for tweet in tweepy.Cursor(api.user_xxxx).items():
    print(tweet._json)

To collect 2000 @TelUniversity tweets and then saving the texts, number of retweets, and number of likes inside a list, we can perform as below.

In [None]:
telu_cursor = tweepy.Cursor(api.user_timeline, screen_name = 'TelU twitter account')
telu_tweets = [(tweet.text, tweet.retweet_count, tweet.favorite_count) \
                 for tweet in telu_cursor.items(2000)]

In [None]:
telu_tweets[0]  # index

Showing tweet form account @<your account>

In [None]:
variable = api.user_timeline('<your twitter account>')
for status in variable:
    print(tweet.text)

So far, we have seen how we can use Tweepy to collect actual Twitter data. Now let’s be curious and visualize them using Matplotlib.

### Minimum Data Visualization

To start this, let us see an example of plotting the username frequency of tweets in our home timeline. The username frequency is the number of tweets from a specific username that appears in our data. The tweets data will be of 200 tweets from the home timeline. Below is the code sample followed by the bar plot result.

In [None]:
import xxxxx as xxxx
home_cursor = tweepy.Cursor(api.home_timeline)
tweets = [i.author.screen_name for i in home_cursor.items(200)]
unq = set(tweets)
freq = {uname: tweets.count(uname) for uname in unq}
plt.bar(range(len(unq)), freq.values())
# plt.show()

As you can see, the plot shows that there were xxx different Twitter accounts tweeting in the timeline, and there is only one account that tweets the most (xxx tweets).

### Case 1. Objective: accounts comparison, Model: modified horizontal bar chart

In this case, we will use a data of 192 tweets that I have collected from my home timeline a few months ago. The data will be transformed such that we can group the tweets by accounts, and then we will compare the numbers through a data visualization. It is important to note that we will develop a good data visualization that combines various Matplotlib features and not restrict ourselves to the plotting models it has.
The modules that we will use in Python are:

In [None]:
import numpy
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import operator
import itertools

The code for data representation is shown below. It will group the tweets by username and then count the tweets count (also can be seen as username frequency) and the number of followers for each username. It will also sort the data by the username frequency.

### Download testfile.npy on https://github.com/rizaldp/BigDataDTS

In [None]:
home_tweets = list(numpy.load('testfile.npy'))
unicity_key = operator.attrgetter('author.screen_name')
tweets = sorted(home_tweets, key=unicity_key)
authors_tweets = {}
followers_count = {}
sample_count = {}
        
for screen_name, author_tweets in itertools.groupby(tweets, key=unicity_key):
    author_tweets = list(author_tweets)
    author = author_tweets[0].author
    authors_tweets[screen_name] = author
    followers_count[screen_name] = author.followers_count
    sample_count[screen_name] = len(author_tweets)
maxfolls = max(followers_count.values())
sorted_screen_name = sorted(sample_count.keys(), key = lambda x: sample_count[x])
sorted_folls = [followers_count[i] for i in sorted_screen_name]
sorted_count = [sample_count[i] for i in sorted_screen_name]

Next, we could easily perform the horizontal bar plot byplt.hbar(range(len(sorted_count)),sorted_count)and then setting the titles for the figure and x-y axis, but let us improvise. We will add additional visual form by adding color transparancy difference, for example: the transparancy of the bars may represent the number of followers of the accounts. We will also change the tick labels on the y axis from numbers to strings of usernames. Here is the code:

In [None]:
colors = [(0.1, 0.1, 1, i/maxfolls) for i in sorted_folls]
fig, ax = plt.subplots(1,1)
y_pos = range(len(colors))
ax.set_yticks(y_pos)
ax.set_yticklabels(sorted_screen_name, font_properties = helv_8)
ax.barh(y_pos, sorted_count, \
        height = 0.7, left = 0.5, \
        color = colors)
for i in y_pos:
    ax.text(sorted_count[i]+1, i, style_the_num(str(sorted_folls[i])), \
            font_properties = helv_10, color = colors[i])
ax.set_ylim(-1, y_pos[-1]+1+0.7)
ax.set_xlim(0, max(sorted_count) + 8)
ax.set_xlabel('Tweets count', font_properties = helv_14)
plt.tight_layout(pad=3)
fig.show()

In [None]:
from statistweepy import models
import matplotlib.pyplot as plt
import numpy

stats = numpy.load('ProSyn.npy') #You can get this data from the 'samples' folder.
tweets = models.Tweets(stats)
model = models.Authors(tweets)

fig, ax = plt.subplots(1, 1)
model.hbar_plot(ax, meas = 'total_tweets', \
                incolor_meas = 'total_tweets', \
                text_sizes = [10, 15], \
                freq_lim = (10000, 1000000)) 
fig.show()

### Streaming

In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. We need to extend the StreamListener() to customise the way we process the incoming data. A working example that gathers all the new tweets with the #python hashtag:

In [None]:
from tweepy import Stream
from tweepy.streaming import StreamListener
 
class MyListener(StreamListener):
 
    def on_data(self, data):
        try:
            with open('xxxxx.json', 'a') as f:
                f.write(data)
                return True
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True
 
    def on_error(self, status):
        print(status)
        return True
 
twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=['#hashtag'])

Assuming that you have collected a number of tweets and stored them in JSON as suggested in the previous article, let’s have a look at the structure of a tweet:

In [None]:
import json
 
with open('xxxxx.json', 'r') as f:
    line = f.readline() # read only the first tweet/line
    tweet = json.loads(line) # load it as Python dict
    print(json.dumps(tweet, indent=4)) # pretty-print

### How to Tokenise a Tweet Text
Let’s see an example, using the popular NLTK library to tokenise a fictitious tweet:

In [None]:
import re
 
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
 
def tokenize(s):
    return tokens_re.findall(s)
 
def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens
 
tweet = 'RT @marcobonzanini: just an example! :D http://example.com #NLP'
print(preprocess(tweet))
# ['RT', '@marcobonzanini', ':', 'just', 'an', 'example', '!', ':D', 'http://example.com', '#NLP']

As you can see, @-mentions, emoticons, URLs and #hash-tags are now preserved as individual tokens.

If we want to process all our tweets, previously saved on file:

In [None]:
import operator 
import json
from collections import Counter
 
fname = 'xxxxx.json'
with open(fname, 'r') as f:
    count_all = Counter()
    for line in f:
        tweet = json.loads(line)
        # Create a list with all the terms
        terms_all = [term for term in preprocess(tweet['text'])]
        # Update the counter
        count_all.update(terms_all)
    # Print the first 5 most frequent words
    print(count_all.most_common(5))

Given the nature of our data and our tokenisation, we should also be careful with all the punctuation marks and with terms like RT (used for re-tweets) and via (used to mention the original author of an article or a re-tweet), which are not in the default stop-word list.

In [None]:
from nltk.corpus import stopwords
import string
 
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via']


We can now substitute the variable terms_all in the first example with something like:

In [None]:
terms_stop = [term for term in preprocess(tweet['text']) if term not in stop]
print(terms_stop)

Besides stop-word removal, we can further customise the list of terms/tokens we are interested in. Here you have some examples that you can embed in the first fragment of code:

In [None]:
# Count terms only once, equivalent to Document Frequency
terms_single = set(terms_all)
print("Terms single: ", terms_single)

# Count hashtags only
terms_hash = [term for term in preprocess(tweet['text']) 
              if term.startswith('#')]
print("\nTerms hashtag: ",terms_hash)

# Count terms only (no hashtags, no mentions)
terms_only = [term for term in preprocess(tweet['text']) 
              if term not in stop and
              not term.startswith(('#', '@'))] 
              # mind the ((double brackets))
              # startswith() takes a tuple (not a list) if 
              # we pass a list of inputs
print("\nTerms Only: ",terms_only)

While the other frequent terms represent a clear topic, more often than not simple term frequencies don’t give us a deep explanation of what the text is about. To put things in context, let’s consider sequences of two terms (a.k.a. bigrams).

In [None]:
from nltk import bigrams 
 
terms_bigram = bigrams(terms_stop)

The bigrams() function from NLTK will take a list of tokens and produce a list of tuples using adjacent tokens. Notice that we could use terms_all to compute the bigrams, but we would probably end up with a lot of garbage. In case we decide to analyse longer n-grams (sequences of n tokens), it could make sense to keep the stop-words, just in case we want to capture phrases like “to be or not to be”.

In [None]:
from collections import defaultdict
# remember to include the other import from the previous post
 
com = defaultdict(lambda : defaultdict(int))
 
# f is the file pointer to the JSON data set
for line in f: 
    tweet = json.loads(line)
    terms_only = [term for term in preprocess(tweet['text']) 
                  if term not in stop 
                  and not term.startswith(('#', '@'))]
 
    # Build co-occurrence matrix
    for i in range(len(terms_only)-1):            
        for j in range(i+1, len(terms_only)):
            w1, w2 = sorted([terms_only[i], terms_only[j]])                
            if w1 != w2:
                com[w1][w2] += 1

### End of session