# Overview

An overview of the data collected during our monitoring period. 

---

In this notebook we aim to investigate the distribution of tweets, retweets, replies, hashtags and links within our dataset. By doing this we hope to gain some understanding of the data we have collected. It will also allow us to determine how active the two distinct groups of users, journalists and news organisations, are on social media.

In [None]:
import collections
import json
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
import requests
import bs4
import re

# add penemue to path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
from utils import twiterate
from utils import Collect

%matplotlib inline

We begin by loading the user profiles, from the Twitter API, of each user contained within the original set of Twitter lists that we provided to Penemue when we begun our data collection. We then extract only the `id_str` of each user, classifying them into their two distinct groups using the appropriately named variables, journalists and organisations.

To do this we make use of the `Collect` class within the Penemue `utils`. By passing it a list of strings containing URLs of the Twitter lists we can begin to extract the user profiles by making rate limited calls to the Twitter API. Once the profiles have been collected, we then retrieve them using the `members` property of the class. As you can see below, our Twitter list URLs are stored in a JSON file so we must first open the appropriate file, then pass its contents to the `Collect` class. 

In [None]:
j = json.load(open('../data/journalists.json'))
journalists = [user['id_str'] for user in Collect(lists=j).members]

In [None]:
o = json.load(open('../data/organisations.json'))
organisations = [user['id_str'] for user in Collect(lists=o).members]

Now that we have established our users of interest we can begin extracting some figures from the data. In order to present this data we'll define a couple of functions to chart the data as a pie chart of the distribution and a bar chart showing the top 10 occurances in the data.

In [None]:
def pie(joi, ooi):
    # calculate mean percentage
    joi_size = len(joi)
    ooi_size = len(ooi)
    total = joi_size + ooi_size

    joi_mean = (joi_size / total) * 100
    ooi_mean = (ooi_size / total) * 100

    # data to plot
    sizes = [joi_mean, ooi_mean]
    labels = "Journalists", "Organisations"
    colors = ['lightskyblue', 'lightcoral']
    
    # plot
    plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', 
        shadow=True, startangle=90)

    plt.axis('equal')
    plt.show()

In [None]:
def bar(joi, ooi, label):
    # create list of users
    id_strs = ["@%s" % screen_name for screen_name in joi]
    id_strs += ["@%s" % screen_name for screen_name in ooi]

    # count occurances
    counter = Counter(id_strs)
    most_common = counter.most_common(10)

    # data to plot
    labels, y = zip(*most_common)
    x = range(len(labels))
    
    # plot
    plt.bar(x, y, alpha=0.5)
    plt.xticks(x, labels, rotation='90')
    plt.ylabel(label)
    plt.show()

We begin by establishing the number of _original tweets_ authored by either a journalist of interest (`joi`) or a news organisation of interest (`ooi`) during the period of data collection. An original tweet is one that has been written and published by a user, such that it is not a reply nor a retweet.

In order to do this we make use of the `twiterate` function found in the Penemue `utils`. This function allows us to iterate through a JSON file of any size containing any number of [Tweet objects](https://dev.twitter.com/overview/api/tweets) by making use of a callback function. This callback accepts a single `tweet object` as a parameter and should return the attribute(s) of the tweet that we require. The reason for using such an approach is to avoid loading the full list of tweets into memory, which may lead to an out of memory exception as the number of tweets grows.

To determine whether a tweet is an _original tweet_ we must first check to see whether the tweet is a reply to another tweet and then check if the tweet is a retweet. As you can see in the callback function defined below, we do this using the `in_reply_to_status_id_str` attribute and the `retweeted_status` attribute of the tweet.

In [None]:
def get_original(tweet, search):
    if (tweet["user"]["id_str"] in search 
        and tweet["in_reply_to_status_id_str"] is None 
        and "retweeted_status" not in tweet):
            return tweet["user"]["screen_name"]

joi_originals = twiterate(lambda tweet : get_original(tweet, journalists))
ooi_originals = twiterate(lambda tweet : get_original(tweet, organisations))

In [None]:
pie(joi_originals, ooi_originals)

In [None]:
bar(joi_originals, ooi_originals, label="Original Tweets")

Next we will establish the number of retweets created by our users of interest throughout the period of data collection. To do this we define a new callback function that will this time only look at the `retweeted_status` of the tweet. According to the [Twitter documentation](https://dev.twitter.com/overview/api/tweets) a retweet can be identified by the presence of the `retweeted_status` attribute.

In [None]:
def get_retweet(tweet, search):
    if (tweet["user"]["id_str"] in search 
        and "retweeted_status" in tweet):
            return tweet["user"]["screen_name"]

joi_retweets = twiterate(lambda tweet : get_retweet(tweet, journalists))
ooi_retweets = twiterate(lambda tweet : get_retweet(tweet, organisations))

In [None]:
pie(joi_retweets, ooi_retweets)

In [None]:
bar(joi_retweets, ooi_retweets, label="Retweets")

Similar to the process above, we will now determine the number of direct replies to other tweets were created during the period of data collection. To this we will define a callback that examines the `in_reply_to_status_id_str` attribute of a tweet. If the attribute is not `None` then we can classify it as a reply.

In [None]:
def get_reply(tweet, search):
    if (tweet["user"]["id_str"] in search 
        and tweet["in_reply_to_status_id_str"] is not None):
            return tweet["user"]["screen_name"]

joi_replies = twiterate(lambda tweet : get_reply(tweet, journalists))
ooi_replies = twiterate(lambda tweet : get_reply(tweet, organisations))

In [None]:
pie(joi_replies, ooi_replies)

In [None]:
bar(joi_replies, ooi_replies, label="Replies")


Now that we have an idea of how active our users of interest are on social media, we thought it would be interesting to extract the most common content shared throughout the data collection period. Below we have extracted the top 10 links from all tweets in the dataset as well as the top 10 hashtags.

To extract the link from each tweet in the dataset we again look to the `twiterate` function. We must therefore define an appropriate callback to examine each tweet. Below you will notice that we are examining the `entities` attribute of the tweet to extract the `expanded_url`. For more information on how Twitter stores its tweet entities please see the [entities documentation](https://dev.twitter.com/overview/api/entities).

In [None]:
def get_url(tweet):
    for url in tweet["entities"]["urls"]:
        return url["expanded_url"]

urls = twiterate(get_url)

Once we have extracted all of the links we must then establish the top 10 links that were shared and collect their associated title (i.e. the HTML `<title>` tag associated with that webpage). To do this we are going to define a new function `get_title` which we will call for the 10 most common links. This function will load the webpage using the provided link and extract the contents of its `<title>` tag.

In [None]:
def get_title(url):
    # get title text
    html = requests.get(url)
    page = bs4.BeautifulSoup(html.text, "html.parser")
    title = page.title.string if page.title != None else ""
    # remove markdown grammar
    title = re.sub(r"\r|\n|\||\s+", " ", title)
    # remove leading & trailing whitespace
    title = title.lstrip().rstrip()
    
    return title

In [None]:
lc = collections.Counter(urls)
mcl = [(url, get_title(url), occ) 
       for (url, occ) in lc.most_common(10)]

In [None]:
pd.DataFrame(mcl, 
             range(1, len(mcl) + 1), 
             ['Link', 'Title', 'Occurances'])

In [None]:
pd.DataFrame([len(urls), len(set(urls))], 
             ['Total', 'Unique'], 
             ['Links'])

Next we will extract the top 10 hashtags from the dataset. To do this we simply look at the hashtag attribute within the entities object of the tweet and count the occurences of each hashtag.

In [None]:
def get_hashtag(tweet):
    for hashtags in tweet["entities"]["hashtags"]:
        return hashtags["text"]
    
hashtags = twiterate(get_hashtag)
hc = Counter(hashtags)
mch = [("#" + key, value) 
       for (key, value) in hc.most_common(10)]

In [None]:
labels, y = zip(*most_common)
x = range(len(labels))

plt.bar(x, y, alpha=0.5)
plt.xticks(x, labels, rotation='90')
plt.ylabel("Tweets")
plt.show()

While the data above might not provide us with any insight into the activity of our users of interest, it clearly highlights the discussions that took place during our monitoring window. With some further analysis, such as grouping journalists by hashtags, we may be able to build up a detailed picture about what exactly journalists are talking about and whether journalists stick to their domain of reporting or whether there is some cross over.