# Journalist Discussions

_Do journalists talk to others, or mostly amongst themselves?_

---

Using the data collected, we hope that we are able to begin answering this question. By analysing the tweets collected during our monitoring period, and analysing the descriptions of each user profile contained within those tweets, we are able to classify users into two distinct groups, journalists and non journalists.

By then reading the content of all tweets authored by a user who is classified as a journalist, that are either replies or contain a mention of another twitter user, we can begin to establish whether journalists talk mostly amongst themselves or whether they interact with those outside of their community.


In [1]:
import json
import sys
import os.path
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

# add penemue to path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
from utils import user_has_keyword
from utils import twiterate
from utils import lookup

First, we must develop a method of classification for our set of twitter users. Using the same filtering technique used in the data collection section of our research we examine each user description for the presence of a journalism related keyword. To achieve this, we have made use of the `user_has_keyowrd` function, found in the `utils` directory.

If the user description contains the presence of a keyword we classify them as a _journalist_, if they do not they are pooled as _other_ users.

In [2]:
def classify(users):
    """Classify users as journalists or other.
    
    :param users: A twitter user object
                  This may be altered to contain
                  only the id_str and description
                  props.
                  
    :return joi:  A list of journalist str_ids
    :return oth:  A list of non journalist str_ids
    """
    
    joi = list(set([user['id_str'] 
                    for user in users 
                        if user_has_keyword(user)]))
    
    oth = list(set([user['id_str'] 
                    for user in users 
                        if not user_has_keyword(user)]))
    
    return joi, oth

To extract user profiles (i.e. user id_str and description) from the tweet we must create a callback for the `twiterate` function defined in the `utils` directory. The function takes a callback with a tweet object as the parameter in order to iterate through each tweet in the provided json file and return the tweet information we have requested.

In [3]:
def get_user(t):
    """Return only the id_str, description for users.
    
    :param t: A tweet object
    :return:  A python dict
    """
    
    return {'id_str': t['user']['id_str'],
            'description': t['user']['description'].lower() 
                if t['user']['description'] else ''}

To extract all tweets that are in reply to another tweet we again make use of the `twiterate` function and so we must define an appropriate callback. Here we tag the author of the reply as the sender (from) and the receiver (to) as the user specified by the `in_reply_to_user_id_str` property of the tweet.

In [4]:
def get_reply(t):
    """Format replies to show sender, reciever.
    
    :param t: A tweet object
    :return:  A python dict
    """
    
    return {'from': t['user']['id_str'], 
            'to': t['in_reply_to_user_id_str']}

Similar to the process above, in order to extract user mentions we define a callback to pass to the `twiterate` function. However, in this case we must iterate through the list of mentions, extract the `id_str` of each user and place the resulting value in a list.

In [5]:
def get_mention(t):
    """Extract mentioning user and the users mentioned.
    
    :param t: A tweet object
    :return:  A python dict
              The mentions key of the dictionary
              contains a list of strings
    """
    
    user_mentions = [mention['id_str'] 
                     for mention in t['entities']['user_mentions']]
    
    return {'user_id_str': t['user']['id_str'],
            'mentions': user_mentions}

Now we can begin to collect data from the set of collected tweets. 

First, we iterate through each tweet and extract the user profile contained within the tweet using the `twiterate` function. We can then pass the result of this directly the the `classify` function that we defined above to classify the users as either a journalist or not a journalist. The results are unpacked into two variables.

In [6]:
joi, oth = classify(twiterate(get_user))

[Progress] 


Next we can extract all the replies and mentions made by journalists during the monitoring period.

In [7]:
replies = [reply 
           for reply in twiterate(get_reply) 
               if reply['to'] is not None
               and reply['from'] in joi]

[Progress] 


In [8]:
mentions = [mention
            for mention in twiterate(get_mention)
                if len(mention['mentions']) > 0
                and mention['user_id_str'] in joi]

[Progress] 


Extracting the users who created a reply, created a mention, received a reply or received a mention will create a set of users for whom we have no data. This is becuase we currently only have data for those users who created a tweet within our monitoring period and, as those users who were mentioned may not have created a tweet during this time, we do not have any stored user profiles for these users. 

In order to classify these users we must request their user profiles using the twitter api. Using the `users/lookup` method we are able to receive upto 100 user profiles per request and once we have the profile information we can then begin to classify these users.

In [9]:
# extract users by id_str
unclassified = [reply['to'] 
                for reply in replies 
                    if reply['to'] not in joi + oth]

unclassified += [id_str 
                 for m in mentions 
                     for id_str in m['mentions'] 
                         if id_str not in joi + oth]

# remove duplicates
unclassified = list(set(unclassified))

# lookup users
users = []
for i in range(0, len(unclassified) + 1, 100):
    users += [{'id_str': user['id_str'], 'description': user['description']} 
                  for user in lookup(','.join(unclassified[i : i + 100]))]
# classify users
c = classify(users)

# add classified users
joi = joi + c[0]
oth = oth + c[1]

Now that we have classified all users as being either a journalist or not a journalist we can begin to produce some meaningful results from our data. 

---

Users classified as journalists and non journalists.

In [10]:
pd.DataFrame([len(joi), len(oth)], 
             ['Journalists', 'Non Journalists'], 
             ['Users'])

Unnamed: 0,Users
Journalists,671
Non Journalists,4435


Number of direct mentions that were made by journalists to both journalists and non journalists.

In [11]:
joi_mentions_joi = [n for m in mentions for n in m['mentions'] if n in joi]
joi_mentions_oth = [n for m in mentions for n in m['mentions'] if n in oth]

pd.DataFrame([len(joi_mentions_joi), len(joi_mentions_oth)],
             ['To Journalists', 'To Non Journalists'],
             ['Mentions by Journalists'])

Unnamed: 0,Mentions by Journalists
To Journalists,434
To Non Journalists,279


Number of direct replies that were made by journalists to both journalists and non journalists.

In [12]:
joi_to_joi = [reply for reply in replies if reply['to'] in joi]
oth_to_oth = [reply for reply in replies if reply['to'] in oth]

pd.DataFrame([len(joi_to_joi), len(oth_to_oth)], 
             ['To Journalists', 'To Non Journalists'], 
             ['Replies by Journalists'])

Unnamed: 0,Replies by Journalists
To Journalists,67
To Non Journalists,36
