# Timeline Analysis
Scope of the following analysis is to investigate whether the previously analyzed hashtags actually play a significant role in the underlying network structure: does a specific topic (hashtag) strengthen ties or increases diversity within Twitter users? The idea is therefore to build a metric and extract this information by combining mutual interactions Tweets and the network structure that represented the sole core of the analysis so far.

In [155]:
import snap
from snap import TUNGraph
import os
import sys
import operator
import pandas as pd
import subprocess
import numpy as np
import csv
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib import dates as mdates
import seaborn as sns
from __future__ import print_function
from IPython.display import HTML, display
import tabulate
import json
import datetime
from datetime import timedelta
from dateutil.relativedelta import relativedelta
from collections import Counter
import re
from itertools import combinations

# Set Seaborn defaults
sns.set()
%matplotlib inline
pd.set_option("display.precision", 3)
mpl.rcParams['figure.dpi'] = 100
mpl.rcParams['savefig.dpi'] = 100
mpl.rcParams['figure.autolayout'] = True

# Global variables
data_dir = "../data"
pictures_path = os.path.join("../Pictures", "8.TimelineAnalysis")
tweets_path = "../lib/GetOldTweets-python/out/completed"

## 0. Creation of Model Classes
First of all, I might need to be a bit more concerned about performance since some Tweets files are pretty big (a few GBs), therefore I'd like to optimize some operations as much as possible.
- Since Tweets come naturally with a unique ID, I may create a `Tweet` custom type;
- Class `Interaction` representing an interaction between 2 users.

In [42]:
class Tweet:
    def __init__(self, tweet_id, users, tweet_dict):
        self.tweet_id = tweet_id
        self.tweet_dict = tweet_dict
        self.users = users
        
    def __eq__(self, other):
        if isinstance(other, Tweet):
            return self.tweet_id == other.tweet_id
        return NotImplemented
    
    def __ne__(self, other):
        x = self.__eq__(other)
        if x is not NotImplemented:
            return not x
        return NotImplemented
    
    def __hash__(self):
        return hash(self.tweet_id)

In [148]:
class Interaction:
    def __init__(self, source, target):
        self.source = source
        self.target = target
        
    def __eq__(self, other):
        if isinstance(other, Interaction):
            return (self.source == other.source and self.target == other.target) or (self.source == other.target and self.target == other.source)
        return NotImplemented
    
    def __ne__(self, other):
        x = self.__eq__(other)
        if x is not NotImplemented:
            return not x
        return NotImplemented
    
    def __hash__(self):
        return hash(hash(self.source)+hash(self.target))

Hereby a collection of some utility functions:

In [162]:
def get_relative_percentage(n,m):
    return n*100.0/m

def load_graph_from_backup(filename):
    FIn = snap.TFIn("../data/"+filename+".bin")
    graph = snap.TUNGraph.Load(FIn)
    return graph

def read_large_file(file_object):
    while True:
        data = file_object.readline()
        if not data:
            break
        yield data.rstrip('\n')
        
def process_edge_line(line):
    source, target, prop = line.split(',')
    return int(source), int(target), prop
        
def get_usernames_from_basic_tweet_info(hashtag, tweet):
    usernames = set()
    # (1): Has tweeted using hashtag
    if hashtag in [h.lower() for h in tweet["entities"]["hashtags"]]:
        usernames.add(tweet["user"]["screen_name"].lower())

    # (2): Has been mentioned / replied to
    if not tweet["in_reply_to_screen_name"] is None:
        usernames.add(tweet["in_reply_to_screen_name"].lower())
    for mentions in tweet["entities"]["user_mentions"]:
        usernames.add(mentions["screen_name"].lower())
    return usernames

def get_tweet_usernames(hashtag, tweet):
    usernames = set()
    usernames.update(get_usernames_from_basic_tweet_info(hashtag, tweet))
    if "retweeted_status" in tweet:
        usernames.update(get_usernames_from_basic_tweet_info(hashtag, tweet["retweeted_status"]))
    if "quoted_status" in tweet:
        usernames.update(get_usernames_from_basic_tweet_info(hashtag, tweet["quoted_status"]))
    return usernames

def extract_hashtag_usernames(hashtag, tweets):
    hashtag_usernames = set()
    for tweet in tweets:
        hashtag_usernames.update(tweet.users)
    print("Total unique usernames involved in '#%s' hashtag conversations from %d tweets: %d" %(hashtag, len(tweets), len(hashtag_usernames)))
    return hashtag_usernames

# Extract tweets given a specific hashtag
def get_tweets(hashtag):
    tweets_filename = os.path.join(tweets_path,"tweets_#" + hashtag + "_2013-09-01_2016-12-31.json")
    tweets = set()
    with open(tweets_filename) as fin:
        for line in read_large_file(fin):
            tweet_dict = json.loads(line)
            tweet_id = np.int64(tweet_dict["id_str"])
            tweet_users = get_tweet_usernames(hashtag, tweet_dict)
            tweets.add(Tweet(tweet_id, tweet_users, tweet_dict))
    print("Imported %d tweets from %s" %(len(tweets),tweets_filename))
    return tweets

## 1. Extract Tweets: filter usernames and tweets

In [44]:
hashtag = "jesuischarlie"
hashtag_full = "#JeSuisCharlie"

Load hashtag subgraph from backup:

In [24]:
hashtag_subgraph = load_graph_from_backup("mmr_subgraph_"+hashtag)

In [45]:
%%time
tweets = get_tweets(hashtag)

Imported 413857 tweets from ../lib/GetOldTweets-python/out/completed/tweets_#jesuischarlie_2013-09-01_2016-12-31.json
CPU times: user 20.5 s, sys: 1.05 s, total: 21.5 s
Wall time: 21.5 s


In [53]:
hashtag_usernames = extract_hashtag_usernames(hashtag, tweets)

Total unique usernames involved in '#jesuischarlie' hashtag conversations from 413857 tweets: 224646


In [85]:
%%time
usernames_to_id_dict = {}
with open("../data/usernames.csv") as usernames_f:
    for line in read_large_file(usernames_f):
        username = line.split(',')[0]
        encoding = int(line.split(',')[1])
        # Add to dict only if username is part of the hashtag subgraph
        if hashtag_subgraph.IsNode(encoding):
            usernames_to_id_dict[username] = encoding

CPU times: user 2min 44s, sys: 400 ms, total: 2min 45s
Wall time: 2min 45s


The next step is then to filter out those Tweets whose involved users are not part of the corresponding $H$ subgraph (i.e. if none of its involved users represents a node in $H$). The keys of the `usernames_to_id_dict` dictionary correspond to all the usernames of the $H$ subgraph, so it should be sufficient to check the following criteria:
- A tweet is kept if any of its involved users is part of the $H$ subgraph:

In [140]:
tweets_filtered = filter(lambda t: any(map(lambda u: u in usernames_to_id_dict, t.users)), tweets)
print("Number of filtered tweets (with at least 1 involved user within MMR graph data): %d (%.2f%% of %d total tweets)" %(len(tweets_filtered), get_relative_percentage(len(tweets_filtered), len(tweets)), len(tweets)))

Number of filtered tweets (with at least 1 involved user within MMR graph data): 218940 (52.90% of 413857 total tweets)


According to the above: among 413.857 total tweets, **218.940** involve at least one user that is part of the MMR graph data.

In [141]:
def count_tot_users(tweets):
    tot_users = set()
    for t in tweets:
        tot_users.update(t.users)
    return len(tot_users)

print("Total number of unique users involved in %d filtered tweets: %d" %(len(tweets_filtered), count_tot_users(tweets_filtered)))

Total number of unique users involved in 218940 filtered tweets: 156245


**Note**: Of course the number of users involved in the filtered tweets may be higher than the actual number of nodes in $H$, since each Tweet might include users that have not been captured by the MMR data. The reason why I decided to apply this filtering step, is that the users collected by the MMR graph data hide some extra interaction properties, e.g. 2 users are related if their total number of interactions within a 3 months period is sufficiently high, so that it represents a constant interaction over time and not a random one. This way we would already remove a lot of noisy/irrelevant data and speed up the algorithms in the next steps.

## 2. User Interactions Statistics
I could first show some statistics about the interactions.

### 2.1 Top Interactions
I may show which are the interactions that occurred the most throughout the years we're considering. It's convenient to create a dictionary with $K \rightarrow V$ pairs where $K$ = interaction, $V$ = count (count how many times two people have interacted with each other within the whole timeline period we're considering).

Collecting the interactions count dictionary may be done linearly with a single pass, and the steps are hereby summarized:
1. For each tweet, create a list of all possible pairs of its involved users and for each of them create an `Interaction` instance;
2. Add a new entry to the dictionary with value 1, if the interaction is not existing;
3. If the interaction is already existing, increment its value. 

In [149]:
def create_interactions_count_dict(tweets):
    interactions_count = {}
    for t in tweets:
        pairs = list(combinations(t.users, 2))
        for p in pairs:
            i = Interaction(p[0], p[1])
            if i in interactions_count:
                interactions_count[i] += 1
            else:
                interactions_count[i] = 1
    return interactions_count

In [150]:
interactions_count = create_interactions_count_dict(tweets_filtered)

Let's visualize the results conveniently through a DataFrame, by first sorting it with descending counts (I only show the top interactions e.g. whose count value are at least 100):

In [151]:
interactions_df = pd.DataFrame(data=[(el[0].source, el[0].target, el[1]) for el in interactions_count.items()], columns=["Source", "Target", "Count"])

In [152]:
interactions_df.sort_values(by="Count", ascending=False, inplace=True)
interactions_df[interactions_df["Count"]>=100]

Unnamed: 0,Source,Target,Count
3983,bibleloussegond,linformatrice,569
91363,mumtazceltik,whitehouse,282
33120,botcharlie,pressmoustache,278
132721,gendarmerie,pnationale,190
45210,mumtazceltik,fhollande,180
80880,vp,mumtazceltik,172
101191,ivorydove,johnkerry,162
59980,fhollande,whitehouse,144
116031,vp,whitehouse,132
138969,mumtazceltik,senjohnmccain,124


The above DataFrame therefore shows who are the pairs of users that interacted the most on a time basis about hashtag **#JeSuisCharlie**. Please remind that because of the filtering performed in the previous section, DataFrame might include some users that are not part of the graph data.

Another perspective may be given instead to highlight which individual users have been involved in the highest number of interactions:

In [153]:
temp1 = interactions_df.groupby(['Source']).sum().reset_index().rename(columns={'Source':'User'})
temp2 = interactions_df.groupby(['Target']).sum().reset_index().rename(columns={'Target':'User'})
temp = temp1.append(temp2)
temp = temp.groupby(['User']).sum().reset_index()
temp.sort_values(by="Count",ascending=False, inplace=True)
temp[temp["Count"]>=500]

Unnamed: 0,User,Count
70879,pressmoustache,13229
92740,youtube,4448
16237,charlie_hebdo_,2761
29537,fhollande,2406
61545,mumtazceltik,1382
11096,bfmtv,1284
49947,lemondefr,1240
90996,whitehouse,1074
87537,twitter,1008
50684,libe,974


The result above gives an idea about who have been the most active users around the hashtag topic overtime.

## 3. Measuring interactions consistency overtime
What I'm now interested in, and what actually represents the end goal of this whole work, is identifying a measure of consistency related to the interactions of people taking part of controversies on Twitter. According to the data I am provided with, I may identify 2 paths:
- **MMR graph data**: there's a property of the graph data that hasn't been used so far but still lies in the CSV files I got as result of all the data transformation steps that featured the first notebooks. This property would already give a built-in definition of *interaction consistency*; however, the information we may infer from it is quite limited, and the reason is that for each pair of users we only know whether they have interacted (consistently) within a 3 months period, and given the ranges of our timeline this sums up to a total of 13 periods. It's the only way, however, to compare how interaction consistency has changed overtime;
- **Tweets**: when limiting the scope to the tweets related to a specific topic (hashtag), I have no information about which other topics all involved users have been tweeting about neither the related temporal information. Therefore, I may only know when has the *first* Twitter interaction occurred between each pair of mutually interacting users and all the subsequent ones. This is something that has already been highlighted a bit with the previous section, however I may further narrow it down and measure interactions consistency overtime if I count the interactions *per month*.

### 3.1 Long-term interaction consistency comparison: Analyzing behavioral changes overtime
The idea of this analysis is to compare interaction consistency before and after the first $H$-type interaction, in order to highlight the bonding effect of a given hashtag: does it actually strengthen or weaken ties overtime? Are users more or less likely to communicate with each other after their $H$ interaction? We want to address these questions by carrying out this first timeline analysis by using metadata as edge attributes provided by the MMR graph data.

In details: we have a total of 13 periods, ranging from 2013-09 to 2016-12, each of 3 months duration. For each edge in $H$ and for each of the periods, we then have a binary variable with values 1/0, set to 1 if the interaction occurred in the respective 3 months period, 0 otherwise (simplification of True/False values). Let's say, given period with 0-based index $i$ and edge $e$: $IsInPeriod(e,i)$ is a function as defined below:

$$
IsInPeriod(e,i) =
    \begin{cases}
        1 & \text{if $e$ occurred in period $i$,}\\
        0 & \text{otherwise.}
    \end{cases}
$$

Then, the **average mutual interaction consistency** $\langle MIC \rangle$ of edge $e$, calculated over $P$ consecutive periods starting from period with index $i$ until period with index $j$, is given by:

$$\langle MIC \rangle = \sum_{i=i}^{j}\frac{IsInPeriod(e,i)}{P}$$

Note that $P = j-i+1$.

$\langle MIC \rangle$ provides a metric I may calculate as:
- $\langle MIC \rangle_{T}$, where $T$ stands for *total*, and $P = 13$ (the whole timeline period);
- $\langle MIC \rangle_{B}$ ($B$ = *before*), $P$ is the number of consecutive periods *before* the first $H$ interaction;
- $\langle MIC \rangle_{A}$ ($A$ = *after*), $P$ is the number of consecutive periods *after* the first $H$ interaction.

$\langle MIC \rangle_{B}$ and $\langle MIC \rangle_{A}$ may be then directly compared.

As a first step, we need to conveniently store the edges attribute. This metadata is already transformed in a convenient, light type that assumes periods have been encoded, such that their indexes range from 0 to 12, where period $0$ is period between 2013-09 and 2013-11 (edges included),  and period $12$ between 2016-09 and 2016-11.

In [None]:
# Create dict with Interaction as keys and list of periods as values

### 3.2 Detailed Interaction consistency: how long do ties survive?