## Sentiment140 Dataset (as taken from Kaggle):
The current dataset comprises of 1.6 million tweets with a sentiment value of 0 (negative), 2 (neutral) or 4 (positive). We will try to find tweets with mentions interconnecting users to build a directed graph.

The zip file can be downloaded from https://www.kaggle.com/kazanova/sentiment140/downloads/sentiment140.zip.

In [1]:
import pandas as pd
import numpy as np
import networkx as nx

In [2]:
base_dataframe = pd.read_csv("sentiment140.zip", delimiter=',', encoding='latin-1', compression='zip', header=None)
print("The imported dataset has a total of", len(base_dataframe), "tweets.")

The imported dataset has a total of 1600000 tweets.


In [3]:
biased_dataframe = base_dataframe.loc[base_dataframe[0] != 2]
print("Tweets which aren't neutral are", len(biased_dataframe), "in number.")
print("Thereby, all tweets in the dataset are either positive or negative.")

Tweets which aren't neutral are 1600000 in number.
Thereby, all tweets in the dataset are either positive or negative.


In [4]:
src_users = biased_dataframe[4].values
src_users_set = set(src_users)
print("There are", len(src_users_set), "distinct source users who tweeted a total of", len(biased_dataframe), "tweets.")

There are 659775 distinct source users who tweeted a total of 1600000 tweets.


In [5]:
tweets = biased_dataframe[5].values
uni = []
multi = []
for tweet in tweets:
    words = tweet.split()
    uniflag = False
    multiflag = False
    for word in words:
        if word[0] == '@' and len(word) > 1 and not uniflag:
            uniflag = True
        elif word[0] == '@' and len(word) > 1 and uniflag:
            multiflag = True
    if multiflag:
        multi.append(tweet)
    elif uniflag:
        uni.append(tweet)
print("There are a total of", len(uni), "tweets directed to single other users.")
print("There are a total of", len(multi), "tweets directed to multiple other users.")

There are a total of 703311 tweets directed to single other users.
There are a total of 33194 tweets directed to multiple other users.


The following function looks for targetted tweets and isolated them into two sets. Tweets directed at one user and tweets directed at multiple users.

In [6]:
def extract_directed_tweets(src_list, tweet_list, sentiment_list):
    unidirected = []
    multidirected = []
    for index, tweet in enumerate(tweet_list):
        target = []
        words = tweet.split()
        uniflag = False
        target = ''
        targets = []
        for word in words:
            if word[0] == '@' and len(word) > 1 and not uniflag:
                target = word[1:]
                uniflag = True
            elif word[0] == '@' and len(word) > 1 and uniflag:
                targets.append(word[1:])
        if len(targets) != 0:
            targets.append(target)
            multidirected.append([src_list[index], targets, sentiment_list[index]])
        elif uniflag:
            unidirected.append([src_list[index], target, sentiment_list[index]])
    return unidirected, multidirected

In [7]:
src_list = biased_dataframe[4].values
tweet_list = biased_dataframe[5].values
sentiment_list = biased_dataframe[0].values
uni_list, multi_list = extract_directed_tweets(src_list, tweet_list, sentiment_list)
print("After tweet extraction, we have", len(uni_list), "and", len(multi_list), "tweets being unidirected and multidirected.")

After tweet extraction, we have 703311 and 33194 tweets being unidirected and multidirected.


### Sentiment Validity is unpredictable for multi-directed tweets. First, let us consider only uni-directed tweets.

In [8]:
src_users = [x[0] for x in uni_list]
tgt_users = [x[1] for x in uni_list]
src_users_set = set(src_users)
tgt_users_set = set(tgt_users)
intersection = src_users_set.intersection(tgt_users_set)

print("There are", len(src_users_set), "distinct source users who tweeted a total of", len(uni_list), "tweets.")
print("These", len(uni_list), "tweets are directed to among", len(tgt_users_set), "distinct users.")
print("There a total of", len(intersection), "common users among these 2 sets.")

There are 293410 distinct source users who tweeted a total of 703311 tweets.
These 703311 tweets are directed to among 334272 distinct users.
There a total of 96584 common users among these 2 sets.


In [9]:
user_cluster = src_users_set.union(tgt_users_set)
user_list = list(user_cluster)
hashmap = {user:index for index, user in enumerate(user_list)}
print("We have a hashmap for a user base of", len(user_cluster), "individuals.")
print("Also,", len(user_cluster), "=", len(src_users), "+", len(tgt_users), "-", len(intersection))

We have a hashmap for a user base of 531098 individuals.
Also, 531098 = 703311 + 703311 - 96584


In [10]:
tuples = []
for row in uni_list:
    if row[2] == 0:
        v = -1
    elif row[2] == 4:
        v = 1
    s, t = hashmap[row[0]], hashmap[row[1]]
    tuples.append((s, t, v))
print("Edge count in tuples list is " + str(len(tuples)) + ".")

Edge count in tuples list is 703311.


In [11]:
edges = [(x,y) for (x,y,z) in tuples]
distinct_edges = set(edges)
print("The number of distinct edge connections is " + str(len(distinct_edges)) + ".")

The number of distinct edge connections is 581815.


The following two funtions execute the following function:
- Check if two tuple elements correspond to the same directed edge.
- Compute an equivalent for multiple edges between two specific nodes, mentions if resultant edge is rendered neutral.

In [38]:
def edges_are_same(a, b):
    if a[0:2] == b[0:2]:
        return True
    else:
        return False

def get_equivalent_edge(edges):
    mean = 0
    for edge in edges:
        mean += edge[2]
    if mean == 0:
        return (edges[0][0], edges[0][1], 0), False
    elif mean > 0:
        return (edges[0][0], edges[0][1], 1), True
    else:
        return (edges[0][0], edges[0][1], -1), True

In [39]:
tuples.sort(key=lambda x: x[0]*1000000 + x[1])
final_tuples = []
prev_edge = tuples[0]
average_flag = False
average_set = []
for i in range(1, len(tuples)):
    curr_edge = tuples[i]
    if not edges_are_same(curr_edge, prev_edge) and not average_flag:
        final_tuples.append(prev_edge)
    elif edges_are_same(curr_edge, prev_edge) and not average_flag:
        average_set.append(prev_edge)
        average_flag = True
    elif edges_are_same(curr_edge, prev_edge) and average_flag:
        average_set.append(prev_edge)
    elif not edges_are_same(curr_edge, prev_edge) and average_flag:
        average_set.append(prev_edge)
        equivalent_edge, valid = get_equivalent_edge(average_set)
        average_flag = False
        average_set = []
        if valid:
            final_tuples.append(equivalent_edge)
    prev_edge = curr_edge
if average_flag:
    average_set.append(prev_edge)
    equivalent_edge, valid = get_equivalent_edge(average_set)
    if valid:
        final_tuples.append(equivalent_edge)
else:
    final_tuples.append(prev_edge)
    
print("Taking majority value of all multiple directed edges between two nodes and ignoring all the edges that are rendered")
print("balanced due to positive and negative tweets, we have total of", len(final_tuples), "edges left in the network.")

Taking majority value of all multiple directed edges between two nodes and ignoring all the edges that are rendered
balanced due to positive and negative tweets, we have total of 566173 edges left in the network.


In [40]:
G = nx.DiGraph()
G.add_weighted_edges_from(final_tuples)

print("Number of edges in the graph object is " + str(G.number_of_edges()) + ".")
print("This is less than", len(distinct_edges), "because, effectively 'zero' (balanced) edges are ignored.")

print("NetworkX graph created, saving...")
nx.write_gpickle(G, 'sentiment140.gpickle')
print("Graph saved.")

Number of edges in the graph object is 566173.
This is less than 581815 because, effectively 'zero' (balanced) edges are ignored.
NetworkX graph created, saving...
Graph saved.
