## Extracting a multilayer network from Reddit data

### First we load the network and check some basic stats

In [1]:
from redditnetwork.network_extractor import extract_week_network
import networkx as nx

In [2]:
# We will use /r/politics subreddit as the running example
# We extract a network for this subreddit, corresponding to the first week of 2014
politics_net = extract_week_network("politics", 2014, 1)

Processed 45097 comments, of which 12378 were removed for missing post and 5888 for missing parent


Ignore the warning about using the week argument instead of month. (This is just an internal complication due to the fact that the data is stored at the monthly level but we are accessing weeks).

Once the data finishes processing it will say that it processed a certain number of comments and removed some due to having a missing parent or post (e.g., they were replying to an old post from an earlier week).

The returned object is a networkx DiGraph (directed graph).

In [3]:
## some basic stats:
print "There are {:d} users, {:d} comments, and {:d} posts in the graph"\
            .format(len([node for node in politics_net.nodes(data=True) if node[1]["type"] == "user"]),
                   len([node for node in politics_net.nodes(data=True) if node[1]["type"] == "comment"]),
                   len([node for node in politics_net.nodes(data=True) if node[1]["type"] == "post"]))

There are 8992 users, 26832 comments, and 2368 posts in the graph


### Okay, and now some details on the data 

The underlying structure is a directed graph (DiGraph) and additional information is stored as node and edge attributes.

#### Node types

Every node as an "type" attribute that is one of "user", "comment", or "post".
Users are indexed by their username and post/comments by unique string ids. 

#### Edge types

Every edge has a "type" attribute as well, which is one of the following:
* "user_post": a directed edge from a user to a post they made.
* "user_comment": a directed edge from a user to a comment they made.
* "post_comment": a directed edge from a post to a top-level comment in that post.
* "comment_comment": a directed edge from a comment to a comment that replies to it. 

#### Node attributes/features

Comment nodes and post nodes also additional features/attributes (which can be listed by running politics_net.graph; see the example below). User nodes currently have no features (besides those that are implicit in the graph structure). 

##### Comment features
* score: score that comment received
* time: describes when the comment was made during the week (hour offset from 12:00am on Monday of that week).
* post_time_offset: how old was the post when the comment was made (in hours)
* length: how many words in the comment
* word_vec: 300 dimensional vector embedding of the comment (tf-idf average of GloVe vectors)

##### Post features
* score: score that the post recieved
* time: when was the post made during the week (hour offset from 12:00 on Monday of that week)
* length: number of words in the title
* word_vec: vector embedding of post title (average of Glove vectors)

*NOTE THAT NONE OF THESE FEATURES ARE THE "LABELS" WE WANT TO PREDICT.* That data is stored elsewhere for now because I don't want to clutter the network representations and because the "labels" are in flux. See the bottom of this notebook for an example of how to get the labels for predictions.

In [4]:
# this prints info about what features there are and the dimensionality of these features
politics_net.graph

{'comment_feats': {'length': 1,
  'post_time_offset': 1,
  'score': 1,
  'subreddit': 1,
  'time': 1,
  'word_vecs': 300},
 'post_feats': {'length': 1,
  'num_comments': 1,
  'score': 1,
  'subreddit': 1,
  'time': 1,
  'word_vecs': 300},
 'user_feats': {}}

In [5]:
# lets access the node for a random user 
# and get all comments and posts that this user made
user_out_nodes = politics_net.successors("RedSquirrelFtw")
print user_out_nodes

['cejaksn']


In [6]:
# this user made only one comment... but I think you get the picture
# e.g., we could access the attributes for this comment 
print politics_net.node[user_out_nodes[0]]

{'word_vecs': array([  2.45881882e-02,  -8.85956455e-03,   4.07702522e-03,
        -3.59144271e-03,  -5.35505451e-03,   3.04689351e-03,
        -2.86572031e-05,   6.46826986e-04,   3.98649042e-03,
        -3.48688639e-03,   3.40964980e-02,   3.39702074e-03,
        -2.66911592e-02,   9.43523180e-03,  -2.05980968e-02,
        -2.33542006e-02,  -2.23564263e-02,  -4.97682840e-02,
         2.15058471e-03,   6.99266186e-03,  -1.03599476e-02,
        -3.42106936e-03,  -1.32135861e-03,  -3.16169374e-02,
         1.49107622e-02,   4.38282602e-02,  -1.15861988e-03,
        -5.54729579e-03,  -6.17341464e-03,   1.52532337e-02,
         1.30888699e-02,   1.42863719e-02,   5.32958051e-03,
         6.43259101e-03,  -2.33824583e-04,  -1.21295080e-02,
        -4.83304122e-03,  -6.96073147e-03,  -6.06134126e-04,
         1.71746537e-02,   8.68919492e-03,   1.78009700e-02,
         1.27696199e-02,  -1.09810466e-02,  -3.44701274e-03,
        -4.43779491e-03,  -2.83656735e-03,  -1.35982307e-02,
         9

There is still lots of graph management stuff that is left unspecificed (e.g., what's the best way to get all nodes of a certain type), but I figure this is just networkx/bookkeeping stuff and doesn't need to be baked in to the representation.

## We can also extract networks for multiple subreddits....

In [1]:
from redditnetwork.network_extractor import extract_week_network_multisubreddits

In [2]:
multi_test = extract_week_network_multisubreddits(["politics", "Libertarian"], 2014, 2)

Processed 57517 comments, of which 11175 were removed for missing post and 8704 for missing parent


In [4]:
len([node for node in multi_test.nodes(data=True) if node[1]["type"] == "post"])

11196