## Extracting a multilayer network from Reddit data

### First we load the network and check some basic stats

In [39]:
# We will use /r/politics subreddit as the running example
# We extract a network for this subreddit, corresponding to the first week of 2014
from redditnetwork.network_extractor import extract_week_network
import networkx as nx
politics_net = extract_week_network("politics", 2014, 1)

Processed 45097 comments, of which 12378 were removed for missing post and 5888 for missing parent


Ignore the warning about using the week argument instead of month. (This is just an internal complication due to the fact that the data is stored at the monthly level but we are accessing weeks).

Once the data finishes processing it will say that it processed a certain number of comments and removed some due to having a missing parent or post (e.g., they were replying to an old post from an earlier week).

The returned object is a networkx DiGraph (directed graph).

In [11]:
## some basic stats:
print "There are {:d} users, {:d} comments, and {:d} posts in the graph"\
            .format(len([node for node in politics_net.nodes(data=True) if node[1]["type"] == "user"]),
                   len([node for node in politics_net.nodes(data=True) if node[1]["type"] == "comment"]),
                   len([node for node in politics_net.nodes(data=True) if node[1]["type"] == "post"]))

There are 8992 users, 26832 comments, and 2368 posts in the graph


### Okay, and now some details on the data 

The underlying structure is a directed graph (DiGraph) and additional information is stored as node and edge attributes.

#### Node types

Every node as an "type" attribute that is one of "user", "comment", or "post".
Users are indexed by their username and post/comments by unique string ids. 

#### Edge types

Every edge has a "type" attribute as well, which is one of the following:
* "user_post": a directed edge from a user to a post they made.
* "user_comment": a directed edge from a user to a comment they made.
* "post_comment": a directed edge from a post to a top-level comment in that post.
* "comment_comment": a directed edge from a comment to a comment that replies to it. 

#### Node attributes/features

Comment nodes and post nodes also additional features/attributes (which can be listed by running politics_net.graph; see the example below). User nodes currently have no features (besides those that are implicit in the graph structure). 

##### Comment features
* score: score that comment received
* time: describes when the comment was made during the week (hour offset from 12:00am on Monday of that week).
* post_time_offset: how old was the post when the comment was made (in hours)
* length: how many words in the comment
* word_vec: 300 dimensional vector embedding of the comment (average of GloVe vectors)

##### Post features
* score: score that the post recieved
* time: when was the post made during the week (hour offset from 12:00 on Monday of that week)
* length: number of words in the title
* word_vec: vector embedding of post title (average of Glove vectors)

*NOTE THAT NONE OF THESE FEATURES ARE THE "LABELS" WE WANT TO PREDICT.* That data is stored elsewhere for now because I don't want to clutter the network representations and because the "labels" are in flux. See the bottom of this notebook for an example of how to get the labels for predictions.

In [43]:
# this prints info about what features there are and the dimensionality of these features
politics_net.graph

{'comment_feats': {'length': 1,
  'score': 1,
  'time': 1,
  'time_offset': 1,
  'word_vec': 300},
 'post_feats': {'score': 1, 'time': 1, 'word_vec': 300},
 'user_feats': {}}

In [42]:
# lets access the node for a random user 
# and get all comments and posts that this user made
user_out_nodes = politics_net.successors("RedSquirrelFtw")
print user_out_nodes

['cejaksn']


In [45]:
# this user made only one comment... but I think you get the picture
# e.g., we could access the attributes for this comment 
print politics_net.node[user_out_nodes[0]]

{'type': 'comment', 'time': 196.05277777777778, 'length': 26, 'score': 28, 'word_vecs': array([  4.01195958e-02,  -1.20655401e-02,   1.38737066e-02,
        -5.05111087e-03,   5.46537992e-03,   1.39144640e-02,
        -1.47369644e-03,  -1.15743780e-03,   1.36278768e-03,
        -6.99382462e-03,   3.86757664e-02,  -1.19612238e-03,
        -5.26179001e-02,   1.56097841e-02,  -3.74819189e-02,
        -4.27772626e-02,  -5.47819696e-02,  -8.62767547e-02,
        -4.51631006e-03,   1.23882452e-02,  -2.33349986e-02,
        -2.14756629e-03,   3.54487333e-03,  -4.24631499e-02,
         3.80612463e-02,   8.60011578e-02,   7.25141494e-03,
        -7.39442371e-03,  -1.87714286e-02,   2.38258410e-02,
         1.63386296e-02,   5.72118908e-02,  -1.17002837e-02,
         1.87530424e-02,   5.77666052e-03,  -4.24845144e-02,
        -1.39498822e-02,  -9.69437137e-03,   1.16598764e-02,
         2.97847353e-02,   1.72986519e-02,   4.32692170e-02,
         3.43013853e-02,  -1.29871527e-02,  -1.82183404e-0

There is still lots of graph management stuff that is left unspecificed (e.g., what's the best way to get all nodes of a certain type), but I figure this is just networkx/bookkeeping stuff and doesn't need to be baked in to the representation.

### What about labels?

The attributes in the provided graph are all *features* but we also need some labels to make predictions.
For now, the label data is stored elsewhere.

To get the labels, use the following approach:

In [56]:
import pandas as pd
from redditnetwork import constants
# the user_scores folder in the data directory contains the info we want 
#(i.e., info about the future comment scores obtained by users)
# we load this data in using pandas.
# NOTE that we specify the subreddit we want and the week we are making predictions from (week 1)
# these values should match the extract_week_network parameters
future = pd.read_csv(constants.DATA_HOME + "user_scores/{}_2014_wf{:02d}.csv".format("politics", 1))
future = future.set_index("user")
future

Unnamed: 0_level_0,count,sum,max,above_one,above_one_sum,mean,median,std,k_index
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
kittentitten,2,3,2,1,2,1.500000,1.5,0.500000,2
therealtman,1,1,1,0,0,1.000000,1.0,0.000000,1
sn00gan,1,-13,-13,0,0,-13.000000,-13.0,0.000000,0
RedSquirrelFtw,6,4,2,1,2,0.666667,1.0,1.247219,1
Metalcamra,1,1,1,0,0,1.000000,1.0,0.000000,1
Transfatcarbokin,2,41,40,1,40,20.500000,20.5,19.500000,1
bmwnut,1,-1,-1,0,0,-1.000000,-1.0,0.000000,0
becausefahq,1,-8,-8,0,0,-8.000000,-8.0,0.000000,0
cellardweller1234,1,7,7,1,7,7.000000,7.0,0.000000,1
Pandaro81,1,29,29,1,29,29.000000,29.0,0.000000,1


The above should print out info from a pandas data frame. All the different columns are different summary statistics of the users future comment scores. So, for example, the cell below gets the sum of the future comment scores for the user RedSquirrelFtw

In [58]:
future.loc["RedSquirrelFtw"]["sum"]

4.0

Trying to predict the sum is totally reasonable, but to make things a bit easier, I recommend turning it into a binary task where we try to predict whether a user will be in the top-10% in terms of their future comment scores:

In [60]:
import numpy as np
future["label"] = np.sign(future["sum"] - np.percentile(future["sum"], 90)-10e10)
# (do)

In [61]:
# the "future" dataframe now has a "label" column that is either 1 or -1.
future

Unnamed: 0_level_0,count,sum,max,above_one,above_one_sum,mean,median,std,k_index,label
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
kittentitten,2,3,2,1,2,1.500000,1.5,0.500000,2,-1
therealtman,1,1,1,0,0,1.000000,1.0,0.000000,1,-1
sn00gan,1,-13,-13,0,0,-13.000000,-13.0,0.000000,0,-1
RedSquirrelFtw,6,4,2,1,2,0.666667,1.0,1.247219,1,-1
Metalcamra,1,1,1,0,0,1.000000,1.0,0.000000,1,-1
Transfatcarbokin,2,41,40,1,40,20.500000,20.5,19.500000,1,-1
bmwnut,1,-1,-1,0,0,-1.000000,-1.0,0.000000,0,-1
becausefahq,1,-8,-8,0,0,-8.000000,-8.0,0.000000,0,-1
cellardweller1234,1,7,7,1,7,7.000000,7.0,0.000000,1,-1
Pandaro81,1,29,29,1,29,29.000000,29.0,0.000000,1,-1
