# Part 3: Building the GME redditors network

Ok, enough with theory :) It is time to go back to our cool dataset it took us so much pain to download! And guess what? We will build the network of GME Redditors. Then, we will use some Network Science to study some of its properties.


> 
> *Exercise*: Build the network of Redditors discussing about GME on r\wallstreetbets. In this network, nodes correspond to authors of comments, and a direct link going from node _A_ to node _B_ exists if _A_ ever answered a submission or a comment by _B_. The weight on the link corresponds to the number of times _A_ answered _B_. You can build the network as follows:
>
> 1. Open the _comments dataset_ and the _submission datasets_ (the first contains all the comments and the second cointains all the submissions) and store them in two Pandas DataFrames.
> 2. Create three dictionaries, using the command ``dict(zip(keys,values))``, where keys and values are columns in your dataframes. The three dictionaries are the following:
>     * __comment_authors__: (_comment id_, _comment author_)
>     * __parent__:  (_comment id_ , _parent id_)
>     * __submission_authors__: (_submission id_, _submission author_)
>
> where above I indicated the (key, value) tuples contained in each dictionary.
>
> 3. Create a function that take as input a _comment id_ and outputs the author of its parent. The function does two things:
>     * First, it calls the dictionary __parent__, to find the _parent id_ of the comment identified by a given _comment id_. 
>     * Then, it finds the author of  _parent id_. 
>          * if the _parent id_ starts with "t1_", call the __comment_authors__ dictionary (for key=parent_id[3:])
>          * if the _parent id_ starts with "t3_", call the __submission_authors__ dictionars (for key=parent_id[3:])
>
> where by parent_id[3:], I mean that the first three charachters of the _parent id_ (either "t1_" or "t3_" should be ingnored).
>
> 4. Apply the function you created in step 3. to all the comment ids in your comments dataframe. Store the output in a new column, _"parent author"_, of the comments dataframe. 
> 5. For now, we will focus on the genesis of the GME community on Reddit, before all the hype started and many new redditors jumped on board. For this reason, __filter all the comments written before Dec 31st, 2020__. Also, remove deleted users by filtering all comments whose author or parent author is equal to "[deleted]". 
> 6. Create the weighted edge-list of your network as follows: consider all comments (after applying the filtering step above), groupby ("_author_", _"parent author"_) and count. 
> 7. Create a [``DiGraph``](https://networkx.org/documentation/stable//reference/classes/digraph.html) using networkx. Then, use the networkx function [``add_weighted_edges_from``](https://networkx.org/documentation/networkx-1.9/reference/generated/networkx.DiGraph.add_weighted_edges_from.html) to create a weighted, directed, graph starting from the edgelist you created in step 5.

### Imports

In [2]:
import os
import pandas as pd
import networkx as nx
import matplotlib.pylab as plt

### 1) Dataframes

In [3]:
submissions = pd.read_csv(os.path.join('Data', 'wallstreetbets_submissions.csv'))

In [35]:
import datetime
def dateparse (time_in_secs):    
    return datetime.datetime.fromtimestamp(float(time_in_secs))

comments = pd.read_csv(os.path.join('Data', 'wallstreetbets_comments.csv'), parse_dates=['created'], date_parser=dateparse)

### 2) Dictionaries

In [8]:
comment_authors = dict(zip(comments.id, comments.author))
parent = dict(zip(comments.id, comments.parent_id))
submission_authors = dict(zip(submissions.id, submissions.author))

### 3) Comment id to Parent Author Function

In [22]:
def comment_to_parent_author(comment_id):
    try:
        pid = parent[comment_id]
        if pid.startswith('t1_'):
            return comment_authors[pid[3:]] #ignore the first 3 character
        elif pid.startswith('t3_'):
            return submission_authors[pid[3:]]
    except:
        return None

### 4) Parent author Comments Dataframe

In [45]:
comments['parent_author'] = comments.id.apply(comment_to_parent_author)

### 5) Filter comments, with authors and before hype

In [51]:
import datetime as dt
filtered_comments = comments[comments.created < dt.datetime.strptime('31 December, 2020', '%d %B, %Y')]
filtered_comments = filtered_comments[filtered_comments.author is not None and filtered_comments.author != 'deleted']
filtered_comments = filtered_comments[filtered_comments.parent_author is not None and filtered_comments.parent_author != 'deleted']
len(filtered_comments)

85958

### 6) Weighted edge list

In [53]:
weighted_edge_list = filtered_comments.groupby(['author', 'parent_author']).count()

### 7) Directed graph 

In [62]:
G = nx.DiGraph()
G.add_nodes_from(filtered_comments.author)
G.add_edges_from([(a,pa, {'weight': w}) for a,pa,w in zip(filtered_comments.author, filtered_comments.parent_author, filtered_comments.id)])

# Part 4: Preliminary analysis of the GME redditors network

We begin with a preliminary analysis of the network.

> 
> *Exercise: Basic Analysis of the Redditors Network*
> * Why do you think I want you guys to use a _directed_ graph? Could have we used an undirected graph instead?
> * What is the total number of nodes in the network? What is the total number of links? What is the density of the network (the total number of links over the maximum number of links)?
> * What are the average, median, mode, minimum and maximum value of the in-degree (number of incoming edges per redditor)? And of the out-degree (number of outgoing edges per redditor)? How do you intepret the results?
> * List the top 5 Redditors by in-degree and out-degree. What is their average score over time? At which point in time did they join the discussion on GME? When did they leave it?
> * Plot the distribution of in-degrees and out-degrees, using a logarithmic binning (see last week's exercise 4). 
> * Plot a scatter plot of the the in- versus out- degree for all redditors. Comment on the relation between the two.
> * Plot a scatter plot of the the in- degree versus average score for all redditors. Comment on the relation between the two.


### Bullet Point 1)
I think, you want us to see the child parent relation ship. For instance it could show that some people are "first movers" in the comments and other are "followers", who mainly comment on the "first movers" postings

### Bullet Point 2)

In [63]:
# Total number of nondes
len(G.nodes)

24198

In [64]:
# Totale number of links
len(G.edges)

67480

In [66]:
n = len(G.nodes)
max_link_count = n*(n-1)/2
actual_link_count = len(G.edges)
density = actual_link_count/max_link_count
print(f'Network density: {density}')

Network density: 0.00023049636069371248


### Bullet Point 3)

In [79]:
import numpy as np
in_degrees = list(dict(G.in_degree).values())
out_degrees = list(dict(G.out_degree).values())

In [87]:
import numpy as np
from scipy import stats

def degree_calc(deg_type:str, degrees):
    print(f'Average {deg_type}: {np.mean(degrees):.3}')
    print(f'Median {deg_type}: {np.median(degrees):.3}')
    print(f'{deg_type} mode: {stats.mode(degrees).mode[0]}')
    print(f'Min {deg_type}: {min(degrees)}')
    print(f'Max {deg_type}: {max(degrees)}')

In [88]:
degree_calc('in-degree', in_degrees)

Average in-degree: 2.79
Median in-degree: 0.0
in-degree mode: 0
Min in-degree: 0
Max in-degree: 5322


In [89]:
degree_calc('out-degree', out_degrees)

Average out-degree: 2.79
Median out-degree: 1.0
out-degree mode: 1
Min out-degree: 0
Max out-degree: 1496


* From the median an mode it seems like it is more typical that people do not get comments on what they post (in-degree median and mode is 0). While the out-degree median and mode is 1. This could mean that mostly people make a single comment and they do it on content created by few people.

* Also we can see that at least one author gets a lot of comments and at least one author is very active commenting as the max in/out degrees are thousand of degrees higher than the min, mode, median, and average

### Bullet Point 4)

### Bullet Point 5)

### Bullet Point 6)

### Bullet Point 7)