# Instructions

- Import the NetworkX library.

- Create a directed graph object using the NetworkX library and add all rows as edges to the graph. Attach the ratings to the edges as weight.

- Print out the total node count in the network object.

- Start generating features in DataFrame.
    - Create a DataFrame as feature_df and add average ratings of targeted users as a first feature. In this DataFrame, each target user will be represented as a row, and they will be the instances of our clustering model in the next milestone.
    - In a directed network (or, in other words, in a directed graph), the number of inbound edges to a node are called the in-degree and the number of outbound edges are called the out-degree of a node. In the previous step, we calculated the average ratings of a target user. Let’s call it the just average rating of a user. In this step, calculate the average of inbound users’ average ratings for each target user and add it as a second feature to the feature_df. In the case of no inbound users, assign 0.
    - As third and fourth features, add the in-degree and out-degree of each target user to feature_df.
    - As a final feature, calculate the page rank value of each target user and add it to the feature_df.

## Import libraries

In [1]:
import pandas as pd
import seaborn as sb
import networkx as nx
from datetime import datetime
import matplotlib.pyplot as plt

## Create a directed graph

### Load data

In [2]:
df = (
    pd
    .read_csv('soc-sign-bitcoinotc.csv', names=['source', 'target', 'rating', 'time_epoch'])
    .assign(time=lambda df: df.time_epoch.apply(datetime.fromtimestamp))
)
df.head()

Unnamed: 0,source,target,rating,time_epoch,time
0,6,2,4,1289242000.0,2010-11-08 19:45:11.728360
1,6,5,2,1289242000.0,2010-11-08 19:45:41.533780
2,1,15,1,1289243000.0,2010-11-08 20:05:40.390490
3,4,3,7,1289245000.0,2010-11-08 20:41:17.369750
4,13,16,8,1289254000.0,2010-11-08 23:10:54.447460


### Prepare data to create graph

#### Nodes

In [3]:
len(df.source.unique().tolist()), len(df.target.unique().tolist())

(4814, 5858)

In [4]:
nodes = set(df.source.unique().tolist() + df.target.unique().tolist())
len(nodes)

5881

#### Edges

In [5]:
edges = (
    df
    [['source', 'target', 'rating']]
    .assign(rating=lambda df: df.rating.apply(lambda x: {'rating': x}))
    .to_records(index=False)
)
edges[:2]

rec.array([(6, 2, {'rating': 4}), (6, 5, {'rating': 2})],
          dtype=[('source', '<i8'), ('target', '<i8'), ('rating', 'O')])

In [6]:
len(edges)

35592

In [7]:
# TODO: use this instead of creating the nodes and edges lists by hand
# nx.from_pandas_edgelist(df, source='source', target='target', edge_attr='rating')

### Build the directed graph

In [8]:
G = nx.DiGraph()

G.add_nodes_from(nodes)
G.add_edges_from(edges)

In [9]:
len(G.nodes)

5881

## Create features

In [10]:
feature_df = df.copy()

### Average rating for each target user

In [11]:
feature_df = (
    feature_df
    .groupby('target')
    [['rating']]
    .mean()
    .reset_index()
    .rename(columns={'rating': 'avg_rating'})
)
feature_df.head()

Unnamed: 0,target,avg_rating
0,1,3.544248
1,2,3.0
2,3,-0.285714
3,4,3.111111
4,5,2.333333


### Avg. of ratings for each inbound user for each target user
In a directed network (or, in other words, in a directed graph), the number of inbound edges to a node are called the in-degree and the number of outbound edges are called the out-degree of a node. In the previous step, we calculated the average ratings of a target user. Let’s call it the just average rating of a user. In this step, calculate the average of inbound users’ average ratings for each target user and add it as a second feature to the feature_df. In the case of no inbound users, assign 0.

#### Add avg. ratings as edge attribute

In [12]:
nodes_with_attributes = (
    feature_df
    .assign(attr=lambda df: df.avg_rating.apply(lambda x: {'avg_rating': x}))
    .drop(columns='avg_rating')
    .to_records(index=False)
)

In [13]:
# this will only update the already existing edges with the attribute information
G.add_nodes_from(nodes_with_attributes)

#### Calculate the new feature

In [14]:
# TODO: check if there's another way to implement this using some feature from NetworkX
#       it feels like there should be a more straightforward way to do something like this

# TODO: reimplement as simple pandas operations

In [15]:
def get_avg_rating_of_all_predecessors(node, graph, feature_df):
    predecessors = list(graph.predecessors(node))
    predecessors_info = feature_df[feature_df.target.isin(predecessors)]['avg_rating']

    avg_predecessor_avg_rating = 0  # initialize with 0 which is the value when there are no inbound users
    if len(predecessors_info):
        avg_predecessor_avg_rating = (
            predecessors_info
            .groupby(lambda x: 0)
            .mean()
            .iloc[0]
        )
        
    return avg_predecessor_avg_rating

In [16]:
%%time
avg_rating_inbound_users = feature_df.target.apply(
    lambda node: get_avg_rating_of_all_predecessors(node, G, feature_df))
feature_df['avg_rating_inbound_users'] = avg_rating_inbound_users

feature_df.head()

CPU times: user 6.54 s, sys: 42.2 ms, total: 6.58 s
Wall time: 6.83 s


Unnamed: 0,target,avg_rating,avg_rating_inbound_users
0,1,3.544248,1.640546
1,2,3.0,1.73565
2,3,-0.285714,2.819381
3,4,3.111111,1.812079
4,5,2.333333,2.591068


### Add the in-degree and out-degree as new features

In [17]:
feature_df['in_degree'] = feature_df.target.apply(G.in_degree)
feature_df['out_degree'] = feature_df.target.apply(G.out_degree)

In [18]:
feature_df.head()

Unnamed: 0,target,avg_rating,avg_rating_inbound_users,in_degree,out_degree
0,1,3.544248,1.640546,226,215
1,2,3.0,1.73565,41,45
2,3,-0.285714,2.819381,21,0
3,4,3.111111,1.812079,54,63
4,5,2.333333,2.591068,3,3


### Calculate the page rank value as a new feature

In [19]:
%%time
pageranks = nx.pagerank_numpy(G)

CPU times: user 2min 14s, sys: 2.17 s, total: 2min 16s
Wall time: 1min 29s


In [20]:
feature_df['page_rank'] = feature_df.target.apply(lambda x: pageranks[x])

In [21]:
feature_df.head()

Unnamed: 0,target,avg_rating,avg_rating_inbound_users,in_degree,out_degree,page_rank
0,1,3.544248,1.640546,226,215,0.005028
1,2,3.0,1.73565,41,45,0.000978
2,3,-0.285714,2.819381,21,0,0.000382
3,4,3.111111,1.812079,54,63,0.001289
4,5,2.333333,2.591068,3,3,9.3e-05


In [22]:
feature_df.to_csv('features.csv', index=False)

## Describe 

In [23]:
feature_df.describe()

Unnamed: 0,target,avg_rating,avg_rating_inbound_users,in_degree,out_degree,page_rank
count,5858.0,5858.0,5858.0,5858.0,5858.0,5858.0
mean,3003.711676,0.728609,1.497921,6.075794,6.067771,0.000171
std,1721.680985,2.827039,1.250479,17.705675,21.126901,0.000421
min,1.0,-10.0,-10.0,1.0,0.0,3.7e-05
25%,1509.25,1.0,1.222598,1.0,1.0,5.5e-05
50%,2998.5,1.0,1.64437,2.0,2.0,7.6e-05
75%,4494.75,1.7,2.02159,5.0,4.0,0.000143
max,6005.0,10.0,8.0,535.0,763.0,0.015023
