# Homework 2: Graph Mining
*Due Wednesday November 14th, 2018 at 11:59 pm*

*Notebook Author: Koki Sasagawa*

Mining a large social network to uncover how well homophily can predict identity as well as the network structure. 

**Task2:** Attribute Prediction - most of the nodes in the social network are provided with one or more attributes that can be drawn from different types. (e.g., age, occupation, musical preference, etc. ) Predict the probabilities of attributes for a set of completely unlabeled nodes

## Data

1. `labeled-vertices.train.tsv` & `labeled-vertices.dev.tsv` & `unlabeled-verticies.test.tsv` 
   - users with attributes formatted as the following: 

   > - **vertex1** T1:3 T7:1 T4:2
   > - **vertex2** T2:4
   > - **vertex3** T4:3 T3:1
   
   - Each value is specified as `AttributeType:Value`
   - Not every user will have their attributes listed 
   - Majority users should have at least 2 attribute set
2. `unlabeled-verticies.test.txt` - simply have list of vertices that should predict attributes and their values 

## Submission 

**Attribute prediction** should be a csv file with two columns: id and attr. 
The attr column should contain a space-deliminted list of the attributes you think the user with that id has. The file should have the following structure:

> id, attr
> 
> 123, T0:0 T1:1 


In [1]:
# Import libraries
import pandas as pd
import numpy as np
import networkx as nx
import time
# import os
# from pprint import pprint

## 1. Create Graph

In [2]:
print("Creating network graph...")
start_time = time.time() 

with open("../data/network.tsv", 'rb') as f:
    grph = nx.read_edgelist(path=f, delimiter='\t', encoding='utf8')

end_time = time.time()
print("Network graph created. Process took {:.04f} seconds".format(end_time - start_time))

# Check that graph is of correct size
print("Number of edges: {}".format(grph.number_of_edges())) # There should be 30915267
print("Number of nodes: {}".format(grph.number_of_nodes())) # There should be 6626753

Creating network graph...
Network graph created. Process took 277.1276 seconds
Number of edges: 30915267
Number of nodes: 6626753


## 2. Load all files for attribute prediction 

In [8]:
print("Reading in dev set...")
start_time = time.time()

dev_set_chunks = pd.read_csv("../data2/labeled-vertices.dev.tsv",
                      delimiter='\t',
                      usecols=[0,1],
                      names=['id', 'attr'],
                      header=None,
                      chunksize=100000)

dev_set = pd.concat(dev_set_chunks)

# Free memory 
dev_set_chunks = None

# with open('../data2/labeled-vertices.dev.test.tsv') as f:
#     for line in f:
#         dev_set.append(line.rstrip())

end_time = time.time()
print("Dev set loaded. Process took {:.04f} seconds".format(end_time - start_time))

# Check 
dev_set.head()

Reading in dev set...
Dev set loaded. Process took 0.2679 seconds


Unnamed: 0,id,attr
0,2666403,T0:2 T1:99
1,2627940,T0:0 T1:26
2,4843136,T0:0 T1:26
3,5396835,T0:0 T1:1813
4,5438188,T0:1 T1:1733


In [11]:
print("Reading in training set...")
start_time = time.time()

train_set_chunks = pd.read_csv("../data2/labeled-vertices.train.tsv",
                      delimiter='\t',
                      usecols=[0,1],
                      names=['id', 'attr'],
                      header=None,
                      chunksize=100000)

train_set = pd.concat(train_set_chunks)

# Free memory 
train_set_chunks = None

# with open('../data2/labeled-vertices.dev.test.tsv') as f:
#     for line in f:
#         dev_set.append(line.rstrip())

end_time = time.time()
print("Train set loaded. Process took {:.04f} seconds".format(end_time - start_time))

# Check 
print(train_set.shape[0])
train_set.head()

Reading in training set...
train set loaded. Process took 1.9288 seconds


Unnamed: 0,id,attr
0,5509623,T0:0 T1:0
1,6334893,T0:0 T1:1
2,1218900,T0:1 T1:2
3,3871398,T0:1 T1:2
4,3942361,T0:0 T1:3


In [12]:
print('Reading in test set...')
start_time = time.time()

test_set = []

with open('../data2/unlabeled-vertices.test.txt') as f:
    for line in f:
        test_set.append(line.rstrip())

end_time = time.time()
print("Test set loaded. Process took {:.04f} seconds".format(end_time - start_time))

# Check 
test_set[:5]

Reading in test set...
Test set loaded. Process took 0.2228 seconds


['4546232', '3711008', '6394112', '5883774', '2843733']

## Define the functions for attribute prediction

The following function will compute some similarity metric on the neighbors of a node, and the node will inherit the attributes of the highest scoring node. 

In [13]:
def highest_jaccard_similarity(test_set, grph):
    '''Calculate then jaccard similarity of neighbors and return the most similar node
    
    If jaccard similarity of two nodes is zero, return 0 
    
    :params test_set: target nodes we want to find the most similar neighbor for
    :type test_set: list
    :params grph: network containing target nodes
    :type grph: networkx graph
    :return: node and the most similar neighboring node
    :rtype: dict
    '''
    
    # Store results
    sim_results = {}
    
    for i in test_set:      
        nearest_neighbors = list(grph.neighbors(i))
        # If a node only has only 1 neighbor, inherit that neighbors attributes 
        if len(nearest_neighbors) == 1: 
            sim_results[i] = nearest_neighbors[0]
        else:
            # Generate node-neighbor pairings 
            node_pairs = [(i, j) for j in nearest_neighbors]
            preds = nx.jaccard_coefficient(grph, ebunch=node_pairs)

            # Sort by similarity score
            preds = list(preds)
            preds.sort(key=lambda x: x[2], reverse=True)

            # Take the node with highest sim score
            # If similarity is 0, return 0
            if preds[0][2] == 0.0:
                sim_results[i] = 0
            else:
                sim_results[i] = preds[0][1]

    return sim_results

In [14]:
def highest_adamic_adar_similarity(test_set, grph):
    '''Calculate then adamic/adar similarity of neighbors and return the most similar node
    
    If adamic/adar similarity of two nodes is zero, return 0
    
    :params test_set: target nodes we want to find the most similar neighbor for
    :type test_set: list 
    :params grph: network containing target nodes
    :type grph: networkx graph
    :return: node and the most similar neighboring node
    :rtype: dict
    '''
    
    # Store results
    sim_results = {}
    
    for i in test_set:
        nearest_neighbors = list(grph.neighbors(i))
        # If a node only has only 1 neighbor, inherit that neighbors attributes 
        if len(nearest_neighbors) == 1: 
            sim_results[i] = nearest_neighbors[0]
        else:
            # Generate node-neighbor pairings 
            node_pairs = [(i, j) for j in nearest_neighbors]
            preds = nx.adamic_adar_index(grph, ebunch=node_pairs)

            # Sort by similarity score
            preds = list(preds)
            preds.sort(key=lambda x: x[2], reverse=True)

            # Take the node with highest sim score
            # If similarity is 0, return 0
            if preds[0][2] == 0.0:
                sim_results[i] = 0
            else:
                sim_results[i] = preds[0][1]

    return sim_results

In [15]:
def highest_preferential_attachment_score(test_set, grph):
    '''Calculate then preferential attachment scores of neighbors and return the highest scoring node
    
    :params test_set: target nodes we want to find the most similar neighbor for
    :type test_set: list
    :params grph: network containing target nodes
    :type grph: networkx graph
    :return: node and the most similar neighboring node
    :rtype: dict
    '''
    
    # Store results
    sim_results = {}
    
    for i in test_set:
        nearest_neighbors = list(grph.neighbors(i))
        # If a node only has only 1 neighbor, inherit that neighbors attributes 
        if len(nearest_neighbors) == 1: 
            sim_results[i] = nearest_neighbors[0]
        else:
            # Generate node-neighbor pairings 
            node_pairs = [(i, j) for j in nearest_neighbors]
            preds = nx.preferential_attachment(grph, ebunch=node_pairs)

            # Sort by score
            preds = list(preds)
            preds.sort(key=lambda x: x[2], reverse=True)

            # Take the node with highest score
            # If score is 0, return 0
            if preds[0][2] == 0.0:
                sim_results[i] = 0
            else:
                sim_results[i] = preds[0][1]

    return sim_results

## Generate node pairs

Jaccard similarity and Adamic/adar similarity both require nodes to have common neighbors, or the similarity score will be zero. Preferential attachment, however, can still be calculated as it is based on the idea that nodes will attach to nodes of higher degree. 

In [16]:
print('Running jaccard similarity...')
start_time = time.time()

jaccard_similarity_nodes = highest_jaccard_similarity(test_set, grph)

end_time = time.time()
print("Node pairs generated. Process took {:.04f} seconds".format(end_time - start_time))

Running jaccard similarity...
Node pairs generated. Process took 31.3890 seconds


In [None]:
print('Running adamic/adar similarity...')
start_time = time.time()

adamic_adar_nodes = highest_adamic_adar_similarity(test_set, grph)

end_time = time.time()
print("Node pairs generated. Process took {:.04f} seconds".format(end_time - start_time))

In [18]:
print('Running preferential attachment...')
start_time = time.time()

preferential_attachment_nodes = highest_preferential_attachment_score(test_set, grph)

end_time = time.time()
print("Node pairs generated. Process took {:.04f} seconds".format(end_time - start_time))

Running preferential attachment...
Node pairs generated. Process took 4.9312 seconds


In [115]:
predictions = pd.DataFrame({
                        'id': test_set,
                        'jaccard_similarity': list(jaccard_similarity_nodes.values()), 
                        'preferential_attachment': list(preferential_attachment_nodes.values()),
                            })

In [None]:
# Free memory of variables no longer used
jaccard_similarity_nodes = None
adamic_adar_nodes = None
preferential_attachment_nodes = None
grph = None
del jaccard_similarity_nodes
del adamic_adar_nodes
del preferential_attachment_nodes
del grph

In [116]:
predictions.head()

Unnamed: 0,id,jaccard_similarity,preferential_attachment
0,4546232,2494614,2494614
1,3711008,2444912,2174169
2,6394112,6223074,6223074
3,5883774,2967901,4485305
4,2843733,3931905,3931905


In [117]:
# Find the number of cells where jaccard similarity did not return a similarity score and returned 0
print("Number of 0\'s: {}".format(predictions.shape[0] - np.count_nonzero(predictions['jaccard_similarity'])))

Number of 0's: 58932


In [118]:
# For every row where jaccard similarity failed to return a candidate node due to lack of common neighbors,
# use the node returned by preferential attachment
predictions['attr'] = np.where(predictions['jaccard_similarity'] == 0, predictions['preferential_attachment'], predictions['jaccard_similarity'])

In [119]:
print("Number of 0\'s: {}".format(predictions.shape[0] - np.count_nonzero(predictions['attr'])))

Number of 0's: 0


In [120]:
# We no longer need columns jaccard_similarity and preferential attachment, so drop them
predictions.drop(['jaccard_similarity', 'preferential_attachment'], axis=1, inplace=True)

In [121]:
predictions.head()

Unnamed: 0,id,attr
0,4546232,2494614
1,3711008,2444912
2,6394112,6223074
3,5883774,2967901
4,2843733,3931905


In [170]:
def find_attributes(node_pairs, attr_set):
    '''Retrieve the corresponding attributes from the trainset
    
    :params node_pairs: nodes with unknown attributes
    :type node_pairs: pandas dataframe 
    :params attr_set: nodes with known attributes
    :type attr_set: pandas dataframe
    :return: predicted attributes 
    :rtype: list
    '''
    
    count = 0 
    
    pred_attr = []
    
    all_nodes = set(node_pairs['attr'])
    
    print('Getting attributes...')
    start_time = time.time()
    
    for i, r in attr_set.iterrows():
        print(r[0])
        print(r[1])
        break
    # Find the corresponding attributes from the trainset
#     for i in node_pairs['attr']:
#         count += 1 
#         if count <= 10:
# #             attribute = attr_set[attr_set['id'] == int(i)]['attr']
# #             if attribute.empty:
# #                 pred_attr.append(None)
# #             else:
# #                 pred_attr.append(attribute.values[0])
    
#             attribute = attr_set.loc[attr_set['id'] == int(i), 'attr']
#             if attribute.empty:
#                 pred_attr.append(None)
#             else:
#                 pred_attr.append(attribute.values[0])

    end_time = time.time()
    print("Attributes found. Process took {:.04f} seconds".format(end_time - start_time))
    

        
        

#     for i in node_pairs['attr']:
#         attribute = attr_set.loc[attr_set['id'] == int(i), 'attr']
#         if attribute.empty:
#             pred_attr.append(None)
#         else:
#             pred_attr.append(attribute.values[0])
        
#     return pred_attr

In [171]:
find_attributes(predictions, train_set)

Getting attributes...
5509623
T0:0 T1:0
Attributes found. Process took 0.7670 seconds


In [152]:
print('Getting attributes...')
start_time = time.time()

# Fill attr column with predicted attributes 
predictions['attr'] = find_attributes(predictions, train_set)

end_time = time.time()
print("Attributes found. Process took {:.04f} seconds".format(end_time - start_time))

Getting attributes...


KeyboardInterrupt: 

## Step 1: Use the common neighbor node pairs produced from the link prediction problem and find the node with the highest CN value and inherit their attributes.

In [61]:
candidate_pairs = candidate_nodes = pd.read_csv('../data/candidate_pairs.csv',
                              dtype={'node1': np.int32, 'node2': np.int32, 'CN': np.int32})

# Check
print(candidate_nodes.shape[0])
candidate_nodes.head()

231167


Unnamed: 0,node1,node2,CN
0,1091804,967845,67
1,1091804,1354523,51
2,1091804,2309755,68
3,4573414,967845,101
4,4573414,1354523,54


In [69]:
# Sort by CN value and reset index
candidate_nodes.sort_values('CN', inplace=True, ascending=False)
candidate_nodes.reset_index(drop=True, inplace=True)

print('Number of rows before dropping duplicates: {}'.format(candidate_nodes.size))

# Drop duplicates. The first match we get is the one with the highest CN.
candidate_nodes.drop_duplicates(subset='node1', keep='first', inplace=True)

print('Number of rows after dropping duplicates: {}'.format(candidate_nodes.size))
candidate_nodes.head()

Number of rows before dropping duplicates: 693501
Number of rows after dropping duplicates: 52395


Unnamed: 0,node1,node2,CN
0,6247815,1542000,593
1,5401224,988315,557
2,304552,6314086,545
5,353603,988315,497
6,4258628,1542000,496


In [70]:
for i in test_set:
    for i, r in c_pairs.iterrows()
    break


0         True
1         True
2         True
5         True
6         True
7         True
8         True
9         True
10        True
11        True
12        True
13        True
14        True
15        True
16        True
17        True
19        True
20        True
21        True
23        True
25        True
26        True
27        True
28        True
29        True
31        True
32        True
38        True
41        True
42        True
          ... 
230561    True
230566    True
230584    True
230595    True
230598    True
230602    True
230619    True
230637    True
230652    True
230656    True
230677    True
230720    True
230722    True
230725    True
230743    True
230745    True
230753    True
230775    True
230843    True
230846    True
230865    True
230902    True
230927    True
230945    True
230985    True
231004    True
231061    True
231156    True
231158    True
231162    True
Name: node1, Length: 17465, dtype: bool


In [48]:
for i in test_set:
    print(candidate_nodes.node1.ne(int(i)).idxmin())
    break
        

140294


In [43]:
# Solution 1: Take advantage of the common neighbors data produced earlier to find candidata nodes for a node.
# Assume their attributes will be shared with the mystery node. 
count = 0
for i in test_set:
    print(adj_list[i])
    print(adj_list[adj_list[i][0]])
    count += 1 
    if count == 1:
        break


['2494614']
['2120159', '3119624', '2817288', '848669', '4546232', '1242663', '3191796']


Use the following intuition:
1. Nodes sharing common neighbors are likely to be friends, thus share similar interests. Jaccard_similarity to find the most similar node. 
2. For nodes with few neighbors, if nodes are closer than 4 nodes away (small world phenomenon states that on averages nodes are 4 steps aways), find those nodes, rank them by connetivity (preferential attachment) and inherit the attributes of the node with the highest degree. 

In [None]:
# Create a user x user matrix, and have the value b 0 or 1 to indicate a link. 
# Using cosine similarity, the most similar node will be returned for a target node
# Using the similar node, use that as a key to look up its node in the training set.
# Retrieve its attributes and assign that as the attribute for the particular node. 

# If multiple nodes exists, for example, using a threshold to define similarity, 
# choose the nearest one. Perhaps, I can find the shortest path for node1 - candidate node
# then choose the closest one. 




In [None]:
# In a adjacency matrix, we can assume that popular neighbors have more influence than less coneccted neighbors.
# For a given node, find all  neighbors.
# Return the "most" connected neighbor and their attributes.
# These attributes can be the new attributes for the user. 



In [None]:
# triangular closure can indicate tight community of friends.
# if a nighbor of a friend is friends with another nieghbor, this triangualr relationship can indicate close nit 
# circle, assume that these nodes share alot of attributes.

In [None]:
# SVD approach
# Create a user x attribute matrix
# presence of attribute is indicated by 1, absence is 0,
# use this 