# Homework 2: Graph Mining
*Due Wednesday November 14th, 2018 at 11:59 pm*

*Notebook Author: Koki Sasagawa*

Mining a large social network to uncover how well homophily can predict identity as well as the network structure. 

**Task2:** Attribute Prediction - most of the nodes in the social network are provided with one or more attributes that can be drawn from different types. (e.g., age, occupation, musical preference, etc. ) Predict the probabilities of attributes for a set of completely unlabeled nodes

## Data

1. `labeled-vertices.train.tsv` & `labeled-vertices.dev.tsv` & `unlabeled-verticies.test.tsv` 
   - users with attributes formatted as the following: 

   > - **vertex1** T1:3 T7:1 T4:2
   > - **vertex2** T2:4
   > - **vertex3** T4:3 T3:1
   
   - Each value is specified as `AttributeType:Value`
   - Not every user will have their attributes listed 
   - Majority users should have at least 2 attribute set
2. `unlabeled-verticies.test.txt` - simply have list of vertices that should predict attributes and their values 

## Submission 

**Attribute prediction** should be a csv file with two columns: id and attr. 
The attr column should contain a space-deliminted list of the attributes you think the user with that id has. The file should have the following structure:

> id, attr
> 
> 123, T0:0 T1:1 


In [1]:
# Import libraries
import pandas as pd
import numpy as np
import networkx as nx
import time
# import os
# from pprint import pprint

## 1. Create Graph

In [2]:
print("Creating network graph...")
start_time = time.time() 

with open("../data/network.tsv", 'rb') as f:
    grph = nx.read_edgelist(path=f, delimiter='\t', encoding='utf8')

end_time = time.time()
print("Network graph created. Process took {:.04f} seconds".format(end_time - start_time))

# Check that graph is of correct size
print("Number of edges: {}".format(grph.number_of_edges())) # There should be 30915267
print("Number of nodes: {}".format(grph.number_of_nodes())) # There should be 6626753

Creating network graph...
Network graph created. Process took 255.3354 seconds
Number of edges: 30915267
Number of nodes: 6626753


## 2. Load all files for attribute prediction 

In [3]:
# print("Reading in dev set...")
# start_time = time.time()

# dev_set_chunks = pd.read_csv("../data2/labeled-vertices.dev.tsv",
#                       delimiter='\t',
#                       usecols=[0,1],
#                       names=['id', 'attr'],
#                       header=None,
#                       chunksize=100000)

# dev_set = pd.concat(dev_set_chunks)

# # Free memory 
# dev_set_chunks = None

# # with open('../data2/labeled-vertices.dev.test.tsv') as f:
# #     for line in f:
# #         dev_set.append(line.rstrip())

# end_time = time.time()
# print("Dev set loaded. Process took {:.04f} seconds".format(end_time - start_time))

# # Check 
# dev_set.head()

In [3]:
print("Reading in training set...")
start_time = time.time()

train_set_chunks = pd.read_csv("../data2/labeled-vertices.train.tsv",
                      delimiter='\t',
                      usecols=[0,1],
                      names=['id', 'attr'],
                      header=None,
                      chunksize=100000)

train_set = pd.concat(train_set_chunks)

# Free memory 
train_set_chunks = None

# with open('../data2/labeled-vertices.dev.test.tsv') as f:
#     for line in f:
#         dev_set.append(line.rstrip())

end_time = time.time()
print("Train set loaded. Process took {:.04f} seconds".format(end_time - start_time))

# Check 
print(train_set.shape[0])
train_set.head()

Reading in training set...
Train set loaded. Process took 2.0877 seconds
5301403


Unnamed: 0,id,attr
0,5509623,T0:0 T1:0
1,6334893,T0:0 T1:1
2,1218900,T0:1 T1:2
3,3871398,T0:1 T1:2
4,3942361,T0:0 T1:3


In [4]:
print('Reading in test set...')
start_time = time.time()

test_set = []

with open('../data2/unlabeled-vertices.test.txt') as f:
    for line in f:
        test_set.append(line.rstrip())

end_time = time.time()
print("Test set loaded. Process took {:.04f} seconds".format(end_time - start_time))

# Check 
test_set[:5]

Reading in test set...
Test set loaded. Process took 0.2285 seconds


['4546232', '3711008', '6394112', '5883774', '2843733']

## Define the functions for attribute prediction

The following function will compute some similarity metric on the neighbors of a node, and the node will inherit the attributes of the highest scoring node. 

In [5]:
def highest_jaccard_similarity(test_set, grph):
    '''Calculate then jaccard similarity of neighbors and return the most similar node
    
    If jaccard similarity of two nodes is zero, return 0 
    
    :params test_set: target nodes we want to find the most similar neighbor for
    :type test_set: list
    :params grph: network containing target nodes
    :type grph: networkx graph
    :return: node and the most similar neighboring node
    :rtype: dict
    '''
    
    # Store results
    sim_results = {}
    
    for i in test_set: 
        nearest_neighbors = list(grph.neighbors(i))
        # If a node only has only 1 neighbor, inherit that neighbors attributes 
        if len(nearest_neighbors) == 1: 
            sim_results[i] = nearest_neighbors[0]
        else:
            # Generate node-neighbor pairings 
            node_pairs = [(i, j) for j in nearest_neighbors]
            preds = nx.jaccard_coefficient(grph, ebunch=node_pairs)

            # Sort by similarity score
            preds = list(preds)
            preds.sort(key=lambda x: x[2], reverse=True)

            # Take the node with highest sim score
            # If similarity is 0, return 0
            if preds[0][2] == 0.0:
                sim_results[i] = 0
            else:
                sim_results[i] = preds[0][1]

    return sim_results

In [6]:
def highest_adamic_adar_similarity(test_set, grph):
    '''Calculate then adamic/adar similarity of neighbors and return the most similar node
    
    If adamic/adar similarity of two nodes is zero, return 0
    
    :params test_set: target nodes we want to find the most similar neighbor for
    :type test_set: list 
    :params grph: network containing target nodes
    :type grph: networkx graph
    :return: node and the most similar neighboring node
    :rtype: dict
    '''
    
    # Store results
    sim_results = {}
    
    for i in test_set:
        nearest_neighbors = list(grph.neighbors(i))
        # If a node only has only 1 neighbor, inherit that neighbors attributes 
        if len(nearest_neighbors) == 1: 
            sim_results[i] = nearest_neighbors[0]
        else:
            # Generate node-neighbor pairings 
            node_pairs = [(i, j) for j in nearest_neighbors]
            preds = nx.adamic_adar_index(grph, ebunch=node_pairs)

            # Sort by similarity score
            preds = list(preds)
            preds.sort(key=lambda x: x[2], reverse=True)

            # Take the node with highest sim score
            # If similarity is 0, return 0
            if preds[0][2] == 0.0:
                sim_results[i] = 0
            else:
                sim_results[i] = preds[0][1]

    return sim_results

In [7]:
def highest_preferential_attachment_score(test_set, grph):
    '''Calculate then preferential attachment scores of neighbors and return the highest scoring node
    
    :params test_set: target nodes we want to find the most similar neighbor for
    :type test_set: list
    :params grph: network containing target nodes
    :type grph: networkx graph
    :return: node and the most similar neighboring node
    :rtype: dict
    '''
    
    # Store results
    sim_results = {}
    
    for i in test_set:
        nearest_neighbors = list(grph.neighbors(i))
        # If a node only has only 1 neighbor, inherit that neighbors attributes 
        if len(nearest_neighbors) == 1: 
            sim_results[i] = nearest_neighbors[0]
        else:
            # Generate node-neighbor pairings 
            node_pairs = [(i, j) for j in nearest_neighbors]
            preds = nx.preferential_attachment(grph, ebunch=node_pairs)

            # Sort by score
            preds = list(preds)
            preds.sort(key=lambda x: x[2], reverse=True)

            # Take the node with highest score
            # If score is 0, return 0
            if preds[0][2] == 0.0:
                sim_results[i] = 0
            else:
                sim_results[i] = preds[0][1]

    return sim_results

## Generate node pairs

Jaccard similarity and Adamic/adar similarity both require nodes to have common neighbors, or the similarity score will be zero. Preferential attachment, however, can still be calculated as it is based on the idea that nodes will attach to nodes of higher degree. 

In [8]:
# print('Running jaccard similarity...')
# start_time = time.time()

# jaccard_similarity_nodes = highest_jaccard_similarity(test_set, grph)

# end_time = time.time()
# print("Node pairs generated. Process took {:.04f} seconds".format(end_time - start_time))

Running jaccard similarity...
Node pairs generated. Process took 18.2419 seconds


In [37]:
print('Running adamic/adar similarity...')
start_time = time.time()

adamic_adar_nodes = highest_adamic_adar_similarity(test_set, grph)

end_time = time.time()
print("Node pairs generated. Process took {:.04f} seconds".format(end_time - start_time))

Running adamic/adar similarity...
Node pairs generated. Process took 8.6248 seconds


In [10]:
print('Running preferential attachment...')
start_time = time.time()

preferential_attachment_nodes = highest_preferential_attachment_score(test_set, grph)

end_time = time.time()
print("Node pairs generated. Process took {:.04f} seconds".format(end_time - start_time))

Running preferential attachment...
Node pairs generated. Process took 2.4117 seconds


In [11]:
# # Predictions with jaccard
# predictions = pd.DataFrame({
#                         'id': test_set,
#                         'jaccard_similarity': list(jaccard_similarity_nodes.values()), 
#                         'preferential_attachment': list(preferential_attachment_nodes.values()),
#                         })

In [38]:
# Predictions with Adamic/Adar
predictions = pd.DataFrame({
                        'id': test_set,
                        'adamic_adar': list(adamic_adar_nodes.values()), 
                        'preferential_attachment': list(preferential_attachment_nodes.values()),
                        })

In [39]:
predictions.head()

Unnamed: 0,id,adamic_adar,preferential_attachment
0,4546232,2494614,2494614
1,3711008,2444912,2174169
2,6394112,6223074,6223074
3,5883774,4485305,4485305
4,2843733,3931905,3931905


## Use Predictions from jaccard

In [13]:
# Find the number of cells where jaccard similarity did not return a similarity score and returned 0
print("Number of 0\'s: {}".format(predictions.shape[0] - np.count_nonzero(predictions['jaccard_similarity'])))

Number of 0's: 58932


In [14]:
# For every row where jaccard similarity failed to return a candidate node due to lack of common neighbors,
# use the node returned by preferential attachment
predictions['attr'] = np.where(predictions['jaccard_similarity'] == 0, predictions['preferential_attachment'], predictions['jaccard_similarity'])

In [15]:
print("Number of 0\'s: {}".format(predictions.shape[0] - np.count_nonzero(predictions['attr'])))

Number of 0's: 0


In [16]:
# We no longer need columns jaccard_similarity and preferential attachment, so drop them
predictions.drop(['jaccard_similarity', 'preferential_attachment'], axis=1, inplace=True)

## Use Predictions from adamic adar

In [40]:
# Find the number of cells where jaccard similarity did not return a similarity score and returned 0
print("Number of 0\'s: {}".format(predictions.shape[0] - np.count_nonzero(predictions['adamic_adar'])))

Number of 0's: 58932


In [43]:
# For every row where jaccard similarity failed to return a candidate node due to lack of common neighbors,
# use the node returned by preferential attachment
predictions['attr'] = np.where(predictions['adamic_adar'] == 0, predictions['preferential_attachment'], predictions['adamic_adar'])

In [44]:
print("Number of 0\'s: {}".format(predictions.shape[0] - np.count_nonzero(predictions['attr'])))

Number of 0's: 0


In [45]:
# We no longer need columns jaccard_similarity and preferential attachment, so drop them
predictions.drop(['adamic_adar', 'preferential_attachment'], axis=1, inplace=True)

## Obtain attributes for nodes

In [46]:
predictions.head()

Unnamed: 0,id,attr
0,4546232,2494614
1,3711008,2444912
2,6394112,6223074
3,5883774,4485305
4,2843733,3931905


In [47]:
# Variable type in train_set 
print(type(train_set['id'][0]))

<class 'numpy.int64'>


In [48]:
# The values in the prediction dataframe are currently str type
# convert them to int64 to allow merge with train_set
print(type(predictions['attr'][0]))

predictions = predictions.astype(dtype=np.int64, copy=True)
print(type(predictions['attr'][0]))

<class 'str'>
<class 'numpy.int64'>


In [49]:
# Merge preditions dataframe with trainset by id 
results = predictions.merge(train_set, left_on='attr', right_on='id', how='left')
# Check results
results.head()

Unnamed: 0,id_x,attr_x,id_y,attr_y
0,4546232,2494614,2494614.0,T0:0 T1:1766
1,3711008,2444912,2444912.0,T0:0 T1:1762
2,6394112,6223074,6223074.0,T0:0 T1:1914 T8:0
3,5883774,4485305,4485305.0,T0:0 T1:944
4,2843733,3931905,3931905.0,T0:0 T1:538


In [50]:
# Drop unecessary columns
results.drop(['attr_x', 'id_y'], inplace=True, axis=1)

In [51]:
# Rename columns
results.columns = ['id', 'attr']

# Show results
results.head()

Unnamed: 0,id,attr
0,4546232,T0:0 T1:1766
1,3711008,T0:0 T1:1762
2,6394112,T0:0 T1:1914 T8:0
3,5883774,T0:0 T1:944
4,2843733,T0:0 T1:538


In [52]:
# Check results
print('Results contain {} nodes'.format(results.shape[0]))

Results contain 662675 nodes


In [53]:
# Save Results
results.to_csv('../data2/attribute_predictions_v2.csv', index=False)