# Homework 2: Graph Mining
*Due Wednesday November 14th, 2018 at 11:59 pm*

*Notebook Author: Koki Sasagawa*

Mining a large social network to uncover how well homophily can predict identity as well as the network structure. 

**Task2:** Attribute Prediction - most of the nodes in the social network are provided with one or more attributes that can be drawn from different types. (e.g., age, occupation, musical preference, etc. ) Predict the probabilities of attributes for a set of completely unlabeled nodes

## Data

1. `labeled-vertices.train.tsv` & `labeled-vertices.dev.tsv` & `unlabeled-verticies.test.tsv` 
   - users with attributes formatted as the following: 

   > - **vertex1** T1:3 T7:1 T4:2
   > - **vertex2** T2:4
   > - **vertex3** T4:3 T3:1
   
   - Each value is specified as `AttributeType:Value`
   - Not every user will have their attributes listed 
   - Majority users should have at least 2 attribute set
2. `unlabeled-verticies.test.txt` - simply have list of vertices that should predict attributes and their values 

## Submission 

**Attribute prediction** should be a csv file with two columns: id and attr. 
The attr column should contain a space-deliminted list of the attributes you think the user with that id has. The file should have the following structure:

> id, attr
> 
> 123, T0:0 T1:1 


In [61]:
# Import libraries
import pandas as pd
import numpy as np
import networkx as nx
import os
import time
from pprint import pprint

## 1. Create Graph

In [31]:
# # Open file and create dictionary 
# adj_list = {}

# with open('../data/adjacency_list.txt', 'r') as f:
#     # For each line in the file, create a dictionary that has a key = node and value = edges
#     for line in f:
#         adj_list[line.split(',')[0]] = line.split(',')[1].rstrip().split(' ')

In [3]:
print("Creating network graph...")
start_time = time.time() 

with open("../data/network.tsv", 'rb') as f:
    grph = nx.read_edgelist(path=f, delimiter='\t', encoding='utf8')

end_time = time.time()
print("Network graph created. Process took {:.04f} seconds".format(end_time - start_time))

# Check that graph is of correct size
print("Number of edges: {}".format(grph.number_of_edges())) # There should be 30915267
print("Number of nodes: {}".format(grph.number_of_nodes())) # There should be 6626753

Creating network graph...
Network graph created. Process took 251.0850 seconds
Number of edges: 30915267
Number of nodes: 6626753


In [4]:
print('Reading in test set...')
start_time = time.time()

test_set = []

with open('../data2/unlabeled-vertices.test.txt') as f:
    for line in f:
        test_set.append(line.rstrip())

# Check 
test_set[:5]

Reading in test set...


['4546232', '3711008', '6394112', '5883774', '2843733']

In [67]:
# Find the jaccard similarity between two nodes, and return the highest similarity pair 
# One with the highest similarity value --> inherit their values 

def highest_jaccard_similarity(test_set):
    '''Calculate then jaccard similarity of neighbors and return the most similar node
    
    If jaccard similarity of two nodes is zero, return None. 
    
    :params test_set: target nodes we wan to find the most similar to
    :type test_set: list 
    :return: node and their most similar neighboring node
    :rtype: dict
    '''
    
    # For testing
    count = 0 
    
    # Store results
    sim_results = {}
    
    for i in test_set:
        
        count += 1 
        if count < 10:
            nearest_neighbors = list(grph.neighbors(i))
            # If a node only has only 1 neighbor, inherit that neighbors attributes 
            if len(nearest_neighbors) == 1: 
                print('1 neighbor: ({}, {})'.format(i, nearest_neighbors[0]))
                sim_results[i] = nearest_neighbors[0]
            else:
                # Generate node-neighbor pairings 
                node_pairs = [(i, j) for j in nearest_neighbors]
                preds = nx.jaccard_coefficient(grph, ebunch=node_pairs)
                
                # Sort by similarity score
                preds = list(preds)
                print('Before sorting: {}'.format(preds))
                preds.sort(key=lambda x: x[2], reverse=True)
                print('After sorting: {}'.format(preds))
                
                # Take the node with highest sim score
                # If similarity is 0, return 0
                if preds[0][2] == 0.0:
                    sim_results[i] = 0
                else:
                    print(preds[0][1])
                    sim_results[i] = preds[0][1]

    return sim_results

# Run 
prediction_1 = highest_jaccard_similarity(test_set)
print(prediction_1)

# print(sim_results)

#         # From specified node in graph, perform dfs of 2 to find relevant neighbors 
#         sub_grph = nx.dfs_edges(grph, source=i, depth_limit=2)

# predicted_df = pd.DataFrame.from_dict(predicted_dict, orient='index', columns = [])

1 neighbor: (4546232, 2494614)
Before sorting: [('3711008', '2444912', 0.07142857142857142), ('3711008', '2582444', 0.004761904761904762), ('3711008', '2174169', 0.0022624434389140274)]
After sorting: [('3711008', '2444912', 0.07142857142857142), ('3711008', '2582444', 0.004761904761904762), ('3711008', '2174169', 0.0022624434389140274)]
2444912
1 neighbor: (6394112, 6223074)
Before sorting: [('5883774', '2967901', 0.07692307692307693), ('5883774', '4485305', 0.045454545454545456)]
After sorting: [('5883774', '2967901', 0.07692307692307693), ('5883774', '4485305', 0.045454545454545456)]
2967901
1 neighbor: (2843733, 3931905)
1 neighbor: (2571713, 26197)
1 neighbor: (1672856, 2305106)
1 neighbor: (1899234, 923650)
Before sorting: [('1257763', '1096777', 0.0), ('1257763', '2650764', 0.0)]
After sorting: [('1257763', '1096777', 0.0), ('1257763', '2650764', 0.0)]
{'4546232': '2494614', '3711008': '2444912', '6394112': '6223074', '5883774': '2967901', '2843733': '3931905', '2571713': '26197

In [55]:
def highest_adamic_adar_similarity(test_set):
    '''Calculate then adamic/adar similarity of neighbors and return the most similar node
    
    :params test_set: target nodes we wan to find the most similar to
    :type test_set: list 
    :return: node and their most similar neighboring node
    :rtype: dict
    '''
    
    # For testing
    count = 0 
    
    # Store results
    sim_results = {}
    
    for i in test_set:
        
        count += 1 
        if count < 10:
            nearest_neighbors = list(grph.neighbors(i))
            # If a node only has only 1 neighbor, inherit that neighbors attributes 
            if len(nearest_neighbors) == 1: 
                print('1 neighbor: {}, {}'.format(i, nearest_neighbors))
                sim_results[i] = nearest_neighbors
            else:
                # Generate node-neighbor pairings 
                node_pairs = [(i, j) for j in nearest_neighbors]
                preds = nx.preferential_attachment(grph, ebunch=node_pairs)
                
                # Sort by similarity score
                preds = list(preds)
                print('Before sorting: {}'.format(preds))
                preds.sort(key=lambda x: x[2], reverse=True)
                print('After sorting: {}'.format(preds))
                
                ties = [] 
                
                # Check for ties in similarity values 
                for i in range(1, len(preds)): 
                    if preds[i][2] == preds[i-1][2]:
                        ties.append(grph.degree(preds[i-1][1]))
                
                print(ties)
                
highest_adamic_adar_similarity(test_set)

1 neighbor: 4546232, ['2494614']
Before sorting: [('3711008', '2444912', 81), ('3711008', '2582444', 624), ('3711008', '2174169', 1320)]
After sorting: [('3711008', '2174169', 1320), ('3711008', '2582444', 624), ('3711008', '2444912', 81)]
[]
1 neighbor: 6394112, ['6223074']
Before sorting: [('5883774', '2967901', 24), ('5883774', '4485305', 42)]
After sorting: [('5883774', '4485305', 42), ('5883774', '2967901', 24)]
[]
1 neighbor: 2843733, ['3931905']
1 neighbor: 2571713, ['26197']
1 neighbor: 1672856, ['2305106']
1 neighbor: 1899234, ['923650']
Before sorting: [('1257763', '1096777', 10), ('1257763', '2650764', 92)]
After sorting: [('1257763', '2650764', 92), ('1257763', '1096777', 10)]
[]


## 2. Generate candidate node pairs by CN

The exponential growth in nodes start around threshold L < 10, thus we will define the threshold at 15 to be able to calculate the CN for 

In [None]:
# # Define Threshold
# L = 15

# print("Step 1: Filter adjacency list")
# f_adj_list = filter_by_lemma1(adj_list, L)

# print("Step 2: Invert adjacency list")
# inv_adj_list = invert_adjacency_list(f_adj_list)

# # Clear Variables
# f_adj_list = None

# print("Step 3: Create accompanied groups")
# acc_groups = generate_accompanied_groups(inv_adj_list)

# print("Step 4: Filter accompanied groups")
# f_acc_groups = filter_by_lemma2(acc_groups, L)

# # Clear variables
# acc_groups = None

# candidate_node_pairs = generate_node_pairs(f_acc_groups, inv_adj_list, L)
# print("Candidate node pairs generated!")

# # Clear variables
# f_acc_groups = None
# inv_adj_list = None

## Read data for attribute prediction

In [4]:
start_time = time.time()
print('Reading in adjacnecy list and creating dictionary...')

# Read adjacency list
adj_list = {}

with open('../data/adjacency_list.txt', 'r') as f:
    # For each line in the file, create a dictionary that has a key = node and value = edges
    for line in f:
        adj_list[line.split(',')[0]] = line.split(',')[1].rstrip().split(' ')

end_time = time.time()
print("Adjacency list loaded. Process took {:.04f} seconds".format(end_time - start_time))

Reading in adjacnecy list and creating dictionary...
Adjacency list loaded. Process took 26.0404 seconds


In [7]:
start_time = time.time()
print('Reading in test set...')

test_set = []

with open('../data2/unlabeled-vertices.test.txt') as f:
    for line in f:
        test_set.append(line.rstrip())

# Check 
test_set[:5]

Reading in test set...


['4546232', '3711008', '6394112', '5883774', '2843733']

In [10]:
# Len of test_set
print('test set has {} nodes'.format(len(test_set)))

test set has 662675 nodes


In [15]:
# Check that every node in the test set exists in adj_list
for i in test_set:
    if i in adj_list:
        continue
    else:
        print('Test set contains new nodes not in original network')

## Fact 1: every node in the adj_list exists in the original network. 

In [18]:
# Convert it to a pandas series for additional functions 
test_set = pd.Series(test_set)
# Check
test_set.size

662675

In [20]:
# Load trainset 
print("Reading training set...")
start_time = time.time()

train_df_chunks = pd.read_csv("../data/labeled-vertices.train.tsv",
                       delimiter='\t',
                       usecols=[0,1],
                       names=['id', 'attr'],
                       header=None,
                       chunksize=100000)

# Combine
train_df =  pd.concat(train_df_chunks)

end_time = time.time()
print("Process took {:.04f} seconds".format(end_time - start_time))

# Free memory
train_df_chunks = None 

Reading training set...
Process took 1.8339 seconds


In [30]:
# Total number of rows in the training set
print("There are {} total nodes in the training set".format(train_df.shape[0]))
# Total number of unique nodes
print("Of which, {} are unique".format(train_df['id'].nunique()))

There are 5301403 total nodes in the training set
Of which, 5301403 are unique


In [31]:
train_df.head()

Unnamed: 0,id,attr
0,5509623,T0:0 T1:0
1,6334893,T0:0 T1:1
2,1218900,T0:1 T1:2
3,3871398,T0:1 T1:2
4,3942361,T0:0 T1:3


In [39]:
for i in train_df['id']:
    if str(i) in test_set:
        print('Some nodes in training set exist in the test set.')

## None of the nodes in the test set exist in the training set

## Step 1: Use the common neighbor node pairs produced from the link prediction problem and find the node with the highest CN value and inherit their attributes.

In [61]:
candidate_pairs = candidate_nodes = pd.read_csv('../data/candidate_pairs.csv',
                              dtype={'node1': np.int32, 'node2': np.int32, 'CN': np.int32})

# Check
print(candidate_nodes.shape[0])
candidate_nodes.head()

231167


Unnamed: 0,node1,node2,CN
0,1091804,967845,67
1,1091804,1354523,51
2,1091804,2309755,68
3,4573414,967845,101
4,4573414,1354523,54


In [69]:
# Sort by CN value and reset index
candidate_nodes.sort_values('CN', inplace=True, ascending=False)
candidate_nodes.reset_index(drop=True, inplace=True)

print('Number of rows before dropping duplicates: {}'.format(candidate_nodes.size))

# Drop duplicates. The first match we get is the one with the highest CN.
candidate_nodes.drop_duplicates(subset='node1', keep='first', inplace=True)

print('Number of rows after dropping duplicates: {}'.format(candidate_nodes.size))
candidate_nodes.head()

Number of rows before dropping duplicates: 693501
Number of rows after dropping duplicates: 52395


Unnamed: 0,node1,node2,CN
0,6247815,1542000,593
1,5401224,988315,557
2,304552,6314086,545
5,353603,988315,497
6,4258628,1542000,496


In [70]:
for i in test_set:
    for i, r in c_pairs.iterrows()
    break


0         True
1         True
2         True
5         True
6         True
7         True
8         True
9         True
10        True
11        True
12        True
13        True
14        True
15        True
16        True
17        True
19        True
20        True
21        True
23        True
25        True
26        True
27        True
28        True
29        True
31        True
32        True
38        True
41        True
42        True
          ... 
230561    True
230566    True
230584    True
230595    True
230598    True
230602    True
230619    True
230637    True
230652    True
230656    True
230677    True
230720    True
230722    True
230725    True
230743    True
230745    True
230753    True
230775    True
230843    True
230846    True
230865    True
230902    True
230927    True
230945    True
230985    True
231004    True
231061    True
231156    True
231158    True
231162    True
Name: node1, Length: 17465, dtype: bool


In [48]:
for i in test_set:
    print(candidate_nodes.node1.ne(int(i)).idxmin())
    break
        

140294


In [43]:
# Solution 1: Take advantage of the common neighbors data produced earlier to find candidata nodes for a node.
# Assume their attributes will be shared with the mystery node. 
count = 0
for i in test_set:
    print(adj_list[i])
    print(adj_list[adj_list[i][0]])
    count += 1 
    if count == 1:
        break


['2494614']
['2120159', '3119624', '2817288', '848669', '4546232', '1242663', '3191796']


Use the following intuition:
1. Nodes sharing common neighbors are likely to be friends, thus share similar interests. Jaccard_similarity to find the most similar node. 
2. For nodes with few neighbors, if nodes are closer than 4 nodes away (small world phenomenon states that on averages nodes are 4 steps aways), find those nodes, rank them by connetivity (preferential attachment) and inherit the attributes of the node with the highest degree. 

In [None]:
# Create a user x user matrix, and have the value b 0 or 1 to indicate a link. 
# Using cosine similarity, the most similar node will be returned for a target node
# Using the similar node, use that as a key to look up its node in the training set.
# Retrieve its attributes and assign that as the attribute for the particular node. 

# If multiple nodes exists, for example, using a threshold to define similarity, 
# choose the nearest one. Perhaps, I can find the shortest path for node1 - candidate node
# then choose the closest one. 




In [None]:
# In a adjacency matrix, we can assume that popular neighbors have more influence than less coneccted neighbors.
# For a given node, find all  neighbors.
# Return the "most" connected neighbor and their attributes.
# These attributes can be the new attributes for the user. 



In [None]:
# triangular closure can indicate tight community of friends.
# if a nighbor of a friend is friends with another nieghbor, this triangualr relationship can indicate close nit 
# circle, assume that these nodes share alot of attributes.

In [None]:
# SVD approach
# Create a user x attribute matrix
# presence of attribute is indicated by 1, absence is 0,
# use this 