# Data Preparation
*Notebook Author: Koki Sasagawa*  
*Date: 11/14/2018*

Prepare data for link prediction task

## Data:
`network.tsv` - large social network specified in edge list format 
   - two columns: each column is label for a node in the graph
   - undirected graph
   - nodes assigned random numeric ID that have no special meaning

In [1]:
import pandas as pd
import numpy as np
import networkx as nx
import time
from decorators import timer

## 1. Read network tsv file and create graph

In [2]:
print("Creating network graph...")
start_time = time.perf_counter() 

with open("../raw_data/network.tsv", 'rb') as f:
    grph = nx.read_edgelist(path=f, delimiter='\t', encoding='utf8')

end_time = time.perf_counter()
print("Network graph created. Process took {:.4f} seconds".format(end_time - start_time))

# Check that graph is of correct size
print("Number of edges: {}".format(grph.number_of_edges())) # There should be 30915267
print("Number of nodes: {}".format(grph.number_of_nodes())) # There should be 6626753

Creating network graph...
Network graph created. Process took 260.0808 seconds
Number of edges: 30915267
Number of nodes: 6626753


## 2. Generate the adjacency list and save as text file 

This function was created using the following example on stackoverflow as reference:

Stack Overflow. (n.d.). Write a Graph into a file in an adjacency list form [mentioning all neighbors of each node in each line] [online] Available at: https://stackoverflow.com/questions/34917550/write-a-graph-into-a-file-in-an-adjacency-list-form-mentioning-all-neighbors-of [Accessed 18 Dec. 2018].

In [3]:
def save_adjacency_list(graph, file_name):
    '''Create adjacency list containing all neighbors of 
    each node in each line and save as text file

    :params graph: network graph
    :type graph: networkx.classes.graph.Graph
    :params file_name: file name and location to be saved 
    :type file_name: str
    :returns: adjacency list as text file 
    '''

    with open(file_name, "w") as f:
        for n in graph.nodes():
            f.write(str(n) + ',')
            for neighbor in graph.neighbors(n):
                f.write(str(neighbor) + ' ')
            f.write('\n')      

In [4]:
save_adjacency_list(grph, '../temp_data/adjacency_list.txt')