# 1. Introduction
  
  
**Note:** `CD2L-KnowlTransfer` corresponds to Section 4.2 in paper Cross-domain network representations [[2]](#refs).

<a name="refs"></a>  
## References

[1] Xue, Shan, Jie Lu, Guangquan Zhang, and Li Xiong. "A framework of transferring structures across large-scale information networks." 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 2018. ([link](https://ieeexplore.ieee.org/abstract/document/8489037))

[2] Xue, Shan, Jie Lu, and Guangquan Zhang. "Cross-domain network representations." Pattern Recognition 94 (2019): 135-148. ([link](https://www.sciencedirect.com/science/article/pii/S0031320319301852))

In [1]:
import os
import networkx as nx
import pandas as pd
import numpy as np

from stellargraph.data import BiasedRandomWalk
from stellargraph import StellarGraph

from gensim.models import Word2Vec

Using TensorFlow backend.


**Choose a case to run the notebook**

In [2]:
case = 'merge'
# case = 'split'

# 2. Dataset  

The dataset is the citation network Cora. It can be downloaded by clicking [here](https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz). The following is the description of the dataset from the publisher:

> The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details. 

For this demo, we ignore the word vectors associated with each paper. We are only interested in the network structure and the **subject** attribute of each paper.

Download and unzip the cora.tgz file to a location on your computer. 

We assume that the dataset is stored in the directory

`~/data/cora/`

where the files `cora.cites` and `cora.content` can be located.

We are going to load the data into a networkx object.

In [3]:
# data_dir = "~/data/cora"
data_dir = "~/Desktop/stellargraph/data/cora"

In [4]:
# load directed graph from ordering (cited_paper, citing_paper)
data_location = os.path.expanduser(data_dir)
g_nx = nx.read_edgelist(path=os.path.join(data_location,"cora.cites"), create_using=nx.DiGraph()).reverse()

# convert to undirected graph for processing
g_nx = g_nx.to_undirected()

# load the node attribute data
node_attr = pd.read_csv(os.path.join(data_location,"cora.content"), sep='\t', header=None)
values = { str(row.tolist()[0]): row.tolist()[-1] for _, row in node_attr.iterrows() }
nx.set_node_attributes(g_nx, values, 'subject')

print("Graph statistics: {} nodes, {} edges".format(
    g_nx.number_of_nodes(), g_nx.number_of_edges()))

Graph statistics: 2708 nodes, 5278 edges


## 2.1 Data Preparation for Source and Target Graph  

We select the largest connected component as **Source Graph** and the second largest connected component as **Target Graph**.   

Case1: Data for Merge  
Case2: Data for Split

In [5]:
# For clarity we ignore isolated nodes and subgraphs; having these in the data does not prevent the
# algorithm from running and producing valid results.v_no_s

g_nx_ccs = ( g_nx.subgraph(c).copy() for c in nx.connected_components(g_nx) )

if case is 'merge':
    # Select the largest connected component as the source graph. 
    g_nx_s = max(g_nx_ccs, key=len)
    v_no_s = g_nx_s.number_of_nodes()
    print("Largest subgraph (source graph) statistics: {} nodes, {} edges".format(
        v_no_s, g_nx_s.number_of_edges()))
    
    # Select a smaller connected component as the target graph. 
    # Assumption one: target graph node scale < source graph node scale
    threshold = int(v_no_s / 10)
    print("Target graph number of nodes threshold: {}".format(threshold))
    
    g_nx_ccs_t = ( g_nx.subgraph(c).copy() for c in nx.connected_components(g_nx) 
                  if g_nx.subgraph(c).copy().number_of_nodes() < threshold)
    g_nx_t = max(g_nx_ccs_t, key=len)
    v_no_t = g_nx_t.number_of_nodes()
    print("Smaller subgraph (target graph) statistics: {} nodes, {} edges".format(
        v_no_t, g_nx_t.number_of_edges()))
    
if case is 'split':
    # Select the largest connected component as the target graph. 
    g_nx_t = max(g_nx_ccs, key=len)
    v_no_t = g_nx_t.number_of_nodes()
    print("Largest subgraph (target graph) statistics: {} nodes, {} edges".format(
        v_no_t, g_nx_t.number_of_edges()))
    
    # Select a smaller connected component as the source graph. 
    threshold = int(v_no_t / 10)
    print("Source graph number of nodes threshold: {}".format(threshold))
    
    g_nx_ccs_s = ( g_nx.subgraph(c).copy() for c in nx.connected_components(g_nx) 
                  if g_nx.subgraph(c).copy().number_of_nodes() < threshold)
    g_nx_s = max(g_nx_ccs_s, key=len)
    v_no_s = g_nx_s.number_of_nodes()
    print("Smaller subgraph (source graph) statistics: {} nodes, {} edges".format(
        v_no_s, g_nx_s.number_of_edges()))

Largest subgraph (source graph) statistics: 2485 nodes, 5069 edges
Target graph number of nodes threshold: 248
Smaller subgraph (target graph) statistics: 26 nodes, 43 edges


## 2.2 Algorithm Inputs

### Input 1:  Random Walks on Origial Source Graph by Node2Vec Algorithm

In [6]:
# rw_s = BiasedRandomWalk(StellarGraph(g_nx_s))

# walks_s = rw_s.run(
#     nodes=list(g_nx_s.nodes()), # root nodes
#     length=100,  # maximum length of a random walk
#     n=10,        # number of random walks per root node 
#     p=0.5,       # Defines (unormalised) probability, 1/p, of returning to source node
#     q=2.0        # Defines (unormalised) probability, 1/q, for moving away from source node
# )
# print("Number of random walks of source graph: {}".format(len(walks_s)))

# np.save('cora_walk_s_merge_case.npy', walks_s)

In [7]:
walks_s = np.load('cora_walk_s_merge_case.npy')
print("Number of random walks of source graph: {}".format(len(walks_s)))

Number of random walks of source graph: 24850


### Input 2: Weighted Target Graph

Since the edges in the Cora dataset are unweighted, we need to synthetically add weights to the links in the graph. The weights are initialized in 1 and put on transfered weights afterwards by `CD2L-KnowlTransfer`.

In [8]:
for u,v in g_nx_t.edges():
    g_nx_t[u][v]['weight'] = 1

# g_nx_t[list(g_nx_t.edges())[0][0]][list(g_nx_t.edges())[0][1]]['weight']

### Input 3: Cross Graph Links and Weights by CD2L-NodeBalance

In [9]:
data = np.load('cora_NodeBalance_result_{}_case.npz'.format(case))

e_cg = data['edge']
w_cg = data['weight']
supernode = data['supernode'].item()

print("shape of cross graph edge set: {}".format(e_cg.shape))
print("shape of cross graph weight set: {}".format(w_cg.shape))
print("keys of supernodes: {}".format(supernode.keys()))

shape of cross graph edge set: (26, 9)
shape of cross graph weight set: (26, 9)
keys of supernodes: dict_keys(['supernode1', 'supernode2', 'supernode3', 'supernode4', 'supernode5', 'supernode6', 'supernode7', 'supernode8', 'supernode9'])


## 2.3 Dataset Summary  

| Graph | Number of Nodes  | Number of Edges | Average Degree |
|------|------|------|------|
|   Source Graph  | 2485 | 5069 | 4.080 |
|   Target Graph  | 26   | 43   | 3.308 |
|   Super Graph   | 9    | --   | --    |
|   Target Cluster| 7    | --   | --    |

# 3. Knowledge Transfer

![CD2L-KnowlTransfer](knowltransfer.png)  

In [10]:
id_v_to_sn = {node_id: sn_name for sn_name in supernode.keys() for node_id in supernode[sn_name]['node_ids']}
print("number of supernode ids index by node: {}".format(len(id_v_to_sn)))

number of supernode ids index by node: 2485


In [12]:
walks_sn = np.array([[str(id_v_to_sn[v_id]) for j,v_id in enumerate(walk)] for i,walk in enumerate(walks_s)])
print("shape of walks_sn: {}".format(walks_sn.shape))

shape of walks_sn: (24850, 100)


In [13]:
walks_sn[0]

array(['supernode1', 'supernode7', 'supernode6', 'supernode7',
       'supernode7', 'supernode7', 'supernode3', 'supernode7',
       'supernode7', 'supernode8', 'supernode7', 'supernode8',
       'supernode7', 'supernode8', 'supernode7', 'supernode8',
       'supernode7', 'supernode7', 'supernode7', 'supernode7',
       'supernode7', 'supernode7', 'supernode8', 'supernode8',
       'supernode8', 'supernode8', 'supernode8', 'supernode7',
       'supernode8', 'supernode7', 'supernode9', 'supernode8',
       'supernode9', 'supernode8', 'supernode9', 'supernode8',
       'supernode9', 'supernode8', 'supernode9', 'supernode8',
       'supernode9', 'supernode8', 'supernode9', 'supernode7',
       'supernode9', 'supernode7', 'supernode9', 'supernode8',
       'supernode9', 'supernode8', 'supernode6', 'supernode7',
       'supernode3', 'supernode7', 'supernode7', 'supernode7',
       'supernode6', 'supernode7', 'supernode6', 'supernode7',
       'supernode7', 'supernode7', 'supernode7', 'super

In [None]:
def supernode_edge_and_weight(supernode,walks_s):
    weight = np.zeros((len(supernode.keys()),len(supernode.keys())),dtype=np.float)
    
    return sn_w, sn_e

In [None]:
def cross_graph_weight(edge_t, v_ind_sn):
    return cg_w_l, cg_w_r

In [None]:
def update_target_graph_weight(g_nx_t):
    # part 1: weight on random walk over supernodes, sn_w
    # part 2: cg_w (two sides)
    
    return g_nx_t

# Random Walk Distributions of Source and Target Graphs