Our GeoAI approach to Geodemographic classification consists of four consecutive steps: **Spatial Graph Construction**, **Geo-saptially Embedding Generation**, **Canonical-correlation Analysis-based Embedding generation** and **K-Mean clustering**. This notebook is demonstrating the step of **Geo-saptially Embedding Generation**. The steps of **Spatial Graph Construction** and **Canonical-correlation Analysis-based Embedding generation** and **K-Mean clustering** can be found in file *Step1-GeoAIGeodemographicClassification-SpatialGraphConstruction.ipynb* and *Step3-GeoAIGeodemographicClassification.ipynb*

**Step 2: Geo-saptially Embedding Generation**: GraphSAGE is an iterative unsupervised algorithm that learns graph embeddings for every node in a graph. Each node in the graph is represented by the aggregation of its neighbourhoods. In other words, GraphSAGE learns embeddings of the nodes using the graph structure (spatial graph constructed in *Step 1*) and node features (167 z-scores from census data) as input. 

In [None]:
import pandas as pd
import networkx as nx

import numpy as np

import stellargraph as sg
from stellargraph.mapper import GraphSAGELinkGenerator
from stellargraph.layer import GraphSAGE, link_classification
from stellargraph.layer.graphsage import MaxPoolingAggregator
from stellargraph.data import UnsupervisedSampler
from sklearn.model_selection import train_test_split

from tensorflow import keras

Reading neighbouring information from the saved csv file in *GeoAIGeodemographicClassification-Step1.ipynb*, and converting the neighbouring information into a graph.

In [1]:
#Read neighbouring information from the saved csv file
#Specify where the csv file is located 
edgelist = pd.read_csv('Data/Output/Spatial-Graph/SpatialGraphs.csv')
# Gnx = nx.from_pandas_edgelist(edgelist, edge_attr="neighbouring")

#Convert the neighbouring information into edges and nodes in the graph
# nx.set_node_attributes(Gnx, "oa", "neighbouring")

#Read heading information from the csv file
colums = pd.read_csv('Data/Input/Census-Data/GreaterLondon_2011_OAC_Raw_uVariables--zscores.csv', nrows=1).columns.tolist()

graph_col = colums[1:]
#Read node values 
node_data = pd.read_csv('Data/Input/Census-Data/GreaterLondon_2011_OAC_Raw_uVariables--zscores.csv',  sep=',', header=None, names=graph_col)

#Assign z-scored census data to the nodes in the graph
node_features = node_data[graph_col]
#Note that StellarGraph needs to read nodes_features as formatted below
node_features

Unnamed: 0,u001,u002,u003,u004,u005,u006,u007,u008,u009,u010,...,u158,u159,u160,u161,u162,u163,u164,u165,u166,u167
E00023264,-0.095331,0.095331,0.055545,-0.055545,-0.103801,-0.044297,-0.076303,-0.040292,0.009062,-0.021628,...,-0.149701,-0.088053,0.436106,0.152557,-0.316420,0.120145,0.074700,0.201705,-0.848502,-0.307093
E00003359,-1.418286,1.418286,-0.560974,0.560974,0.048329,-0.442293,0.992393,-1.378535,-1.075055,-0.211811,...,-1.292145,-0.350277,0.468944,1.037567,-0.245804,-0.394771,-1.405887,0.102967,-0.646988,0.131114
E00023266,-1.371893,1.371893,0.186027,-0.186027,-0.218357,0.301579,0.272683,-0.124018,0.567885,1.285076,...,0.283671,-0.021209,-0.799988,-0.374153,-0.350084,-0.278863,1.607125,0.654204,0.967081,0.063448
E00020264,0.345864,-0.345864,0.186027,-0.186027,-0.089596,-0.141357,1.021013,-0.168908,-0.405402,-0.929133,...,-0.763256,0.353044,0.821773,0.418832,-0.357110,-1.505983,-0.416591,-0.038058,-0.070071,-0.383369
E00023263,-1.017971,1.017971,0.186027,-0.186027,-0.203236,0.209187,0.595391,-0.183896,0.332768,-0.610983,...,-0.877876,-0.628536,0.153515,-0.640127,-0.217981,-0.867562,0.688924,1.544802,0.823315,-0.095388
E00007412,-1.253213,1.253213,0.186027,-0.186027,-0.169327,0.210639,0.762391,2.263252,0.632166,-0.106779,...,0.123814,-1.140056,-1.045402,-0.863918,0.757327,1.321249,1.024746,0.776772,0.730610,0.568879
E00007413,-0.040293,0.040293,0.186027,-0.186027,-0.198653,0.270535,1.638754,1.063687,1.398593,0.015190,...,-1.655582,-0.275790,-0.765285,-1.646045,-0.863622,-0.242018,1.479263,1.940778,1.190198,1.023155
E00175260,-0.047131,0.047131,0.186027,-0.186027,-0.260055,2.193498,1.658488,2.814288,1.733825,0.955345,...,1.175945,-0.836752,-1.537505,-0.916767,0.050342,0.808891,1.250769,1.322407,0.466372,1.523466
E00175261,0.471880,-0.471880,0.186027,-0.186027,-0.232104,0.724844,0.734991,0.669807,1.146768,1.761611,...,-0.776797,-0.882745,-1.346617,-1.446362,0.182456,0.455521,0.042802,1.189588,0.992008,2.545052
E00022334,-0.221103,0.221103,0.186027,-0.186027,-0.152373,-0.078450,-0.093964,-0.779963,-0.993903,-0.362274,...,1.034214,-0.441980,1.411479,0.193345,-0.070762,-0.142090,-0.357918,-0.210547,-0.670315,-0.987635


Use *StellarGraph* API to read the graph structure data

In [None]:
G = sg.StellarGraph(Gnx, node_features=node_features)

Specify the other optional parameter values: root nodes, the number of walks to take per node, the length of each walk, and random seed.

In [None]:
nodes = list(G.nodes())
number_of_walks = 1
length = 5

Create the *UnsupervisedSampler* instance with the relevant parameters passed to it. The *UnsupervisedSampler* class takes in a *Stellargraph* graph instance. The generator method in the *UnsupervisedSampler* is responsible for generating equal number of positive and negative node pair samples from the graph for training. The samples are generated by performing uniform random walks over the graph. Positive (target, context) node pairs are extracted from the walks, and for each positive pair a corresponding negative pair (target, node) is generated by randomly sampling node from the degree distribution of the graph. Once the batch_size number of samples is accumulated, the generator yields a list of positive and negative node pairs along with their respective 1/0 labels.

In [None]:
unsupervised_samples = UnsupervisedSampler(G, nodes=nodes, length=length, number_of_walks=number_of_walks)

Specify: 1. The minibatch size (number of node pairs per minibatch). 2. The number of epochs for training the model. 3. The sizes of 1- and 2-hop neighbor samples for GraphSAGE, in our case, they are 28 and 8.

In [None]:
batch_size = 50
epochs =4
num_samples = [28, 8]

In the following we show the working of node pair generator with the UnsupervisedSampler, which will generate samples on demand.

In [None]:
generator = GraphSAGELinkGenerator(G, batch_size, num_samples)
train_gen = generator.flow(unsupervised_samples)

Build the model: a 2-layer GraphSAGE encoder acting as node representation learner, with a link classification layer on concatenated (OA, OA) node embeddings.

In [None]:
layer_sizes = [50, 50]
graphsage = GraphSAGE(
    layer_sizes=layer_sizes, aggregator= MeanAggregator, generator=train_gen, bias=True, dropout=0.0, normalize="l2")

In [None]:
# Build the model and expose input and output sockets of graphsage, for node pair inputs:
x_inp, x_out = graphsage.build()

Final node pair classification layer that takes a pair of nodes’ embeddings produced by graphsage encoder, applies a binary operator to them to produce the corresponding node pair embedding

In [None]:
prediction = link_classification(
    output_dim=1, output_act="tanh", edge_embedding_method='avg'
)(x_out)

Stack the GraphSAGE encoder and prediction layer into a Keras model, and specify the loss

In [None]:
model = keras.Model(inputs=x_inp, outputs=prediction)

model.compile(
    optimizer=keras.optimizers.Adam(lr=1e-3),
    loss=keras.losses.binary_crossentropy,
    metrics=[keras.metrics.binary_accuracy],
)

Train the model

In [None]:
history = model.fit_generator(
    train_gen,
    epochs=epochs,
    verbose=1,
    use_multiprocessing=False,
    workers=4,
    shuffle=True,
)

Extracting node embeddings

In [None]:
from stellargraph.mapper import GraphSAGENodeGenerator
import numpy as np

x_inp_src = x_inp[0::2]
x_out_src = x_out[0]
embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)

node_ids = node_data.index
node_gen = GraphSAGENodeGenerator(G, batch_size, num_samples).flow(node_ids)

node_embeddings = embedding_model.predict_generator(node_gen, workers=4, verbose=1)

X = node_embeddings
#Please specify where to save the extracted embeddings
np.save('Data/Output/Graph-Embedding/knn8_GraphSAGE.npy', X)