Our GeoAI approach to Geodemographic classification consists of four consecutive steps: **Spatial Graph Construction**, **Geo-saptially Embedding Generation**, **Canonical-correlation Analysis-based Embedding generation** and **K-Mean clustering**. This notebook is deonstrating the step of **Canonical-correlation Analysis-based Embedding generation** and **K-Mean clustering**. The steps of **Spatial Graph Construction** and **Geo-saptially Embedding Generation** can be found in file *Step1-GeoAIGeodemographicClassification.ipynb* and *Step2-GeoAIGeodemographicClassification.ipynb*

**Step 3 and 4**: CorrNet is a machine learning approach for learning common representations from heterogeneous sources of data (i.e., multimodal data). Its architecture is similar to a conventional single-view deep autoencoder but including one encoder-decoder pair for each modality of data. We create a joint representation that maximises the correlation between geographic location (the graph-based embedding produced by GraghSAGE) and census data attributes. 

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from keras.layers import Embedding, Input, LSTM, RepeatVector, Dense, Dropout,concatenate,Conv2D,UpSampling2D,MaxPooling2D,BatchNormalization,Activation,Add,GlobalMaxPool2D
from keras.models import Model
import keras.backend as K
from keras.engine.topology import Layer, InputSpec

Build a canonical-correlation analysis-based loss layer

In [None]:
class CorrnetCost(Layer):
    def __init__(self,lamda, **kwargs):
        super(CorrnetCost, self).__init__(**kwargs)
        self.lamda = lamda

    def cor(self,y1, y2, lamda):
        y1_mean = K.mean(y1, axis=0)
        y1_centered = y1 - y1_mean
        y2_mean = K.mean(y2, axis=0)
        y2_centered = y2 - y2_mean
        corr_nr = K.sum(y1_centered * y2_centered, axis=0)
        corr_dr1 = K.sqrt(K.sum(y1_centered * y1_centered, axis=0) + 1e-8)
        corr_dr2 = K.sqrt(K.sum(y2_centered * y2_centered, axis=0) + 1e-8)
        corr_dr = corr_dr1 * corr_dr2
        corr = corr_nr / corr_dr
        return K.sum(corr) * lamda

    def call(self ,x ,mask=None):
        h1=x[0]
        h2=x[1]

        corr = self.cor(h1,h2,self.lamda)

        #self.add_loss(corr,x)
        #we output junk but be sure to use it for the loss to be added
        return corr

    def get_output_shape_for(self, input_shape):
        #print input_shape[0][0]
        return (input_shape[0][0],input_shape[0][1])
    
#ZeroPadding layer is for CorrNet resconstruct from single modality 
class ZeroPadding(Layer):
    def __init__(self, **kwargs):
        super(ZeroPadding, self).__init__(**kwargs)

    def call(self, x, mask=None):
        return K.zeros_like(x)

    def get_output_shape_for(self, input_shape):
        return input_shape

def corr_loss(y_true, y_pred):
    #print y_true.type,y_pred.type
    #return K.zeros_like(y_pred)
    return y_pred

We use Keras to design the architecture of deep Corrnet. The CorrNet takes input from two modalities of the data and contains three dense layers for each part of encoder.

In [None]:
lamda = 0.02
h_loss = 50

#Take inputs from two modalities of the data
#input tensor to the model, inpx = view1 and inpy = view2
inpx = Input(shape=(50,))
inpy = Input(shape=(167,))

#Three-layer architecture for each part of encoder
#Adding dense layers for view1, hx is the hidden representation for view1
hx = Dense(32,activation='sigmoid')(inpx)
hx = Dense(16, activation='sigmoid',name='hid_l1')(hx)
hx = Dense(8, activation='sigmoid',name='hid_l')(hx)

#Adding dense layers for view2, hy is the hidden representation for view2
hy = Dense(63,activation='sigmoid')(inpy)
hy = Dense(32, activation='sigmoid',name='hid_r1')(hy)
hy = Dense(8, activation='sigmoid',name='hid_r')(hy)

#Combine the ecoded represntations from both encoder
h = Add(name='combined_features')([hx,hy]) 

#Each decoder corresponds to each encoder
recx = Dense(50)(h)
recy = Dense(167)(h)

#Creating a intermediate models
branchModel = Model( [inpx,inpy],[recx,recy,h])

#reconstruction from view1, view2 = 0-vector
[recx1,recy1,h1] = branchModel( [inpx, ZeroPadding()(inpy)])
#reconstruction from view2, view1 = 0-vector
[recx2,recy2,h2] = branchModel( [ZeroPadding()(inpx), inpy ])

#you may probably add a reconstruction from combined
[recx3,recy3,h] = branchModel([inpx, inpy])

#adding the correlation loss
corr=CorrnetCost(-lamda)([h1,h2])

#create intermedia model to extract representation
feature_extraction = Model([inpx,inpy],h)   
model = Model( [inpx,inpy],[recy1,recx2,recx3,recx1,recy2,recy3,corr])
model.compile( loss=["mse","mse","mse","mse","mse","mse",corr_loss],optimizer="rmsprop")

Read data from each modality as input for the CorrNet. Essentially, they're the geo-saptially aware embeddings created from the step of **Geo-saptially Embedding Generation** and the z-scored census data.

In [None]:
#Reading geo-saptially aware embeddings created from the step of Geo-saptially Embedding Generation
X = np.load('Data/Output/Graph-Embedding/knn8_GraphSAGE.npy')

#The step of reading z-scored census is same as when assigning the data to each node as described 
#in the notbook of GeoAIGeodemographicClassification-Step2
colums = pd.read_csv('Data/Input/Census-Data/GreaterLondon_2011_OAC_Raw_uVariables--zscores.csv', nrows=1).columns.tolist()
graph_col = colums[1:]
node_data = pd.read_csv('Data/Input/Census-Data/GreaterLondon_2011_OAC_Raw_uVariables--zscores.csv',  sep=',', header=None, names=graph_col)

node_features = node_data[graph_col]
node=np.asarray(node_features.values.tolist())

Training

In [None]:
X_train_l = X
X_train_r = node
model.fit([X_train_l,X_train_r], [X_train_r,X_train_l,X_train_l,X_train_l,X_train_r,X_train_r,np.zeros((X_train_l.shape[0],h_loss))],
                  nb_epoch=300,
                  batch_size=140)

Extract the joint representation from the intermediate layer

In [None]:
extraced_feat= feature_extraction.predict([X_train_l,X_train_r])
np.save('Data/Output/Corrnet-Embedding/corrnet.npy',extraced_feat)

Feed the extracted representations to the K-Means clustering algorithm

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=8, random_state=0).fit(extraced_feat)
labels = pd.DataFrame(kmeans.labels_)
#Convert the clusters created by K-Means to csv file
labels.to_csv('Data/Output/Geodemographic-Clusters/Clusters.csv')