# Creating a feature matrix from a networkx graph

In this notebook we will look at a few ways to quickly create a feature matrix from a networkx graph.

In [1]:
import networkx as nx
import pandas as pd
import pickle 

G = pickle.load(open('major_us_cities', 'rb'))

## Node based features

In [12]:
list(G.nodes(data=True))[0:10]

[('El Paso, TX', {'population': 674433, 'location': (-106, 31)}),
 ('Long Beach, CA', {'population': 469428, 'location': (-118, 33)}),
 ('Dallas, TX', {'population': 1257676, 'location': (-96, 32)}),
 ('Oakland, CA', {'population': 406253, 'location': (-122, 37)}),
 ('Albuquerque, NM', {'population': 556495, 'location': (-106, 35)}),
 ('Baltimore, MD', {'population': 622104, 'location': (-76, 39)}),
 ('Raleigh, NC', {'population': 431746, 'location': (-78, 35)}),
 ('Mesa, AZ', {'population': 457587, 'location': (-111, 33)}),
 ('Arlington, TX', {'population': 379577, 'location': (-97, 32)}),
 ('Sacramento, CA', {'population': 479686, 'location': (-121, 38)})]

In [3]:
# Initialize the dataframe, using the nodes as the index
df = pd.DataFrame(index=G.nodes)
df

"El Paso, TX"
"Long Beach, CA"
"Dallas, TX"
"Oakland, CA"
"Albuquerque, NM"
"Baltimore, MD"
"Raleigh, NC"
"Mesa, AZ"
"Arlington, TX"
"Sacramento, CA"
"Wichita, KS"


### Extracting attributes

Using `nx.get_node_attributes` it's easy to extract the node attributes in the graph into DataFrame columns.

In [4]:
df['location'] = pd.Series(nx.get_node_attributes(G, 'location'))
df['population'] = pd.Series(nx.get_node_attributes(G, 'population'))

df.head()

Unnamed: 0,location,population
"El Paso, TX","(-106, 31)",674433
"Long Beach, CA","(-118, 33)",469428
"Dallas, TX","(-96, 32)",1257676
"Oakland, CA","(-122, 37)",406253
"Albuquerque, NM","(-106, 35)",556495


### Creating node based features

Most of the networkx functions related to nodes return a dictionary, which can also easily be added to our dataframe.

In [5]:
df['clustering'] = pd.Series(nx.clustering(G))
df['degree'] = pd.Series(G.degree())

df

Unnamed: 0,location,population,clustering,degree
"El Paso, TX","(-106, 31)",674433,0.7,
"Long Beach, CA","(-118, 33)",469428,0.745455,
"Dallas, TX","(-96, 32)",1257676,0.763636,
"Oakland, CA","(-122, 37)",406253,1.0,
"Albuquerque, NM","(-106, 35)",556495,0.52381,
"Baltimore, MD","(-76, 39)",622104,0.8,
"Raleigh, NC","(-78, 35)",431746,0.615385,
"Mesa, AZ","(-111, 33)",457587,0.75,
"Arlington, TX","(-97, 32)",379577,0.763636,
"Sacramento, CA","(-121, 38)",479686,0.777778,


# Edge based features

In [11]:
list(G.edges(data=True))[0:10]

[('El Paso, TX', 'Albuquerque, NM', {'weight': 367.88584356108345}),
 ('El Paso, TX', 'Mesa, AZ', {'weight': 536.256659972679}),
 ('El Paso, TX', 'Tucson, AZ', {'weight': 425.41386739988224}),
 ('El Paso, TX', 'Phoenix, AZ', {'weight': 558.7835703774161}),
 ('El Paso, TX', 'Colorado Springs, CO', {'weight': 797.7517116740046}),
 ('Long Beach, CA', 'Oakland, CA', {'weight': 579.5829987228403}),
 ('Long Beach, CA', 'Mesa, AZ', {'weight': 590.156204210031}),
 ('Long Beach, CA', 'Sacramento, CA', {'weight': 611.0649790490104}),
 ('Long Beach, CA', 'Tucson, AZ', {'weight': 698.6566667728368}),
 ('Long Beach, CA', 'San Jose, CA', {'weight': 518.2330606219175})]

In [7]:
# Initialize the dataframe, using the edges as the index
df = pd.DataFrame(index=G.edges())
df.head()

Unnamed: 0,Unnamed: 1
"El Paso, TX","Albuquerque, NM"
"El Paso, TX","Mesa, AZ"
"El Paso, TX","Tucson, AZ"
"El Paso, TX","Phoenix, AZ"
"El Paso, TX","Colorado Springs, CO"


### Extracting attributes

Using `nx.get_edge_attributes`, it's easy to extract the edge attributes in the graph into DataFrame columns.

In [8]:
df['weight'] = pd.Series(nx.get_edge_attributes(G, 'weight'))

df

Unnamed: 0,Unnamed: 1,weight
"El Paso, TX","Albuquerque, NM",367.885844
"El Paso, TX","Mesa, AZ",536.256660
"El Paso, TX","Tucson, AZ",425.413867
"El Paso, TX","Phoenix, AZ",558.783570
"El Paso, TX","Colorado Springs, CO",797.751712
...,...,...
"Detroit, MI","Columbus, OH",263.423765
"Nashville-Davidson, TN","Milwaukee, WI",770.146706
"Nashville-Davidson, TN","Columbus, OH",536.274548
"Milwaukee, WI","Columbus, OH",532.568423


### Creating edge based features

Many of the networkx functions related to edges return a nested data structures. We can extract the relevant data using list comprehension.

In [9]:
df['preferential attachment'] = [i[2] for i in nx.preferential_attachment(G, df.index)]
df

Unnamed: 0,Unnamed: 1,weight,preferential attachment
"El Paso, TX","Albuquerque, NM",367.885844,35
"El Paso, TX","Mesa, AZ",536.256660,40
"El Paso, TX","Tucson, AZ",425.413867,40
"El Paso, TX","Phoenix, AZ",558.783570,45
"El Paso, TX","Colorado Springs, CO",797.751712,30
...,...,...,...
"Detroit, MI","Columbus, OH",263.423765,165
"Nashville-Davidson, TN","Milwaukee, WI",770.146706,130
"Nashville-Davidson, TN","Columbus, OH",536.274548,195
"Milwaukee, WI","Columbus, OH",532.568423,150


In the case where the function expects two nodes to be passed in, we can map the index to a lamda function.

In [10]:
df['Common Neighbors'] = df.index.map(lambda city: len(list(nx.common_neighbors(G, city[0], city[1]))))
df

Unnamed: 0,Unnamed: 1,weight,preferential attachment,Common Neighbors
"El Paso, TX","Albuquerque, NM",367.885844,35,4
"El Paso, TX","Mesa, AZ",536.256660,40,3
"El Paso, TX","Tucson, AZ",425.413867,40,3
"El Paso, TX","Phoenix, AZ",558.783570,45,3
"El Paso, TX","Colorado Springs, CO",797.751712,30,1
...,...,...,...,...
"Detroit, MI","Columbus, OH",263.423765,165,10
"Nashville-Davidson, TN","Milwaukee, WI",770.146706,130,7
"Nashville-Davidson, TN","Columbus, OH",536.274548,195,9
"Milwaukee, WI","Columbus, OH",532.568423,150,6
