# Creating a feature matrix from a networkx graph

In this notebook we will look at a few ways to quickly create a feature matrix from a networkx graph.

In [1]:
import networkx as nx
import pandas as pd

G = nx.read_gpickle('major_us_cities')

## Node based features

In [2]:
G.nodes(data=True)

[('Mesa, AZ', {'location': (-111, 33), 'population': 457587}),
 ('New York, NY', {'location': (-74, 40), 'population': 8405837}),
 ('Phoenix, AZ', {'location': (-112, 33), 'population': 1513367}),
 ('El Paso, TX', {'location': (-106, 31), 'population': 674433}),
 ('Kansas City, MO', {'location': (-94, 39), 'population': 467007}),
 ('New Orleans, LA', {'location': (-90, 29), 'population': 378715}),
 ('Tucson, AZ', {'location': (-110, 32), 'population': 526116}),
 ('Long Beach, CA', {'location': (-118, 33), 'population': 469428}),
 ('Jacksonville, FL', {'location': (-81, 30), 'population': 842583}),
 ('Chicago, IL', {'location': (-87, 41), 'population': 2718782}),
 ('Virginia Beach, VA', {'location': (-75, 36), 'population': 448479}),
 ('Wichita, KS', {'location': (-97, 37), 'population': 386552}),
 ('San Francisco, CA', {'location': (-122, 37), 'population': 837442}),
 ('Nashville-Davidson, TN', {'location': (-86, 36), 'population': 634464}),
 ('Sacramento, CA', {'location': (-121, 38),

In [3]:
# Initialize the dataframe, using the nodes as the index
df = pd.DataFrame(index=G.nodes())

### Extracting attributes

Using `nx.get_node_attributes` it's easy to extract the node attributes in the graph into DataFrame columns.

In [4]:
df['location'] = pd.Series(nx.get_node_attributes(G, 'location'))
df['population'] = pd.Series(nx.get_node_attributes(G, 'population'))

df.head()

Unnamed: 0,location,population
"Mesa, AZ","(-111, 33)",457587
"New York, NY","(-74, 40)",8405837
"Phoenix, AZ","(-112, 33)",1513367
"El Paso, TX","(-106, 31)",674433
"Kansas City, MO","(-94, 39)",467007


### Creating node based features

Most of the networkx functions related to nodes return a dictionary, which can also easily be added to our dataframe.

In [5]:
df['clustering'] = pd.Series(nx.clustering(G))
df['degree'] = pd.Series(G.degree())

df

Unnamed: 0,location,population,clustering,degree
"Mesa, AZ","(-111, 33)",457587,0.75,8
"New York, NY","(-74, 40)",8405837,0.833333,9
"Phoenix, AZ","(-112, 33)",1513367,0.694444,9
"El Paso, TX","(-106, 31)",674433,0.7,5
"Kansas City, MO","(-94, 39)",467007,0.472527,14
"New Orleans, LA","(-90, 29)",378715,0.607143,8
"Tucson, AZ","(-110, 32)",526116,0.75,8
"Long Beach, CA","(-118, 33)",469428,0.745455,11
"Jacksonville, FL","(-81, 30)",842583,0.5,4
"Chicago, IL","(-87, 41)",2718782,0.618182,11


# Edge based features

In [6]:
G.edges(data=True)

[('Mesa, AZ', 'Los Angeles, CA', {'weight': 596.6944422568216}),
 ('Mesa, AZ', 'San Diego, CA', {'weight': 502.32635614606744}),
 ('Mesa, AZ', 'El Paso, TX', {'weight': 536.256659972679}),
 ('Mesa, AZ', 'Phoenix, AZ', {'weight': 22.79553039579591}),
 ('Mesa, AZ', 'Tucson, AZ', {'weight': 157.26017307785148}),
 ('Mesa, AZ', 'Long Beach, CA', {'weight': 590.156204210031}),
 ('Mesa, AZ', 'Las Vegas, NV', {'weight': 429.8988310173471}),
 ('Mesa, AZ', 'Albuquerque, NM', {'weight': 514.5675468665884}),
 ('New York, NY', 'Virginia Beach, VA', {'weight': 461.65690712679395}),
 ('New York, NY', 'Columbus, OH', {'weight': 765.9634514216572}),
 ('New York, NY', 'Washington D.C.', {'weight': 327.37847523720734}),
 ('New York, NY', 'Cleveland, OH', {'weight': 649.4502672715655}),
 ('New York, NY', 'Boston, MA', {'weight': 305.91369197099885}),
 ('New York, NY', 'Raleigh, NC', {'weight': 680.9077807661519}),
 ('New York, NY', 'Baltimore, MD', {'weight': 272.38422181145063}),
 ('New York, NY', 'Phila

In [7]:
# Initialize the dataframe, using the edges as the index
df = pd.DataFrame(index=G.edges())

### Extracting attributes

Using `nx.get_edge_attributes`, it's easy to extract the edge attributes in the graph into DataFrame columns.

In [8]:
df['weight'] = pd.Series(nx.get_edge_attributes(G, 'weight'))

df

Unnamed: 0,weight
"(Mesa, AZ, Los Angeles, CA)",596.694442
"(Mesa, AZ, San Diego, CA)",502.326356
"(Mesa, AZ, El Paso, TX)",536.256660
"(Mesa, AZ, Phoenix, AZ)",22.795530
"(Mesa, AZ, Tucson, AZ)",157.260173
"(Mesa, AZ, Long Beach, CA)",590.156204
"(Mesa, AZ, Las Vegas, NV)",429.898831
"(Mesa, AZ, Albuquerque, NM)",514.567547
"(New York, NY, Virginia Beach, VA)",461.656907
"(New York, NY, Columbus, OH)",765.963451


### Creating edge based features

Many of the networkx functions related to edges return a nested data structures. We can extract the relevant data using list comprehension.

In [9]:
df['preferential attachment'] = [i[2] for i in nx.preferential_attachment(G, df.index)]

df

Unnamed: 0,weight,preferential attachment
"(Mesa, AZ, Los Angeles, CA)",596.694442,88
"(Mesa, AZ, San Diego, CA)",502.326356,88
"(Mesa, AZ, El Paso, TX)",536.256660,40
"(Mesa, AZ, Phoenix, AZ)",22.795530,72
"(Mesa, AZ, Tucson, AZ)",157.260173,64
"(Mesa, AZ, Long Beach, CA)",590.156204,88
"(Mesa, AZ, Las Vegas, NV)",429.898831,96
"(Mesa, AZ, Albuquerque, NM)",514.567547,56
"(New York, NY, Virginia Beach, VA)",461.656907,81
"(New York, NY, Columbus, OH)",765.963451,135


In the case where the function expects two nodes to be passed in, we can map the index to a lamda function.

In [10]:
df['Common Neighbors'] = df.index.map(lambda city: len(list(nx.common_neighbors(G, city[0], city[1]))))

df

Unnamed: 0,weight,preferential attachment,Common Neighbors
"(Mesa, AZ, Los Angeles, CA)",596.694442,88,5
"(Mesa, AZ, San Diego, CA)",502.326356,88,5
"(Mesa, AZ, El Paso, TX)",536.256660,40,3
"(Mesa, AZ, Phoenix, AZ)",22.795530,72,7
"(Mesa, AZ, Tucson, AZ)",157.260173,64,7
"(Mesa, AZ, Long Beach, CA)",590.156204,88,5
"(Mesa, AZ, Las Vegas, NV)",429.898831,96,6
"(Mesa, AZ, Albuquerque, NM)",514.567547,56,4
"(New York, NY, Virginia Beach, VA)",461.656907,81,7
"(New York, NY, Columbus, OH)",765.963451,135,7
