# Stellargraph example: Load the CORA citation network

Import stellar:

In [1]:
import pandas as pd
import os

from stellargraph.core.edge_data import to_edge_data

Using TensorFlow backend.


### Loading the CORA network

**Downloading the CORA dataset:**
    
The dataset used in this demo can be downloaded from [here](https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz).

The following is the description of the dataset:
> The Cora dataset consists of 2708 scientific publications classified into one of seven classes.
> The citation network consists of 5429 links. Each publication in the dataset is described by a
> 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary.
> The dictionary consists of 1433 unique words. The README file in the dataset provides more details.

Download and unzip the cora.tgz file to a location on your computer and set the `data_dir` variable to
point to the location of the dataset (the directory containing "cora.cites" and "cora.content").

In [2]:
data_dir = os.path.expanduser("~/data/cora")

Load the edgelist (in `cited-paper` <- `citing-paper` order)

In [3]:
edgelist = pd.read_csv(os.path.join(data_dir, "cora.cites"), sep='\t', header=None, names=["target", "source"])

**Encapsulate the edge data**

In [4]:
IS_DIRECTED = True
ed = to_edge_data(
    edgelist, 
    IS_DIRECTED,
    "source",
    "target",
    default_edge_type="cites"
)

In [5]:
print("Type of data = {}".format(type(ed).__name__))
print("Number of edges = {}".format(ed.num_edges()))
print("Is homogenous? {}".format(ed.is_homogeneous()))

Type of data = PandasEdgeData
Number of edges = 5429
Is homogenous? True


Examine first edge

In [6]:
print(next(iter(ed.edges())))

EdgeDatum(source_id=1033, target_id=35, edge_id=0, edge_type=cites)


Check against raw data

In [7]:
print("Number of edges = {}".format(len(edgelist)))
print("First row = {}".format(edgelist.iloc[0]))

Number of edges = 5429
First row = target      35
source    1033
Name: 0, dtype: int64
