# Stellar Graph Analytics
## Python Client Walkthrough
This notebook is designed to show the functionality of the Stellar Graph Analytics library v0.2.

This demo will show the following functionality from the Stellar library:

- **Construct A Graph from Relational Data**: Pull in multiple CSV datasets and convert to a graph

- **Run Entity Resolution**: Run basic entity resolution on the graph (demo only)

- **Predict A Node Attribute**: Run a predictive model to predict an unknown attribute

- **Convert the Output to Gephi**: Visualisation can be done by exporting to GraphML

Note that this library is simply a prototype, and has limitations:

- Small graphs only e.g. < 50,000 nodes

- Entity Resolution is limited to the CORA dataset

- Only CSV files are supported for the data ingestion

The dataset used in this demo is a synthetic dataset derived from the CORA publication network. The data set here describes an artificial database, which contains people with names, addresses and 'risk' level.

In [None]:
import stellar as st

## Connect to Stellar

The first step is to connect to the backend, which will be processing the data.

The Python client is sending REST commands to the backend to control the pipeline.

In [None]:
ss = st.create_session("localhost", 8000)

## The Dataset

In this example, we'll use the CORA-Terror dataset. This dataset has 2 files:
- **people.csv**: People with names, addresses, terror-risk level
- **links.csv**: Associations between people

Note that this is a synthetic dataset and has no connection to the real world.

In [None]:
import pandas as pd

people = pd.read_csv('/tmp/stellar/people.csv')
people[1:10]

In [None]:
links = pd.read_csv('/tmp/stellar/links.csv')
links[1:10]

## Create A Graph Schema
- First we need to define a schema for our graph.

- For this we need to define types for our **nodes** (vertices) and **edges** (links). For example, we may want a *Person* type connected to a *Business* type via a *works-for* edge.

- On both nodes and edges, you can also add **properties** to hold additional information.

- To create a schema, we can use `create_schema`.

- To add nodes and edges we use `add_node_type` and `add_edge_type` respectively.

In our example graph, we will describe people and their connections. We will describe our node as:
- **Person**
    - A name property
    - An address property
    - A risk level property
    - A connection (edge) to another Person (**Association**)

Note that the schema is arbitrary, you can define any schema you like to fit your relational data.

In [None]:
# homogeneous network of people and their associations
schema = (st.create_schema()
     .add_node_type(
        name='Person',
        attribute_types={
            'name': 'string',
            'address': 'string',
            'risk': 'string'
        })
     .add_edge_type(
        name='Association',
        src_type='Person',
        dst_type='Person'))

## Connect Data to the Graph Schema

Now that we have our schema, next we need to let Stellar know which column goes where.

To do this we need to create a **Mapping**. This mapping object defines where the schema can find it's data. We use `create_map` our data sources to a part of the schema.

For each **node** we need to give Stellar an 'id' column. This will be assigned to each new node.

We can also assign columns to properties. Note that the properties must align with the 'id' column in the CSV file.

Here we map the **people.csv** file onto the **Person** nodes, and we map the **links.csv** file onto the **Association** edges.

In [None]:
#
# First we create a mapping for the Person node. We use create_map
# to point to the 'people.csv' dataset. The identifier for the Person
# node is specified in the 'column' attribute. This should be a unique
# identifier. All other properties can be mapped to columns using the
# map_attributes function.
#
people_map = schema.node['Person'].create_map(
    path='people.csv',
    column='id',
    map_attributes={
        'name': 'name',
        'address': 'address',
        'risk': 'risk'
    }
)

#
# We define the person's 'Association' edges using the 'links.csv' file.
# For an edge we need to define a source (src) and destination (dst).
# This needs to point to columns that contain the Node id's of the src
# and dst defined in the schema.
#
assoc_map = schema.edge['Association'].create_map(
    path='links.csv',
    src='PERSON1',
    dst='PERSON2'
)

# Create A Graph
Now that we have our dataset, our schema, and our mappings defined, we can tell Stellar to create our graph.

In [None]:
graph = ss.ingest(
    schema=schema,
    mappings=[schema, people_map, assoc_map],
    timeout=60,
    label='papers'
)
graph

We've also included a `to_graphml` conversion function. We can use this to view our graph in **Gephi**

In [None]:
graph.to_graphml('/tmp/stellar/graph-ingest.graphml')

# Entity Resolution

A module included in Stellar is the Entity Resolution module. This module will take a graph object, match fields on nodes to determine which nodes may be identical. For example, a person may appear twice in the graph, once with 'Bill J. Johnson' and another time with 'Bill Johnson'. The goal of the Entity Resolution module is to identify potential duplicates by resolving the true entities.

This module will return a new graph with **is-same-as** edges added to the graph where the algorithm thinks there may be a duplicate. A confidence score will also be attached to the link as a property.

Note that this module is a baseline only, and is configured to work with the cora-terror set only.

In [None]:
graph_predicted = ss.predict(
    graph=graph,
    model=st.model.Node2Vec(),
    target_attribute='venue', 
    node_type='Paper', 
    attributes_to_ignore=['dataset', 'title', '__id'],
    timeout=600
)

In [None]:
graph_predicted.to_graphml('/home/ubuntu/papers.graphml')

# Prediction
This module uses machine learning to predict missing properties on nodes. Underneath the hood there is a small pipeline that:
- Performs a train/test split
- Trains a machine learning model
- Uses the model to predict missing values

The module also allows different models to be chosen. The models available in this demo are the following:

- **Node2Vec**: Creates machine learning features describing the structure of the graph

- **GCN**: The Graph Convolutional Network model uses deep learning to learn the surrounding structure of the graph

To run the module, a `target` attribute needs to be chosen. This will be the attribute that is predicted. You can also use `attributes_to_ignore` to remove attributes from the calculation.

This module also allows heterogeneous datasets to be used. This will be performed automatically.

In [None]:
graph_predicted = ss.predict(
    graph=graph,
    model=st.model.Node2Vec(),
    target_attribute='venue', 
    node_type='Paper', 
    attributes_to_ignore=['dataset', 'title', '__id'],
    timeout=600
)

In [None]:
graph_predicted.to_graphml('/home/ubuntu/papers.graphml')