# Gremlin Getting Started

# Tinkerpop, Gremlin, Python Example

A sandbox repo of how to use Gremlin in Python against the Tinkerpop GraphDB.

This repo assume you are running a local gremlin-server and the modern initialized db:

```./bin/gremlin-server.sh conf/gremlin-server-modern.yaml```

This will create the *modern* pre-initialized db as documented in the [Getting Started](http://tinkerpop.apache.org/docs/current/tutorials/getting-started/)
## Apache Tinkerpop

[Tinkerpop](http://tinkerpop.apache.org)

Graph computing framework for both graph databases and graph analytic systems.

From
```
Apache TinkerPop™ is an open source, vendor-agnostic, graph computing
framework distributed under the commercial friendly Apache2 license.
 When a data system is TinkerPop-enabled, its users are able to model
 their domain as a graph and analyze that graph using the Gremlin
 graph traversal language.
```


## AWS Neptune

All of this work in to get familiar with *Gremlin* in anticipation for working on the AWS Cloud GraphDB called *Neptune*

```
Amazon Neptune supports popular graph models Property Graph and
W3C's RDF, and their respective query languages Apache TinkerPop Gremlin
and SPARQL, allowing you to easily build queries that efficiently
navigate highly connected datasets.
```

[AWS Neptune](https://docs.aws.amazon.com/neptune/latest/userguide/intro.html)

Running locally first, to learn how to interface Python, Gremlin is faster and cheaper than hitting Neptune immediately.

## Environment

I created this on MacOS and Python 3.6.1, with Apache Tinkerpop 3.3.3

## Instructions

- First, run the gremlin server as shown above

- Second, create a python virtualenv

- Third, **install the requirements.txt**  pip install -r requirements.txt

- Fourth, run the simple_test.py file.

## Additional Reading

- http://tinkerpop.apache.org/docs/current/tutorials/getting-started/
- https://www.datastax.com/dev/blog/a-gremlin-implementation-of-the-gremlin-traversal-machine
- https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html
- https://github.com/nedlowe/gremlin-python-example
- http://tinkerpop.apache.org/docs/3.3.3/reference/#gremlin-python
- https://docs.aws.amazon.com/neptune/latest/userguide/intro.html

### Start Gremlin Server

There is a shell script: *start_gremlin.sh* that you can change for your local installation.

This script will start the GraphDB with an initialized set of Vertex and Edges, the *modern* data set as described in the *Getting Started* guide ( link above )

![Modern GraphDB](images/ModernGraphDB.png)

In [2]:
from gremlin_python import statics
from gremlin_python.structure.graph import Graph
from gremlin_python.process.traversal import T, P, Operator, Order
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection


In [3]:
# Helper function to return the properties of a Vertex as a json document
# as the TODO says, there seems like there should be an easier way to do this
def vertex_to_json(vertex):
    # TODO - Almost certainly a better way of doing this
    # True - include the Vertex properties: id, name
    values = vertex.valueMap(True).toList()[0]
    # values["id"] = vertex.id
    return values


### Create Graph object and create a GraphTraversal instance

This notebook does assume you are running the gremlin server locally and that you have started the server already.

In [4]:
graph = Graph()
g = graph.traversal().withRemote(DriverRemoteConnection('ws://localhost:8182/gremlin','g'))

At this point *g* is the interface to the Graph database.

Get a list of all Vertex objects in the graphDB.  Since this is small this is not an issue

In [5]:
g.V().toList()

[v[1], v[2], v[3], v[4], v[5], v[6]]

#### toList

V() returns an iterable that we have to iterate through of use something like toList to actually get the values.

The Vertex objects return are the *shallow* Vertex objects and only contain the *id* and *label* of the Vertex.  None of the additional properties are returned.  For those properties you have to fetch the actual Vertex by id.

In [6]:
v1 = g.V(1)
print(type(v1))

<class 'gremlin_python.process.graph_traversal.GraphTraversal'>


Notice the type of the Vertex returned - it is a GraphTraversal - which means it is actually an iterable to the Vertex that we have to unwrap.

In [7]:
v1 = v1.toList()
print(v1)
print(type(v1))

[v[1]]
<class 'list'>


But this only returns a list of 1 Vertex that we still have to index to get.

In [8]:
v1 = v1[0]

In [9]:
print(type(v1))
print(v1)

<class 'gremlin_python.structure.graph.Vertex'>
v[1]


In [10]:
print(vertex_to_json(v1))

AttributeError: 'Vertex' object has no attribute 'valueMap'

How can that be?  I have a Vertex object but there is no *valueMap* attribute.  To get to the actual additional properties, you do not unwrap the Vertex object.

Using the GraphTraversal iterable, pass that into vertex_to_json, to get the valueMap and then call toList.

In [11]:
v1 = g.V(1)
v1_json = vertex_to_json(v1)

In [12]:
print(v1_json)

{'name': ['marko'], <T.id: 1>: 1, <T.label: 3>: 'person', 'age': [29]}


Get each of the values from the Vertex

In [14]:
print(v1_json['name'])

['marko']


In [15]:
print(v1_json['age'])

[29]


In [16]:
print(v1_json[T.id])

1


In [17]:
print(v1_json[T.label])

person


Get a count of all of the Vertex objects.  Notice that we have to call *next* because the return from count is an iterable, and we need to step through to get the next value.

In [18]:
c = g.V().count().next()
print(c)

6


In [19]:
for x in g.V():
    print(f'V[id, label]: {x.id}, {x.label}')
    print(f'\t{vertex_to_json(g.V(x.id))}')
    print('\n')


V[id, label]: 1, person
	{'name': ['marko'], <T.id: 1>: 1, <T.label: 3>: 'person', 'age': [29]}


V[id, label]: 2, person
	{'name': ['vadas'], <T.id: 1>: 2, <T.label: 3>: 'person', 'age': [27]}


V[id, label]: 3, software
	{'name': ['lop'], <T.id: 1>: 3, 'lang': ['java'], <T.label: 3>: 'software'}


V[id, label]: 4, person
	{'name': ['josh'], <T.id: 1>: 4, <T.label: 3>: 'person', 'age': [32]}


V[id, label]: 5, software
	{'name': ['ripple'], <T.id: 1>: 5, 'lang': ['java'], <T.label: 3>: 'software'}


V[id, label]: 6, person
	{'name': ['peter'], <T.id: 1>: 6, <T.label: 3>: 'person', 'age': [35]}




Get the details for a single Vertex in one line

In [20]:
values = g.V(3).valueMap(True).toList()[0]
print(values)

{'name': ['lop'], <T.id: 1>: 3, 'lang': ['java'], <T.label: 3>: 'software'}


### Using the GraphDB to answer questions about the data

Lets look at how to use the graphdb and Gremlin syntax to answer questions about the data.  Using the edges we can start to find relationships in the data that might otherwise be difficult to uncover.

#### How many people does Marko know?

In [21]:
count = g.V().has('name','marko').out('knows').count().next()
print(count)

2


The above statement says, find all of the Vertex objects that has a *name* property with value *marko*.  For each of those Vertex objects, follow the outbound edge called *knows* and count them.

Remember the API always returns an iterable, so we have to step next to get the actual values.

#### What are the names of the people that Marko knows?

In [22]:
people = g.V(1).outE('knows').inV().values('name').toList()
print(people)

['vadas', 'josh']


The above statement can be interpreted as:
get the Vertex with id=1.  Take the outbound edge labeled, 'knows'.  At this point you can assume you are actually on the edge.  Next, the *inV* call says to traverse into the inbound Vertex, and get the 'names' property.

If you do not need to stop at the edge, you can call:

*.out('knows')* which is the same as *.outE('knows').inV()*


In [24]:
people = g.V(1).out('knows').values('name').toList()
print(people)

['vadas', 'josh']


#### Who does Marko know that is older than 30

In [25]:
people = g.V(1).out('knows').has('age', P.gt(30)).values('name').toList()
print(people)

['josh']


#### list the people that are older than 30 from olders to youngest

In this example, there is no traversal.  We are just searching for any Vertex of label *person* that is greater than age 30, and order them oldest to youngest.

In [26]:
people = g.V().hasLabel('person').has('age',P.gt(30)).order().by('age',Order.decr).values('name').toList()
print(people)

['peter', 'josh']


By importing the statics of Gremlin-Python, the class prefixes (e.g. P, Order) can be omitted.
for example P.gt because gt and Order.decr becomes decr, but the IDE has no idea and flags
them as an error - but it will execute.


In [27]:
statics.load_statics(globals())

Same query as above, without the usage of the static **P, Order**

In [28]:
people = g.V().hasLabel('person').has('age',gt(30)).order().by('age',decr).values('name').toList()
print(people)

['peter', 'josh']


#### What is the average age of the friends of the people who created LinkedProcess?

In [29]:
avg_age = g.V().has('name','lop').in_('created').out('knows').dedup().age.mean().next()
print(avg_age)

29.5


The above statement can be interpreted as, get the Vertex with the name or *lop*.  For all input edges with label *created* go to the source Vertex.  From those Vertex objects follow the outbound edges labeled, *knows*.  Deduplicate the Vertex objects, get the *age* property and average all of the values.

#### Rank all people by how central they are in the knows-subgraph.

In [30]:
people = g.V().hasLabel('person').repeat(both('knows')).times(5).groupCount().by('name').next()
print(people)

{'vadas': 4, 'josh': 4, 'marko': 8}
