### Check Notebook configuration and Neptune cluster Status

In [None]:
%graph_notebook_version

In [None]:
%%graph_notebook_config
{
  "host": "neptune-cluster-poc-identity-graph.cluster-c4k0tumhelmt.us-east-2.neptune.amazonaws.com",
  "port": 8182,
  "auth_mode": "IAM",
  "load_from_s3_arn": "",
  "ssl": true,
  "aws_region": "us-east-2",
  "sparql": {
    "path": "sparql"
  },
  "gremlin": {
    "traversal_source": "g"
  }
}

In [None]:
%graph_notebook_config

In [None]:
%status

### Loading data into the Identity Graph

In [None]:
%load

### Exploring the Identity Graph
In order to better understand the data model and schema for the graph database we can execute the below graph queries to identify common entities like nodes/vertices and relationships/edges

#### Amount of nodes/vertices group by label

In [None]:
%%gremlin

g.V().groupCount().by(label).unfold()

#### Amount of relationships/edges group by label

In [None]:
%%gremlin

g.E().groupCount().by(label).unfold()

#### Top 10 Device IDs by amount of known users linked

In [None]:
%%gremlin

g.V().hasLabel('DeviceID').
    project('device_id','known_users_linked').
        by(id).
        by(in('usedDevice').out('loggedAs').count()).
    order().
        by('known_users_linked',desc).
    limit(10)

#### Top 10 Client IPs by amount of known users linked

In [None]:
%%gremlin

g.V().hasLabel('ClientIP').
    project('client_ip','known_users_linked').
        by(id).
        by(in('lastSeenAtIP').out('loggedAs').count()).
    order().
        by('known_users_linked',desc).
    limit(10)

#### Top 10 Client IPs by amount of anonymous session linked

In [None]:
%%gremlin

g.V().hasLabel('ClientIP').
    project('client_ip','anonymous_sessions_linked').
        by(id).
        by(in('lastSeenAtIP').not(out('loggedAs')).count()).
    order().
        by('anonymous_sessions_linked',desc).
    limit(10)

#### Anonymous sessions related to a given Client IP

In [None]:
%%gremlin -p inv,outv

g.V('147.35.190.53').
    in('lastSeenAtIP').
    not(out('loggedAs')).
    path().
    by()

#### Top 10 Client's External ID by amount of product purchases

In [None]:
%%gremlin

g.V().hasLabel('CustomerID').
    project('external_id','amount_of_purchases').
        by(id).
        by(in('hasExternalId').out().out('hasPurchased').count()).
    order().
        by('amount_of_purchases',desc).
    limit(10)

### Find out information about user interests based on the activity of the user across all devices
Suppose you are hosting a web platform and collecting clickstream data as users browse your site or use your mobile app. In the majority of situations, users using your platform will be anonymous (or non-registered or logged in users). However, these anonymous users may be linked to other known users in that have used our platform before. We can join (or resolve) the identity of the anonymous user with attributes we know about existing users to make some assumptions (based off of known user behavior and heuristics) in order to know more about this anonymous user. We can then use this information to target the user with advertising, special offers, discounts, etc

Let's use an example where we have an anonymous user interaction through AnyCompany Marketplace website (e.i. web session). The user's interaction is registered as session id **'3f553df3-307c-4fbf-ba16-2f20817d7c86'**. We want to know more about this user and if it is linked to other users on our platform. This anonymous user is considered a **"transient ID"** in our graph data model. Assuming this user does not have a link to a known user, or **"persistent ID"**, how might we find connections from this transient ID to other known user IDs?

Looking at the data model, you can see that **"SessionID"** (a.k.a. transient ID) vertices in our graph are connected to **"ClientIP"** vertices by an outgoing edge. We can traverse across "ClientIP" vertices to get to other linked "SessionID" that might be linked to a known user (a.k.a. persistent ID). Let's do that in the following graph query.

In [None]:
%%gremlin -p outv,inv,outv,inv

g.V('3f553df3-307c-4fbf-ba16-2f20817d7c86').
    out('lastSeenAtIP').
    in('lastSeenAtIP').
    out('loggedAs').
    dedup().
    path()

Now we can dive deeper into the subgraph that describe the online interactions for one of the identified known users

In [None]:
%%gremlin -p inv,outv

g.V('victorwilson').
    in('loggedAs').
    out().
    simplePath().
    path()

Let's use another example where we have an anonymous user interaction through AnyCompany's Mobile Application. The user's interaction is coming from a device with id **'5D693FB50F103CC1'**. We want to know more about this user and if it is linked to other users on our platform. This anonymous user is considered a "transient ID" in our graph data model. Assuming this user does not have a link to a known user, or "persistent ID", how might we find connections from this transient ID to other known user IDs?

Looking at the data model, you can see that **"DeviceID"** vertices in our graph are connected to **"SessionID"** (a.k.a. transient ID) vertices by an incomming edge. We can traverse across "SessionID" vertices that might be linked to a known user (a.k.a. persistent ID) and are connected to same "DeviceID" (5D693FB50F103CC1). Let's do that in the following graph query.

In [None]:
%%gremlin -p inv,outv,inv,inv,inv

g.V('5D693FB50F103CC1').
    in('usedDevice').
    out('loggedAs').
    order().
        by(id).
    dedup().
    out('hasEmail','hasExternalId').
    out('hasPurchased').
    path()

Let's use another example where we have an user interaction at the register of an AnyCompany's physical store. The user's interaction is registered with the customer external ID **'709-31-6436'** provided by the user at the register. We want to know more about this user and its purchase history, even if the user is linked to other known users on our platform.

Looking at the data model, you can see that **"ExternalID"** vertices in our graph are connected to **"User"** (a.k.a. persistent ID) vertices by an incomming edge. We can traverse across "User" vertices to get other linked purchase identifiers (e.g. Email) that might have been used by same user when purchasing other products. Let's do that in the following graph query.

In [None]:
%%gremlin -p inv,outv,inv

g.V('460-74-7963').
    in('hasExternalId').
    out().
    out('hasPurchased').
    path()

### Adding new data into the Identity Graph

#### Generate raw data simulating incremental updates from source datasets

Mock data for the below source datasets will be generated

* First-party database (e.g. CRM)
* Transactional database (e.g. purchases)
* Cookie database
* Click Stream database

Resulting CSV files with raw data per each dataset

* first-party: `first_party_data_inc.csv`
* cookie: `cookie_data_inc.csv`
* clickstream: `clickstream_data_inc.csv`
* transactional: `transactional_data_inc.csv`

In [None]:
%%bash

pip install Faker >/dev/null 2>&1
cd /home/ec2-user/SageMaker
python3 utils/create-source-datasets.py --records 100 --uniqueness 50 --incremental 1

#### Visualizing raw data from clickstream dataset

In [None]:
%%python3

import pandas as pd

csv_file_path = '/home/ec2-user/SageMaker/clickstream_data_inc.csv'
cols1 = [ 'session_id', 'client_platform', 'canonical_url', 
    'app_id', 'events', 'start_timestamp', 'start_event', 'end_timestamp', 
    'end_event', 'session_duration_sec' ]
cols2 = [ 'session_id', 'client_ip' ]
cols3 = [ 'session_id', 'device_id' ]
cols4 = [ 'session_id', 'user_name' ]
df1 = pd.read_csv(csv_file_path, usecols=cols1, nrows=1)
print('Sample data for SessionID vertex')
print('--------------------------------')
print(df1.squeeze())
print('')
df2 = pd.read_csv(csv_file_path, usecols=cols2, nrows=1)
print('Sample data for SessionID to ClientIP edge')
print('------------------------------------------')
print(df2)
print('')
df3 = pd.read_csv(csv_file_path, usecols=cols3, nrows=1)
print('Sample data for SessionID to DeviceID edge')
print('------------------------------------------')
print(df3)
print('')
df4 = pd.read_csv(csv_file_path, usecols=cols4, nrows=1)
print('Sample data for SessionID to User edge')
print('--------------------------------------')
print(df4)
print('')

#### Create property graph vertices and edges

In [None]:
%%gremlin

//New ClientIP vertex
g.addV('ClientIP').
    property(id,'73.90.98.110')

In [None]:
%%gremlin

//New DeviceID vertex
g.addV('DeviceID').
    property(id,'57797B2259422AD5')

In [None]:
%%gremlin

//New SessionID vertex
g.addV('SessionID').
    property(id,'c6876df1-b8a9-4d49-bd62-f206fae899db').
    property('client_platform','mobile').
    property('canonical_url','http://www.alvarez-smith.com/').
    property('app_id','ecommerce').
    property('events',25).
    property('start_timestamp','2019-02-10 09:35:52').
    property('start_event','CompareProducts').
    property('end_timestamp','2019-02-10 10:03:37').
    property('end_event','SingUp').
    property('session_duration_sec',1665)

In [None]:
%%gremlin

g.V('c6876df1-b8a9-4d49-bd62-f206fae899db').as('session').
  addE('loggedAs').
    from('session').
    to('victorwilson')