In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv('./Dataset Activity.csv', low_memory=False)

Given that I have no prior knowledge about the data, except for the column names and types, I intend to execute some basic commands on the data to gain more insights. I have a few questions in mind that will guide my exploration:

1. What is the meaning of each column?
2. Which columns correspond to entity types?
3. For each entry, what activity does it signify? Is it the initiator, recipient, or something else?

In [4]:
data.head()

Unnamed: 0,Object_Name,User_ID,Source_ID,Name,Action,Object_Type,Object_ID,Type,Event_Time,Device,Browser_Details
0,'@Daniel Gouveia social has higher than averag...,98726482,98726482,Daniel Gouveia,CREATED,HUDDLE,ec30b903-5026-421e-8537-f8e2441b5832,USER,2018-04-10 17:41:32,,
1,Daniel Gouveia,98726482,98726482,Daniel Gouveia,INVITED,USER,98726482,USER,2018-04-10 17:41:32,,
2,Cost per Lead by Source,98726482,98726482,Daniel Gouveia,INVITED_TO,CHANNEL,d8aaa700-70b8-481f-9e05-2b0cdb6999a0,USER,2018-04-10 17:41:32,,
3,,217072912,217072912,Ron Karas,LOGGEDIN,,,USER,2018-04-10 17:41:35,,
4,Cost per Lead by Source,98726482,98726482,Daniel Gouveia,VIEWED,CARD,1504216089,USER,2018-04-10 17:42:19,,


In [5]:
# print number of unique instances in each column 

print(data.apply(lambda col: len(col.unique())))

Object_Name         25482
User_ID               641
Source_ID           19022
Name                11371
Action                 94
Object_Type            62
Object_ID           74833
Type                   14
Event_Time         830507
Device                  3
Browser_Details       840
dtype: int64


In [8]:
# print unique values in each column 

print(data.apply(lambda col: col.unique()))

Object_Name        ['@Daniel Gouveia social has higher than avera...
User_ID            [98726482, 217072912, 827313924, 794923353, 16...
Source_ID          [98726482, 217072912, 827313924, 794923353, 16...
Name               [Daniel Gouveia, Ron Karas, Steven Monk, Nick ...
Action             [CREATED, INVITED, INVITED_TO, LOGGEDIN, VIEWE...
Object_Type        [HUDDLE, USER, CHANNEL, nan, CARD, PAGE, PAGE_...
Object_ID          [ec30b903-5026-421e-8537-f8e2441b5832, 9872648...
Type               [USER, DATA_SOURCE, CARD, PAGE, PROJECT_TASK, ...
Event_Time         [2018-04-10 17:41:32, 2018-04-10 17:41:35, 201...
Device                                        [nan, desktop, mobile]
Browser_Details    [nan, Mozilla/5.0 (Macintosh; Intel Mac OS X 1...
dtype: object


Based on the information I have gathered thus far, it appears that the "Object_Type" and "Type" columns are the most probable sources of information regarding entity types. My next objective is to determine which columns, when used in combination, would provide me with the most comprehensive information about the initiator and recipient of an activity.

In [9]:
# Try group by 'Object_Name','Object_Type','Action','Name','Type'

ndata = data.groupby(['Object_Name','Object_Type','Action','Name','Type']).size().reset_index().rename(columns={0:'count'})

In [12]:
ndata.head()

Unnamed: 0,Object_Name,Object_Type,Action,Name,Type,count
0,Why are you spending so much money on T&E?,HUDDLE,CREATED,Mike Harding,USER,1
1,DP18 Employee Details & Quotas,DATA_LINEAGE,VIEWED,Amos Oaks,USER,12
2,DP18 Employee Details & Quotas,DATA_SOURCE,EXPORTED,JJ Persaud,USER,1
3,DP18 Employee Details & Quotas,DATA_SOURCE,UPDATED,Mike Kirkeide,USER,1
4,DP18 Employee Details & Quotas,DATA_SOURCE,UPDATED,Will Vaughan,USER,1


Based on my analysis, I have formulated a hypothesis that for each entry, the "Name" column corresponds to the name of the initiator, the "Type" column corresponds to the entity type of the initiator, the "Action" column corresponds to the action of the initiator, the "Object_Name" column corresponds to the name of the action receiver, and the "Object_Type" column corresponds to the entity type of the receiver. My strategy to identify the types of entities that interact with each other involves grouping the data based on "Object_Type" and "Type".

In [16]:
edata = data.groupby(['Type', 'Object_Type']).size().reset_index().rename(columns={0:'count'})

In [17]:
edata.sort_values('count')

Unnamed: 0,Type,Object_Type,count
40,USER,CUSTOMER_FEATURE,1
54,USER,LICENSE_REQUEST,1
73,USER,TEMPLATE,1
26,USER,ACTIVITY_LOG_CSV,2
55,USER,METRIC,3
...,...,...,...
52,USER,JOB,103310
15,DATA_SOURCE,USER,131776
7,CARD,USER,238644
33,USER,CARD,270889


Having obtained this information, I can now proceed with constructing an entity graph. I will consider each unique instance in the set comprising "Object_Type" and "Type" as a node in the graph. For each pair of "Type" and "Object_Type", I will create a directed edge between them, with the count of such pairs serving as the weight of the corresponding edge.

139143