To use D4M, D4M needs to be on your path. In most environments on the LLSC systems D4M will already be on your path.

Frequently in development: it's good to update before use. In the txe1 command line terminal: 

`module load julia-1.0
export JULIA_DEPOT_PATH=/home/gridsan/USERNAME/.julia
julia
] up D4M`

and then restart the notebook, and run these:

In [1]:
using D4M

┌ Info: Precompiling D4M [ca196bdc-a701-11e8-3d50-3b5cc8577617]
└ @ Base loading.jl:1186


Database capabilities loaded!
D4M loaded!


In [2]:
D4Mver()

version 0.5.3


You can check that the version matches the latest version of D4M.jl, found here: https://github.com/n8kim1/D4M.jl/blob/master/src/version.jl

# D4M Warmup

D4M is a package for working with Associative Arrays. An Associative Array is a bit like a sparse matrix, but the rows, columns, and values can be either numbers or strings.

In [34]:
row = "a,a,a,a,a,a,a,aa,aaa,b,bb,bbb,a,aa,aaa,b,bb,bbb,"
column = "a,aa,aaa,b,bb,bbb,a,a,a,a,a,a,a,aa,aaa,b,bb,bbb,"
values = "a-a,a-aa,a-aaa,a-b,a-bb,a-bbb,a-a,aa-a,aaa-a,b-a,bb-a,bbb-a,a-a,aa-aa,aaa-aaa,b-b,bb-bb,bbb-bbb,"

A = Assoc(row,column,values)

printFull(A)

printFull(logical(A))

7×7 Array{Union{AbstractString, Number},2}:
 ""     "a"      "aa"     "aaa"      "b"    "bb"     "bbb"    
 "a"    "a-a"    "a-aa"   "a-aaa"    "a-b"  "a-bb"   "a-bbb"  
 "aa"   "aa-a"   "aa-aa"  ""         ""     ""       ""       
 "aaa"  "aaa-a"  ""       "aaa-aaa"  ""     ""       ""       
 "b"    "b-a"    ""       ""         "b-b"  ""       ""       
 "bb"   "bb-a"   ""       ""         ""     "bb-bb"  ""       
 "bbb"  "bbb-a"  ""       ""         ""     ""       "bbb-bbb"

7×7 Array{Union{AbstractString, Number},2}:
 ""      "a"   "aa"   "aaa"   "b"   "bb"   "bbb"
 "a"    1     1      1       1     1      1     
 "aa"   1     1      0       0     0      0     
 "aaa"  1     0      1       0     0      0     
 "b"    1     0      0       1     0      0     
 "bb"   1     0      0       0     1      0     
 "bbb"  1     0      0       0     0      1     

This flexibility makes D4M Associative Arrays ideal for representing graph data. Often this involves having string row and column labels, representing the names of vertices and/or edges, and numeric values representing the existance of an edge or the weight of that edge.

_TODO fix put_

In [38]:
row = "v1,v1,v1,v1,v1,v1,v1,v2,v3,v4,v5,v6,v1,v2,v3,v4,v5,v6,"
column = "v1,v2,v3,v4,v5,v6,v1,v1,v1,v1,v1,v1,v1,v2,v3,v4,v5,v6,"
values = "1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,"

A = Assoc(row,column,values)

printFull(A)

printFull(logical(A))

7×7 Array{Union{AbstractString, Number},2}:
 ""    "v1"  "v2"  "v3"  "v4"  "v5"  "v6"
 "v1"  "1"   "1"   "1"   "1"   "1"   "1" 
 "v2"  "1"   "1"   ""    ""    ""    ""  
 "v3"  "1"   ""    "1"   ""    ""    ""  
 "v4"  "1"   ""    ""    "1"   ""    ""  
 "v5"  "1"   ""    ""    ""    "1"   ""  
 "v6"  "1"   ""    ""    ""    ""    "1" 

7×7 Array{Union{AbstractString, Number},2}:
 ""     "v1"   "v2"   "v3"   "v4"   "v5"   "v6"
 "v1"  1      1      1      1      1      1    
 "v2"  1      1      0      0      0      0    
 "v3"  1      0      1      0      0      0    
 "v4"  1      0      0      1      0      0    
 "v5"  1      0      0      0      1      0    
 "v6"  1      0      0      0      0      1    

<img src="images/graphEx.png" alt="Drawing" style="width: 200px;"/>

With Associative Arrays, you can extract a subgraph by indexing into the Associative Array. For example, let's just get all the columns with odd vertices and rows with vertices 1-3.

Note the last character in an index string is the delimiter, this allows us to do indexing on multiply values with a single string, as cell arrays of strings can get very slow.

In [42]:
printFull(A[:,"v1,v3,v5,"])

printFull(A["v1,:,v3,",:])

7×4 Array{Union{AbstractString, Number},2}:
 ""    "v1"  "v3"  "v5"
 "v1"  "1"   "1"   "1" 
 "v2"  "1"   ""    ""  
 "v3"  "1"   "1"   ""  
 "v4"  "1"   ""    ""  
 "v5"  "1"   ""    "1" 
 "v6"  "1"   ""    ""  

4×7 Array{Union{AbstractString, Number},2}:
 ""    "v1"  "v2"  "v3"  "v4"  "v5"  "v6"
 "v1"  "1"   "1"   "1"   "1"   "1"   "1" 
 "v2"  "1"   "1"   ""    ""    ""    ""  
 "v3"  "1"   ""    "1"   ""    ""    ""  

We can also add, subtract, and mutliply Associative Arrays.

In [46]:
printFull(A + A[:,"v1,v3,v5,"]) # A, with odd columns doubled

printFull(A - A["v1,:,v3,",:]) # A, with first three columns gone

printFull(A * A[:,"v2,v4,"]) # A times just a couple of its columns

7×7 Array{Union{AbstractString, Number},2}:
 ""     "v1"   "v2"   "v3"   "v4"   "v5"   "v6"
 "v1"  2.0    1.0    2.0    1.0    2.0    1.0  
 "v2"  2.0    1.0    0.0    0.0    0.0    0.0  
 "v3"  2.0    0.0    2.0    0.0    0.0    0.0  
 "v4"  2.0    0.0    0.0    1.0    0.0    0.0  
 "v5"  2.0    0.0    0.0    0.0    2.0    0.0  
 "v6"  2.0    0.0    0.0    0.0    0.0    1.0  

4×5 Array{Union{AbstractString, Number},2}:
 ""     "v1"   "v4"   "v5"   "v6"
 "v4"  1.0    1.0    0.0    0.0  
 "v5"  1.0    0.0    1.0    0.0  
 "v6"  1.0    0.0    0.0    1.0  

7×3 Array{Union{AbstractString, Number},2}:
 ""     "v2"   "v4"
 "v1"  2      2    
 "v2"  2      1    
 "v3"  1      1    
 "v4"  1      2    
 "v5"  1      1    
 "v6"  1      1    

This just a small taste of D4M to get you started. You'll see more as we go through this notebook.

# Creating Incidence and Adjacency Matrices

Usually when we parse data into D4M Associative Arrays we put them in an Incidence Matrix format. For review, the rows of an incidence matrix correspond to edges, while the rows correspond to vertices. Incidence matrices can represent a variety of types of graphs, and so can be very useful. You can also easily form any adjacency matrix you need from an incidence matrix.

To see how an incidence matrix is formed, let's say for example you have some raw data in a tsv file. The rows IDs of the incidence matrix are some unique identifier of the rows in the original file, in this example they are just the row number in the file. There are generally different columns associated with each row in the tsv file, we combine the column name with each column value to create our column IDs, or vertex IDs in our Incidence matrix. Notice how we have parsed out each individual word in column 3 as well.

<img src="images/TSVtoAssoc.png" alt="Drawing" style="width: 600px;"/>

Let's look at an example. We are loading an Associative Array with Twitter data (in 10 parts), and looking at the user columns in the first three tweets:

In [10]:
A = ReadCSV("graphclassdata/A1.csv");
A += ReadCSV("graphclassdata/A2.csv");#error
A += ReadCSV("graphclassdata/A3.csv");#error
A += ReadCSV("graphclassdata/A4.csv"); #error ?
A += ReadCSV("graphclassdata/A5.csv");
A += ReadCSV("graphclassdata/A6.csv");
A += ReadCSV("graphclassdata/A7.csv");
A += ReadCSV("graphclassdata/A8.csv");
A += ReadCSV("graphclassdata/A9.csv");
A += ReadCSV("graphclassdata/A10.csv");

In [11]:
length(A.row) # if loaded all 10 files, should be 10000

10000

In [12]:
length(A.col) # if loaded all 10 files, should be 129895

129895

In [18]:
printFull(A[1:3,StartsWith("user|,")])

4×4 Array{Union{AbstractString, Number},2}:
              ""   "user|Blocker57"   "user|Wangjeje"   "user|gyanpratika"
  821278781733    0.0                0.0               1.0                
 4857479781733    1.0                0.0               0.0                
 8653317781733    0.0                1.0               0.0                

Since the full matrix display is not practical for viewing much more than this, we often look at the triples of the Associative Array. Here are the first two rows:

In [19]:
print(A[1:2,:])

  [2, latlon|++004315..82003012019000000000]  =  1.0
  [1, latlon|-+010067..10923207153171000000]  =  1.0
  [2, lat|+041.8031090000                  ]  =  1.0
  [1, lat|-006.1930137000                  ]  =  1.0
  [2, lon|+035.2002100000                  ]  =  1.0
  [1, lon|+107.0227511000                  ]  =  1.0
  [2, place|682c5a667856ef42               ]  =  1.0
  [1, place|85858f3447b85e2b               ]  =  1.0
  [1, time|2013-05-22 12:47:08             ]  =  1.0
  [2, time|2013-05-22 12:47:32             ]  =  1.0
  [1, userID|204683306                     ]  =  1.0
  [2, userID|258075949                     ]  =  1.0
  [2, user_lower|blocker57                 ]  =  1.0
  [1, user_lower|gyanpratika               ]  =  1.0
  [2, user|Blocker57                       ]  =  1.0
  [1, user|gyanpratika                     ]  =  1.0
  [1, word_lower|:(                        ]  =  1.0
  [2, word_lower|pastanesi                 ]  =  1.0
  [2, word_lower|tarhan                    ]  

From here we can form a variety of adjacency matrices. For example:

In [20]:
# First, we make a bunch of query/filter objects that we can pass into indexing.

# All words that start with a letter:
realwords = "word|A,:,word|z" * Char(127) * ",";

# All hashtags:
hashtags = StartsWith("word|#");

# Tweets @ someone:
directedtweet = StartsWith("word|@");

# Usernames:
users = StartsWith("user|");

# Locations:
locs = StartsWith("latlon|");

In [9]:
# Users-Location Graph
println("Bipartite User-Location Graph") # number of tweets for each user/location pair
Auserloc = transpose(A[:,users]) * A[:,locs]
print(Auserloc[1:2,:]) # print the first two rows to get a feel

println()

# Word-Word Graph
println("Small Symmetric Word-Word Graph") # yay for quadratic runtime
Asmall = A[1:1000, :]
Arealwordsfilter = Asmall[:, realwords] # looking at just the words 
Awords = transpose(Arealwordsfilter) * Arealwordsfilter # nuber of tweets for each word/word pair
Awords = removediag(Awords) 
print(Awords>4) # display sufficiently high

println()

# Directed tweets
println("Directed User-User Graph")
Adir = transpose(A[:,users]) * A[:,directedtweet] # number of each tweets for each "sender/recipient" pair
print(Adir[11:20, :]) # just a few users

Bipartite User-Location Graph
  [user|0000The  , latlon|++002459..46269574254468000000]  =  1.0
  [user|000chiaki, latlon|++013369..30212434630140720000]  =  1.0

Small Symmetric Word-Word Graph
  [word|T?rkiye) , word|I'm      ]  =  8.0
  [word|w/       , word|I'm      ]  =  14.0
  [word|see      , word|It       ]  =  5.0
  [word|aku      , word|RT       ]  =  6.0
  [word|lah      , word|RT       ]  =  5.0
  [word|yg       , word|RT       ]  =  5.0
  [word|I'm      , word|T?rkiye) ]  =  8.0
  [word|w/       , word|T?rkiye) ]  =  7.0
  [word|eu       , word|To       ]  =  5.0
  [word|eu       , word|aii      ]  =  10.0
  [word|RT       , word|aku      ]  =  6.0
  [word|ga       , word|aku      ]  =  8.0
  [word|jadi     , word|aku      ]  =  6.0
  [word|kamu     , word|aku      ]  =  11.0
  [word|eu       , word|baga?eira]  =  5.0
  [word|var      , word|bir      ]  =  6.0
  [word|eu       , word|cabare   ]  =  5.0
  [word|eu       , word|caminh?o ]  =  5.0
  [word|lah      , word|dak 

# Graph Algorithms in D4M

D4M has some graph algorithm capabilities built in. Let's take a look at a few examples before we start looking at in-database algorithms.

## Breadth First Search

From our word-word graph, we can see which words that are used together. But what if we want to go out one or two levels more? We can use breadth first search (BFS). We'll start by specifying a few source vertices.

In [73]:
words = "word|I'm,word|RT,"
Asearch = logical(Awords)

## Run Breadth First Search on Adjancency Schema
numsteps=3
minDegree=10
maxDegree=10000
v0 = words # initial vertices
Adeg = sum(logical(A),1) # Degree of each node; represents number of tweets each word appeared in
# degree filtering lets us get words with sufficiently high frequency

v = adjbfs(Asearch,Adeg,v0,numsteps,minDegree,maxDegree, takeunion=true)
result = v.row

# Vertices reached in 3 hops from specified users
wordsrep = replace(words, "word|" => " ")
println("Vertices reached in " * string(k) * " hops from " * wordsrep * ":")
for i = 1:length(result)
    print(result[i][6:end] * " ")
end

Vertices reached in 3 hops from  I'm, RT,:
A Ada Ah Aku And Apa Bom Buenos But Cafe Calum Center Es Eu Facebook Fast Furious Good HP Haha Hahaha Happy Home House I I'm If It It's Iya Jangan Just Kalo La Lo Lol Love Mau Me My No O Oh Pero Pues Que RT Regi?n Restaurant Si So T?rkiye) Tapi Te The To U Udah What Wkwk Y Ya You abis about actually ada adalah ade again ah ahora ai aja ajak akan ako aku al always ama amo anak anda apa aq atau ba bagus baik banyak baru beli belum berapa besok biar biasa bien bikin bile bir birthday bisa blm bola boleh bom bu buat buka call cara che comer como con cuando d d?a da dah dalam dan dapat dari day deh del dengan depan di dia diri disneyland done dong dos dulu e eh el ele em emang en era es eso esok est? este eu excited fast feel first follback follow foto fuck full ga gak ganti gitu give gk good gracias great gua gw ha hacer haha hahahaha hanya happy hari harus hate hati heart ho hoje home hora hours hoy i ih ik ikut il ini itu iya j? jadi jaja jajaja

## Jaccard Index

The Jaccard index is a metric of similarity which can be calculated on a graph. In the context of words, this takes into account not only the number of times two words occur together, but also the overall degrees of each word.

_TODO make this work, once Jaccard works too_

In [None]:
# % First remove low-degree nodes, those will throw off the results
# w = Row(Adeg(realwords,:) > 2);
# Atest = Awords(:,w);
# Atest = Atest(w,:);

# J = Jaccard(Atest);

# J(['word|coffee' nl ],:)> 0.1

# Database Setup

Now we will bind to a database. Often we put these lines in a script called "DBsetup" and just run that.

Be sure to edit this to include a unique name for your tables, this will prevent multiply people from running on the same tables.

In [3]:
# Initialize DB connectors
dbinit()

# Connect to Database

DB = dbsetup("class-db02")

13 Aug 2019 13:47:17,297 WARN - ClientConfiguration.loadFromSearchPath(227) -  Found no client.conf in default paths. Using default client configuration values.


D4M.DBserver("class-db02", "class-db02.cloud.llgrid.txe1.mit.edu:2181", "AccumuloUser", "aDvx@T_OwqfypNSnbcAMa14FV", "BigTableLike", JavaCall.JavaObject{Symbol("edu.mit.ll.graphulo.MatlabGraphulo")}(Ptr{Nothing} @0x0000000003dab6c0))

In [4]:
# Give a unique name prefix to prevent collisions in Accumulo
myName = "t2_" # CHANGE THIS TO SOMETHING UNIQUE

deleteprefix(DB, myName)

# Bind to tables
Tedge = DB[myName * "Tedge", myName * "TedgeT"]
TedgeDeg = DB[myName * "TedgeDeg"]
TedgeTxt = DB[myName * "TedgeTxt"]

Deleting t2_Tedge in class-db02
Deleting t2_TedgeDeg in class-db02
Deleting t2_TedgeT in class-db02
Deleting t2_TedgeTxt in class-db02
Deleting t2_Tedge_BFS in class-db02
Deleting t2_Tedge_BFST in class-db02
Deleting t2_hashtag in class-db02
Deleting t2_hashtag_user in class-db02
Deleting t2_user_hashtag in class-db02
Deleting t2_user_share_hashtag in class-db02
Deleting t2_user_share_hashtag_bfs in class-db02
Deleting t2_user_share_hashtag_bfsT in class-db02
Deleting t2_user_share_hashtag_deg in class-db02
Deleting t2_user_share_hashtag_jaccard in class-db02
Deleting t2_wordword in class-db02
Creating t2_Tedge in class-db02
Creating t2_TedgeT in class-db02
Creating t2_TedgeDeg in class-db02
Creating t2_TedgeTxt in class-db02


D4M.DBtable(D4M.DBserver("class-db02", "class-db02.cloud.llgrid.txe1.mit.edu:2181", "AccumuloUser", "aDvx@T_OwqfypNSnbcAMa14FV", "BigTableLike", JavaCall.JavaObject{Symbol("edu.mit.ll.graphulo.MatlabGraphulo")}(Ptr{Nothing} @0x0000000003dab6c0)), "t2_TedgeTxt", "", 0, 0, "", 500000.0, JavaCall.JavaObject{Symbol("edu.mit.ll.d4m.db.cloud.D4mDataSearch")}(Ptr{Nothing} @0x000000000c3f5e58), JavaCall.JavaObject{Symbol("edu.mit.ll.d4m.db.cloud.D4mDbTableOperations")}(Ptr{Nothing} @0x000000000c3f5df0))

If you check the Accumulo monitor, these tables are now present!

Now we will ingest our data.

In [None]:
# for i = 1:10
#     A = ReadCSV("graphclassdata/A" * string(i) * ".csv")
#     put(Tedge,A, clear = (i==1 ? true : false))
#     put(TedgeDeg,transpose(sum(A,1)), clear = (i==1 ? true : false))
#     println(i)
# end

In [None]:
# A = ReadCSV("graphclassdata/A1.csv");
# println("1")
# A += ReadCSV("graphclassdata/A2.csv")
# println("2")
# A += ReadCSV("graphclassdata/A3.csv")
# println("3")
# A += ReadCSV("graphclassdata/A4.csv")
# println("4")
# A += ReadCSV("graphclassdata/A5.csv")
# println("5")
# A += ReadCSV("graphclassdata/A6.csv")
# println("6")
# A += ReadCSV("graphclassdata/A7.csv")
# println("7")
# A += ReadCSV("graphclassdata/A8.csv")
# println("8")
# A += ReadCSV("graphclassdata/A9.csv")
# println("9")
# A += ReadCSV("graphclassdata/A10.csv")
# println("10")

# A = Assoc(A.row, A.col, A.val, convert.(Int64, A.A))
# # put(Tedge,A, clear=true)
# # put(TedgeDeg,transposeusing JLD(sum(A,1)), clear=true)

In [37]:
using SparseArrays
A = ReadJLD("graphclassdata/A.jld");
put(Tedge,A, clear=true)
put(TedgeDeg,transpose(sum(A,1)), clear=true)

Creating t2_Tedge in class-db02
Creating t2_TedgeT in class-db02
Creating t2_TedgeDeg in class-db02


# Creating Adjacency Graphs

We store our data in an incidence matrix form using the standard D4M Schema. In this way you have the flexibility to create adjacency matrices of individual graphs as you need them.

<img src="images/d4mschema.png" alt="Drawing" style="width: 800px;"/>

We can use Graphulo's table multiply to create our adjacency matrices. For our twitter data, we can create several that may be interesting. First let's get the column keys of the columns we may want to filter on, and get an idea of how many columns there are in each using the degree table. This information is important when deciding how to form your adjacency matrix.

In [6]:
# All words that start with a letter:
realwords = "word_lower|a,:,word_lower|z" * Char(127) * ",";
println("There are " * string(nnz(TedgeDeg[realwords,:])) * " words that start with a letter.")
# should be 25147

# All hashtags:
hashtags = StartsWith("word|#");
println("There are " * string(nnz(TedgeDeg[hashtags,:])) * " hashtags.")
# should be 1118

# Tweets @ someone:
directedtweet = StartsWith("word|@");
println("There are " * string(nnz(TedgeDeg[directedtweet,:])) * " words that start with @.")
# should be 5474

# Usenames:
users = StartsWith("user|");
println("There are " * string(nnz(TedgeDeg[users,:])) * " usernames.")
# should be 8510

# Locations:
locs = StartsWith("latlon|");
println("There are " * string(nnz(TedgeDeg[locs,:])) * " locations.")
# should be 7705

There are 25147 words that start with a letter.
There are 1118 hashtags.
There are 5474 words that start with @.
There are 8510 usernames.
There are 7705 locations.


Now let's make a few graphs. A Graphulo Table Multiply gives us A\*B, but takes in A' and B. Letting E be our incidence matrix, we can get an adjacency matrix by doing E'\*E, so A = E' and B = E. Graphulo requires A' to multiply, and A' = (E')' = E.

First let's make a word-word graph. On the full dataset, this takes about 3-5 minutes, but the result is nearly a 1Mx1M graph with about 40M edges! Since this is a small subset, it only takes a few seconds.

In [7]:
# Set Parameters
ATtable = Tedge
Btable = Tedge
Cname = myName * "wordword"

# Multiply Tables
wordword = tablemult(ATtable, Btable, Cname, colfilterAT = realwords, colfilterB = realwords, clear=true);

13 Aug 2019 13:48:10,698 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 525654 entries processed


Let's take a look at some of it.

In [8]:
# change to iterators
print(wordword[:, StartsWith("word_lower|abri")])

  [word_lower|abrigado , word_lower|abrigado]  =  1
  [word_lower|hijo     , word_lower|abrigado]  =  1
  [word_lower|m?s      , word_lower|abrigado]  =  1
  [word_lower|voy      , word_lower|abrigado]  =  1
  [word_lower|abrir    , word_lower|abrir   ]  =  2
  [word_lower|acabo    , word_lower|abrir   ]  =  1
  [word_lower|acho     , word_lower|abrir   ]  =  1
  [word_lower|adivinha?, word_lower|abrir   ]  =  1
  [word_lower|atrasada , word_lower|abrir   ]  =  2
  [word_lower|boca     , word_lower|abrir   ]  =  1
  [word_lower|cortando , word_lower|abrir   ]  =  1
  [word_lower|estou    , word_lower|abrir   ]  =  1
  [word_lower|eu       , word_lower|abrir   ]  =  2
  [word_lower|gengiva  , word_lower|abrir   ]  =  1
  [word_lower|kkkk     , word_lower|abrir   ]  =  1
  [word_lower|loja..   , word_lower|abrir   ]  =  1
  [word_lower|minha    , word_lower|abrir   ]  =  1
  [word_lower|muito    , word_lower|abrir   ]  =  1
  [word_lower|pra      , word_lower|abrir   ]  =  2
  [word_lowe

Maybe we want to see what hashtags are used together. Let's make a hashtag-hashtag graph. On the full dataset is much faster than the previous one and takes about a minute. On this subset it takes only a few seconds.

In [9]:
# Set Parameters
ATtable = Tedge
Btable = Tedge
Cname = myName * "hashtag"

rowFilter = ""
colFilterAT = hashtags
colFilterB = hashtags

# Multiply Tables
@time hashtag = tablemult(ATtable, Btable, Cname, colfilterAT = hashtags, colfilterB = hashtags, clear=true);

13 Aug 2019 13:48:17,302 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 3921 entries processed
  2.308884 seconds (121.16 k allocations: 6.033 MiB)


In [10]:
print(hashtag[:,StartsWith("word|#ea")])

  [word|#1            , word|#earthquake]  =  1
  [word|#16           , word|#earthquake]  =  1
  [word|#Breaking?    , word|#earthquake]  =  2
  [word|#PastHour     , word|#earthquake]  =  2
  [word|#earthquake   , word|#earthquake]  =  3
  [word|#prayfromjapan, word|#earthquake]  =  2
  [word|#quake        , word|#earthquake]  =  1
  [word|#tsunami      , word|#earthquake]  =  2
  [word|#Aunty        , word|#eat       ]  =  1
  [word|#Thanks       , word|#eat       ]  =  1
  [word|#cake         , word|#eat       ]  =  1
  [word|#cheesecake.  , word|#eat       ]  =  1
  [word|#delicious    , word|#eat       ]  =  1
  [word|#eat          , word|#eat       ]  =  1
  [word|#iloveit      , word|#eat       ]  =  1
  [word|#sweet        , word|#eat       ]  =  1
  [word|#yummi        , word|#eat       ]  =  1


Let's make a directed graph that shows which users use which hashtags.

If you are creating an adjacency matrix for a directed graph, you may want to create a transpose result graph as well, so you can query quickly for both in and out vetices (Accumulo is indexed by row key, so it is fastest to query rows). All previous graphs were symmetric so you don't need a transpose adjacency matrix.

I have also found that you want to use the filter with the larger number of values as your colFilterB- it will allow the iterator to process more entries at a time when it scanning from your A and B tables.

For example, the first time I created the user-hashtag graph on the full dataset I set colFilterAT as the "user|" prefix and colFilterB as the "word|#" prefix. I eventually killed the process because it was taking a very long time (see below). When I swapped them, it took about a minute.

<img src="images/graphulo-colfilters.png" alt="Drawing" style="width: 500px;"/>

Here is how you would multiply two tables and produce a transpose result matrix as well. Since you are calling a Java function, it is very picky about what inputs you use. I am just using the default values for the added parameters.

In [11]:
# Set Parameters
ATtable = Tedge
Btable = Tedge
Cname = myName * "hashtag_user"
CTname = myName * "user_hashtag"

# Multiply Tables
@time hashtaguser = tablemult(ATtable, Btable, Cname, CTname, colfilterAT = hashtags, colfilterB = users);

13 Aug 2019 13:48:19,635 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 1553 entries processed
  2.202923 seconds (13.98 k allocations: 794.123 KiB)


In [12]:
print(hashtaguser[:,StartsWith("user|An")])

  [word|#OmSpikTanya   , user|AnesEsa       ]  =  2
  [word|#kepo          , user|Annisa1309_   ]  =  1
  [word|#hoytocasiesta!, user|AntonellaVolpi]  =  1


What if we want a graph that describes the users that use the same hashtags? That will take two steps. First create an adjacency matrix of users-hashtags by multiplying (we already created this in the previous example). Then, using the graph you have just made, multiplying again will create the graph you are looking for.

Since we are multiplying the full table, we don't need to provide any filters.

In [13]:
# Set Parameters
# Multiply in terms of A*B = C, so if we want to do A'*B, then AT is just A
ATtable = hashtaguser
Btable = hashtaguser
Cname = myName * "user_share_hashtag"

# Multiply Tables
usersharehashtag = tablemult(ATtable, Btable, Cname);

13 Aug 2019 13:48:19,996 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 7142 entries processed


In [14]:
# usersharehashtaglocal = usersharehashtag[:, :]
# usersharehashtaglocalnodiag = removediag(usersharehashtaglocal)
# print(usersharehashtaglocalnodiag[:, StartsWith("user|Ac")])
print(usersharehashtag[:, StartsWith("user|Ac")])

  [user|AceATamayo     , user|AceATamayo]  =  1
  [user|Ace_Cauyan     , user|AceATamayo]  =  1
  [user|Camimimimille  , user|AceATamayo]  =  1
  [user|FRlENSHIP      , user|AceATamayo]  =  1
  [user|FaryllJoiz     , user|AceATamayo]  =  1
  [user|HeyJirko27     , user|AceATamayo]  =  1
  [user|HolaAngelicAnne, user|AceATamayo]  =  2
  [user|KhimUlit       , user|AceATamayo]  =  2
  [user|aiiiidy        , user|AceATamayo]  =  1
  [user|buscasFaye13   , user|AceATamayo]  =  1
  [user|camillevllnva  , user|AceATamayo]  =  2
  [user|conhalili      , user|AceATamayo]  =  1
  [user|elsiejeline    , user|AceATamayo]  =  1
  [user|howtogain      , user|AceATamayo]  =  1
  [user|iamrenzijohn   , user|AceATamayo]  =  1
  [user|im_shufflin    , user|AceATamayo]  =  1
  [user|itsjhengXI     , user|AceATamayo]  =  1
  [user|izecanFLY      , user|AceATamayo]  =  2
  [user|juliannejomocan, user|AceATamayo]  =  1
  [user|katrinatella   , user|AceATamayo]  =  1
  [user|librebetina    , user|AceATamayo

What other graphs can you create that might be interesting? Try making some now.

# Running Graph Algorithms

## Degree-Filtered Breadth First Search

Say you are interested in seeing people that talk about similar things. You may have a handful of target individuals (maybe selected because they talk a lot about a topic), and want to get their 2-hop neighbors. You could do this by running breadth first search (BFS) on those individuals on the graph we generated above.

Since we are doing degree filtering, we'll first generate a degree table.

In [15]:
# Create Degree Table to Degree Filtering
usersharehashtagdegname=myName * "user_share_hashtag_deg"

usersharehashtagdeg = makedegreetable(usersharehashtag, usersharehashtagdegname, countColumns=true, colq="deg");

In [16]:
# Lots of degree 1 -- we didn't filter out the diagonal on Graphulo. Degree filtering can help us with this!
# Also, for convenience, we can also view all the entries that are greater than 1, as follows:
# u = usersharehashtagdeg[:, :]
# uparse = str2num(u)
# printFull(uparse>1)

Now we can run BFS. I picked two users mostly at random as our starting vertices.

In [21]:
users = "user|smilevvsmilevv,user|OrenTsur,user|viiocious,"

## Run Breadth First Search on Adjancency Schema
numsteps=2 # Number of steps
v0=users
Atable=DB[myName * "user_share_hashtag"]

# Set results table
Rname=myName * "user_share_hashtag_bfs"
RnameTable = DB[Rname]
delete(RnameTable)

# Other BFS Params
RnameT=Rname * "T"

# Do BFS
v = adjbfs(Atable, v0, numsteps, Rname, RnameT; minDegree=5, maxDegree=20, 
    ADegtable=usersharehashtagdegname, degColumn="deg", degInColQ=false)
bfsresult = RnameTable[:,:]

# Cleanly printing the output
v0rep = replace(v0, "user|" => " ")
println("\nVertices reached in " * string(numsteps) * " hops from " * v0rep * " (" * string(length(bfsresult.col)) * " total):")
for i = 1:length(bfsresult.col)
    print(bfsresult.col[i] * ", ")
end

Creating t2_user_share_hashtag_bfs in class-db02
Deleting t2_user_share_hashtag_bfs in class-db02
13 Aug 2019 13:51:02,338 DEBUG - Graphulo.OneTable(987) -  user|viiocious :%00; [3] -> 33 entries processed
13 Aug 2019 13:51:02,461 DEBUG - Graphulo.OneTable(987) -  user|viiocious :%00; [24] -> 315 entries processed

Vertices reached in 2 hops from  smilevvsmilevv, OrenTsur, viiocious, (49 total):
user|3lowshal7aeyaii, user|AleshkaLamanov, user|Alesyafedotova, user|Canniagoyessie, user|DudnikD, user|Ghassan366, user|HajiArefi, user|Indraputr4, user|IndryOktafiany, user|Mmmjj55, user|NadiaSaftiyan, user|Nakkarin_P, user|NatashaTyugaeva, user|NottinghamTIC, user|OrenTsur, user|PRINCESS_mony9, user|PalamarVeronika, user|SAIF_HA, user|SalmaAlQibti, user|StasAntonov, user|SueChua, user|SuzanneTee0217, user|Vanya_Cherevko, user|WulaaanWP, user|ZELO96_BTHB, user|_fosa, user|aboabdolla8, user|aisyyahs, user|akhbarhurra, user|an_NoY_nr, user|andrebranca94, user|deadinyati, user|dianaraflata_, use

How do your results change if you change the min and max degrees?

## Jaccard Index

The Jaccard index -- a similarity measure of two sets -- is another metric for saying how similar two users may be. In our example, we look at each user's set of hashtag sharers, and compute how similar the two sets are.

A good explanation of the Jaccard Index, and how it works in the context of graph theory: https://medium.com/rapids-ai/similarity-in-graphs-jaccard-versus-the-overlap-coefficient-610e083b877d

We can run Jaccard on the entire user-user graph in the database using Graphulo.

In [22]:
# Set Params
Aname=myName * "user_share_hashtag"
Atable = DB[Aname]
ADegtable=Aname * "_deg"
Rfinal=Aname * "_jaccard"

# Do Jaccard
jaccard(Atable, ADegtable, Rfinal)

# Set up Results Table
TadjJaccard = DB[Rfinal];

13 Aug 2019 13:51:11,577 DEBUG - Graphulo.OneTable(987) -   :%00; [1] -> 121104 entries processed
13 Aug 2019 13:51:11,577 DEBUG - Graphulo.Jaccard(3429) -  Jaccard #partial products 121104


In [23]:
# Printing selected Jaccard Coefficients
using Random

userDeg=DB[ADegtable]
users = str2num(userDeg[:,:])
u = getrow(strictbounded(users, 5, 20))
J = str2float(TadjJaccard[u,:])
cols = 3
printFull( J[:, rand(1:size(J)[2], 2)] )

9×3 Array{Union{AbstractString, Number},2}:
 ""                       "user|RizaIcha12"   "user|NottinghamTIC"
 "user|12N_nadiya2"      0.0327869           0.0                  
 "user|Alesyafedotova"   0.0                 0.125                
 "user|DebbyAngrainiHD"  0.0327869           0.0                  
 "user|Indraputr4"       0.0625              0.0                  
 "user|IndryOktafiany"   0.0625              0.0                  
 "user|Mmmjj55"          0.0                 0.2                  
 "user|NadiaSaftiyan"    0.0625              0.0                  
 "user|NikeDcn4"         0.0327869           0.0                  

## Topic Modeling

Those users we found in our BFS example earlier, what do they talk about? We can run NMF to do some topic modeling.

Since NMF runs on an Incidence matrix, we need to first filter our original Edge table down to the tweets written by our users of interest. The first step to doing this is running BFS on the incidence matrix. This will give us the rows of Tedge that correspond to those users. Then we can filter out just the words by using the Graphulo OneTable function.

First we run BFS (unfortunately calling BFS on the EdgeTable is a little messy):

In [24]:
## Run Breadth First Search on Incidence Schema
k=1 # Number of step
v0 = v
Etablename=myName * "Tedge"
Etable = DB[Etablename]

# Set results table
Rtablename=Etablename * "_BFS"
TadjBFS = DB[Rtablename]

# Other BFS Params
RTtablename=Rtablename * "T"

# Do BFS
vGraphulo = edgebfs(Etable, v0, k, Rtablename, RTtablename;
    EDegtable=Etablename * "Deg", degColumn="", degInColQ=false);

Creating t2_Tedge_BFS in class-db02
13 Aug 2019 13:51:26,628 DEBUG - Graphulo.EdgeBFS(1445) -  fetchColumn :	
13 Aug 2019 13:51:26,630 DEBUG - EdgeBFSReducer.parseOptions(34) -  inColumnPrefixes: ,


Now the original incidence matrix is filtered to just tweets with the users in question.

Then we can filter out the words:

In [25]:
# TODO make this work in Graphulo, not locally

# Making some handy bindings
RtableT = DB[RTtablename]
Rtable = DB[Rtablename]

# Filter to just words
RtableFilterlocal = RtableT[realwords,:]

# Create a DB table, and upload
RtableFilter = DB[myName * "Tedge_filtered"]
put(RtableFilter, RtableFilterlocal, clear=true)
RtableFilterT = DB[myName * "Tedge_filteredT"]
put(RtableFilterT, RtableFilterlocal', clear=true)

Creating t2_Tedge_filtered in class-db02
Deleting t2_Tedge_filtered in class-db02
Creating t2_Tedge_filtered in class-db02
Creating t2_Tedge_filteredT in class-db02
Deleting t2_Tedge_filteredT in class-db02
Creating t2_Tedge_filteredT in class-db02


Finally we can run NMF:

In [28]:
## NMF on Incidence/Edge Schema
# Note: takes some time to run

# Set results tables
tname_W=myName * "word_NMF_W"
TedgeNMF_W = DB[tname_W, tname_W * "T"]
tname_H=myName * "word_NMF_H"
TedgeNMF_H = DB[tname_H, tname_H * "T"]

# Set Params
Aorig=DB[myName * "Tedge_filtered"]
ATorig=DB[myName * "Tedge_filteredT"]
Wfinal= tname_W
WTfinal= tname_W * "T"
Hfinal= tname_H
HTfinal= tname_H * "T"
k=3 # Number of topics
maxiter=20 # Maximum number of iterations

# Do NMF
nmf(Aorig, ATorig, k, Wfinal, WTfinal, Hfinal, HTfinal, maxiter)

13 Aug 2019 13:52:12,235 DEBUG - Graphulo.OneTable(987) -   :%00; [1] -> 468 entries processed
13 Aug 2019 13:52:12,510 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 9 entries processed
13 Aug 2019 13:52:13,524 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 93 entries processed
13 Aug 2019 13:52:14,067 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 93 entries processed
13 Aug 2019 13:52:15,255 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 9 entries processed
13 Aug 2019 13:52:16,207 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 351 entries processed
13 Aug 2019 13:52:16,716 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 468 entries processed
13 Aug 2019 13:52:17,215 DEBUG - Graphulo.NMF(3791) -  NMF Iteration 1 to t2_Tedge_filtered_NMF_Hprev: hdiff 0.0
13 Aug 2019 13:52:17,438 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 9 entries processed
13 Aug 2019 13:52:18,756 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 71 entries processed
13 Aug 2019 13:52:19,291 DEBUG - Graphulo

0.0

First, let's look at the words of each topic. The information is stored in the W table of the result:

In [29]:
WAssoc = str2float(TedgeNMF_W[:,:])
printFull(WAssoc)

157×4 Array{Union{AbstractString, Number},2}:
 ""                                    "1"          "2"          "3"       
 "word_lower|ada"                     0.0          0.0          3.25061    
 "word_lower|ado"                     0.0          0.0          0.288681   
 "word_lower|aja"                     0.0          0.582558     0.691934   
 "word_lower|alasannya?"              0.00131378   0.76175      0.0182819  
 "word_lower|ang"                     0.0          8.72533e-5   0.0        
 "word_lower|anjing"                  0.0250904    0.0          0.173126   
 "word_lower|apa"                    11.5075       0.0          0.0        
 "word_lower|apaan?"                  1.43874      0.0          0.0        
 "word_lower|atau"                    0.0          0.0          0.000453263
 "word_lower|ayah"                    0.0          0.206679     0.204627   
 "word_lower|ba"                      0.0          8.72533e-5   0.0        
 "word_lower|baik"                    0.0 

Now, we split the words into topics, and filter out the ones with low values.

In [30]:
# Filter out only the highest value in each row, so that each word is associated with its topic

M = Matrix(WAssoc.A)

# TODO write a sortperm-based version

width = size(M)[2]
for i = 1:size(M)[1]
    index = argmax(M[i,:])
    for j = 1:width
        if j != index
            M[i, j] = 0
        end
    end
end

WAssoc2 = putAdj(WAssoc, D4M.sparse(M))

# split into individual columns, with each column containing only one topic
# also filtering out words with higher weights
# for more / less results, change topN, or change the scalars
topN = 11
col1 = WAssoc2[:, 1].A
cut1 = sort(col1, dims=1, rev=true)[topN] - 0.01
# cut1 = sum(col1)/length(col1) * 2
col2 = WAssoc2[:, 2].A
# cut2 = sum(col2)/length(col2) * 2.5
cut2 = sort(col2, dims=1, rev=true)[topN] - 0.01
col3 = WAssoc2[:, 3].A
# cut3 = sum(col3)length(col3) * 1
cut3 = sort(col3, dims=1, rev=true)[topN] - 0.01

println("Topic 1:")
println.((WAssoc2[:, 1] > cut1).row)
println("\nTopic 2:")
println.((WAssoc2[:, 2] > cut2).row)
println("\nTopic 3:")
println.((WAssoc2[:, 3] > cut3).row);

Topic 1:
word_lower|apa
word_lower|calon
word_lower|eyang
word_lower|kabar
word_lower|pencalonan
word_lower|pendapatmu
word_lower|presiden
word_lower|punya
word_lower|sbg
word_lower|subur
word_lower|ttg

Topic 2:
word_lower|bekerja
word_lower|dalam
word_lower|dan
word_lower|disiplin
word_lower|kebohongan
word_lower|pemimpin
word_lower|pribadi
word_lower|sama
word_lower|seorang
word_lower|tidak
word_lower|yg

Topic 3:
word_lower|ada
word_lower|gak
word_lower|gak?
word_lower|itu
word_lower|kalau
word_lower|kamu
word_lower|kata
word_lower|menurut
word_lower|orang
word_lower|pacar
word_lower|rasa
word_lower|salah


It's a bit hard to see if this makes any sense, since it's not in English. Using Google Translate or similar, we can see that at least the first two could make sense:

<img src="images/topic1.png" alt="Drawing" style="width: 700px;"/>
<img src="images/topic2.png" alt="Drawing" style="width: 700px;"/>
<img src="images/topic3.png" alt="Drawing" style="width: 700px;"/>

Now, let's take a look at the tweets that are grouped into topics. First, we have to find the sets of Tweet IDs. We follow a similar process as before.

In [31]:
HAssoc = str2float(TedgeNMF_H[:,:])

M = Matrix(HAssoc.A)

l = size(M)[1]
for i = 1:size(M)[2]
    index = argmax(M[:,i])
    for j = 1:l
        if j != index
            M[j, i] = 0
        end
    end
end

HAssoc2 = putAdj(HAssoc, D4M.sparse(M))

row1 = HAssoc2[1, :].A
# cut1 = sum(row1)/length(row1) * 0.00001
cut1 = 10^-9
row2 = HAssoc2[2, :].A
cut2 = 10^-9
# cut2 = sum(row2)/length(row2) * 0.1
row3 = HAssoc2[3, :].A
cut3 = 10^-8
# cut3 = sum(row3)/length(row3)

# left pad with zeros, so that we can use them to look up the actual tweets
topic1cols = (HAssoc2[1, :] > cut1).col
topic2cols = (HAssoc2[2, :] > cut2).col
topic3cols = (HAssoc2[3, :] > cut3).col

println("Topic 1:")
println.(topic1cols)
println("\nTopic 2:")
println.(topic2cols)
println("\nTopic 3:")
println.(topic3cols);

Topic 1:
4030149240881733
422083534270881733
589104590429781733
6522994496781733
677304766330881733
739531693730881733
804184452207781733
844063821428781733

Topic 2:
121375375429781733
253486513169781733
35874956448781733
359502688949781733
378746561148781733
4053126776781733
624514187498781733

Topic 3:
42472996876781733
441260120156781733
635779414759781733
652489168756781733
673924555778781733
825479379077781733
846990361667781733
848249482340881733
8866479030881733


Now we load the full text of the tweets that were grouped into topics.

In [32]:
fulltext = ReadCSV("graphclassdata/tweetsfulltext2.csv", quotes=false);

Printing associative arrays in databases is asynchronous.....here's things printed one cell at a time.

In [33]:
printFull(fulltext[topic1cols, :])

9×2 Array{Union{AbstractString, Number},2}:
 ""                    …  "Text"                                                                                                                           
 "4030149240881733"       "Apaan? NO!!!!\"@JawabJUJUR: Apa pendapatmu ttg kabar pencalonan eyang subur sbg calon presiden 2014? #JJ | @diahpuspa\""        
 "422083534270881733"     "sinting RT @JawabJUJUR: Apa pendapatmu ttg kabar pencalonan eyang subur sbg calon presiden 2014? #JJ | @diahpuspa"              
 "589104590429781733"     "maymoop ituv\"@JawabJUJUR: Apa pendapatmu ttg kabar pencalonan eyang subur sbg calon presiden 2014? #JJ | @diahpuspa\""         
 "6522994496781733"       "Haha itu lucuRT@JawabJUJUR: Apa pendapatmu ttg kabar pencalonan eyang subur sbg calon presiden 2014? #JJ | @diahpuspa"          
 "677304766330881733"  …  "Ga punya malu \"@JawabJUJUR: Apa pendapatmu ttg kabar pencalonan eyang subur sbg calon presiden 2014? #JJ | @diahpuspa\""       
 "739531693730881733

In [34]:
printFull(fulltext[topic2cols, :])

8×3 Array{Union{AbstractString, Number},2}:
 ""                    …  "Text"                                                                                                                                   
 "121375375429781733"     "Cuekin nnti jg baik sndri kok RT @JawabJUJUR: Kalo doi marah"                                                                           
 "253486513169781733"     "@fdrifda coy jangan ke saya coy hahaha"                                                                                                 
 "35874956448781733"      "RT @TweetRAMALAN: #Capricorn seorang pribadi yg tidak suka kebohongan"                                                                  
 "359502688949781733"     "Hhhaaa RT@TweetRAMALAN: #Capricorn seorang pribadi yg tidak suka kebohongan"                                                            
 "378746561148781733"  …  "- Natuklasan Mo na ba ang Tunay Nitong Ganda? http://t.co/UztKtdHaPG #Philippine #philippines #Mindanao #Mani

In [35]:
printFull(fulltext[topic3cols, :])

10×3 Array{Union{AbstractString, Number},2}:
 ""                    …  "Text"                                                                                                                                     
 "42472996876781733"      "Gak\"@AhSpeakDoang: #OmSpikTanya lo dirumah punya anjing gak ?\""                                                                         
 "441260120156781733"     "masa bodoooo :D RT @nankkatro04: @deadinyati masa?? RT"                                                                                   
 "635779414759781733"     "Ibu\"@AhSpeakDoang: #OmSpikTanya lo lebih deket sama ayah / ibu ?\""                                                                      
 "652489168756781733"     "typo RT@rahna27: Mawah merah. @jawabJUJUR: [Cewek] Pilih mawar putih atau mawar merah? #JJ | @Dhea_Vanny\""                               
 "673924555778781733"  …  "Ndk ado lai"                                                                                      

Topic modeling tends to be fairly sensitive to the number of topics. Try varying k above and see how the resutls change. You can use Google Translate to see what the tweets are saying, to some degree. You can also try varying the maximum number of iterations.

# Deleting Tables

If you want to start over, you can run this to delete your tables.

In [36]:
deleteprefix(DB, myName)

Deleting t2_Tedge in class-db02
Deleting t2_TedgeDeg in class-db02
Deleting t2_TedgeT in class-db02
Deleting t2_TedgeTxt in class-db02
Deleting t2_Tedge_BFS in class-db02
Deleting t2_Tedge_BFST in class-db02
Deleting t2_Tedge_filtered in class-db02
Deleting t2_Tedge_filteredT in class-db02
Deleting t2_hashtag in class-db02
Deleting t2_hashtag_user in class-db02
Deleting t2_user_hashtag in class-db02
Deleting t2_user_share_hashtag in class-db02
Deleting t2_user_share_hashtag_bfs in class-db02
Deleting t2_user_share_hashtag_bfsT in class-db02
Deleting t2_user_share_hashtag_deg in class-db02
Deleting t2_user_share_hashtag_jaccard in class-db02
Deleting t2_word_NMF_H in class-db02
Deleting t2_word_NMF_HT in class-db02
Deleting t2_word_NMF_W in class-db02
Deleting t2_word_NMF_WT in class-db02
Deleting t2_wordword in class-db02
