# D4M Warmup

To use D4M, D4M needs to be on your path. In most environemnts on the LLSC systems D4M will already be on your path, however in this specific case (an Octave kernel in Jupyter) it sometimes is not.

In [1]:
using D4M

D4M is a package for working with Associative Arrays. An Associative Array is a bit like a sparse matrix, but the rows, columns, and values can be either numbers or strings.

In [34]:
row = "a,a,a,a,a,a,a,aa,aaa,b,bb,bbb,a,aa,aaa,b,bb,bbb,"
column = "a,aa,aaa,b,bb,bbb,a,a,a,a,a,a,a,aa,aaa,b,bb,bbb,"
values = "a-a,a-aa,a-aaa,a-b,a-bb,a-bbb,a-a,aa-a,aaa-a,b-a,bb-a,bbb-a,a-a,aa-aa,aaa-aaa,b-b,bb-bb,bbb-bbb,"

A = Assoc(row,column,values)

printFull(A)

printFull(logical(A))

7×7 Array{Union{AbstractString, Number},2}:
 ""     "a"      "aa"     "aaa"      "b"    "bb"     "bbb"    
 "a"    "a-a"    "a-aa"   "a-aaa"    "a-b"  "a-bb"   "a-bbb"  
 "aa"   "aa-a"   "aa-aa"  ""         ""     ""       ""       
 "aaa"  "aaa-a"  ""       "aaa-aaa"  ""     ""       ""       
 "b"    "b-a"    ""       ""         "b-b"  ""       ""       
 "bb"   "bb-a"   ""       ""         ""     "bb-bb"  ""       
 "bbb"  "bbb-a"  ""       ""         ""     ""       "bbb-bbb"

7×7 Array{Union{AbstractString, Number},2}:
 ""      "a"   "aa"   "aaa"   "b"   "bb"   "bbb"
 "a"    1     1      1       1     1      1     
 "aa"   1     1      0       0     0      0     
 "aaa"  1     0      1       0     0      0     
 "b"    1     0      0       1     0      0     
 "bb"   1     0      0       0     1      0     
 "bbb"  1     0      0       0     0      1     

This flexibility makes D4M Associative Arrays ideal for representing graph data. Often this involves having string row and column labels, representing the names of vertices and/or edges, and numeric values representing the existance of an edge or the weight of that edge.

_TODO fix put_

In [38]:
row = "v1,v1,v1,v1,v1,v1,v1,v2,v3,v4,v5,v6,v1,v2,v3,v4,v5,v6,"
column = "v1,v2,v3,v4,v5,v6,v1,v1,v1,v1,v1,v1,v1,v2,v3,v4,v5,v6,"
values = "1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,"

A = Assoc(row,column,values)

printFull(A)

printFull(logical(A))

7×7 Array{Union{AbstractString, Number},2}:
 ""    "v1"  "v2"  "v3"  "v4"  "v5"  "v6"
 "v1"  "1"   "1"   "1"   "1"   "1"   "1" 
 "v2"  "1"   "1"   ""    ""    ""    ""  
 "v3"  "1"   ""    "1"   ""    ""    ""  
 "v4"  "1"   ""    ""    "1"   ""    ""  
 "v5"  "1"   ""    ""    ""    "1"   ""  
 "v6"  "1"   ""    ""    ""    ""    "1" 

7×7 Array{Union{AbstractString, Number},2}:
 ""     "v1"   "v2"   "v3"   "v4"   "v5"   "v6"
 "v1"  1      1      1      1      1      1    
 "v2"  1      1      0      0      0      0    
 "v3"  1      0      1      0      0      0    
 "v4"  1      0      0      1      0      0    
 "v5"  1      0      0      0      1      0    
 "v6"  1      0      0      0      0      1    

<img src="images/graphEx.png" alt="Drawing" style="width: 200px;"/>

With Associative Arrays, you can extract a subgraph by indexing into the Associative Array. For example, let's just get all the columns with odd vertices and rows with vertices 1-3.

Note the last character in an index string is the delimiter, this allows us to do indexing on multiply values with a single string, as cell arrays of strings can get very slow.

In [42]:
printFull(A[:,"v1,v3,v5,"])

printFull(A["v1,:,v3,",:])

7×4 Array{Union{AbstractString, Number},2}:
 ""    "v1"  "v3"  "v5"
 "v1"  "1"   "1"   "1" 
 "v2"  "1"   ""    ""  
 "v3"  "1"   "1"   ""  
 "v4"  "1"   ""    ""  
 "v5"  "1"   ""    "1" 
 "v6"  "1"   ""    ""  

4×7 Array{Union{AbstractString, Number},2}:
 ""    "v1"  "v2"  "v3"  "v4"  "v5"  "v6"
 "v1"  "1"   "1"   "1"   "1"   "1"   "1" 
 "v2"  "1"   "1"   ""    ""    ""    ""  
 "v3"  "1"   ""    "1"   ""    ""    ""  

We can also add, subtract, and mutliply Associative Arrays.

In [46]:
printFull(A + A[:,"v1,v3,v5,"]) # A, with odd columns doubled

printFull(A - A["v1,:,v3,",:]) # A, with first three columns gone

printFull(A * A[:,"v2,v4,"]) # A times just a couple of its columns

7×7 Array{Union{AbstractString, Number},2}:
 ""     "v1"   "v2"   "v3"   "v4"   "v5"   "v6"
 "v1"  2.0    1.0    2.0    1.0    2.0    1.0  
 "v2"  2.0    1.0    0.0    0.0    0.0    0.0  
 "v3"  2.0    0.0    2.0    0.0    0.0    0.0  
 "v4"  2.0    0.0    0.0    1.0    0.0    0.0  
 "v5"  2.0    0.0    0.0    0.0    2.0    0.0  
 "v6"  2.0    0.0    0.0    0.0    0.0    1.0  

4×5 Array{Union{AbstractString, Number},2}:
 ""     "v1"   "v4"   "v5"   "v6"
 "v4"  1.0    1.0    0.0    0.0  
 "v5"  1.0    0.0    1.0    0.0  
 "v6"  1.0    0.0    0.0    1.0  

7×3 Array{Union{AbstractString, Number},2}:
 ""     "v2"   "v4"
 "v1"  2      2    
 "v2"  2      1    
 "v3"  1      1    
 "v4"  1      2    
 "v5"  1      1    
 "v6"  1      1    

This just a small taste of D4M to get you started. You'll see more as we go through this notebook.

# Creating Incidence and Adjacency Matrices

Usually when we parse data into D4M Associative Arrays we put them in an Incidence Matrix format. For review, the rows of an incidence matrix correspond to edges, while the rows correspond to vertices. Incidence matrices can represent a variety of types of graphs, and so can be very useful. You can also easily form any adjacency matrix you need from an incidence matrix.

To see how an incidence matrix is formed, let's say for example you have some raw data in a tsv file. The rows IDs of the incidence matrix are some unique identifier of the rows in the original file, in this example they are just the row number in the file. There are generally different columns associated with each row in the tsv file, we combine the column name with each column value to create our column IDs, or vertex IDs in our Incidence matrix. Notice how we have parsed out each individual word in column 3 as well.

<img src="images/TSVtoAssoc.png" alt="Drawing" style="width: 600px;"/>

Let's look at an example. We are loading an Associative Array with Twitter data, and looking at the user columns in the first three tweets:

In [2]:
A = ReadCSV("graphclassdata/A1.csv");
A += ReadCSV("graphclassdata/A2.csv");
A += ReadCSV("graphclassdata/A3.csv");
A += ReadCSV("graphclassdata/A4.csv");
A += ReadCSV("graphclassdata/A5.csv");
A += ReadCSV("graphclassdata/A6.csv");
A += ReadCSV("graphclassdata/A7.csv");
A += ReadCSV("graphclassdata/A8.csv");
A += ReadCSV("graphclassdata/A9.csv");
A += ReadCSV("graphclassdata/A10.csv");

In [3]:
length(A.row) # should be 10000

10000

In [4]:
length(A.col) # should be 129895

129895

In [5]:
printFull(A[1:3,StartsWith("user|,")])

4×4 Array{Union{AbstractString, Number},2}:
              ""   "user|Blocker57"   "user|Wangjeje"   "user|gyanpratika"
  821278781733    0.0                0.0               1.0                
 4857479781733    1.0                0.0               0.0                
 8653317781733    0.0                1.0               0.0                

Since the full matrix display is not practical for viewing much more than this, we often look at the triples of the Associative Array. Here are the first two rows:

In [6]:
print(A[1:2,:])

  [2, latlon|++004315..82003012019000000000]  =  1.0
  [1, latlon|-+010067..10923207153171000000]  =  1.0
  [2, lat|+041.8031090000                  ]  =  1.0
  [1, lat|-006.1930137000                  ]  =  1.0
  [2, lon|+035.2002100000                  ]  =  1.0
  [1, lon|+107.0227511000                  ]  =  1.0
  [2, place|682c5a667856ef42               ]  =  1.0
  [1, place|85858f3447b85e2b               ]  =  1.0
  [1, time|2013-05-22 12:47:08             ]  =  1.0
  [2, time|2013-05-22 12:47:32             ]  =  1.0
  [1, userID|204683306                     ]  =  1.0
  [2, userID|258075949                     ]  =  1.0
  [2, user_lower|blocker57                 ]  =  1.0
  [1, user_lower|gyanpratika               ]  =  1.0
  [2, user|Blocker57                       ]  =  1.0
  [1, user|gyanpratika                     ]  =  1.0
  [1, word_lower|:(                        ]  =  1.0
  [2, word_lower|pastanesi                 ]  =  1.0
  [2, word_lower|tarhan                    ]  

From here we can form a variety of adjacency matrices. For example:

In [6]:
# First, we make a bunch of query/filter objects that we can pass into indexing.

# All words that start with a letter:
realwords = "word|A,:,word|z,";

# All hashtags:
hashtags = StartsWith("word|#");

# Tweets @ someone:
directedtweet = StartsWith("word|@");

# Usernames:
users = StartsWith("user|");

# Locations:
locs = StartsWith("latlon|");

In [18]:
# Users-Location Graph
println("Bipartite User-Location Graph") # number of tweets for each user/location pair
Auserloc = transpose(A[:,users]) * A[:,locs]
print(Auserloc[1:2,:]) # print the first two rows to get a feel

println()

# Word-Word Graph
println("Small Symmetric Word-Word Graph") # yay for quadratic runtime
Asmall = A[1:1000, :]
Arealwordsfilter = Asmall[:, realwords] # looking at just the words 
Awords = transpose(Arealwordsfilter) * Arealwordsfilter # nuber of tweets for each word/word pair
Awords = removediag(Awords) 
print(Awords>4) # display sufficiently high

println()

# Directed tweets
println("Directed User-User Graph")
Adir = transpose(A[:,users]) * A[:,directedtweet] # number of each tweets for each "sender/recipient" pair
print(Adir[11:20, :]) # just a few users

Bipartite User-Location Graph
  [user|0000The  , latlon|++002459..46269574254468000000]  =  1.0
  [user|000chiaki, latlon|++013369..30212434630140720000]  =  1.0

Small Symmetric Word-Word Graph
  [word|T?rkiye) , word|I'm      ]  =  8.0
  [word|w/       , word|I'm      ]  =  14.0
  [word|see      , word|It       ]  =  5.0
  [word|aku      , word|RT       ]  =  6.0
  [word|lah      , word|RT       ]  =  5.0
  [word|yg       , word|RT       ]  =  5.0
  [word|I'm      , word|T?rkiye) ]  =  8.0
  [word|w/       , word|T?rkiye) ]  =  7.0
  [word|eu       , word|To       ]  =  5.0
  [word|eu       , word|aii      ]  =  10.0
  [word|RT       , word|aku      ]  =  6.0
  [word|ga       , word|aku      ]  =  8.0
  [word|jadi     , word|aku      ]  =  6.0
  [word|kamu     , word|aku      ]  =  11.0
  [word|eu       , word|baga?eira]  =  5.0
  [word|var      , word|bir      ]  =  6.0
  [word|eu       , word|cabare   ]  =  5.0
  [word|eu       , word|caminh?o ]  =  5.0
  [word|lah      , word|dak 

# Graph Algorithms in D4M

D4M has some graph algorithm capabilities built in. Let's take a look at a few examples before we start looking at in-database algorithms.

## Breadth First Search

From our word-word graph, we can see which words that are used together. But what if we want to go out one or two levels more? We can use breadth first search (BFS). We'll start by specifying a few source vertices.

In [None]:
words = "word|RT,"
Asearch = logical(Awords)

## Run Breadth First Search on Adjancency Schema
k=3 # Number of steps
minDegree=5
maxDegree=10
v0 = words # initial vertices
Adeg = transpose(sum(logical(A),1))        # Degree of each node.

v = AdjBFS(Asearch,Adeg,v0,numsteps,minDegree,maxDegree, takeunion=true)
v = Row(v)

# Vertices reached in 3 hops from specified users
# println("Vertices reached in " *  num2str(k) " hops from " * strrep(v0(1:end-1),char(10),', ') ':']) 
# println(strrep(v(1:end-1),char(10),', '))

## Jaccard Index

The Jaccard index is a metric of similarity which can be calculated on a graph. In the context of words, this takes into account not only the number of times two words occur together, but also the overall degrees of each word.

_TODO make this work, once Jaccard works too_

In [None]:
# % First remove low-degree nodes, those will throw off the results
# w = Row(Adeg(realwords,:) > 2);
# Atest = Awords(:,w);
# Atest = Atest(w,:);

# J = Jaccard(Atest);

# J(['word|coffee' nl ],:)> 0.1

# Database Setup

Now we will bind to a database. Often we put these lines in a script called "DBsetup" and just run that.

Be sure to edit this to include a unique name for your tables, this will prevent multiply people from running on the same tables.

In [2]:
# Initialize DB connectors
dbinit()

# Connect to Database
DB = dbsetup("uno", "AccumuloConfig.jl")

08 Jul 2019 14:53:12,545 WARN - ClientConfiguration.loadFromSearchPath(227) -  Found no client.conf in default paths. Using default client configuration values.


D4M.DBserver("uno", "localhost:2181", "root", "secret", "BigTableLike", JavaCall.JavaObject{Symbol("edu.mit.ll.graphulo.MatlabGraphulo")}(Ptr{Nothing} @0x0000000004093458))

In [3]:
# Give a unique name prefix to prevent collisions in Accumulo
myName = "my_tweets_"

# Bind to tables
Tedge = DB[myName * "Tedge", myName * "TedgeT"]
TedgeDeg = DB[myName * "TedgeDeg"]
TedgeTxt = DB[myName * "TedgeTxt"]

Creating my_tweets_Tedge in uno
Creating my_tweets_TedgeT in uno
Creating my_tweets_TedgeDeg in uno
Creating my_tweets_TedgeTxt in uno


D4M.DBtable(D4M.DBserver("uno", "localhost:2181", "root", "secret", "BigTableLike", JavaCall.JavaObject{Symbol("edu.mit.ll.graphulo.MatlabGraphulo")}(Ptr{Nothing} @0x0000000004093458)), "my_tweets_TedgeTxt", "", 0, 0, "", 500000.0, JavaCall.JavaObject{Symbol("edu.mit.ll.d4m.db.cloud.D4mDataSearch")}(Ptr{Nothing} @0x0000000004093510), JavaCall.JavaObject{Symbol("edu.mit.ll.d4m.db.cloud.D4mDbTableOperations")}(Ptr{Nothing} @0x0000000002aae3a0))

If you check the Accumulo monitor, these tables are now present!

Now we will ingest our data.

In [22]:
# Load .mat files containing data
fname = '/home/gridsan/ledwards/Graphulo/GraphClass/tweets.mat';
load(fname);
nl = char(10);

# Insert into database
A=NewSep(A,nl);
put(Tedge,num2str(A));
put(TedgeDeg,putCol(num2str(sum(A,1).'),['Degree' nl]));
put(TedgeTxt,Atxt);

# Creating Adjacency Graphs

We store our data in an incidence matrix form using the standard D4M Schema. In this way you have the flexibility to create adjacency matrices of individual graphs as you need them.

<img src="images/d4mschema.png" alt="Drawing" style="width: 800px;"/>

We can use Graphulo's table multiply to create our adjacency matrices. For our twitter data, we can create several that may be interesting. First let's get the column keys of the columns we may want to filter on, and get an idea of how many columns there are in each using the degree table. This information is important when deciding how to form your adjacency matrix.

In [23]:
# All words that start with a letter:
realwords = ['word_lower|a' char(10) ':' char(10) 'word_lower|z' char(127) char(10)];
disp(['There are ' num2str(nnz(TedgeDeg(realwords,:))) ' words that start with a letter.'])

# All hashtags:
hashtags = StartsWith(['word|#' char(10)]);
disp(['There are ' num2str(nnz(TedgeDeg(hashtags,:))) ' hashtags.'])

# Tweets @ someone:
directedtweet = StartsWith(['word|@' char(10)]);
disp(['There are ' num2str(nnz(TedgeDeg(directedtweet,:))) ' words that start with @.'])

# Usenames:
users = StartsWith(['user|' char(10)]);
disp(['There are ' num2str(nnz(TedgeDeg(users,:))) ' usernames.'])

# Locations:
locs = StartsWith(['latlon|' char(10)]);
disp(['There are ' num2str(nnz(TedgeDeg(locs,:))) ' locations.'])

There are 25147 words that start with a letter.
There are 1118 hashtags.
There are 5474 words that start with @.
There are 8510 usernames.
There are 7705 locations.


Now let's make a few graphs. A Graphulo Table Multiply is in terms of A\*B. If you think of E as our incidence matrix, we can get an adjacency matrix by multiply E'\*E, so A = E' and B = E. Since Graphulo requires the transpose of the "A" table to do the multiply, A' = (E')' = E.

First let's make a word-word graph. On the full dataset, this takes about 3-5 minutes, but the result is nearly a 1Mx1M graph with about 40M edges! Since this is a small subset, it only takes a few seconds.

In [24]:
% Set Parameters
% Multiply in terms of A*B = C, so if we want to do A'*A, then AT is just A
ATtable = [myName 'Tedge'];
Btable = [myName 'Tedge'];
Ctable = [myName 'word_graph'];
rowFilter = '';
colFilterAT = realwords;
colFilterB = realwords;


# Multiply Tables
tic;
numpp = G.TableMult(ATtable, Btable, Ctable, rowFilter, colFilterAT, colFilterB);
toc;

15 Oct 2018 15:13:21,095 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 525616 entries processed
Elapsed time is 2.30288 seconds.


Maybe we want to see what hashtags are used together. Let's make a hashtag-hashtag graph. On the full dataset is much faster than the previous one and takes about a minute. On this subset it takes only a few seconds.

In [29]:
% Set Parameters
% Multiply in terms of A*B = C, so if we want to do A'*B, then AT is just A
ATtable = [myName 'Tedge'];
Btable = [myName 'Tedge'];
Ctable = [myName 'hashtag_graph'];
rowFilter = '';
colFilterAT = hashtags;
colFilterB = hashtags;

# Multiply Tables
tic;
numpp = G.TableMult(ATtable, Btable, Ctable, rowFilter, colFilterAT, colFilterB);
toc;

15 Oct 2018 15:15:10,426 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 3921 entries processed
Elapsed time is 1.74382 seconds.


Let's make a directed graph that shows which users use which hashtags.

If you are creating an adjacency matrix for a directed graph, you may want to create a transpose result graph as well, so you can query quickly for both in and out vetices (Accumulo is indexed by row key, so it is fastest to query rows). All previous graphs were symmetric so you don't need a transpose adjacency matrix.

I have also found that you want to use the filter with the larger number of values as your colFilterB- it will allow the iterator to process more entries at a time when it scanning from your A and B tables.

For example, the first time I created the user-hashtag graph on the full dataset I set colFilterAT as the "user|" prefix and colFilterB as the "word|#" prefix. I eventually killed the process because it was taking a very long time (see below). When I swapped them, it took about a minute.

<img src="images/graphulo-colfilters.png" alt="Drawing" style="width: 500px;"/>

Here is how you would multiply two tables and produce a transpose result matrix as well. Since you are calling a Java function, it is very picky about what inputs you use. I am just using the default values for the added parameters.

In [30]:
% Set Parameters
% Multiply in terms of A*B = C, so if we want to do A'*B, then AT is just A
ATtable = [myName 'Tedge'];
Btable = [myName 'Tedge'];
Ctable = [myName 'hashtag_user_graph'];
CTtable = [myName 'user_hashtag_graph'];
rowFilter = '';
colFilterAT = hashtags;
colFilterB = users;
presumCacheSize = -1; % this is the default value
numEntriesCheckpoint = -1; % this is the default value
trace_param = true; % this is the default value

# Multiply Tables
tic;
numpp = G.TableMult(ATtable, Btable, Ctable, CTtable, rowFilter, ...
    colFilterAT, colFilterB, presumCacheSize, numEntriesCheckpoint, trace_param);
toc;

15 Oct 2018 15:15:16,126 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 1553 entries processed
Elapsed time is 1.67015 seconds.


What if we want a graph that describes the users that use the same hashtags? That will take two steps. First create an adjacency matrix of users-hashtags by multiplying (we already created this in the previous example). Then, using the graph you have just made, multiplying again will create the graph you are looking for.

Since we are multiplying the full table, we don't need to, we don't need to provide any filters.

In [31]:
% Set Parameters
% Multiply in terms of A*B = C, so if we want to do A'*B, then AT is just A
ATtable = [myName 'hashtag_user_graph'];
Btable = [myName 'hashtag_user_graph'];
Ctable = [myName 'user_graph'];

# Multiply Tables
tic;
numpp = G.TableMult(ATtable, Btable, Ctable);
toc;

15 Oct 2018 15:15:39,877 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 7142 entries processed
Elapsed time is 0.289115 seconds.


What other graphs can you create that might be interesting? Try making some now.

# Running Graph Algorithms

## Degree-Filtered Breadth First Search

Say you are interested in seeing people that talk about similar things. You may have a handful of target individuals (maybe selected because they talk a lot about a topic), and want to get their 2-hop neighbors. You could do this by running breadth first search (BFS) on those individuals on the graph we generated above.

Since we are doing degree filtering, we'll first generate a degree table.

In [32]:
% Create Degree Table to Degree Filtering
Atable=[myName 'user_graph'];
ADegtable=[Atable '_Deg'];
countColumns = true; % true counts the columns, false sums the weights
G.generateDegreeTable(Atable, ADegtable, countColumns);

Now we can run BFS. I picked two users mostly at random as our starting vertices.

In [33]:
users = 'user|smilevvsmilevv,user|OrenTsur,user|viiocious,';

%% Run Breadth First Search on Adjancency Schema
k=2; % Number of steps
minDegree=5;
maxDegree=20;
v0 = users;
Atable=[myName 'user_graph'];

% Set results table
Rtable=[Atable '_BFS'];
TadjBFS = DB(Rtable);
if nnz(TadjBFS)
    deleteForce(TadjBFS);
    TadjBFS = DB(Rtable);
end

% Other BFS Params
RtableTranspose=[Rtable 'T'];
ADegtable=[Atable '_Deg'];
degColumn='';
degInColQ=false;

% Do BFS
v = G.AdjBFS(Atable, v0, k, Rtable, RtableTranspose, ADegtable, degColumn, degInColQ, minDegree, maxDegree);

% Vertices reached in 3 hops from specified users
disp([char(10) 'Vertices reached in ' num2str(k) ' hops from ' strrep(v0(1:end-1),char(10),', ') ':'])
strrep(v(1:end-1),char(10),', ')

Creating my_tweets_user_graph_BFS in class-db01.cloud.llgrid.txe1.mit.edu:2181 Accumulo
15 Oct 2018 15:16:14,772 DEBUG - Graphulo.OneTable(987) -  user|viiocious :%00; [3] -> 33 entries processed
15 Oct 2018 15:16:14,905 DEBUG - Graphulo.OneTable(987) -  user|viiocious :%00; [24] -> 315 entries processed

Vertices reached in 2 hops from user|smilevvsmilevv,user|OrenTsur,user|viiocious:
ans = user|PRINCESS_mony9,user|talhi20,user|ghida0,user|SuzanneTee0217,user|Indraputr4,user|dianaraflata_,user|aisyyahs,user|an_NoY_nr,user|viiocious,user|DudnikD,user|franchiaraihc,user|ZELO96_BTHB,user|StasAntonov,user|SueChua,user|OrenTsur,user|refa_mn_,user|tungnail_hee,user|Vanya_Cherevko,user|NottinghamTIC,user|WulaaanWP,user|deadinyati,user|rachmawatidewi,user|Canniagoyessie,user|SAIF_HA,user|3lowshal7aeyaii,user|mymemoly,user|akhbarhurra,user|SalmaAlQibti,user|Mmmjj55,user|iina_ona,user|IndryOktafiany,user|PalamarVeronika,user|Nakkarin_P,user|t_8man,user|no_kawaii,user|noOpeXky,user|mo7ammed____,

How do your results change if you change the min and max degrees?

## Jaccard Index

Our user-user graph says a bit about how similar users are, by saying how many hashtags they have in common. The Jaccard index is another metric for saying how similar two users may be, scaled by their overall popularity, how many other users they are similar to.

We can run Jaccard on the entire user-user graph in the database using Graphulo.



In [34]:
% Set Params
Atable=[myName 'user_graph'];
ADegtable=[Atable '_Deg'];
Rfinal=[Atable '_Jaccard'];
filterRowCol=[];
Aauthorizations=[];
RNewVisibility=[];

% Set up Results Table
TadjJaccard = DB(Rfinal);
if nnz(TadjJaccard)
    deleteForce(TadjJaccard);
    TadjJaccard = DB(Rfinal);
end

% Do Jaccard
tic;
G.Jaccard(Atable, ADegtable, Rfinal, filterRowCol, Aauthorizations, RNewVisibility);
toc;

% Jaccard Coefficients
disp(char(10))
disp('Some Jaccard Coefficients')
userDeg=DB(ADegtable);
users = str2num(userDeg(:,:));
u = Row((users > 5) < 20);
J = str2num(TadjJaccard(u,:));
displayFull(NewSep(J(:,ceil(rand(3,1).*size(J,2))),','))

Creating my_tweets_user_graph_Jaccard in class-db01.cloud.llgrid.txe1.mit.edu:2181 Accumulo
15 Oct 2018 15:22:36,213 DEBUG - Graphulo.OneTable(987) -   :%00; [1] -> 121104 entries processed
15 Oct 2018 15:22:36,214 DEBUG - Graphulo.Jaccard(3429) -  Jaccard #partial products 121104
Elapsed time is 0.459159 seconds.


Some Jaccard Coefficients
                     user|Amie99_,user|Lulu_ltfh,user|Restikhrnsa_,
user|12N_nadiya2,    0.016129,    0.016129,      0.016129,         
user|DebbyAngrainiHD,             0.016129,      0.016129,         
user|Indraputr4,                  0.030303,      0.030303,         
user|IndryOktafiany,              0.030303,      0.030303,         
user|NadiaSaftiyan,                              0.030303,         
user|NikeDcn4,                                   0.016129,         


Try running Jaccard on the word-word graph. Do you find anything interesting?

## Topic Modeling

Those users we found in our BFS example earlier, what do they talk about? We can run NMF to do some topic modeling.

Since NMF runs on an Incidence matrix, we need to first filter our original Edge table down to the tweets written by our users of interest. The first step to doing this is running BFS on the incidence matrix. This will give us the rows of Tedge that correspond to those users. Then we can filter out just the words by using the Graphulo OneTable function.

First we run BFS (unfortunately calling BFS on the EdgeTable is a little messy):

In [35]:
%% Run Breadth First Search on Incidence Schema
k=1; % Number of steps
minDegree=0;
maxDegree=javaObject("java.lang.Integer",0).MAX_VALUE;
v0 = v;
Etable=[myName 'Tedge'];

% Set results table
Rtable=[Etable '_BFS'];
TadjBFS = DB(Rtable);
if nnz(TadjBFS)
    deleteForce(TadjBFS);
    TadjBFS = DB(Rtable);
end

% Other BFS Params
RTtable=[Rtable 'T'];
ETDegtable=[Etable 'Deg'];
startPrefixes=',';
endPrefixes=',';
degColumn='';
degInColQ=false;
plusOp=[];
EScanIteratorPriority=-1;
Eauthorizations=[];
EDegauthorizations=[];
newVisibility=[];
useNewTimestamp=true;
outputUnion=true;
numEntriesWritten=[];

% Do BFS
vGraphulo = G.EdgeBFS(Etable, v0, k, Rtable, RTtable, startPrefixes, endPrefixes,...
    ETDegtable, degColumn, degInColQ, minDegree, maxDegree,...
    plusOp, EScanIteratorPriority, Eauthorizations, EDegauthorizations, newVisibility,...
    useNewTimestamp, outputUnion, numEntriesWritten);

Creating my_tweets_Tedge_BFS in class-db01.cloud.llgrid.txe1.mit.edu:2181 Accumulo
15 Oct 2018 15:23:44,740 DEBUG - Graphulo.EdgeBFS(1445) -  fetchColumn :	
15 Oct 2018 15:23:44,741 DEBUG - EdgeBFSReducer.parseOptions(34) -  inColumnPrefixes: ,


Then we can filter out the words:

In [36]:
% Filter to just words
Atable = [Etable '_BFS'];
Rtable = [myName 'Tedge_filtered'];
RTtable = [myName 'Tedge_filteredT'];
clientResultMap = [];
AScanIteratorPriority = -1;
reducer = [];
reducerOpts = [];
plusOp = [];
rowFilter = '';
colFilter = realwords;
midIterator = [];
bs = [];
authorizations = [];

G.OneTable(Atable, Rtable, RTtable, clientResultMap, AScanIteratorPriority, reducer,...
    reducerOpts, plusOp, rowFilter, colFilter, midIterator, bs, authorizations)

15 Oct 2018 15:24:09,908 DEBUG - Graphulo.OneTable(987) -   :%00; [1] -> 274 entries processed
ans =  274


Finally we can run NMF:

In [37]:
%% NMF on Incidence/Edge Schema
% Note: takes some time to run

% Set results table
tname_W=[myName 'word' '_NMF_W'];
TedgeNMF_W = DB(tname_W,[tname_W 'T']);
if nnz(TedgeNMF_W)
    deleteForce(TedgeNMF_W);
    TedgeNMF_W = DB(tname_W,[tname_W 'T']);
end
tname_H=[myName 'word' '_NMF_H'];
TedgeNMF_H = DB(tname_H,[tname_H 'T']);
if nnz(TedgeNMF_H)
    deleteForce(TedgeNMF_H);
    TedgeNMF_H = DB(tname_H,[tname_H 'T']);
end

% Set Params
Aorig=[myName 'Tedge_filtered'];
ATorig=[myName 'Tedge_filteredT'];
Wfinal= tname_W;
WTfinal= [tname_W 'T'];
Hfinal= tname_H;
HTfinal= [tname_H 'T'];
k=3; % Number of topics
maxiter=10; % Maximum number of iterations
forceDelete=true;
cutoffThreshold=0;
maxColsPerTopic=0;

% Do NMF
G.NMF(Aorig, ATorig, Wfinal, WTfinal, Hfinal, HTfinal, k, maxiter,...
    forceDelete, cutoffThreshold, maxColsPerTopic);
    
% NMF Results
disp('Edge Assignments (W)')
W = Abs0(TopColPerRow(str2num(TedgeNMF_W(:,:)),1));

disp('Vertex Assignments (H)')
H = Abs0(TopColPerRow(str2num(TedgeNMF_H(:,:)),1));

Creating my_tweets_word_NMF_W in class-db01.cloud.llgrid.txe1.mit.edu:2181 Accumulo
Creating my_tweets_word_NMF_WT in class-db01.cloud.llgrid.txe1.mit.edu:2181 Accumulo
Creating my_tweets_word_NMF_H in class-db01.cloud.llgrid.txe1.mit.edu:2181 Accumulo
Creating my_tweets_word_NMF_HT in class-db01.cloud.llgrid.txe1.mit.edu:2181 Accumulo
15 Oct 2018 15:24:21,218 DEBUG - Graphulo.OneTable(987) -   :%00; [1] -> 93 entries processed
15 Oct 2018 15:24:21,406 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 9 entries processed
15 Oct 2018 15:24:22,345 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 468 entries processed
15 Oct 2018 15:24:22,729 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 468 entries processed
15 Oct 2018 15:24:23,956 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 9 entries processed
15 Oct 2018 15:24:24,837 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 76 entries processed
15 Oct 2018 15:24:25,133 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 93 entries processed
15 O

15 Oct 2018 15:25:31,146 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 330 entries processed
15 Oct 2018 15:25:31,510 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 468 entries processed
15 Oct 2018 15:25:34,202 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 9 entries processed
15 Oct 2018 15:25:35,102 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 57 entries processed
15 Oct 2018 15:25:35,409 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 93 entries processed
15 Oct 2018 15:25:35,935 DEBUG - Graphulo.OneTable(987) -   :%00; [1] -> 216 entries processed
15 Oct 2018 15:25:35,955 DEBUG - Graphulo.TwoTable(770) -   :%00; [1] -> 19 entries processed
15 Oct 2018 15:25:35,956 DEBUG - Graphulo.NMF(3791) -  NMF Iteration 10 to my_tweets_word_NMF_H: hdiff 0.08796296296296297
15 Oct 2018 15:25:35,956 DEBUG - Graphulo.NMF(3797) -  EVEN Hfinal is my_tweets_word_NMF_H
15 Oct 2018 15:25:35,956 DEBUG - Graphulo.NMF(3798) -  EVEN Hprev is my_tweets_Tedge_filtered_NMF_Hprev
Edge Assignments (W)
V

Here are the full text of the tweets that were grouped into topics. The first one in particular is talking about a presidential nomination.

In [38]:
disp('Topic 1:')
TedgeTxt(Row(W(:,1)),:)
disp('')
disp('Topic 2:')
TedgeTxt(Row(W(:,2)),:)
disp('')
disp('Topic 3:')
TedgeTxt(Row(W(:,3)),:)

Topic 1:
(004030149240881733,Text)     Apaan? NO!!!!"@JawabJUJUR: Apa pendapatmu ttg kabar pencalonan eyang subur sbg calon presiden 2014? #JJ | @diahpuspa"
(006522994496781733,Text)     Haha itu lucuRT@JawabJUJUR: Apa pendapatmu ttg kabar pencalonan eyang subur sbg calon presiden 2014? #JJ | @diahpuspa
(422083534270881733,Text)     sinting RT @JawabJUJUR: Apa pendapatmu ttg kabar pencalonan eyang subur sbg calon presiden 2014? #JJ | @diahpuspa
(589104590429781733,Text)     maymoop ituv"@JawabJUJUR: Apa pendapatmu ttg kabar pencalonan eyang subur sbg calon presiden 2014? #JJ | @diahpuspa"
(677304766330881733,Text)     Ga punya malu "@JawabJUJUR: Apa pendapatmu ttg kabar pencalonan eyang subur sbg calon presiden 2014? #JJ | @diahpuspa"
(739531693730881733,Text)     gak bnget @JawabJUJUR Apa pendapatmu ttg kabar pencalonan eyang subur sbg calon presiden 2014? #JJ | @diahpuspa
(804184452207781733,Text)     Kurang kerjaan pisan "@JawabJUJUR: Apa pendapatmu ttg kabar pencalonan eyang subur 

Here are a few of the top words in each topic:

In [39]:
disp('Topic 1:')
str2num(TedgeNMF_H('1,',:))> 0.15
disp('')
disp('Topic 2:')
str2num(TedgeNMF_H('2,',:))> 0.15
disp('')
disp('Topic 3:')
str2num(TedgeNMF_H('3,',:))> 0.15

Topic 1:
(1,word_lower|apa)     0.27405
(1,word_lower|calon)     0.27405
(1,word_lower|eyang)     0.27405
(1,word_lower|kabar)     0.27405
(1,word_lower|pencalonan)     0.27405
(1,word_lower|pendapatmu)     0.27405
(1,word_lower|presiden)     0.27405
(1,word_lower|sbg)     0.27405
(1,word_lower|subur)     0.27405
(1,word_lower|ttg)     0.27405

Topic 2:
(2,word_lower|gak?)     0.19025
(2,word_lower|itu)     0.31736
(2,word_lower|kamu)     0.19025
(2,word_lower|menurut)     0.19025
(2,word_lower|orang)     0.19025
(2,word_lower|pacar)     0.19025
(2,word_lower|rt)     0.30501
(2,word_lower|salah)     0.31112
(2,word_lower|sama)     0.24648
(2,word_lower|suka)     0.18494

Topic 3:
(3,word_lower|bekerja)     0.17198
(3,word_lower|dalam)     0.17198
(3,word_lower|dan)     0.17198
(3,word_lower|disiplin)     0.17198
(3,word_lower|kebohongan)     0.17198
(3,word_lower|pemimpin)     0.17198
(3,word_lower|pribadi)     0.17198
(3,word_lower|sama)     0.16899
(3,word_lower|seorang)     0.17198


It's a bit hard to see if this makes any sense, since it's not in English. Using Google Translate or similar, we can see that at least the first two could make sense:

<img src="images/topic1.png" alt="Drawing" style="width: 700px;"/>
<img src="images/topic2.png" alt="Drawing" style="width: 700px;"/>
<img src="images/topic3.png" alt="Drawing" style="width: 700px;"/>

Topic modeling tends to be fairly sensitive to the number of topics. Try varying k above and see how the resutls change. You can use Google Translate to see what the tweets are saying, to some degree. You can also try varying the maximum number of iterations.

# Deleting your Tables

If you want to start over, you can run this to delete your tables. It will ask you whether you want to delete each table.

In [20]:
tbls = strsplit(ls(DB),' ');

for i = 1:length(tbls)
    if strfind(tbls{i},myName)
        deleteForce(DB(tbls{i}));
    end
end