In [None]:
using D4M

# Intro

## Assoc Intro

### AI1 Setup

Hello! This is a test on the basic Assoc Array construction. Associative array takes on entries of triplets, and it will parse an array of substrings that is divided by char divider.

--Please note that this divider is indicated as the last char in the string.

In [None]:
row = "a,a,a,a,a,a,a,aa,aaa,b,bb,bbb,a,aa,aaa,b,bb,bbb,"
column = "a,aa,aaa,b,bb,bbb,a,a,a,a,a,a,a,aa,aaa,b,bb,bbb,"
values = "a-a,a-aa,a-aaa,a-b,a-bb,a-bbb,a-a,aa-a,aaa-a,b-a,bb-a,bbb-a,a-a,aa-aa,aaa-aaa,b-b,bb-bb,bbb-bbb,"

# Create assoc array and list triples.
A = Assoc(row,column,values)

This is the data structure of the Associative Array Class:

In [None]:
dump(A)

The printFull function allows it to be printed in a tabular form.

In [None]:
printFull(A)

When written into CSV form, the data is stored in the tabular form.

In [None]:
WriteCSV(A,"A.csv")

### AI2 Subsref

This is a test on the subreferencing of Associative Array.

In [None]:
A = ReadCSV("A.csv")

Get rows a and b.

In [None]:
A1r = A["a,b,",:]

print(A1r)

Get rows containing a and columns 1 thru 3 (not yet supported).

In [None]:
#A2r = A["a *,",1:3]

#print(A2r)

Get rows a to b.

In [None]:
A3r = A["a,:,b,",:]

print(A3r)

Get rows starting with a or c.

In [None]:
A4r = A[StartsWith("a,c,"),:]

print(A4r)

Get cols a and b.

In [None]:
A1c = A[:,"a,b,"]

print(A1c)

Get rows 1 thru 3 and cols containing a (not yet supported).

In [None]:
#A2c = A[1:3,"a *,"]

#print(A2c)

Get cols a to b.

In [None]:
A3c = A[:,"a,:,b,"]

print(A3c)

Get cols starting with a or b.

In [None]:
A4c = A[:,StartsWith("a,c,")]

print(A4c)

Get all values less than b (not yet supported for string values).

In [None]:
#A1v = (A < "b,")

#print(A1v)

### AI3 Math

This section demos some of the mathematical operations on Associative Array.

In [None]:
A = ReadCSV("A.csv")
A = logical(A)

printFull(A)

We can sum down rows and across columns.

In [None]:
printFull(sum(A,1))
printFull(sum(A,2))

Compute a simple join.

In [None]:
Aa = A[:,"a,"]
Ab = A[:,"b,"]
Aab = nocol(Aa) & nocol(Ab)

printFull(Aab)

Compute a histogram (facets) and normalized histogram of other columns that are in rows with both a and b.

In [None]:
F =  ( Aab )' * A
printFull(F)

Fn = F ./ sum(A,1)
printFull(Fn)

Compute correlation

In [None]:
AtA = sqIn(A)
d = diag(adj(AtA))
AtA = putAdj(AtA,adj(AtA) - sparse(diagm(d)))
printFull(AtA)

### AI4 Advanced Constsruction

Mixed string and numeric associative arrays.

In [None]:
# String  vectors *must* be ROW vectors
iStr =  "01,02,03,04,21,22,23,24,41,51,61,62,63,64,"

# Numeric vectors *must* be COLUMN vectors
iNum =  [ 1  1  1  1  4  3  2  1 4  5  6  6  6  6 ]'

Mixed type empty arrays.

In [None]:
# All empty.
A00 = Assoc("","","")

# Empty row.
A01 = Assoc("",iStr,iNum)

# Empty value.
A02 = Assoc(iNum,iStr,"")

# Empty column and value.
A03 = Assoc(iNum,[],"")

# All empty.
A04 = Assoc("",[],[])

# All empty.
A05 = Assoc("","",[])

# row and value empty.
A06 = Assoc("",iNum,[])


Mixed type non-empty arrays. (Commented out cells are not yet supported)

String scalar,  string vector,  numeric vector.

In [None]:
# A11 = Assoc("a,",iStr,iNum)

# printFull(A11)

Numeric vector, string vector,  string scalar.

In [None]:
#A12 = Assoc(iNum,iStr,"a,")

# printFull(A12)

Numeric vector, numeric scalar, string scalar.

In [None]:
#A13 = Assoc(iNum,1,"a,")

# printFull(A13)

String scalar,  numeric scalar, numeric scalar.

In [None]:
A14 = Assoc("a,",1,1)

printFull(A14)

String scalar,  string scalar,  numeric scalar.

In [None]:
A15 = Assoc("a,","a,",1)

printFull(A15)

String scalar,  numeric vector, numeric scalar.

In [None]:
#A16 = Assoc("a,",iNum,1)

# printFull(A16)

## Edge Art

### EA1 Graph

Forming adjacency graphs

Read CSV file into associative array. Get vertices and convert to numbers.

In [None]:
E = ReadCSV("Edge.csv")

Ev = logical( E[:, StartsWith("V,")] )
printFull(Ev)

Compute vertex adjacency graph.

In [None]:
Av = sqIn(Ev)
printFull(Av)

Compute edge adjacency graph.

In [None]:
Ae = sqOut(Ev)
printFull(Ae)

### EA2 Subsref

Show different wasy to index associative arrays.

Read CSV file into associative array.

In [None]:
E = ReadCSV("Edge.csv");
printFull(E);

Get orange edges.

In [None]:
Eo = E[(E[:,"Color,"] == "Orange" ).row,:];
printFull(Eo);

Get orange and green edges.

In [None]:
Eog = E[ StartsWith("O,G,") ,:];
printFull(Eog);

### EA3 SubGraph

Show some associative array math.

Read CSV file into associative array, get vertices and convert to numbers.

In [None]:
E = ReadCSV("Edge.csv");
Ev = logical( E[:, StartsWith("V,")] );

Get orange and green edges.

In [None]:
EvO = Ev[StartsWith("O,"),:];
EvG = Ev[StartsWith("G,"),:];

Compute (empty) vertex adjacency graph.

In [None]:
AvOG = transpose(EvO) * EvG;
printFull(AvOG);

Compute edge adjacency graph.

In [None]:
AeOG = EvO * transpose(EvG)
printFull(AeOG)

Compute edge adjacency graph preserving keys.

In [None]:
AeOG = CatKeyMul(EvO,transpose(EvG))
printFull(AeOG)

# Apps

## Entity Analysis

### EA1 Read

Read entity data and organize into sparse associative array.

Entity data are derived summaries obtained by from automated
entity extraction algorithms applied to <1% of the NIST Rueters Corpus.
See: http://trec.nist.gov/data/reuters/reuters.html

In [None]:
using PyPlot,JLD2

file_dir = Pkg.dir("D4M")*"/examples/2Apps/1EntityAnalysis/Entity.csv"
save_dir = "Entity.jld"

E_raw = ReadCSV(file_dir)
printFull(E_raw[1:5,:])

Organize data into new format.

In [None]:
row,col,doc_val      = find(E_raw[:,"doc,"])
row,col,entity_val   = find(E_raw[:,"entity,"])
row,col,position_val = find(E_raw[:,"position,"])
row,col,type_val     = find(E_raw[:,"type,"])

typeEntity_val = CatStr(type_val, "/" , entity_val);

Create a sparse associative array of all the data.

In [None]:
E = Assoc(doc_val,typeEntity_val,position_val)

Show a few rows and plot a spy plot.

In [None]:
print(E[1:5,:])
spy(E[1:1000,:]')

Save associative array.

In [None]:
save(save_dir,"E",E)

### EA2 Statistics

Compute statistics on entity data.

In [None]:
using JLD2, PyPlot

E = load("./Entity.jld")["E"];

Calculate number of entities in each category, then count the number of times each entity occurs.

In [None]:
print(sum(logical(col2type(E,"/")),1))
En = sum(logical(E),1)

Plot the log-log plot of location frequencies. Notice the power-law distribution.

In [None]:
row,entity,count = find(En)
An = Assoc(count,entity,1)

loglog(full(sum(adj(An[:,StartsWith("LOCATION/,")]),2)) ,"o")

### Facet

Entity facet search. Shows next most common terms.

In [None]:
using JLD2

E = load("./Entity.jld")["E"]
E = logical(E);

Facet search: Finding entities that occur commonly with LOCATION/new york and PERSON/michael chang.

In [None]:
x = "LOCATION/new york,"
p = "PERSON/michael chang,"
F = ( nocol(E[:,x]) & nocol(E[:,p]))' * E
print(F' > 1 )

Normalize the previous result.

In [None]:
Fn = F ./ sum(E,1)
print((Fn' > 0.02))

### Graph

Compute graphs from entity edge data.

In [None]:
using JLD2,PyPlot
E = load("./Entity.jld")["E"];

Computing adjacency matrix for the Entity-Entity graph.

In [None]:
Es = E
E = logical(E)
Ae = sqIn(E)

spy(Ae)

Compute entity-entity graph that preserves keys (documents)).

In [None]:
# Limit to people with names starting with j
p = StartsWith("PERSON/j,");
Ep = E[:,p]

# Correlate while preserving keys
Ap = CatKeyMul(Ep',Ep)
spy(Ap)

Create document-document graph: documents that contain the same entities.

In [None]:
Ad = sqOut(Ep)
spy(Ad)

### Graph Query

Various ways to query subgraphs.

In [None]:
using JLD2,PyPlot

file_dir = "./Entity.jld"
E = load(file_dir)["E"]
E = logical(E);

Compute entity (all facet pairs).

In [None]:
A = sqIn(E)
d = diag(adj(A))
A = A - diag(A)

print(A[2,:])

Compute normalized correlation.

In [None]:
i,j,v = findnz(adj(A))
An = putAdj(A, sparse(i,j,v ./ min.(d[i],d[j])))

print(An[2,:])

Multi-facet queries.

In [None]:
x = "LOCATION/new york,"
p = StartsWith("PERSON/,")
printFull( (A[p,x] > 4) & (An[p,x] > 0.3))

Find triangles.

In [None]:
p0 = "PERSON/john kennedy,"

p1 = row(A[p,p0] + A[p0,p])
spy(A[p1,p1])

p2 = row( A[p1,p1] - (A[p,p0]+ A[p0,p]))
print(A[p2,p2] > 1)

## Track Analysis

### Build

General approach to computing tracks from entity edge data.

In [None]:
using JLD2,PyPlot

E = load("Entity.jld")["E"]
E = logical(E);

Show general purpose method for building tracks.

In [None]:
p = StartsWith("PERSON/,")      # Set entity range.
t = StartsWith("TIME/,")        # Set time range.
x = StartsWith("LOCATION/,")    # Set spatial range.
a = StartsWith("PERSON/,TIME/,LOCATION/,");

Limit to edges with all three.

In [None]:
E3 = E[row( sum(E[:,p],2) & sum(E[:,t],2) & sum(E[:,x],2) ),a];

Collapse to get unique time and space for each edge and get triples.

In [None]:
edge,time  = find(  val2col(col2type(E3[:,t],"/"),"/")  )
edge,space = find(  val2col(col2type(E3[:,x],"/"),"/")  )

Etx = Assoc(edge,time,space)     # Combine edge, time and space.
Ext = Assoc(edge,space,time)     # Combine edge, space and time.

Construct time tracks with matrix multiply.

In [None]:
At = CatValMul(transpose(Etx),E3[:,p]) 
spy(At')
axis("auto")

Construct space tracks with matrix multiply.

In [None]:
Ax = CatValMul(transpose(Ext),E3[:,p]) 
spy(Ax')
axis("auto")

### TA2 Query

Compute tracks from entity edge data.

In [None]:
function findtracks(A,t,p,l);
#findtracks creates track associative array.

    # Find docs that have person
    DocIDwPer = row(A[:,p]);

    # Find docs that have person and location.
    DocIDwPerLoc = row(A[DocIDwPer,l]);

    # Find docs that have person, location and time.
    DocIDwPerLocTime = row(A[DocIDwPerLoc,t]);

    # Limit to these documents.
    AA = A[DocIDwPerLocTime,:];

    # Get person sub array.
    Aper = AA[DocIDwPerLocTime,p];
    TrackPer,DocAper = find(Aper');

    # Get location sub array.
    Aloc = AA[DocIDwPerLocTime,l];
    EntAloc,DocAloc = find(Aloc');

    # Get Single location per document- order by actors
    uLocDocs = unique(DocAloc)
    uLocDocIdx = [1; indexin(uLocDocs,DocAloc)[1:end-1]+1] # get first index of unique docs (correct for getting highest)
    uDocLocs = EntAloc[uLocDocIdx] # single locations per document
    uLocDocIdxinDocAper = indexin(DocAper,uLocDocs) # locations of unique loc docs in per doc list
    TrackLoc = uDocLocs[uLocDocIdxinDocAper]

    # Get time sub array.
    Atime = AA[DocIDwPerLocTime,t];
    EntAtime,DocAtime = find(Atime');

    # Get Single time per document- order by actors
    uTimeDocs = unique(DocAtime)
    uTimeDocIdx = [1; indexin(uTimeDocs,DocAtime)[1:end-1]+1] # get first index of unique docs (correct for getting highest)
    uDocTimes = EntAtime[uTimeDocIdx] # single times per document
    uTimeDocIdxinDocAper = indexin(DocAper,uTimeDocs) # locations of unique time docs in per doc list
    TrackTime = uDocTimes[uTimeDocIdxinDocAper]

    Tr = Assoc(TrackTime,TrackPer,TrackLoc);
                    
end

Load edge incidence matrix.

In [None]:
using JLD2

E = load("Entity.jld")["E"]
Es = E
A = logical(E);

Set prefixes and build entity tracks using findtracks function.

In [None]:
p = StartsWith("PERSON/,")
t = StartsWith("TIME/,")
x = StartsWith("LOCATION/,")

A = findtracks(E,t,p,x);

Track queries (Where have Michael Chang and Javier Sanchez been when?)

In [None]:
p1 = "PERSON/michael chang,"
p2 = "PERSON/javier sanchez,"
printFull( A[:,p1*p2] )

Track windows (Who was in Austria during this time?)

In [None]:
t = "TIME/1996-09-03,:,TIME/1996-09-06,"
x = "LOCATION/austria"
col(A[t,:] == x)

### TA3 Graph

Compute tracks from entity edge data.

In [None]:
function findtrackgraph(Atrack)
#findtrackgraph forms graph of locations from Track Associative array.

    # Find 1 hop and >1 hop tracks.
    AtrackHop = sum(Atrack,1);
    Hop1 = col(AtrackHop == 1);
    Hop2 = col(AtrackHop > 1); 

    # Get track list.  Naturally comes out sorted by p.
    t1,p1,x1 = find(Atrack[:,Hop1]);
    AtrackGraph1 = Assoc(x1,x1,1,(+));

    t2,p2,x2 = find(Atrack[:,Hop2]);

    # Create matrices and shifted matrices.
    p22 = circshift(p2,-1);
    x22 = circshift(x2,-1);

    # Find where p21 and p22 are the same.
    test = p2 .== p22;   
    x21 = x2[test];   
    x22 = x22[test];   

    AtrackGraph2 = Assoc(x21,x22,1);

    AtrackGraph = AtrackGraph1 + AtrackGraph2;

end

In [None]:
using JLD2,PyPlot

# Load edge incidence matrix.
E = load("Entity.jld")["E"]
E = logical(E);

Build entity tracks with findtracks function.

In [None]:
p = StartsWith("PERSON/,")
t = StartsWith("TIME/,")
x = StartsWith("LOCATION/,")

A = findtracks(E,t,p,x);

Build track graph.

In [None]:
G = findtrackgraph(A)
print(G > 5)
spy(G)

Track graph pattern.

In [None]:
o = "ORGANIZATION/international monetary fund,"
p = StartsWith("PERSON/,")
Go = findtrackgraph(A[:,col(E[row(E[:,o]),p])])

print((Go > 2) & ((Go ./ G) > 0.2))