# Intro to D4M

Load the D4M module

In [None]:
using D4M, PyPlot.axis

In [None]:
using PyPlot

## Create, Display, Save an Associative Array

Create lists of row, column, and values substrings. Note: the last character in the string is the divider. It can be any character. Common choices are ",", " ", tab, and newline.

In [None]:
row = "a,a,a,a,a,a,a,aa,aaa,b,bb,bbb,a,aa,aaa,b,bb,bbb,"
column = "a,aa,aaa,b,bb,bbb,a,a,a,a,a,a,a,aa,aaa,b,bb,bbb,";
values = "a-a,a-aa,a-aaa,a-b,a-bb,a-bbb,a-a,aa-a,aaa-a,b-a,bb-a,bbb-a,a-a,aa-aa,aaa-aaa,b-b,bb-bb,bbb-bbb,";

Create an associative array, A, from row, column, and values.

In [None]:
A = Assoc(row,column,values)

Display the associative array in tabular form.

In [None]:
printFull(A)

In [None]:
WriteCSV(A,"data/A.csv");

## Read and Select Sub Associative Arrays

Read CSV file into an associative array.

In [None]:
A = ReadCSV("data/A.csv");

Select a subset of rows.

In [None]:
printFull(  A["a,b,",:]  );

Convert values to 0 and 1.

In [None]:
printFull(  logical(A["a,b,",:])  );

Select a subset of columns.

In [None]:
printFull(  A[:,"a,b,"]  );

Convert values to 0 and 1.

In [None]:
printFull(  logical(A[:,"a,b,"])  );

# Analyze Entities in News Articles

Load entities from 10,000 news articles and print the first few rows.

In [None]:
A = ReadCSV("data/entity.csv");

printFull(  A[1:5,:]  );

Show dimensions and number of entries of A.

In [None]:
print( [size(A),nnz(A)] );

nnz(A)/(size(A)[1]*size(A)[2])

## Construct and Display a Sparse Associative Array of the Data

Grab doc, entity, position, and type columns and combine type and entity with '|' seperator.

In [None]:
row, col, doc      = find(A[:,"doc,"]);              # Get doc column.
row, col, entity   = find(A[:,"entity,"]);           # Get entity column.
row, col, position = find(A[:,"position,"]);         # Get position column.
row, col, rowType     = find(A[:,"type,"]);             # Get type column.
typeEntity = CatStr(rowType,"|",entity);          # Interleave type and entity strings.

In [None]:
typeEntity

Create a sparse associative array of all the data.

In [None]:
E = Assoc(doc,typeEntity,position);

Show a few rows.

In [None]:
print(E[1:2,:]) # the first two rows as an array

printFull(E[1:2,:]) # the entries of the first two rows, all written out

Display dimensions of data, number of non-zero entries, and density of A.

In [None]:
print( [size(E), nnz(E)]  );

nnz(E)/(size(E)[1]*size(E)[2])

Plot transpose of the sparse data.

In [None]:
spy(transpose(E[1:1000,:]));
axis("auto")

Create an adjacency matrix by multiplying E<sup>T</sup> * E.

In [None]:
E = logical(E)
spy(E'*E);

## Analyze Relationships

Define relationships to examine.

In [None]:
l = "LOCATION|boston,";
P = StartsWith("PERSON|,");
L = StartsWith("LOCATION|,");

Show all people mentioned more than once in news articles in Boston.

In [None]:
people = getcol(sum(E[getrow(E[:,l]),P],1)>1)

Show the most common locations for those found in Boston.

In [None]:
print(sum(  transpose(E[:,people]) * E[:,L]  ,1) > 15)

Combining the above into 1 line:

In [None]:
print(sum(  transpose( E[:,getcol(sum(E[getrow(E[:,l]),P],1)>1)] ) * E[:,L]  ,1) > 15)

Scale to multiple cites at once.

In [None]:
l = "LOCATION|boston,LOCATION|chicago,LOCATION|detroit,";
people = getcol(sum(E[getrow(E[:,l]),P],1)>1)
print(sum( transpose(E[:,people]) * E[:,L]  ,1) > 15)

Let's make a Location-Location graph:

In [None]:
Locs = E[:,L]'*E[:,L]
Locs = Locs - diag(Locs)

spy(Locs);

Which location pairs occur together the most?

In [None]:
print(Locs > 200)

# Analyze DNA Data

In [None]:
function SplitSequenceCSV(CSVfile::String,DNAwordsize::Integer)

    A = ReadCSV(CSVfile)
    r, c, v = find(A);      # Read in file
    v = map(lowercase,v)   # Convert sequence to lower case.

    # Create the new column keys
    col=matchall.(Regex("(.{" * string(DNAwordsize) * "})") ,v)
    sizes = length.(col) # Save the lengths to create the row strings
    oneString=join(join.(col,"\n"),"\n")
    col = split(oneString,"\n")
    
    # Create the new row keys
    oneString = join(map(^,r.*"\n",sizes),"")
    newR = split(oneString[1:end-1],"\n")
    
    # Create the Associative Array
    A = Assoc(newR,col,1)
    
    return A
   
end

Read in bacteria reference DNA and palm sample DNA data into an associative arrays.

In [None]:
DNAwordsize = 10;
Eref = SplitSequenceCSV("data/bacteria.csv",DNAwordsize);
Esamp = SplitSequenceCSV("data/palm.csv",DNAwordsize);

Perform BLAST DNA sequeance analysis in 1 line of code to find best bacteria match.

In [None]:
bestMatches = sum( Eref * Esamp.' ,2) > 20;

print(bestMatches);

# Analyze Network Data

Read in 80,000 simulated network traffic logs from 1 day and print the first few rows.

In [None]:
A = ReadCSV("data/network.csv");

print(  A[1:5,:]  );

Make data sparse and show dimensions and number of entries.

In [None]:
E = val2col(A,"|");

display( [size(E) nnz(E)] )

print(E[1:5,:])

In [None]:
size(E[:,StartsWith("src|,")])

Select fields and time windows to explore.

In [None]:
S = StartsWith("src|,");         T1 = StartsWith("time|01:,");
D = StartsWith("dest|,");        T2 = StartsWith("time|05:,");

E1 = E[getrow(E[:,T1]),:];          # Data from time window 1.
E2 = E[getrow(E[:,T2]),:];          # Data from time winod 2.

Create adjacency array of network traffic in each time window.

In [None]:
A1 = E1[:,S]' * E1[:,D];
A2 = E2[:,S]' * E2[:,D];

Find source/destination pairs that are common to both time windows.

In [None]:
print(A1 .* A2)