<h2>test of creating a paragraph byte map for Information Retrieval</h2>


Given a set of n properties of a document, for example, a list of subjects contained in an entire document collection, and a number of documents, create a n x m matrix.  Each vector represents some subject or facet of interest.  For instance, [cats, dogs, birds, red_things, and so on] and each of these terms have their own matrix (cat is in 0, dogs in 1) and so on.   An algorithm or human judge reads the document [a document could be some part, such as an abstract, title, individual paragraphs, entire sections, entire document (or the equivalent depending on the medium)] and a binary value (0 = this facet is not in this unit; 1 = this facet is in this unit).  The result looks something like:
0, 1, 0, 0, 0, 0, 1, 1, 1, 0, .... 
<br />
Once the matrix is created ... us it to find intersections; but also let's use it to winnow (reduce the space of data) to be inspected to be included.  Pretend we have millions of records coming in rapidly to be processed.  We need some technique to pre-process the data as fast as we can to reduce the set that is actually used for other analysis.  What's more, the ever-streaming data set must be available simultaneously for other reductions and analysis.
<br />
Write a script and time several runs.

In [3]:
def bin_add(*args): return bin(sum(int(x, 2) for x in args))[2:]

In [14]:
doc1 = '010010101101010100110101010101010100111111111'
doc2 = '000000000000000000001000000000000000000010001'
doc3 = '110101010101111111000000000000000000000000000'

bin_add(doc1)

'100101'

In [24]:
print(int('010010101101010100110101010101010100111111111',2))
print(int('100000000000000000001000000000000000000010001',2))
print(int('110101010101111111000000000000000000000000000',2))
print(int('010101010101111111000000000000000000000000000',2))
print(int('00000000',2))
print(int('00000001',2))
print(int('00000010',2))
print(int('00000011',2))
print(int('00000100',2))
print(int('00000101',2))

10284947909119
17592202821649
29325902479360
11733716434944
0
1
2
3
4
5


In [33]:
# Do the same as a pandas dataframe
# for varieties of approaches see http://pbpython.com/pandas-list-dict.html
import pandas as pd

#
items = [('title',['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']),
         ('t1',[0,1,0,0,1,0,1,0,1,1,0,1,0,1,0,1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,1,1,1,1,1,1,1]),
         ('t2',[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1]),
         ('t3',[1,1,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]),
        ]
df = pd.DataFrame.from_dict(items)
print(df)


       0                                                  1
0  title  [A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, ...
1     t1  [0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, ...
2     t2  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3     t3  [1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, ...


<hr />We can import using lists and dictionaries ... but also other technqiues, to be sure.  from_items is probably more common.  Now the challenge is to extract a subset of data that conform to our needs.  For instance, say columns A and G are conceptually related for this project and we want to extract them.  

<h2>A challenge ... </h2>
Imagine we have x number of titles (or documents, represented above by t<i>n</i>) and a series of facets (or subjects or whatever interests your group has; A - ...).  
Now, say you need to extract all the matrices in A and G (because you have back-end data that says A and G are related to some need).  Once you've extracted A and G, can you determine which of the 2 has a total higher value?  [That means we start of binarily ... 1/0 if the topic A or G are there ...] and then we have to find facets that are conceptually close to A|G (here you can make up values, such as B and Q).  Then bit-wise sum of the original A|G and subsets and then "rank by relevancy."  
<br />
The concept of "relevancy" is the very heart of search engine differences. In addition, there is also a "weight scheme" that's included.  For instance, if A is conceptually related to B, C, F, Q, and X then how do we integrate them all and then normalize 'em...  and given the variance in these data, what kinds of weight schemes would we want to add (that is, are there metadata or other inputs to include?  [E.g., PageRank's weighting scheme]).  

<h4>How fast and how else?</h4>
Here are 2 other challenges:  (a) can you create a group of data, import them as a list and 
    as a dictionary, and then time the differences in retrieval speed?
    Another real world question is how to reduce the volume of data before processing?  For instance, what's the probability of a set of n facets being sufficiently close to include in further processing?  [This is a real world question posed to me yesterday by a Boston company.]

<h3>LMK!</h3>
This is <i>entirely</i> optional - a fun thing to play with ... cheers!

The idea from this comes from (the hazy remembrance from grad school) Frakes &amp; Baeza-Yates&rsquo; <i>Information retrieval: data structures &amp; algorithms</i>.  And someone has posted this text online!  [https://theswissbay.ch/pdf/Gentoomen%20Library/Information%20Retrieval/Information%20Retrieval%20Data%20Structures%20And%20Algorithms_FRAKES%20WB%20%282004%29.pdf]