In [1]:
# paper name

# motivation (of paper)

# walkthrough
# am beispiel Lebenserwartung...

## A Sketch-based Index for Correlated Dataset Search

### Motivation

With the ever-rising amount of data available, recent researched explored queries where we want to enlarge our dataset by finding related data. This means finding the top-k tables which are joinable and correlated to our inital dataset. As these searches can be quite long they propose a more efficient and effective way of finding these tables than just naively querying over the whole data collection, which might take quite a bit. Their proposed idea has shown to achieve better results than other approaches regarding ranking accruacy and recall. 
In this notebook we want to offer you an understandable explanation of the proposed Solution and hope you are able to grasp all of it.

### Discovering Data with joinable keys and correlated data.

Steps of the algorithm:
1. build index of tables in DB
    - split larger tables into 2-col-tables (cross product)
    - creating a sketch of size n of each table in the Database
    - build the index
    
    
2. query
    - creating a sketch of the query table
    - find correlated and joinable tables for the query-table
(finding set overlap between the two sketches ?)
    

### building the index:

1. all tables with more then 2 columns are split:
    - all numerical columns are combined with all columns containig categorical values
<br>

2. all keys k (categorical values) per table are hashed and stored as tuple with their numeric value c_k
    - <h(k), c_k>
    - for performance reason the sketch size is applied here: only the smallest n hashed values are kept
<br>

3. all categorical keys (c_k) are modified according to their hashed values (h(k))
    - if the hashed value is below or above the median of all table values, c_k is categorized in -c_k or +c_k respectively
    - <h(k), +/-c_k>
    - this is used to identify correlation
<br>

4. picking a specific sample (=sketch) of size n per table, using n tuples with smallest hash-value
    - this way the samples are comparable


In [2]:
import pandas as pd
import qcr
from collections import Counter
from collections import defaultdict
from qcr import load_index, get_kc, get_c, hash_function, build_index, key_labeling

#### Step 1
In this example we want to find data correlating to the life expectancy in certain countries. 
We start by building the index. Therefore we first load 3 tables conatining example data. 

In [3]:
# part 1: building the index

## The following Notebook will walk you through the steps using the following sample data. #TODO: sample data with less or more columns? Table with more columns?
## todo: table multiple columns
## The tables are used to create our index.

table1 = pd.read_csv('data/t1.csv')
table1.columns.name = 't1'
display(table1)

table2 = pd.read_csv('data/t2.csv')
table2.columns.name = 't2'
display(table2)

table3 = pd.read_csv('data/t3.csv')
table3.columns.name = 't3'
display(table3)

t1,Country,Alcohol
0,Armenia,3.702667
1,Colombia,4.419333
2,Equatorial Guinea,7.342
3,Germany,11.628667
4,Kazakhstan,6.641333
5,Montenegro,2.584286
6,Nicaragua,3.596667
7,Nigeria,8.646667
8,Paraguay,5.527333
9,United States of America,8.579333


t2,Country,BMI
0,Armenia,44.70625
1,Colombia,49.54375
2,Equatorial Guinea,17.85625
3,Germany,51.99375
4,Kazakhstan,45.15625
5,Montenegro,50.4875
6,Nicaragua,42.68125
7,Nigeria,19.75
8,Paraguay,39.525
9,United States of America,58.45


t3,Country,Area (sq. mi.)
0,Armenia,29800.0
1,Colombia,1138910.0
2,Equatorial Guinea,28051.0
3,Germany,357021.0
4,Kazakhstan,2717300.0
5,Montenegro,5333.0
6,Nicaragua,129494.0
7,Nigeria,923768.0
8,Paraguay,406750.0
9,United States of America,3796742.0


In [4]:
# 1. extracting numerical column (feature) and categorical column (key)

cat_col = get_kc(table1)
num_col = get_c(table1)

print("cat_col:")
print(cat_col)
print("num_col:")
print(num_col)

# (if a table consists of more than 2 cols, all numerical and categorical columns are combined into 2-col-tables (cross product). here, we simply keep using our original table)

# table1, table2, table3 = build_cross_product(cat_col, num_col)

table1

cat_col:
['Armenia', 'Colombia', 'Equatorial Guinea', 'Germany', 'Kazakhstan', 'Montenegro', 'Nicaragua', 'Nigeria', 'Paraguay', 'United States of America']
num_col:
[3.7026666666666666, 4.419333333333333, 7.342, 11.628666666666668, 6.641333333333334, 2.584285714285714, 3.5966666666666667, 8.646666666666667, 5.527333333333333, 8.579333333333333]


t1,Country,Alcohol
0,Armenia,3.702667
1,Colombia,4.419333
2,Equatorial Guinea,7.342
3,Germany,11.628667
4,Kazakhstan,6.641333
5,Montenegro,2.584286
6,Nicaragua,3.596667
7,Nigeria,8.646667
8,Paraguay,5.527333
9,United States of America,8.579333


#### Step 2
Now we have tables conaining only one categorical column and a numercial column. <br>
We can start hashing the numerical columns. For performance reasons it is advised to limit the sketch size. <br>
Allthough not necessary in this small case, we want to emphasize the scalability of this apporoach and used the limit of 3.

In [15]:
# 2. create sketch by hashing numerical values and picking n smallest. #TODO: oben steht numerische Spalten hashen. Hier jetzt aber beide?

hashed_table1 = table1.copy()
hashed_table1['hashed_values'] = table1['Country'].map(hash_function)
print(hashed_table1)

# now we only keep n rows with the smallest hashed values. those rows form the sketch for this table:

sketch = hashed_table1.nsmallest(3, 'hashed_values')
print('\nsketch of t1:')
sketch


t1                   Country    Alcohol  hashed_values
0                    Armenia   3.702667  -1.040467e-39
1                   Colombia   4.419333  -3.321928e-41
2          Equatorial Guinea   7.342000  -6.294744e-40
3                    Germany  11.628667   9.140805e-40
4                 Kazakhstan   6.641333   5.083309e-40
5                 Montenegro   2.584286   3.881950e-40
6                  Nicaragua   3.596667  -9.162508e-40
7                    Nigeria   8.646667   1.278178e-39
8                   Paraguay   5.527333   6.044821e-41
9   United States of America   8.579333  -3.869834e-40

sketch of t1:


t1,Country,Alcohol,hashed_values
0,Armenia,3.702667,-1.0404670000000001e-39
6,Nicaragua,3.596667,-9.162508e-40
2,Equatorial Guinea,7.342,-6.294744e-40


#### Step 3
Now the paper labels the hashed_keys according to the mean of the values.<br>
To increase Readability we use the actual keys instead of the hashed keys here.<br>
Any value below our mean has its key (or categorical column) concatenated with "-1", while any hashed_value above the mean has its key concatenated with "+1".

In [16]:
# 3. categorize keys by value and use as new key

# first we calculate the mean of all values of this tables numeric column
mean = sketch['Alcohol'].mean()


# then label key by > median (key+1) or < median (key-1)
sketch['labeled_keys'] = [f'{key}{"+1" if value > mean else "-1"}' for key, value, hash_key in sketch.values]

print(sketch)

t1            Country   Alcohol  hashed_values         labeled_keys
0             Armenia  3.702667  -1.040467e-39            Armenia-1
6           Nicaragua  3.596667  -9.162508e-40          Nicaragua-1
2   Equatorial Guinea  7.342000  -6.294744e-40  Equatorial Guinea+1


#### Step 4
We finally can complete our inverted index by merging our sketches into our full inverted index. 

In [17]:
# 4. store hashed and categorized terms of sketch in inverted index #TODO: Hier bauen wir den inverted index aus den sketches? Und dannach aus den vollständigen tabellen im schnelldurchlauf?
#TODO: Reihenfolge der Zellausführungen beachten. Umbenennungen erforderlich
table_id = 't1'
inverted_index = defaultdict(set)

for term in sketch['labeled_keys']:
    inverted_index[term].add(table_id)


inverted_index


defaultdict(set,
            {'Armenia-1': {'t1'},
             'Nicaragua-1': {'t1'},
             'Equatorial Guinea+1': {'t1'}})

In [18]:
### the above is an example for one table, the function "build_index" from qcr.py executes all the above steps. The full code can be found in qcr.py

# now we  import all three tables into the same index
build_index([table1, table2, table3], n=5)
inverted_index = load_index()
inverted_index = dict(sorted(inverted_index.items()))  # sort index for better comparison
print('inverted Index:')
inverted_index

inverted Index:


{'Armenia+1': {'t2'},
 'Armenia-1': {'t1', 't3'},
 'Colombia+1': {'t2', 't3'},
 'Colombia-1': {'t1', 't3'},
 'Equatorial Guinea+1': {'t1'},
 'Equatorial Guinea-1': {'t2', 't3'},
 'Nicaragua+1': {'t2', 't3'},
 'Nicaragua-1': {'t1', 't3'},
 'United States of America+1': {'t1', 't2', 't3'},
 'United States of America-1': {'t3'}}

We now have build our inverted index for searching. Now we can use a query table to find correlated data.

### Querying the index
#### Step 1
We start by building the inverted index for the sketch of our query data.

In [19]:
# load query table: (key & target)
query = pd.read_csv('data/q.csv')
query.columns.name = 'q'
display(query)

# as above:
# 1. build sketch of query table
sketch = qcr.create_sketch(query['Country'], query['Life expectancy '], hash_function, n=5)
# 2. generate terms
search_terms = qcr.key_labeling(sketch)
print("search terms:")
search_terms

q,Country,Life expectancy
0,Armenia,73.4
1,Colombia,73.2875
2,Equatorial Guinea,55.3125
3,Germany,81.175
4,Kazakhstan,66.7625
5,Montenegro,74.5
6,Nicaragua,73.45
7,Nigeria,51.35625
8,Paraguay,73.1125
9,United States of America,78.0625


search terms:


['Armenia+1',
 'Nicaragua+1',
 'Equatorial Guinea-1',
 'United States of America+1',
 'Colombia+1']

### negative correlation
as we also might be interested in negative correlation, we negate the values and build a second set of labeled keys.
This results in two query tables: one for positive correlation, one for negative.

In [20]:
# 3.  inverse values of sketch
# for negative correlation
inverse_search_terms = key_labeling(
    list(map((lambda key_value: (key_value[0], -key_value[1])), sketch))
    ) # same function as above, input key is inverted
print("inverse search terms:")
inverse_search_terms

inverse search terms:


['Armenia-1',
 'Nicaragua-1',
 'Equatorial Guinea+1',
 'United States of America-1',
 'Colombia-1']

#### Step 2
Using our new search terms, we can now query the index. We then count which table comes up how often in the results.
we limit the output to the ten most correlated tables

In [21]:
# execute query
# 1. load index
inverted_index = load_index()

# count how many tables match the search terms
result = Counter()
result.update(
    "+:" + table_id for term in search_terms for table_id in inverted_index[term]
)
result.update(
    "-:" + table_id for term in inverse_search_terms for table_id in inverted_index[term]
)

sketch = result.most_common(10)
sketch

[('+:t2', 5), ('+:t3', 4), ('-:t1', 4), ('-:t3', 4), ('+:t1', 1)]

In [22]:
# we see t2 and t1 are correlated

In [23]:
result = query.join(table1.set_index('Country'), on='Country' , how='left', lsuffix='_who', rsuffix='_kaggle')

result = result.join(table2.set_index('Country'), on='Country' , how='left', lsuffix='_who', rsuffix='_kaggle')

result

t1,Country,Alcohol
0,Armenia,3.702667
1,Colombia,4.419333
2,Equatorial Guinea,7.342
3,Germany,11.628667
4,Kazakhstan,6.641333
5,Montenegro,2.584286
6,Nicaragua,3.596667
7,Nigeria,8.646667
8,Paraguay,5.527333
9,United States of America,8.579333
