# Profiling
Comparando diferentes soluções para o índice invertido:
* Armazenamento do índice utilizando inteiros em formato binário
* Armazenamento do índice utilizando inteiros em codificação de tamanho variável + intervalos de Ids de documento

In [1]:
from inverted_index import (create_doc_ids, load_doc_ids, save_doc_ids, 
    save_field_inv_index, save_inv_index, load_field_inv_index, load_inv_index,
    DATA_PATH, InvertedIndex)

## Experimentos com índice completo

Inicialmente, buscamos comparar o tempo de carregamento dos índices

In [2]:
%%time
index_packed = load_inv_index(compress=True)

CPU times: total: 16.9 s
Wall time: 17.6 s


In [3]:
%%time
index = load_inv_index(compress=False)

CPU times: total: 4 s
Wall time: 4.48 s


Agora, comparando o tempo para realizar consultas nos mesmos. Iniciando com consultas no início do vocabulário e indo para o final

In [4]:
%%time
index_packed.search_term('alcorão')

CPU times: total: 0 ns
Wall time: 0 ns


[(1604, 1), (2898, 1)]

In [5]:
%%time
index_packed.search_term('evolução')

CPU times: total: 0 ns
Wall time: 1.01 ms


[(26, 1),
 (30, 2),
 (80, 1),
 (82, 2),
 (109, 2),
 (293, 1),
 (307, 1),
 (319, 1),
 (321, 3),
 (332, 1),
 (346, 1),
 (371, 1),
 (383, 1),
 (609, 1),
 (646, 1),
 (719, 1),
 (836, 1),
 (983, 1),
 (997, 4),
 (998, 5),
 (1000, 5),
 (1035, 1),
 (1272, 2),
 (1422, 2),
 (1494, 2),
 (1552, 2),
 (1618, 1),
 (1620, 1),
 (1644, 1),
 (1965, 1),
 (1976, 1),
 (1980, 1),
 (1982, 1),
 (1989, 1),
 (1991, 1),
 (1994, 1),
 (1995, 1),
 (2013, 1),
 (2019, 1),
 (2033, 1),
 (2039, 2),
 (2041, 1),
 (2053, 1),
 (2055, 2),
 (2056, 1),
 (2057, 1),
 (2063, 2),
 (2070, 1),
 (2073, 1),
 (2113, 2),
 (2119, 2),
 (2134, 1),
 (2135, 1),
 (2136, 1),
 (2137, 1),
 (2141, 1),
 (2152, 1),
 (2180, 1),
 (2184, 1),
 (2187, 2),
 (2191, 1),
 (2192, 1),
 (2193, 1),
 (2210, 1),
 (2314, 1),
 (2339, 1),
 (2356, 1),
 (2361, 1),
 (2372, 1),
 (2408, 1),
 (2410, 3),
 (2442, 1),
 (2474, 1),
 (2534, 1),
 (2554, 1),
 (2562, 1),
 (2565, 2),
 (2642, 2),
 (2661, 1),
 (2663, 2),
 (2684, 1),
 (2689, 1),
 (2692, 1),
 (2759, 1),
 (2762, 1),
 (27

In [6]:
%%time
index_packed.search_term('ziraldo')

CPU times: total: 0 ns
Wall time: 1 ms


[(871, 1),
 (899, 1),
 (902, 1),
 (917, 1),
 (918, 1),
 (1043, 1),
 (1051, 1),
 (1385, 1),
 (1460, 1),
 (1568, 1),
 (1611, 1),
 (2262, 1),
 (2269, 1),
 (2270, 1),
 (2878, 1),
 (3149, 1),
 (3296, 11),
 (4308, 1),
 (4310, 1),
 (4454, 3),
 (4479, 2),
 (4605, 3),
 (5406, 1),
 (5532, 1),
 (5814, 3),
 (6202, 1),
 (6209, 1),
 (6210, 1),
 (6783, 1),
 (6790, 1),
 (6933, 9),
 (7044, 9)]

In [7]:
%%time
index.search_term('alcorão')

CPU times: total: 0 ns
Wall time: 0 ns


[(1604, 1), (2898, 1)]

In [8]:
%%time
index.search_term('evolução')

CPU times: total: 0 ns
Wall time: 0 ns


[(26, 1),
 (30, 2),
 (80, 1),
 (82, 2),
 (109, 2),
 (293, 1),
 (307, 1),
 (319, 1),
 (321, 3),
 (332, 1),
 (346, 1),
 (371, 1),
 (383, 1),
 (609, 1),
 (646, 1),
 (719, 1),
 (836, 1),
 (983, 1),
 (997, 4),
 (998, 5),
 (1000, 5),
 (1035, 1),
 (1272, 2),
 (1422, 2),
 (1494, 2),
 (1552, 2),
 (1618, 1),
 (1620, 1),
 (1644, 1),
 (1965, 1),
 (1976, 1),
 (1980, 1),
 (1982, 1),
 (1989, 1),
 (1991, 1),
 (1994, 1),
 (1995, 1),
 (2013, 1),
 (2019, 1),
 (2033, 1),
 (2039, 2),
 (2041, 1),
 (2053, 1),
 (2055, 2),
 (2056, 1),
 (2057, 1),
 (2063, 2),
 (2070, 1),
 (2073, 1),
 (2113, 2),
 (2119, 2),
 (2134, 1),
 (2135, 1),
 (2136, 1),
 (2137, 1),
 (2141, 1),
 (2152, 1),
 (2180, 1),
 (2184, 1),
 (2187, 2),
 (2191, 1),
 (2192, 1),
 (2193, 1),
 (2210, 1),
 (2314, 1),
 (2339, 1),
 (2356, 1),
 (2361, 1),
 (2372, 1),
 (2408, 1),
 (2410, 3),
 (2442, 1),
 (2474, 1),
 (2534, 1),
 (2554, 1),
 (2562, 1),
 (2565, 2),
 (2642, 2),
 (2661, 1),
 (2663, 2),
 (2684, 1),
 (2689, 1),
 (2692, 1),
 (2759, 1),
 (2762, 1),
 (27

In [9]:
%%time
index.search_term('ziraldo')

CPU times: total: 0 ns
Wall time: 0 ns


[(871, 1),
 (899, 1),
 (902, 1),
 (917, 1),
 (918, 1),
 (1043, 1),
 (1051, 1),
 (1385, 1),
 (1460, 1),
 (1568, 1),
 (1611, 1),
 (2262, 1),
 (2269, 1),
 (2270, 1),
 (2878, 1),
 (3149, 1),
 (3296, 11),
 (4308, 1),
 (4310, 1),
 (4454, 3),
 (4479, 2),
 (4605, 3),
 (5406, 1),
 (5532, 1),
 (5814, 3),
 (6202, 1),
 (6209, 1),
 (6210, 1),
 (6783, 1),
 (6790, 1),
 (6933, 9),
 (7044, 9)]

Por fim, gostaríamos de obter o tamanho dos postings de cada índice em memória:

In [10]:
def calc_size(index: InvertedIndex):
    size = 0
    for posting in index.postings:
        for element in posting:
            size += len(element) # obtendo o tamanho apenas do array de bytes e não do objeto bytes
    return size

In [11]:
calc_size(index_packed)

10091510

In [12]:
calc_size(index)

38835952

In [25]:
calc_size(index_packed)/calc_size(index)

0.2598496877326453

## Experimentos para o índice de campos
Repetindo os experimentos anteriores para o índice de campos

In [13]:
%%time
field_index_packed = load_field_inv_index(compress=True)

CPU times: total: 1.97 s
Wall time: 2.09 s


In [14]:
%%time
field_index = load_field_inv_index(compress=False)

CPU times: total: 828 ms
Wall time: 936 ms


In [15]:
%%time
field_index.search_term('aurelio.author')

CPU times: total: 0 ns
Wall time: 0 ns


[(7384, 1)]

In [16]:
%%time
field_index.search_term('japão.description')

CPU times: total: 0 ns
Wall time: 1 ms


[(225, 1),
 (317, 2),
 (1642, 1),
 (1654, 1),
 (2657, 1),
 (2888, 1),
 (2996, 1),
 (2999, 1),
 (3196, 2),
 (3360, 2),
 (3379, 1),
 (3561, 1),
 (4937, 1),
 (5091, 1),
 (5454, 1),
 (6596, 1),
 (6873, 1),
 (6994, 2),
 (7162, 1)]

In [17]:
%%time
field_index.search_term('zuckerberg.description')

CPU times: total: 0 ns
Wall time: 0 ns


[(4387, 1)]

In [18]:
%%time
field_index_packed.search_term('aurelio.author')

CPU times: total: 0 ns
Wall time: 0 ns


[(7384, 1)]

In [19]:
%%time
field_index_packed.search_term('japão.description')

CPU times: total: 0 ns
Wall time: 0 ns


[(225, 1),
 (317, 2),
 (1642, 1),
 (1654, 1),
 (2657, 1),
 (2888, 1),
 (2996, 1),
 (2999, 1),
 (3196, 2),
 (3360, 2),
 (3379, 1),
 (3561, 1),
 (4937, 1),
 (5091, 1),
 (5454, 1),
 (6596, 1),
 (6873, 1),
 (6994, 2),
 (7162, 1)]

In [21]:
%%time
field_index_packed.search_term('zuckerberg.description')

CPU times: total: 0 ns
Wall time: 0 ns


[(4387, 1)]

In [22]:
calc_size(field_index_packed)

1237529

In [23]:
calc_size(field_index)

4230512

In [24]:
calc_size(field_index_packed)/calc_size(field_index)

0.2925246400435692

## Conclusões
* Em relação ao tempo de carregamento dos índices, sobretudo no índice completo, nota-se que o arquivo comprimido demora bem mais para ser processado. Acreditamos que isso se deve ao fato de o arquivo ser lido byte a byte (ao invés de 4 em 4 bytes) e a cada leitura processamentos e cálculos são feitos para saber se o número ou o posting atual terminou ou não.
* Com respeito ao tempo de execução de pesquisas nenhuma diferença significativa foi encontrada.
* Por fim, o tamanho dos índices comprimidos em memória ocupam algo entre 25% a 30% do que os índices comuns ocupam, um ganho significativo de performance.