Eagerly fill marks and matches cache #75

gin-ahirsch · 2019-07-04T15:30:08Z

As promised in #37, here's the rebased work.

For some reason I also need to apply the following patch, otherwise my item is corrupted. I think this occurs sometimes when adding marks. While this does fix it, I don't see why the previous code would be wrong. Maybe it's just something else on my setup triggering the failure, so I didn't include it.
Make sure to play with the cache a little! Well, I'm sure you will anyways, to evaluate if you want this change or not.

diff --git a/src/logdata/include/data/logfiltereddata.h b/src/logdata/include/data/logfiltereddata.h
index b66489b..d529f74 100644
--- a/src/logdata/include/data/logfiltereddata.h
+++ b/src/logdata/include/data/logfiltereddata.h
@@ -220,8 +220,8 @@ class LogFilteredData : public AbstractLogData {
     void regenerateFilteredItemsCache() const;
     // start_index can be passed in as an optimization when finding the item.
     // It refers to the index of the singular arrays (Marks or SearchResultArray) where the item was inserted.
-    void insertIntoFilteredItemsCache( size_t start_index, FilteredItem&& item );
-    void insertIntoFilteredItemsCache( FilteredItem&& item );
+    void insertIntoFilteredItemsCache( size_t start_index, const FilteredItem& item );
+    void insertIntoFilteredItemsCache( const FilteredItem& item );
     // Insert new matches into matching_lines_ and filteredItemsCache_
     void insertNewMatches( const SearchResultArray& new_matches );
     // remove_index can be passed in as an optimization when finding the item.
diff --git a/src/logdata/src/logfiltereddata.cpp b/src/logdata/src/logfiltereddata.cpp
index 5978fe5..03df611 100644
--- a/src/logdata/src/logfiltereddata.cpp
+++ b/src/logdata/src/logfiltereddata.cpp
@@ -593,7 +593,7 @@ void LogFilteredData::regenerateFilteredItemsCache() const
     LOG(logDEBUG) << "finished regenerateFilteredItemsCache, size " << filteredItemsCache_.size();
 }
 
-void LogFilteredData::insertIntoFilteredItemsCache( size_t insert_index, FilteredItem&& item )
+void LogFilteredData::insertIntoFilteredItemsCache( size_t insert_index, const FilteredItem& item )
 {
     using std::begin;
     using std::end;
@@ -616,7 +616,7 @@ void LogFilteredData::insertIntoFilteredItemsCache( size_t insert_index, Filtere
     }
 }
 
-void LogFilteredData::insertIntoFilteredItemsCache( FilteredItem&& item )
+void LogFilteredData::insertIntoFilteredItemsCache( const FilteredItem& item )
 {
     insertIntoFilteredItemsCache( 0, std::move( item ) );
 }

and progressively fill/remove items instead of completely regenerating the whole cache.

by passing in a guess for the index.

Instead of copying the result data, we exchange it for an empty array and return the old result data. The caller can do the potentially slow copy if necessary, or handle the new data differently, skipping a perhaps unnecessary copy. Currently we still do the copy, but this allows us to optimize that step. To reflect the new behavior, the methods dealing with the exchange have been appropriately renamed. There's also a small change in that the pointer-parameters are turned into references.

We cannot properly guess a good start_index like for glogg, which has a single-threaded linear search. But we can collapse two loops when we receive two matches: The first loop which sorts the new matches into matching_lines_ and the second one which puts them into the filteredItemsCache_.

except when we absolutely can't, i.e. an empty container.

variar · 2019-07-28T15:02:16Z

I've adapted this to latest changes in separate branch gin-ahirsch-eager-filtered-items-cache-klogg (a055fe6). It seems to work fine, however I notice some decrease in search performance on my computer.

For original build with lazy cache:

17:49:11.859 INFO  [24883] [LogFilteredDataWorker::interrupt@245] Search interruption requested
17:49:11.860 INFO  [25297] [SearchOperation::doSearch@316] Using 3 matching threads
17:49:15.587 INFO  [25297] [SearchOperation::doSearch@456] Searching done, took 3727.69 ms
17:49:15.587 INFO  [25297] [SearchOperation::doSearch@457] Searching perf 2441404 lines/s
17:49:18.926 INFO  [24883] [LogFilteredDataWorker::interrupt@245] Search interruption requested
17:49:18.928 INFO  [25297] [SearchOperation::doSearch@316] Using 3 matching threads
17:49:22.583 INFO  [25297] [SearchOperation::doSearch@456] Searching done, took 3655.21 ms
17:49:22.583 INFO  [25297] [SearchOperation::doSearch@457] Searching perf 2489813 lines/s
17:49:26.111 INFO  [24883] [LogFilteredDataWorker::interrupt@245] Search interruption requested
17:49:26.112 INFO  [25297] [SearchOperation::doSearch@316] Using 3 matching threads
17:49:29.859 INFO  [25297] [SearchOperation::doSearch@456] Searching done, took 3746.57 ms
17:49:29.859 INFO  [25297] [SearchOperation::doSearch@457] Searching perf 2429105 lines/s

For build with eager cache:

17:48:03.536 INFO  [24763] [LogFilteredDataWorker::interrupt@250] Search interruption requested
17:48:03.537 INFO  [24991] [SearchOperation::doSearch@319] Using 3 matching threads
17:48:07.482 INFO  [24991] [SearchOperation::doSearch@459] Searching done, took 3945.03 ms
17:48:07.482 INFO  [24991] [SearchOperation::doSearch@460] Searching perf 2306905 lines/s
17:48:10.661 INFO  [24763] [LogFilteredDataWorker::interrupt@250] Search interruption requested
17:48:10.662 INFO  [24991] [SearchOperation::doSearch@319] Using 3 matching threads
17:48:14.719 INFO  [24991] [SearchOperation::doSearch@459] Searching done, took 4057.1 ms
17:48:14.719 INFO  [24991] [SearchOperation::doSearch@460] Searching perf 2243178 lines/s
17:48:18.095 INFO  [24763] [LogFilteredDataWorker::interrupt@250] Search interruption requested
17:48:18.097 INFO  [24991] [SearchOperation::doSearch@319] Using 3 matching threads
17:48:22.214 INFO  [24991] [SearchOperation::doSearch@459] Searching done, took 4117.24 ms
17:48:22.214 INFO  [24991] [SearchOperation::doSearch@460] Searching perf 2210413 lines/s

So lazy cache runs at about 2,4M lines/s, and eager cache gives 2,2M lines/s for same file and regular expression.

gin-ahirsch · 2019-07-31T11:17:18Z

I guess the commit "Let LogFilteredDataWorkerThread::updateSearchResult() move the result" could maybe affect cache locality negatively. Could you perhaps try running the search on that commit and compare to the previous one? If so, we can maybe mitigate the cost somewhat.
Maybe it's also due to SearchData::getAllMissing now doing a resize + copy (i.e. potentially copying from two sources when reallocating) versus previously just always just copying once (though copying the whole array, not just the new results). If reallocation is happening a lot the SearchResultArray could maybe reserve more memory from the start, though I wouldn't expect this to be happening a lot. You could maybe add a check for that: if ( matches->capacity() < matches_.size() ) LOG( logINFO ) << "reallocation necessary"; and see how often that fires.

In any case, UI latency should improve since we can be smarter about updating the filteredItemsCache_ when MarksAndMatches is on. Maybe that's just a trade-off this change incurs - whether it is an acceptable one I can't decide.

gin-ahirsch · 2019-07-31T11:29:27Z

In case moving the SearchResultArray is the culprit, this patch should help:

diff --git a/src/logdata/src/logfiltereddataworkerthread.cpp b/src/logdata/src/logfiltereddataworkerthread.cpp
index 25de198..2fe10c9 100644
--- a/src/logdata/src/logfiltereddataworkerthread.cpp
+++ b/src/logdata/src/logfiltereddataworkerthread.cpp
@@ -109,7 +109,11 @@ SearchResultArray SearchData::takeAll( LineLength& length, LinesCount& nbLinesPr
 
     length           = maxLength_;
     nbLinesProcessed = nbLinesProcessed_;
-    return std::exchange( matches_, {} );
+
+    SearchResultArray copy{ matches_.begin(), matches_.end() };
+    matches_.clear();
+
+    return std::move( copy );
 }
 
 void SearchData::setAll( LineLength length, SearchResultArray&& matches )

Edit:
And actually, since due to the move we do need to reallocate matches_ for the new results, maybe that is indeed the problem, though not due to cache misses. If so then a probably better fix would be to reserve some memory (matches_.size()?) for the new matches_ before returning:

SearchResultArray matches = std::exchange( matches_, {} );
matches_.reserve( matches.size() );
return matches;

gin-ahirsch · 2019-08-01T13:46:37Z

There's a stale comment in your merge-commit:

klogg/src/logdata/src/logfiltereddata.cpp

Line 525 in a055fe6

// TODO: We might be a bit smarter and not regenerate the whole thing when

variar · 2019-08-11T17:30:49Z

Thanks for your research. I'll need to do more profiling. Also my benchmarks are not very accurate and improving UI latency might worth some minor decrease in search speed.

variar · 2019-09-01T15:03:46Z

The performance hit was also because of merging new search results in UI thread. I've switched back to providing all search data as well as new lines for incremental cache filling. Due to immer persistent structures this does not involve to much data copying.

variar · 2019-09-01T15:07:42Z

Another goot test case for this is searching for very common line (I use log 1Gb file and line that matches 6867600 times). Without eager cache filling search gradually gets slower. With this PR it progresses at almost constant speed. Thanks again 👍

gin-ahirsch added 8 commits June 27, 2019 14:51

Remove unused variable

ab5d5ca

Eagerly fill LogFilteredData::filteredItemsCache_

9bd4aab

and progressively fill/remove items instead of completely regenerating the whole cache.

logfiltereddataTest refactoring

f12ce81

Optimize LogFilteredData::filteredItemsCache_ filling

3874826

by passing in a guess for the index.

Refactor lookupLineNumber

77cee58

Never return invalid iterator from lookupLineNumber

0c7b4f6

except when we absolutely can't, i.e. an empty container.

variar added a commit that referenced this pull request Jul 28, 2019

[skip ci] Merge PR #75

a055fe6

variar mentioned this pull request Aug 14, 2019

Unable to set marks in FilteredView while viewing MatchesOnly #89

Closed

variar merged commit 0c7b4f6 into variar:master Sep 1, 2019

gin-ahirsch deleted the eager-filtered-items-cache-klogg branch October 15, 2019 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eagerly fill marks and matches cache #75

Eagerly fill marks and matches cache #75

gin-ahirsch commented Jul 4, 2019

variar commented Jul 28, 2019

gin-ahirsch commented Jul 31, 2019

gin-ahirsch commented Jul 31, 2019 •

edited

Loading

gin-ahirsch commented Aug 1, 2019

variar commented Aug 11, 2019

variar commented Sep 1, 2019

variar commented Sep 1, 2019

Eagerly fill marks and matches cache #75

Eagerly fill marks and matches cache #75

Conversation

gin-ahirsch commented Jul 4, 2019

variar commented Jul 28, 2019

gin-ahirsch commented Jul 31, 2019

gin-ahirsch commented Jul 31, 2019 • edited Loading

gin-ahirsch commented Aug 1, 2019

variar commented Aug 11, 2019

variar commented Sep 1, 2019

variar commented Sep 1, 2019

gin-ahirsch commented Jul 31, 2019 •

edited

Loading