Add cachew? #6

seanbreckenridge · 2020-09-05T10:15:36Z

Doesnt seem that it'd be useful here, since we're already reading from a database (the firefox history database). Caching that info to another cachew database wouldn't make much sense.

Cant cache the live firefox history file because that keeps changing, so the only place cachew would improve any performance would be if we were spending a long time in merge_visits. But that doesnt even do any IO, its just a loop with a set, so doubtful.

For reference:

[ ~ ] $ time sh -c  'HPI_LOGS=debug python3 -c "from my.browsing import history; x = list(history())"'
[DEBUG   2020-09-05 03:07:21,267 my.browsing __init__.py:681] using inferred type <class 'ffexport.model.Visit'>
[D 200905 03:07:21 save_hist:66] backing up /home/sean/.mozilla/firefox/lsinsptf.dev-edition-default/places.sqlite to /tmp/tmpxvxci5yl/places-20200905100721.sqlite
[D 200905 03:07:21 save_hist:70] done!
[D 200905 03:07:21 merge_db:48] merging information from 2 databases...
[DEBUG   2020-09-05 03:07:21,303 my.browsing __init__.py:728] using /tmp/browser-cachw/homeseandatafirefoxdbsplaces-20200828223058.sqlite for db cache
[DEBUG   2020-09-05 03:07:21,303 my.browsing __init__.py:734] new hash: cachew: 0.7.0, schema: {'url': <class 'str'>, 'visit_date': <class 'datetime.datetime'>, 'visit_type': <class 'int'>, 'title': typing.Union[str, NoneType], 'description': typing.Union[str, NoneType], 'preview_image': typing.Union[str, NoneType]}, hash: 1598653858
[DEBUG   2020-09-05 03:07:21,310 my.browsing __init__.py:761] old hash: cachew: 0.7.0, schema: {'url': <class 'str'>, 'visit_date': <class 'datetime.datetime'>, 'visit_type': <class 'int'>, 'title': typing.Union[str, NoneType], 'description': typing.Union[str, NoneType], 'preview_image': typing.Union[str, NoneType]}, hash: 1598653858
[DEBUG   2020-09-05 03:07:21,310 my.browsing __init__.py:764] hash matched: loading from cache
[DEBUG   2020-09-05 03:07:22,083 my.browsing __init__.py:728] using /tmp/browser-cachw/tmptmpxvxci5ylplaces-20200905100721.sqlite for db cache
[DEBUG   2020-09-05 03:07:22,083 my.browsing __init__.py:734] new hash: cachew: 0.7.0, schema: {'url': <class 'str'>, 'visit_date': <class 'datetime.datetime'>, 'visit_type': <class 'int'>, 'title': typing.Union[str, NoneType], 'description': typing.Union[str, NoneType], 'preview_image': typing.Union[str, NoneType]}, hash: 1599300441
[DEBUG   2020-09-05 03:07:22,085 my.browsing __init__.py:761] old hash: None
[DEBUG   2020-09-05 03:07:22,085 my.browsing __init__.py:770] hash mismatch: computing data and writing to db
[D 200905 03:07:22 parse_db:69] Parsing visits from /tmp/tmpxvxci5yl/places-20200905100721.sqlite...
[D 200905 03:07:22 parse_db:88] Parsing sitedata from /tmp/tmpxvxci5yl/places-20200905100721.sqlite...
[D 200905 03:07:28 merge_db:60] Summary: removed 91,787 duplicates...
[D 200905 03:07:28 merge_db:61] Summary: returning 98,609 visit entries...
sh -c   7.46s user 0.19s system 99% cpu 7.711 total
[ ~ ] $ time sh -c 'HPI_LOGS=debug python3 -c "from my.browsing import history; x = list(history())"'
[D 200905 03:07:48 save_hist:66] backing up /home/sean/.mozilla/firefox/lsinsptf.dev-edition-default/places.sqlite to /tmp/tmpsvri7hr8/places-20200905100748.sqlite
[D 200905 03:07:48 save_hist:70] done!
[D 200905 03:07:48 merge_db:48] merging information from 2 databases...
[D 200905 03:07:48 parse_db:69] Parsing visits from /home/sean/data/firefox/dbs/places-20200828223058.sqlite...
[D 200905 03:07:48 parse_db:88] Parsing sitedata from /home/sean/data/firefox/dbs/places-20200828223058.sqlite...
[D 200905 03:07:49 parse_db:69] Parsing visits from /tmp/tmpsvri7hr8/places-20200905100748.sqlite...
[D 200905 03:07:49 parse_db:88] Parsing sitedata from /tmp/tmpsvri7hr8/places-20200905100748.sqlite...
[D 200905 03:07:50 merge_db:60] Summary: removed 91,787 duplicates...
[D 200905 03:07:50 merge_db:61] Summary: returning 98,609 visit entries...
sh -c   1.65s user 0.10s system 99% cpu 1.759 total

First run is 7 seconds, with a cached cachew hit for the backed up database. Second is reading from both of them directly, which takes 1.6 seconds.

For reference, this is how I modified my.browsing from HPI

diff --git a/my/browsing.py b/my/browsing.py
index 9f44322..af66530 100644
--- a/my/browsing.py
+++ b/my/browsing.py
@@ -25,17 +25,25 @@ import tempfile
 from pathlib import Path
 from typing import Iterator, Sequence
 
-from .core.common import listify, get_files
+from .core.common import listify, get_files, mcachew
 
 
+from .kython.klogging import LazyLogger, mklevel
 # monkey patch ffexport logs
 if "HPI_LOGS" in os.environ:
-    from .kython.klogging import mklevel
     os.environ["FFEXPORT_LOGS"] = str(mklevel(os.environ["HPI_LOGS"]))
 
+logger = LazyLogger(__name__, level="info")
 
-from ffexport import read_and_merge, Visit
+CACHEW_PATH = "/tmp/browser-cachw"
+
+# create cache path
+os.makedirs(CACHEW_PATH, exist_ok=True)
+
+from ffexport import Visit
 from ffexport.save_hist import backup_history
+from ffexport.parse_db import read_visits
+from ffexport.merge_db import merge_visits
 
 @listify
 def inputs() -> Sequence[Path]:
@@ -60,7 +68,20 @@ def history(from_paths=inputs) -> Results:
     import my.browsing
     visits = list(my.browsing.history())
     """
-    yield from read_and_merge(*from_paths())
+    # only load items that are in the config.export path using cachew
+    # the 'live_file' is always going to be uncached
+    db_paths = list(from_paths())
+    tmp_path = db_paths.pop()
+    yield from merge_visits(*map(_read_history, db_paths), _read_history(tmp_path))
+
+
+def _browser_mtime(p: Path) -> int:
+    return int(p.stat().st_mtime)
+
+@mcachew(hashf=_browser_mtime, logger=logger, cache_path=lambda db_path: f"{CACHEW_PATH}/{str(db_path).replace('/','')}")
+def _read_history(db: Path) -> Iterator[Visit]:
+    yield from read_visits(db)
+
 
 def stats():
     from .core import stat

The text was updated successfully, but these errors were encountered:

seanbreckenridge · 2020-09-21T15:11:12Z

After using this for a while, can say with a fair amount of confidence that it'd end up being slower. Better to read/merge from the db's themselves.

If you're doing this all the time, I think it'd be most efficient to periodically read it in using read_and_merge, dump to pickle and load that back into memory whenever you need it.

seanbreckenridge closed this as completed Sep 21, 2020

seanbreckenridge mentioned this issue Mar 24, 2021

cache backed up firefox databases seanbreckenridge/HPI#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cachew? #6

Add cachew? #6

seanbreckenridge commented Sep 5, 2020

seanbreckenridge commented Sep 21, 2020

Add cachew? #6

Add cachew? #6

Comments

seanbreckenridge commented Sep 5, 2020

seanbreckenridge commented Sep 21, 2020