Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cachew? #6

Closed
seanbreckenridge opened this issue Sep 5, 2020 · 1 comment
Closed

Add cachew? #6

seanbreckenridge opened this issue Sep 5, 2020 · 1 comment

Comments

@seanbreckenridge
Copy link
Owner

Doesnt seem that it'd be useful here, since we're already reading from a database (the firefox history database). Caching that info to another cachew database wouldn't make much sense.

Cant cache the live firefox history file because that keeps changing, so the only place cachew would improve any performance would be if we were spending a long time in merge_visits. But that doesnt even do any IO, its just a loop with a set, so doubtful.

For reference:

[ ~ ] $ time sh -c  'HPI_LOGS=debug python3 -c "from my.browsing import history; x = list(history())"'
[DEBUG   2020-09-05 03:07:21,267 my.browsing __init__.py:681] using inferred type <class 'ffexport.model.Visit'>
[D 200905 03:07:21 save_hist:66] backing up /home/sean/.mozilla/firefox/lsinsptf.dev-edition-default/places.sqlite to /tmp/tmpxvxci5yl/places-20200905100721.sqlite
[D 200905 03:07:21 save_hist:70] done!
[D 200905 03:07:21 merge_db:48] merging information from 2 databases...
[DEBUG   2020-09-05 03:07:21,303 my.browsing __init__.py:728] using /tmp/browser-cachw/homeseandatafirefoxdbsplaces-20200828223058.sqlite for db cache
[DEBUG   2020-09-05 03:07:21,303 my.browsing __init__.py:734] new hash: cachew: 0.7.0, schema: {'url': <class 'str'>, 'visit_date': <class 'datetime.datetime'>, 'visit_type': <class 'int'>, 'title': typing.Union[str, NoneType], 'description': typing.Union[str, NoneType], 'preview_image': typing.Union[str, NoneType]}, hash: 1598653858
[DEBUG   2020-09-05 03:07:21,310 my.browsing __init__.py:761] old hash: cachew: 0.7.0, schema: {'url': <class 'str'>, 'visit_date': <class 'datetime.datetime'>, 'visit_type': <class 'int'>, 'title': typing.Union[str, NoneType], 'description': typing.Union[str, NoneType], 'preview_image': typing.Union[str, NoneType]}, hash: 1598653858
[DEBUG   2020-09-05 03:07:21,310 my.browsing __init__.py:764] hash matched: loading from cache
[DEBUG   2020-09-05 03:07:22,083 my.browsing __init__.py:728] using /tmp/browser-cachw/tmptmpxvxci5ylplaces-20200905100721.sqlite for db cache
[DEBUG   2020-09-05 03:07:22,083 my.browsing __init__.py:734] new hash: cachew: 0.7.0, schema: {'url': <class 'str'>, 'visit_date': <class 'datetime.datetime'>, 'visit_type': <class 'int'>, 'title': typing.Union[str, NoneType], 'description': typing.Union[str, NoneType], 'preview_image': typing.Union[str, NoneType]}, hash: 1599300441
[DEBUG   2020-09-05 03:07:22,085 my.browsing __init__.py:761] old hash: None
[DEBUG   2020-09-05 03:07:22,085 my.browsing __init__.py:770] hash mismatch: computing data and writing to db
[D 200905 03:07:22 parse_db:69] Parsing visits from /tmp/tmpxvxci5yl/places-20200905100721.sqlite...
[D 200905 03:07:22 parse_db:88] Parsing sitedata from /tmp/tmpxvxci5yl/places-20200905100721.sqlite...
[D 200905 03:07:28 merge_db:60] Summary: removed 91,787 duplicates...
[D 200905 03:07:28 merge_db:61] Summary: returning 98,609 visit entries...
sh -c   7.46s user 0.19s system 99% cpu 7.711 total
[ ~ ] $ time sh -c 'HPI_LOGS=debug python3 -c "from my.browsing import history; x = list(history())"'
[D 200905 03:07:48 save_hist:66] backing up /home/sean/.mozilla/firefox/lsinsptf.dev-edition-default/places.sqlite to /tmp/tmpsvri7hr8/places-20200905100748.sqlite
[D 200905 03:07:48 save_hist:70] done!
[D 200905 03:07:48 merge_db:48] merging information from 2 databases...
[D 200905 03:07:48 parse_db:69] Parsing visits from /home/sean/data/firefox/dbs/places-20200828223058.sqlite...
[D 200905 03:07:48 parse_db:88] Parsing sitedata from /home/sean/data/firefox/dbs/places-20200828223058.sqlite...
[D 200905 03:07:49 parse_db:69] Parsing visits from /tmp/tmpsvri7hr8/places-20200905100748.sqlite...
[D 200905 03:07:49 parse_db:88] Parsing sitedata from /tmp/tmpsvri7hr8/places-20200905100748.sqlite...
[D 200905 03:07:50 merge_db:60] Summary: removed 91,787 duplicates...
[D 200905 03:07:50 merge_db:61] Summary: returning 98,609 visit entries...
sh -c   1.65s user 0.10s system 99% cpu 1.759 total

First run is 7 seconds, with a cached cachew hit for the backed up database. Second is reading from both of them directly, which takes 1.6 seconds.

For reference, this is how I modified my.browsing from HPI

diff --git a/my/browsing.py b/my/browsing.py
index 9f44322..af66530 100644
--- a/my/browsing.py
+++ b/my/browsing.py
@@ -25,17 +25,25 @@ import tempfile
 from pathlib import Path
 from typing import Iterator, Sequence
 
-from .core.common import listify, get_files
+from .core.common import listify, get_files, mcachew
 
 
+from .kython.klogging import LazyLogger, mklevel
 # monkey patch ffexport logs
 if "HPI_LOGS" in os.environ:
-    from .kython.klogging import mklevel
     os.environ["FFEXPORT_LOGS"] = str(mklevel(os.environ["HPI_LOGS"]))
 
+logger = LazyLogger(__name__, level="info")
 
-from ffexport import read_and_merge, Visit
+CACHEW_PATH = "/tmp/browser-cachw"
+
+# create cache path
+os.makedirs(CACHEW_PATH, exist_ok=True)
+
+from ffexport import Visit
 from ffexport.save_hist import backup_history
+from ffexport.parse_db import read_visits
+from ffexport.merge_db import merge_visits
 
 @listify
 def inputs() -> Sequence[Path]:
@@ -60,7 +68,20 @@ def history(from_paths=inputs) -> Results:
     import my.browsing
     visits = list(my.browsing.history())
     """
-    yield from read_and_merge(*from_paths())
+    # only load items that are in the config.export path using cachew
+    # the 'live_file' is always going to be uncached
+    db_paths = list(from_paths())
+    tmp_path = db_paths.pop()
+    yield from merge_visits(*map(_read_history, db_paths), _read_history(tmp_path))
+
+
+def _browser_mtime(p: Path) -> int:
+    return int(p.stat().st_mtime)
+
+@mcachew(hashf=_browser_mtime, logger=logger, cache_path=lambda db_path: f"{CACHEW_PATH}/{str(db_path).replace('/','')}")
+def _read_history(db: Path) -> Iterator[Visit]:
+    yield from read_visits(db)
+
 
 def stats():
     from .core import stat
@seanbreckenridge
Copy link
Owner Author

After using this for a while, can say with a fair amount of confidence that it'd end up being slower. Better to read/merge from the db's themselves.

If you're doing this all the time, I think it'd be most efficient to periodically read it in using read_and_merge, dump to pickle and load that back into memory whenever you need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant