Skip to content

red-fox-star/search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Running This

Prerequisites

If configured, it uses Redis for a low level persistent cache store. A running redis on localhost is sufficient.

shards to install dependencies.

Running

Without persistent cache: crystal run src/server.cr and open http://0.0.0.0:3000. With persistent cache crystal run src/server.cr -Dpersistent_cache

Explanation

When Parse Dates is selected:

  • Each page in the list is scraped
  • Any pages which have already been cached are returned immediately
  • Any pages which aren't are requested asynchronously over a basic Javascript WebSocket connection.

When Search is selected:

  • Each page in the list is fetched from cache.
  • Perform a text search rank on the full set of pages in the cache

Scraping

  • I used the fantastic and wicked fast lexbor shard to parse and navigate html.
  • No consideration is given to "too much parallelism" - complete trust is given to the crystal runtime to manage the fibers.

Date parsing algorithms

  • A hierarchy of parsing algorithms is implemented, and the first matching method is trusted.
  • For speed, pages which contain some sort of standardized declaration of publish date are quickly parsed with a dedicated routine for that pages.
  • Algorithms are vaguely sorted by speed, so the first to match is probably the fastest that will work for that page.
  • The final search is a free-form text/regex based fallback, which takes the longest by far.
  • Faster algorithms have a higher degree of trust. For example, it is trusted that the Youtube itemprop=datePublished meta tag is correct.
  • cf. scraper/date_extractor.cr and scraper/date_extractors/

Caching

  • A simple hash is used for caching fully parsed and analyzed pages in memory.
  • A second tier, persistent cache is implemented in Redis. This stores only the raw html, and when a fetch from persistent cache is performed it needs to be re-analyzed.
  • No thought has been given to expiring cache entries at either tier.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published