Skip to content

A cookie-cutter library to support data scraping operations

Notifications You must be signed in to change notification settings

trietmnj/scraperCookie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scraperCookie

This repo is a generalized library. Each specific implementation should include 2-3 components:

> Proxy (optional)
    TODO

> Scraper
    Implemented with a builder pattern designed to not expose client code
    to a scraper that is only partially constructed
    https://golangbyexample.com/builder-pattern-golang/?__cf_chl_managed_tk__=Yel6VzlV22y4b1iWKlNVx7STpGlu2tQHo52ZSr.RWV0-1639966741-0-gaNycGzNChE

    endpointScraper - target API endpoints that returns a JSON
    htmlTableScraper - target html <table> tag

> Store
    Automated store as an abstract interface
        type IStore interface {
            Init()
            Store(l Locator, data io.Reader) error
            Read(l Locator) []byte
            KeyExists(l Locator) (bool, error)
        }
    Data are stored using a predetermined format: bucketName/ingest/repoName/sourceUrl/year/month/date/timeStamp-number.format
        timeStamp is datum to UTC

Config should be located in .devcontainer/dev.env - BUCKET, DATASOURCE, REPONAME are used with S3JsonStore.

AWS_REGION =
AWS_ACCESS_KEY_ID =
AWS_SECRET_ACCESS_KEY =
BUCKET =
DATASOURCE =
REPONAME =

About

A cookie-cutter library to support data scraping operations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published