Skip to content

Latest commit

 

History

History
81 lines (65 loc) · 1.99 KB

README.md

File metadata and controls

81 lines (65 loc) · 1.99 KB

News headers

Scrape Swedish news sites to get the headers. All methods allowed, currently including DOM parsing, using GraphQL, parsing <script> tags and more. :)

* Must be initialized with a sha256 hash

Example usage

>> from scraper import Aftonbladet, SVT
>>> s = SVT()
>>> headers = s.headers()
>>> print(headers[0])
Dödsfall som kopplas till e-cigg ökarny studie analyserar skadorna
Forskare: Som att utsättas för senapsgas
https://svt.se/nyheter/utrikes/antal-dodsfall-kopplade-till-e-cigg-okar
>>> a = Aftonbladet()
>>> headers = a.headers()
>>> print(headers[5])
Varför ska vi amma för att rädda klimatet?
Öhagen Britterna kan väl sluta dricka te i stället
https://www.aftonbladet.se/family/a/P9w4Q5/varfor-ska-vi-amma-for-att-radda-klimatet
>>> headers[3].title
Stänger alla butikeroch ger ledigt för fest
>>> headers[3].url
https://www.aftonbladet.se/nyheter/a/vQygkp/jysk-ger-alla-anstallda-ledigt--dagen-efter-personalfest

Implement new sub class

Just extend the Reader and implement ùrl and headers.

import header

class MySite(Scraper):
    @classmethod
    def url(cls):
        """
        The URL for the site.
        """
        return "https://mysite.se"

    def headers(self):
        """
        Return a list of all headers for the site.
        """

        return [
            header.Header(
                "A Title",
                "A text",
                "https://a-url.se",
                True if "paywall" else False,
            )
        ]

Watcher

A simple watcher is bundled with the repository to make it easier to watch for new articles in desired scrapers. Example usage:

from scraper import SVT, DN
from watcher import Watcher

scrapers = [SVT(), DN()]
w = Watcher(scrapers, 60)

for a in w.articles():
    print("New article posted!")
    print(a)