Skip to content

A wrapper to newspaper3k/newspaper python library for scraping nepali news sites

License

MIT, Apache-2.0 licenses found

Licenses found

MIT
LICENSE
Apache-2.0
GOOSE-LICENSE.txt
Notifications You must be signed in to change notification settings

pykancha/newspaper3k_wrapper

 
 

Repository files navigation

Newspaper3k Wrapper: Nepali Article scraping & curation

Tests

Newspaper3k Wrapper

A Wrapper over newspaper3k library to provide support to nepali sites

Current State

  • Works for majority of nepali news site
  • Tested sites list is given in sites.json, (50+ sites)
  • Currently only title and text field is guarenteed to have data.
  • Extraction of images & date is also supported in most sites, be sure check if it supported on the sites you want before relying.

Installation

In case you run in some troubles during installation performing the steps below, Visit newpaper3k for detail usage/installation help.

$ git clone https://github.com/pykancha/newspaper3k_wrapper.git
$ pip install -r requirements.txt
$ python setup.py install 
$ python download_corpora.py

Use as dependency

Once you have run python download_corpora.py command on your machine, you can use:

python -m pip install git+https://github.com/pykancha/newspaper3k_wrapper.git#egg=newspaper_wrapper

to simply install it as regular package without cloning repo to your folder.

You can get the download_corpora.py file without cloning the repo through:

curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py -o download_corpora.py

To specify this git dependency for your project,

In your requirements.txt file add;

-e git+https://github.com/pykancha/newspaper3k_wrapper.git#egg=newspaper_wrapper

Alternatively if you use Poetry

Use the command

poetry add git+https://github.com/pykancha/newspaper3k_wrapper.git

Alternatively, edit in your pyproject.toml file

newspaper3k_wrapper = { git = "https://github.com/pykancha/newspaper3k_wrapper.git" }

Sample Usage

>> from newspaper_wrapper import Article

>> url = 'https://www.himalkhabar.com/news/113640'
>> article = Article(url, language='hi')
>> article.download()

>>> article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'

>> article.parse()
>> article.nlp()

>> print(article.title, article.text)

Development

Refer to: Docs - Adding new languages

$ git clone https://github.com/pykancha/newspaper3k_wrapper
$ cd newspaper3k_wrapper
$ python -m pip install -e .
$ python download_corpora.py

Make changes and run the tests

$ python tests/unit_tests.py 
$ python tests/unit_tests.py fulltext

About

A wrapper to newspaper3k/newspaper python library for scraping nepali news sites

Topics

Resources

License

MIT, Apache-2.0 licenses found

Licenses found

MIT
LICENSE
Apache-2.0
GOOSE-LICENSE.txt

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%