Newspaper3k Wrapper: Nepali Article scraping & curation

Newspaper3k Wrapper

A Wrapper over newspaper3k library to provide support to nepali sites

Current State

Works for majority of nepali news site
Tested sites list is given in sites.json, (50+ sites)
Currently only title and text field is guarenteed to have data.
Extraction of images & date is also supported in most sites, be sure check if it supported on the sites you want before relying.

Installation

In case you run in some troubles during installation performing the steps below, Visit newpaper3k for detail usage/installation help.

$ git clone https://github.com/pykancha/newspaper3k_wrapper.git
$ pip install -r requirements.txt
$ python setup.py install 
$ python download_corpora.py

Use as dependency

Once you have run python download_corpora.py command on your machine, you can use:

python -m pip install git+https://github.com/pykancha/newspaper3k_wrapper.git#egg=newspaper_wrapper

to simply install it as regular package without cloning repo to your folder.

You can get the download_corpora.py file without cloning the repo through:

curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py -o download_corpora.py

To specify this git dependency for your project,

In your requirements.txt file add;

-e git+https://github.com/pykancha/newspaper3k_wrapper.git#egg=newspaper_wrapper

Alternatively if you use Poetry

Use the command

poetry add git+https://github.com/pykancha/newspaper3k_wrapper.git

Alternatively, edit in your pyproject.toml file

newspaper3k_wrapper = { git = "https://github.com/pykancha/newspaper3k_wrapper.git" }

Sample Usage

>> from newspaper_wrapper import Article

>> url = 'https://www.himalkhabar.com/news/113640'
>> article = Article(url, language='hi')
>> article.download()

>>> article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'

>> article.parse()
>> article.nlp()

>> print(article.title, article.text)

Development

Refer to: Docs - Adding new languages

$ git clone https://github.com/pykancha/newspaper3k_wrapper
$ cd newspaper3k_wrapper
$ python -m pip install -e .
$ python download_corpora.py

Make changes and run the tests

$ python tests/unit_tests.py 
$ python tests/unit_tests.py fulltext

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
docs		docs
newspaper_wrapper		newspaper_wrapper
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
GOOSE-LICENSE.txt		GOOSE-LICENSE.txt
LICENSE		LICENSE
README.md		README.md
download_corpora.py		download_corpora.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
sites.json		sites.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Newspaper3k Wrapper: Nepali Article scraping & curation

Newspaper3k Wrapper

Current State

Installation

Use as dependency

To specify this git dependency for your project,

Alternatively if you use Poetry

Sample Usage

Development

About

Licenses found

Releases

Packages

Languages

License

Licenses found

pykancha/newspaper3k_wrapper

Folders and files

Latest commit

History

Repository files navigation

Newspaper3k Wrapper: Nepali Article scraping & curation

Newspaper3k Wrapper

Current State

Installation

Use as dependency

To specify this git dependency for your project,

Alternatively if you use Poetry

Sample Usage

Development

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages