Skip to content
/ yams Public

A simple Python scraper to collect data from the most popular Peruvian news sites

License

Notifications You must be signed in to change notification settings

rodp63/yams

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Yet Another Media Scraper

Code style: black Imports: isort

This project aims to provide a seamless access to news and media information. A simple Python scraper to collect posts from the most popular Peruvian news websites (for now):

Installation

Standard installation via pip:

$ pip install cli-yams

Or, you can install it manually:

$ git clone https://github.com/rodp63/yams.git
$ cd yams
$ pip install .

Quickstart

Let's start by getting all the posts from El Comercio containing the keyword "peru" in the last month (by default).

$ yams start newspaper elcomercio -k peru

We can define the date range to extract the information by using the -s and -t options. Let's get all the post from Perú 21 containing the keyword "congreso" from the last half of the year 2023.

$ yams start newspaper peru21 -k congreso -s "2023-07-01" -t "2023-12-31"

As you may notice, the output is printed on the screen by default. Generally we want to save the information in a file, we can store the output in a JSON file by using the -o option. Let's get all the post from El Correo containing the keyword "futbol" from the last month and save the output in the file futbol_posts.json.

$ yams start newspaper diariocorreo -k futbol -o futbol_posts

We have been using a single keyword, however, we can use as many as we want. Let's get all the post from El Comercio containing the words "messi" or "ronaldo" from the last month.

$ yams start newspaper elcomercio -k messi -k ronaldo

Although we can use keywords with more than one word (e.g. -k "chipi chapa"), it is not recommended since the keywords are stemmed and splited into clean words. But sometimes we want to search for exact terms, considering accents or Letter case. In such situations we can use the --exact-match option. Let's get all the post from Perú 21 containing the exact keyword "Luis Advíncula" from the last month.

$ yams start newspaper peru21 -k "Luis Advíncula" --exact-match

The task parameters are always displayed at the beginning of the process, This way we are always aware of the details of the search that is about to start. We can also define parameters by using environment variables, very useful when we want to execute YAMS within containers for example. We can check them using the info command.

$ yams info
...
  Environment:
    YAMS_NEWSPAPER: 
    YAMS_NEWSPAPER_SINCE: 
    YAMS_NEWSPAPER_TO: 
    YAMS_NEWSPAPER_KEYWORDS:
    YAMS_NEWSPAPER_OUTPUT:

Let's get all the post from El Correo containing the keyword "ambiente" from the last month using environment variables.

$ export YAMS_NEWSPAPER="diariocorreo"
$ export YAMS_NEWSPAPER_KEYWORDS="ambiente"
$ yams start newspaper

Please refer to the Dockerfile and manifest.yaml files to run YAMS within Docker and Kubernetes.

About

A simple Python scraper to collect data from the most popular Peruvian news sites

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published