The NewsLookout web scraping application gathers and classified financial events from public news websites and market data for India. It is a scalable, modular and configurable multi-threaded python console application. The application is readily extended by adding custom modules for web-scraping via its 'plugin' architecture. Plugins can be added for a variety of tasks, including - for scraping additional news sources, perform custom data pre-processing and run NLP based news text analytics such as - entity recognition, negative event classification, economy trends, industry trends, etc.
There are already a number of python libraries available for web-scraping, so why should you consider using this application for web scraping news? The reason is that it has been specifically built for sourcing financial news events and has several useful features. A few notable ones are:
- Text tone classification using deep learning NLP model to indicate positive, neutral or negative news
- Text de-duplication using deep-learning NLP model
- Built-in NLP models for keyword extraction
- Multi-threaded for scraping several news sites in parallel
- Reduces the network traffic and consequently webserver load by pausing between network requests. High traffic load generated by naive scraping code harm web servers by generating intense bursts of network load, hence, websites need to detect and block these for good reason. This application introduces delays between subsequent network fetches to reduce network bandwidth and avoids overloading the news web servers. In other words, it is built to behave responsibly.
- Includes data processing pipeline configurable by defining the execution order of the data-processing plugins
- Performs data processing on multiple news/data in parallel to speed up processing for thousands of news items
- Extensible with custom plugins that can be rapidly written with minimal additional code to support additional news sources. Writing a new plugin does not need writing low level code to handle network traffic and HTTP protocols.
- Rigorously tested for the specific websites enabled in the plugins, handles several quirks and formatting problems caused by inconsistent and non-standard HTML code.
- Rigorous text cleaning tested for each of the sites implemented
- Keeps track of failures and history of sites scraped to avoid re-visiting them again
- Highly configurable functionality
- Enterprise ready functionality - configurable event logging, segregation of data storage locations vs. program executables, minimum permissions to run the executable, etc.
- Works with proxy servers
- Runnable without a frontend, as a daemon.
- Extensible data processing plugins to customize the data processing required after web scraping
- Enables web-scraping news archives to get news from previous dates for establishing history for analysis
- Saves the current session state and resumes downloading unfinished URLs in case the application is shut-down midway during web scraping
- Docker file available to build and deploy the application in a container either standalone or in a Kubernetes cluster
Install the dependencies using pip:
pip install -r requirements.txt
Install the application via pip:
pip install newslookout
Caution: As a security best practice, it is strongly recommended to run the application under its own separate Operating System level user ID without any special privileges.
Next, create and configure separate locations for:
- The application itself (not required if you're installing via the wheel or pip installer)
- The data files downloaded from the news websites, e.g. -
/var/cache/newslookout
- The log file, e.g. -
/var/log/newslookout/newslookout.log
- The PID file, e.g. -
/var/run/newslookout.pid
Set these parameters in the configuration file.
Download the spacy model using this command:
python -m spacy download en_core_web_lg
For NLTK, refer to the NLTK website on downloading the data - https://www.nltk.org/data.html. Specifically, the following data needs to be downloaded:
- reuters
- universal_treebanks_v20
- maxent_treebank_pos_tagger
- punkt
To do this, you could either use the nltk downloader - run the following commands:
import nltk
nltk.download('punkt')
nltk.download('maxent_treebank_pos_tagger')
nltk.download('reuters')
nltk.download('universal_treebanks_v20')
Alternatively, you could manually download these from the source location - https://github.com/nltk/nltk_data/tree/gh-pages/packages
If these are not installed to one of the standard locations, you will need to set the NLTK_DATA environment variable to specify the location of this NLTK data. Refer to the instructions given at the NLTK website about downloading these model files - https://www.nltk.org/data.html.
All the parameters for the application can be configured via the configuration file. Both the configuration file, and the date for which the web scraper is to be run, are passed as command line arguments to the application.
The key parameters that need to be configured are:
- Application root folder:
prefix
- Data directory:
data_dir
- Plugin directory:
plugins_dir
- Contributed Plugins:
plugins_contributed_dir
- Enabled plugins: Add the name of the python file (without file extension) under the
plugins
section as:plugin01=mod_my_plugin
- Network proxy (if any):
proxy_url_https
- The level of logging:
log_level
Installing the wheel or via pip will generate a script newslookout
placed in your local folder.
This invokes the main method of the application and should be passed the two required arguments - configuration file and date for which the application is run.
For example:
newslookout -c myconfigfile.conf -d 2020-01-01
In addition to this, 2 scripts are provided for UNIX-like and Windows OS. For convenience, you may run these shell scripts to start the application, it automatically generates the current date and supplies it as an argument to the python application. Its best advised to run the scripts or command line by scheduling it via the UNIX cron scheduler or the Microsoft Windows Task Scheduler for automated scheduling for small setups. In large enterprise environments, batch job coordination software such as Ctrl-M, IBM Tivoli, or any job scheduling framework may be configured to run it for reliable and automated execution.
The application creates a PID file (process identifier) upon startup to prevent launching multiple instances at the same time. On startup, it checks if this file exists, if it does then the application will stop. If the application is killed or shuts down abruptly without cleaning up, this PID file will remain and will need to be manually deleted.
The application displays its progress on stdout, for example:
NewsLookout Web Scraping Application, Version 1.9.9
Python version: 3.8.5 (Linux)
Run date: 2021-06-10
Reading configuration from: conf/newslookout.conf
Logging events to file: logs/newslookout.log
Using PID file: data/newslookout.pid
URLs identified: 100%|██████████████████████████████████████████████████████████| 14/14 [1h 48:14<00:00, 0.00 Plugins/s]
Data downloaded: 100%|██████████████████████████████████████████████████████| 1474/1474 [1h 48:14<00:00, 0.23 URLs/s]
Data processed: 38%|███████████████████▌ | 384/1007 [1h 48:14<2h 55:36, 0.06 Files/s]
For a more detailed log of events, refer to the log file. It captures all events with timestamp and the relevant name of the module that generated the event.
2021-01-01 01:31:50:[INFO]:queue_manager:4360: 13 worker threads available to fetch content. ... 2021-01-01 02:07:51:[INFO]:worker:320: Progress Status: 1117 URLs successfully scraped out of 1334 processed; 702 URLs remain. ... 2021-01-01 03:02:10:[INFO]:queue_manager:5700: Completed fetching data on all worker threads
You can extend the web scraper's functionality to add any additional website that you need scraped by using the template file template_for_plugin.py
from the plugins_contrib
folder and customising it.
Name your custom plugin file with the same name as the name of the class object.
Place it in the plugins_contrib
folder (or whichever folder you have set in the config file).
Next, add the plugin's name in the configuration file.
It will be read, instantiated and run automatically by the application on the next startup.
Take a look at one of the already implemented plugins code for examples of how a plugin can be written.
The application will automatically rotate the log file upon reaching the set maximum size. The data directory will need to be monitored since its size could grow quickly due to the data scraped from the web.
For enterprise installations, log watch may be enabled for monitoring the operation of the application by watching for specific event entries in the log file.
The data folder should be monitored for growth in its size.
In case news portals change their structure, the web scraper code for their respective plugin will need to be updated to continue retrieving information reliably. This needs careful monitoring of the output to keep checking for parsing related problems.