Skip to content

The NewsLookout web scraping application with NLP and data pre-processing

License

Notifications You must be signed in to change notification settings

sandeep-sandhu/NewsLookout

Repository files navigation

Build Status GitHub release Coverage Status Pypi Release Python Versions Contributors

NewsLookout Web Scraping Application

The NewsLookout web scraping application gathers and classified financial events from public news websites and market data for India. It is a scalable, modular and configurable multi-threaded python console application. The application is readily extended by adding custom modules for web-scraping via its 'plugin' architecture. Plugins can be added for a variety of tasks, including - for scraping additional news sources, perform custom data pre-processing and run NLP based news text analytics such as - entity recognition, negative event classification, economy trends, industry trends, etc.

Features

There are already a number of python libraries available for web-scraping, so why should you consider using this application for web scraping news? The reason is that it has been specifically built for sourcing financial news events and has several useful features. A few notable ones are:

  • Text tone classification using deep learning NLP model to indicate positive, neutral or negative news
  • Text de-duplication using deep-learning NLP model
  • Built-in NLP models for keyword extraction
  • Multi-threaded for scraping several news sites in parallel
  • Reduces the network traffic and consequently webserver load by pausing between network requests. High traffic load generated by naive scraping code harm web servers by generating intense bursts of network load, hence, websites need to detect and block these for good reason. This application introduces delays between subsequent network fetches to reduce network bandwidth and avoids overloading the news web servers. In other words, it is built to behave responsibly.
  • Includes data processing pipeline configurable by defining the execution order of the data-processing plugins
  • Performs data processing on multiple news/data in parallel to speed up processing for thousands of news items
  • Extensible with custom plugins that can be rapidly written with minimal additional code to support additional news sources. Writing a new plugin does not need writing low level code to handle network traffic and HTTP protocols.
  • Rigorously tested for the specific websites enabled in the plugins, handles several quirks and formatting problems caused by inconsistent and non-standard HTML code.
  • Rigorous text cleaning tested for each of the sites implemented
  • Keeps track of failures and history of sites scraped to avoid re-visiting them again
  • Highly configurable functionality
  • Enterprise ready functionality - configurable event logging, segregation of data storage locations vs. program executables, minimum permissions to run the executable, etc.
  • Works with proxy servers
  • Runnable without a frontend, as a daemon.
  • Extensible data processing plugins to customize the data processing required after web scraping
  • Enables web-scraping news archives to get news from previous dates for establishing history for analysis
  • Saves the current session state and resumes downloading unfinished URLs in case the application is shut-down midway during web scraping
  • Docker file available to build and deploy the application in a container either standalone or in a Kubernetes cluster

Installation

Install the dependencies using pip:

pip install -r requirements.txt

Install the application via pip:

pip install newslookout

Caution: As a security best practice, it is strongly recommended to run the application under its own separate Operating System level user ID without any special privileges.

Next, create and configure separate locations for:

  • The application itself (not required if you're installing via the wheel or pip installer)
  • The data files downloaded from the news websites, e.g. - /var/cache/newslookout
  • The log file, e.g. - /var/log/newslookout/newslookout.log
  • The PID file, e.g. - /var/run/newslookout.pid

Set these parameters in the configuration file.

NLP Data

Download the spacy model using this command:

python -m spacy download en_core_web_lg

For NLTK, refer to the NLTK website on downloading the data - https://www.nltk.org/data.html. Specifically, the following data needs to be downloaded:

  1. reuters
  2. universal_treebanks_v20
  3. maxent_treebank_pos_tagger
  4. punkt

To do this, you could either use the nltk downloader - run the following commands:

import nltk
nltk.download('punkt')
nltk.download('maxent_treebank_pos_tagger')
nltk.download('reuters')
nltk.download('universal_treebanks_v20')

Alternatively, you could manually download these from the source location - https://github.com/nltk/nltk_data/tree/gh-pages/packages

If these are not installed to one of the standard locations, you will need to set the NLTK_DATA environment variable to specify the location of this NLTK data. Refer to the instructions given at the NLTK website about downloading these model files - https://www.nltk.org/data.html.

Configuration

All the parameters for the application can be configured via the configuration file. Both the configuration file, and the date for which the web scraper is to be run, are passed as command line arguments to the application.

The key parameters that need to be configured are:

  1. Application root folder: prefix
  2. Data directory: data_dir
  3. Plugin directory: plugins_dir
  4. Contributed Plugins: plugins_contributed_dir
  5. Enabled plugins: Add the name of the python file (without file extension) under the plugins section as: plugin01=mod_my_plugin
  6. Network proxy (if any): proxy_url_https
  7. The level of logging: log_level

Usage

Installing the wheel or via pip will generate a script newslookout placed in your local folder. This invokes the main method of the application and should be passed the two required arguments - configuration file and date for which the application is run. For example:

newslookout -c myconfigfile.conf -d 2020-01-01

In addition to this, 2 scripts are provided for UNIX-like and Windows OS. For convenience, you may run these shell scripts to start the application, it automatically generates the current date and supplies it as an argument to the python application. Its best advised to run the scripts or command line by scheduling it via the UNIX cron scheduler or the Microsoft Windows Task Scheduler for automated scheduling for small setups. In large enterprise environments, batch job coordination software such as Ctrl-M, IBM Tivoli, or any job scheduling framework may be configured to run it for reliable and automated execution.

PID File

The application creates a PID file (process identifier) upon startup to prevent launching multiple instances at the same time. On startup, it checks if this file exists, if it does then the application will stop. If the application is killed or shuts down abruptly without cleaning up, this PID file will remain and will need to be manually deleted.

Console Display

The application displays its progress on stdout, for example:

NewsLookout Web Scraping Application, Version  1.9.9
Python version:  3.8.5 (Linux)
Run date: 2021-06-10
Reading configuration from: conf/newslookout.conf
Logging events to file: logs/newslookout.log
Using PID file: data/newslookout.pid


URLs identified: 100%|██████████████████████████████████████████████████████████| 14/14 [1h 48:14<00:00, 0.00 Plugins/s]
Data downloaded: 100%|██████████████████████████████████████████████████████| 1474/1474 [1h 48:14<00:00, 0.23    URLs/s]
 Data processed:  38%|███████████████████▌                               |  384/1007 [1h 48:14<2h 55:36, 0.06   Files/s]

Event Log

For a more detailed log of events, refer to the log file. It captures all events with timestamp and the relevant name of the module that generated the event.

2021-01-01 01:31:50:[INFO]:queue_manager:4360: 13 worker threads available to fetch content.
...
2021-01-01 02:07:51:[INFO]:worker:320: Progress Status: 1117 URLs successfully scraped out of 1334 processed; 702 URLs remain.
...
2021-01-01 03:02:10:[INFO]:queue_manager:5700: Completed fetching data on all worker threads

Customizing and Writing your own Plugins

You can extend the web scraper's functionality to add any additional website that you need scraped by using the template file template_for_plugin.py from the plugins_contrib folder and customising it. Name your custom plugin file with the same name as the name of the class object. Place it in the plugins_contrib folder (or whichever folder you have set in the config file). Next, add the plugin's name in the configuration file. It will be read, instantiated and run automatically by the application on the next startup.

Take a look at one of the already implemented plugins code for examples of how a plugin can be written.

Maintenance and Monitoring

Data Size

The application will automatically rotate the log file upon reaching the set maximum size. The data directory will need to be monitored since its size could grow quickly due to the data scraped from the web.

Event Monitoring

For enterprise installations, log watch may be enabled for monitoring the operation of the application by watching for specific event entries in the log file.

The data folder should be monitored for growth in its size.

HTML parsing code updates

In case news portals change their structure, the web scraper code for their respective plugin will need to be updated to continue retrieving information reliably. This needs careful monitoring of the output to keep checking for parsing related problems.

About

The NewsLookout web scraping application with NLP and data pre-processing

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages