NewsCrunch is an application to scrape articles from news sites and use a combination of NLP methods to present condensed versions of the daily top stories.
- Articles scraped from reputable sites
- Similarity check to avoid article repetition
- Combination of abstractive and extractive summary methods
- Cronscheduler to scrape news daily
Inspired by a need to keep up with world news. While summarising news does come with caveats, I found that a combination of extractive and abstractive summaries resulted in the most accurate represenation of articles.
Clone this repo, and install all required packages using
pip install -r requirements.txt
run the following commands to setup the sqlite database
python manage.py makemigrations
python manage.py migrate
Run the following command to run the Django server
python manage.py runserver
In a seperate terminal, run the Django custom command using
python manage.py aggregate
This will run a cronsheduler to srape and summarize articles daily
The main stack used in Django
, Pytorch
, BS4
, NLTK
and the Transformers
module.
Extractive summary using TF-IDF Tern Frequency Inverse Document Frequency Abstractive summary using small2bert model
- Call
NewsApi
to get headlines and links. - Check similarity between headlines.
- Scrape article using
Requests
andBeautifulSoup4
. - Create word corpus using
Spacy
. - Perform extractive summary to reduce article to the top 5 sentences.
- Perform abstractive summary using
Small2Bert
model. - Add to database after performing grammar check.