Use files in crawler folder for scrapping URLs from different spanish website. test.py is sample crawler along with metadata generator. read.py is generic metadata generator for all the URLs stored in text file rest are different crawlers for different websites.
import nltk nltk.download('averaged_perceptron_tagger') nltk.download('maxent_ne_chunker') nltk.download('words')
push to kafka: python write_to_kafka.py example python write_to_kafka.py C:\Users\Saurabh\Downloads\courses\bg\project\data.json guardian222 check via consumer: .\kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic guardian2
python similar_stories_pipeline.py
python similar_stories_pipeline.py C:\Users\Saurabh\Downloads\courses\bg\project\data.json C:\Users\Saurabh\Downloads\courses\bg\project\result_
- News-please https://github.com/fhamborg/news-please
- Scrapy https://scrapy.org/
- Apache Kafka https://kafka.apache.org/
- SPEC paper https://ieeexplore.ieee.org/document/7474330
- Universal Dependency https://universaldependencies.org/
- ufal-udpipe python package.
- Deduplication Papers/Resources a. https://www.hindawi.com/journals/mpe/2016/3919043/ b. https://www.aclweb.org/anthology/P16-4019 c. Dissertation of Dr. Ahmad Mustafa, https://search.proquest.com/pqdtlocal1006281/docview/2086379093/EDE7D1E6 ED9843E5PQ/1?accountid=7120 d. https://www.eventregistry.org/documentation?tab=semanticSimilarity
- List of news sources https://docs.google.com/spreadsheets/d/13DmJ140wW8pCp6nyRSAk911S7AoF-6zJOJF77qoMuM/ edit?usp=sharing