GitHub - saurabhkumar2015/large-scale-spanish-news-nlp: Large Scale data collection, meta-data extraction, dependency parse tree creation and de-duplication of similar news articles

Large Scale Data Collection of Spanish News

1) Web crawler and metadata generator:

Use files in crawler folder for scrapping URLs from different spanish website. test.py is sample crawler along with metadata generator. read.py is generic metadata generator for all the URLs stored in text file rest are different crawlers for different websites.

2) NLTK Downloads

import nltk nltk.download('averaged_perceptron_tagger') nltk.download('maxent_ne_chunker') nltk.download('words')

3) Use Json file to write to Kafka

push to kafka: python write_to_kafka.py example python write_to_kafka.py C:\Users\Saurabh\Downloads\courses\bg\project\data.json guardian222 check via consumer: .\kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic guardian2

3) Pipline to generate match article pairs

python similar_stories_pipeline.py

python similar_stories_pipeline.py C:\Users\Saurabh\Downloads\courses\bg\project\data.json C:\Users\Saurabh\Downloads\courses\bg\project\result_

References:

News-please https://github.com/fhamborg/news-please
Scrapy https://scrapy.org/
Apache Kafka https://kafka.apache.org/
SPEC paper https://ieeexplore.ieee.org/document/7474330
Universal Dependency https://universaldependencies.org/
ufal-udpipe python package.
Deduplication Papers/Resources a. https://www.hindawi.com/journals/mpe/2016/3919043/ b. https://www.aclweb.org/anthology/P16-4019 c. Dissertation of Dr. Ahmad Mustafa, https://search.proquest.com/pqdtlocal1006281/docview/2086379093/EDE7D1E6 ED9843E5PQ/1?accountid=7120 d. https://www.eventregistry.org/documentation?tab=semanticSimilarity
List of news sources https://docs.google.com/spreadsheets/d/13DmJ140wW8pCp6nyRSAk911S7AoF-6zJOJF77qoMuM/ edit?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.idea		.idea
Crawler		Crawler
Mongo-Connector		Mongo-Connector
poc		poc
ReadMe.md		ReadMe.md
data.json		data.json
similar_stories_pipeline.py		similar_stories_pipeline.py
spanish-ancora-ud-2.3-181115.udpipe		spanish-ancora-ud-2.3-181115.udpipe
write_to_kafka.py		write_to_kafka.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

Crawler

Crawler

Mongo-Connector

Mongo-Connector

poc

poc

ReadMe.md

ReadMe.md

data.json

data.json

similar_stories_pipeline.py

similar_stories_pipeline.py

spanish-ancora-ud-2.3-181115.udpipe

spanish-ancora-ud-2.3-181115.udpipe

write_to_kafka.py

write_to_kafka.py

Repository files navigation

Large Scale Data Collection of Spanish News

1) Web crawler and metadata generator:

2) NLTK Downloads

3) Use Json file to write to Kafka

3) Pipline to generate match article pairs

References:

About

Releases

Packages

Contributors 4

Languages

saurabhkumar2015/large-scale-spanish-news-nlp

Folders and files

Latest commit

History

Repository files navigation

Large Scale Data Collection of Spanish News

1) Web crawler and metadata generator:

2) NLTK Downloads

3) Use Json file to write to Kafka

3) Pipline to generate match article pairs

References:

About

Resources

Stars

Watchers

Forks

Languages