PIB Crawler

Overview

This repository houses a flask application incrementally built to extract aligned sentences across multiple languages with a translation system in place.

The application was originally built to crawl and store multilingual news articles available at Press Information Bureau website. It can however be repurposed to prototype, inspect and build for other multilingual sources as well.

We require the web application for the reasons below:

Multilingual samples require verification on the alignment and the retrieved samples which can easily be done once a web interface is created.
Storage obviously has to be done in a DBMS due to the nature of the data and incremental updates performed efficiently.
All tokenization and under the hood processing needs to be repeated but hidden from a layman user or expert to gather simple feedback.

Installation

# --user is optional
python3 -m pip install -r requirements.txt --user

After installing the required packages, run the following script to download the PIB database containing the crawled articles. This script also downloads pretrained multilingual model used for alignment.

bash scripts/get-resources.sh

Usage

Once we have the DB and pretrained model in place, to extract parallel corpus from the database run the following command.

bash scripts/export-parallel-corpus.sh

Resources

The CVIT-PIB and CVIT-MKB (Mann-Ki-Baat) datasets are available here.
Database containing the crawled news articles, which are used to extract parallel corpus.
The Multilingual NMT model used for sentence alignment and the associated vocabulary files.
We additionally release multilingual model augmented with the PIB corpus.

Publications

If you use CVIT-PIB and MKB, please cite our paper:

@inproceedings{siripragada-etal-2020-multilingual,
    title = "A Multilingual Parallel Corpora Collection Effort for {I}ndian Languages",
    author = "Siripragada, Shashank and Philip, Jerin and Namboodiri, Vinay P. and Jawahar, C V",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.462",
    pages = "3743--3751",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
migrations		migrations
pib		pib
scripts		scripts
.gitignore		.gitignore
README.md		README.md
build_adjacency.py		build_adjacency.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migrations

migrations

pib

pib

scripts

scripts

.gitignore

.gitignore

README.md

README.md

build_adjacency.py

build_adjacency.py

requirements.txt

requirements.txt

run.py

run.py

Repository files navigation

PIB Crawler

Overview

Installation

Usage

Resources

Publications

About

Releases

Packages

Contributors 3

Languages

shashanksiripragada/pib-crawl

Folders and files

Latest commit

History

Repository files navigation

PIB Crawler

Overview

Installation

Usage

Resources

Publications

About

Topics

Resources

Stars

Watchers

Forks

Languages