Skip to content

Code to extract multilingual parallel corpus from Press Information Bureau (PIB) website.

Notifications You must be signed in to change notification settings

siripragadashashank/pib-crawl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PIB Crawler

Overview

This repository houses a flask application incrementally built to extract aligned sentences across multiple languages with a translation system in place.

The application was originally built to crawl and store multilingual news articles available at Press Information Bureau website. It can however be repurposed to prototype, inspect and build for other multilingual sources as well.

We require the web application for the reasons below:

  1. Multilingual samples require verification on the alignment and the retrieved samples which can easily be done once a web interface is created.
  2. Storage obviously has to be done in a DBMS due to the nature of the data and incremental updates performed efficiently.
  3. All tokenization and under the hood processing needs to be repeated but hidden from a layman user or expert to gather simple feedback.

Installation

# --user is optional
python3 -m pip install -r requirements.txt --user

After installing the required packages, run the following script to download the PIB database containing the crawled articles. This script also downloads pretrained multilingual model used for alignment.

bash scripts/get-resources.sh

Usage

Once we have the DB and pretrained model in place, to extract parallel corpus from the database run the following command.

bash scripts/export-parallel-corpus.sh

Resources

  1. The CVIT-PIB and CVIT-MKB (Mann-Ki-Baat) datasets are available here.
  2. Database containing the crawled news articles, which are used to extract parallel corpus.
  3. The Multilingual NMT model used for sentence alignment and the associated vocabulary files.
  4. We additionally release multilingual model augmented with the PIB corpus.

Publications

If you use CVIT-PIB and MKB, please cite our paper:

@inproceedings{siripragada-etal-2020-multilingual,
    title = "A Multilingual Parallel Corpora Collection Effort for {I}ndian Languages",
    author = "Siripragada, Shashank and Philip, Jerin and Namboodiri, Vinay P. and Jawahar, C V",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.462",
    pages = "3743--3751",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

About

Code to extract multilingual parallel corpus from Press Information Bureau (PIB) website.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 90.2%
  • HTML 8.4%
  • Other 1.4%