Skip to content

thepushkarp/chhattisgarhi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrape Chhattigarhi

Script to collect scrape and clean sentences in Chhattisgarhi for the project Speech Recognition in Agriculture and Finance for the Poor in India


What is this

This script uses Google Sheets API to fetch data from a Google Sheet containing links of sites containing Chhattisgarhi text in the domains of Agriculture and Finance and then does the following:

  • Identifies duplicate links
  • Generates all the links from the sitemaps of popular Chhattisgarhi news portals and gets a list of all links that are not in the sheet (Note: Not all links are useful for us)
  • Scrapes and extracts useful text from the links
  • Optionally, can query all the links to get links containing a particular substring in them
  • Cleans the extracted text and tokenizes them into words to form a vocabulary of Chhattisgarhi words
  • Stores the clean sentences to form our Chhattisgarhi corpus

How to use?

Create a .env file and add a SPREADSHEET_ID field with the unique ID of your Google Sheet. Then run scrape.ipynb.

Authors

About

Script to collect scrape and clean sentences in Chhattisgarhi

Topics

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published