Scrape Chhattigarhi

Script to collect scrape and clean sentences in Chhattisgarhi for the project Speech Recognition in Agriculture and Finance for the Poor in India

What is this

This script uses Google Sheets API to fetch data from a Google Sheet containing links of sites containing Chhattisgarhi text in the domains of Agriculture and Finance and then does the following:

Identifies duplicate links
Generates all the links from the sitemaps of popular Chhattisgarhi news portals and gets a list of all links that are not in the sheet (Note: Not all links are useful for us)
Scrapes and extracts useful text from the links
Optionally, can query all the links to get links containing a particular substring in them
Cleans the extracted text and tokenizes them into words to form a vocabulary of Chhattisgarhi words
Stores the clean sentences to form our Chhattisgarhi corpus

How to use?

Create a .env file and add a SPREADSHEET_ID field with the unique ID of your Google Sheet. Then run scrape.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
sitemaps		sitemaps
.gitignore		.gitignore
README.md		README.md
gargi.ttf		gargi.ttf
script.ipynb		script.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sitemaps

sitemaps

.gitignore

.gitignore

README.md

README.md

gargi.ttf

gargi.ttf

script.ipynb

script.ipynb

Repository files navigation

Scrape Chhattigarhi

What is this

How to use?

Authors

About

Releases

Sponsor this project

Packages

Languages

thepushkarp/chhattisgarhi

Folders and files

Latest commit

History

Repository files navigation

Scrape Chhattigarhi

What is this

How to use?

Authors

About

Topics

Resources

Code of conduct

Stars

Watchers

Forks

Sponsor this project

Languages