Skip to content
Permalink
Browse files

Added link to raw data for electoral rolls

  • Loading branch information
raphael-susewind committed Nov 28, 2019
1 parent 085964e commit a52b885fcccb6eeae61fe5ca4b239773af6261d9
@@ -93,6 +93,8 @@ The dataset in its entirety is **licensed** under an [ODC Open Database license]

> Susewind, R. (2016). Data on religion and politics in India. Published under an ODbL 1.0 license. Available from https://github.com/raphael-susewind/india-religion-politics.
Last but not least, **raw data** behind this dataset (e.g. original files downloaded from ECI websites over the years) is generally not included here, both to save space (it runs into several TB by now) and for privacy concerns (even though all data was originally put in the public domain by the ECI, some of it might be considered sensitive in aggregate). I do archive all relevant original downloads in a restricted access [Zenodo collection](https://zenodo.org/communities/india-religion-politics-raw) though and will make it available to legitimate academic users upon request.

So I invite all to download and use this dataset for more localized quantitative analyses of political, religious and demographic dynamics in India in the spirit of Open Data sharing. Please let me know if you find the dataset useful and alert me to errors and mistakes. I provide this dataset without any guarantee - see [troubleshooting notes](https://github.com/raphael-susewind/india-religion-politics/blob/master/TROUBLESHOOTING.md) for **known general problems** with this data, alongside the various table READMEs.

Raphael Susewind, mail@raphael-susewind.de, GPG key [10AEE42F](https://keybase.io/raphaelsusewind)
@@ -11,15 +11,19 @@ As a next step, I intend to expand to all-India level for the 2014 general elect
* Add polling booth locality data for 2014 across India (from my GIS dataset, practically done, just need to convert to SQL tables and document)
* Add frontapge information from electoral roll for all states that don't already have them
* Add MODIS 500m rural/urban classification for booth localities (rather than current 1km ones)
* Wait for a few weeks to see whether any bugs crop up, then make the formal release

## Third proper release:

As a more distant goal - and only if I find time after the UP 2017 elections consumed all of my energy - I aim to take an experimental shot at integration booth-level electoral and village-level Census data. I already have a processing chain built up, but need to find a way to verify the results' quality before moving ahead with it. Ideally, though, the following should be added across India:
As a more distant goal - and only if I find time - I aim to take an experimental shot at integration booth-level electoral and village-level Census data. I already have a processing chain built up, but need to find a way to verify the results' quality before moving ahead with it. Ideally, though, the following should be added across India:

* Match 2014 booth IDs to Census 2001 and Census 2011 village / ward IDs (either spatially and/or using the electoral roll front pages), and implicitly with the whole administrative hierarchy up to district level
* If copyright permits, match in MOSPI data as well as village-level Census data and/or PCAs on higher administrative units
* Use this Census data to add weights to my namematching estimates (so that the latter only decide the distribution within a given census tract, not the average) - for electoral analyses, it makes sense to stay with the estimates of the electorate, but for demographic analyses, it might make sense to circumscribe the same by census data in a kind of Bayesian way
* Use the latter weighted data for a beautifully tiled atlas
* Add namematched BPL data (which is also tied in with admin boundaries)
* Wait for a few weeks to see whether any bugs crop up, then make the formal release

## Fourth proper release

In the long run, I intend to update this dataset annually for UP (the state I am most invested in) and for every general election across India - pending practicability and time considerations...

* Add 2019 data for across India (largely scraped, but not processed yet)
@@ -30,7 +30,7 @@ female_*_percent_14 | Percentage of female electors among electors estimated to
Originally, the electoral rolls were crawled in July 2014 from http://ceoaperms.ap.gov.in/Electoral_Rolls/PDFGeneration.aspx using run-in-osc/downloadpdf.pl; the "last updated on" entry on the rolls' cover sheet reads "1/1/2014".
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number).
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number). I do archive all relevant original downloads in a restricted access [Zenodo collection](https://zenodo.org/communities/india-religion-politics-raw) though and will make it available to legitimate academic users upon request.
The subsequent processing chain is however preserved in the run-in-arc folder for reference. It ran on the [Oxford Advanced Research Computing cluster](https://www.arc.ox.ac.uk) and several hardcoded binary paths as well as the Torque scheduler commands are unique to this environment. After running createnamedb.pl once and putting all additional software in place, the chain was started using run.sh, which basically sparked 294 parallel processes (one for each assembly segment), in which roll PDFs were downloaded, relevant data extracted, names of electors matched to likely religion and ultimately booth-wise estimates of religious demography calculated, using ngram technology to further reduce missing_percent_14 (see scripts for details).
@@ -30,7 +30,7 @@ female_*_percent_14 | Percentage of female electors among electors estimated to
Originally, the electoral rolls were crawled in July 2014 from http://ceodelhi.gov.in/Content/AccemblyConstituenty.aspx using run-in-osc/downloadpdf.pl; the "last updated on" entry on the rolls' cover sheet reads "1/1/2014".
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number).
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number). I do archive all relevant original downloads in a restricted access [Zenodo collection](https://zenodo.org/communities/india-religion-politics-raw) though and will make it available to legitimate academic users upon request.
The subsequent processing chain is however preserved in the run-in-arc folder for reference. It ran on the [Oxford Advanced Research Computing cluster](https://www.arc.ox.ac.uk) and several hardcoded binary paths as well as the Torque scheduler commands are unique to this environment. After running createnamedb.pl once and putting all additional software in place, the chain was started using run.sh, which basically sparked 294 parallel processes (one for each assembly segment), in which roll PDFs were downloaded, relevant data extracted, names of electors matched to likely religion and ultimately booth-wise estimates of religious demography calculated, using ngram technology to further reduce missing_percent_14 (see scripts for details).
@@ -30,7 +30,7 @@ female_*_percent_14 | Percentage of female electors among electors estimated to
Originally, the electoral rolls were crawled in July 2014 from http://www.ceogoa.nic.in/PER_Search.aspx using run-in-arc/downloadpdf.pl; the "last updated on" entry on the rolls' cover sheet reads "1/1/2014".
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number).
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number). I do archive all relevant original downloads in a restricted access [Zenodo collection](https://zenodo.org/communities/india-religion-politics-raw) though and will make it available to legitimate academic users upon request.
The subsequent processing chain is however preserved in the run-in-arc folder for reference. It ran on the [Oxford Advanced Research Computing cluster](https://www.arc.ox.ac.uk) and several hardcoded binary paths as well as the Torque scheduler commands are unique to this environment. After running createnamedb.pl once and putting all additional software in place, the chain was started using run.sh, which basically sparked 40 parallel processes (one for each assembly segment), in which roll PDFs were downloaded, relevant data extracted, names of electors matched to likely religion and ultimately booth-wise estimates of religious demography calculated, using ngram technology to further reduce missing_percent_14 (see scripts for details).
@@ -30,7 +30,7 @@ female_*_percent_14 | Percentage of female electors among electors estimated to
Originally, the electoral rolls were crawled in July 2014 from http://ceouttarpradesh.nic.in/_RollPDF.aspx using run-in-osc/downloadpdf.pl; the "last updated on" entry on the rolls' cover sheet reads "1/1/2014".
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number).
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number). I do archive all relevant original downloads in a restricted access [Zenodo collection](https://zenodo.org/communities/india-religion-politics-raw) though and will make it available to legitimate academic users upon request.
The subsequent processing chain is however preserved in the run-in-osc, run-in-osc-add-ngram, run-in-osc-add-ngram and run-in-arc-frontpage folders for reference. It ran on the [Oxford Advanced Research Computing cluster](https://www.arc.ox.ac.uk) and several hardcoded binary paths as well as the Torque scheduler commands are unique to this environment. After running createnamedb.pl once and putting all additional software in place, the chain was started using run.sh, which basically sparked 403 parallel processes (one for each assembly segment), in which roll PDFs were downloaded, relevant data extracted, names of electors matched to likely religion and ultimately booth-wise estimates of religious demography calculated. After the main processing chain, the run-in-osc-add-ngram chain attempted to further reduce missing_percent_14 using ngram technology (see scripts there for details). Finally there was a glitch with gender extraction; this was rectified using the run-in-osc-add-gender scripts. Finally run-in-arc-frontpage added frontpage details for the gujid table.
@@ -30,7 +30,7 @@ female_*_percent_14 | Percentage of female electors among electors estimated to
Originally, the electoral rolls were crawled in July 2014 from http://ceoharyana.nic.in/?module=electoralroll using run-in-osc/downloadpdf.pl; the "last updated on" entry on the rolls' cover sheet reads "1/1/2014".
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number).
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number). I do archive all relevant original downloads in a restricted access [Zenodo collection](https://zenodo.org/communities/india-religion-politics-raw) though and will make it available to legitimate academic users upon request.
The subsequent processing chain is however preserved in the run-in-arc folder for reference. It ran on the [Oxford Advanced Research Computing cluster](https://www.arc.ox.ac.uk) and several hardcoded binary paths as well as the Torque scheduler commands are unique to this environment. After running createnamedb.pl once and putting all additional software in place, the chain was started using run.sh, which basically sparked 294 parallel processes (one for each assembly segment), in which roll PDFs were downloaded, relevant data extracted, names of electors matched to likely religion and ultimately booth-wise estimates of religious demography calculated, using ngram technology to further reduce missing_percent_14 (see scripts for details).
@@ -30,7 +30,7 @@ female_*_percent_14 | Percentage of female electors among electors estimated to
Originally, the electoral rolls were crawled in July 2014 from http://ceokarnataka.kar.nic.in/draftroll2014/dist_list.aspx using run-in-osc/downloadpdf.pl; the "last updated on" entry on the rolls' cover sheet reads "1/1/2014".
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number).
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number). I do archive all relevant original downloads in a restricted access [Zenodo collection](https://zenodo.org/communities/india-religion-politics-raw) though and will make it available to legitimate academic users upon request.
The subsequent processing chain is however preserved in the run-in-arc folder for reference. It ran on the [Oxford Advanced Research Computing cluster](https://www.arc.ox.ac.uk) and several hardcoded binary paths as well as the Torque scheduler commands are unique to this environment. After running createnamedb.pl once and putting all additional software in place, the chain was started using run.sh, which basically sparked 294 parallel processes (one for each assembly segment), in which roll PDFs were downloaded, relevant data extracted, names of electors matched to likely religion and ultimately booth-wise estimates of religious demography calculated, using ngram technology to further reduce missing_percent_14 (see scripts for details).
@@ -30,7 +30,7 @@ female_*_percent_14 | Percentage of female electors among electors estimated to
Originally, the electoral rolls were crawled in July 2014 from http://www.ceo.kerala.gov.in/electoralrolls.html using run-in-osc/downloadpdf.pl; the "last updated on" entry on the rolls' cover sheet reads "1/1/2014".
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number).
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number). I do archive all relevant original downloads in a restricted access [Zenodo collection](https://zenodo.org/communities/india-religion-politics-raw) though and will make it available to legitimate academic users upon request.
The subsequent processing chain is however preserved in the run-in-arc folder for reference. It ran on the [Oxford Advanced Research Computing cluster](https://www.arc.ox.ac.uk) and several hardcoded binary paths as well as the Torque scheduler commands are unique to this environment. After running createnamedb.pl once and putting all additional software in place, the chain was started using run.sh, which basically sparked 294 parallel processes (one for each assembly segment), in which roll PDFs were downloaded, relevant data extracted, names of electors matched to likely religion and ultimately booth-wise estimates of religious demography calculated, using ngram technology to further reduce missing_percent_14 (see scripts for details).

0 comments on commit a52b885

Please sign in to comment.
You can’t perform that action at this time.