Permalink
Browse files

Initial commit; contains only UP Vidhan Sabha results for 2007; more …

…to follow once this works out
  • Loading branch information...
raphael-susewind committed Jan 31, 2016
1 parent 97bd720 commit cf3a71540708a8c571de281c556675e9d6b0cb53
Showing with 250,121 additions and 2 deletions.
  1. +541 −0 LICENSE.md
  2. +34 −2 README.md
  3. +29 −0 TROUBLESHOOTING.md
  4. +1 −0 combined.sql
  5. +80,783 −0 upvidhansabha2007.csv
  6. +40 −0 upvidhansabha2007.md
  7. +80,792 −0 upvidhansabha2007.sql
  8. +23 −0 upvidhansabha2007/download.pl
  9. +87,554 −0 upvidhansabha2007/results.csv
  10. +324 −0 upvidhansabha2007/transform.pl
View

Large diffs are not rendered by default.

Oops, something went wrong.
View
@@ -1,2 +1,34 @@
# india-religion-politics
Data on religion and politics in India
# Data on religion and politics in India
This repository provides highly localized statistics on religion and politics in India under an open license, focussing largely on North Indian states, and especially on Uttar Pradesh.
Fortunately, recent transparency initiatives by the Election Commission of India in general and the Chief Electoral Officer of UP in particular now allow researchers to shift the central unit of quantitative political analyses from the constituency level to that of polling booths, stations, and villages (earlier, such data had to be interpolated or estimated). Often, this data is not very user-friendly, though (think garbled, scanned PDFs). The purpose of this repository is to curate this data in a more accessible format and to share the scraping and cleanup code for reference.
Moreover, official data is supplemented with estimates of religious demography based on the religious connotations of electors' names in the voter lists. Upscaling was generously sponsored by the [Oxford Advanced Research Computing unit](http://arc.ox.ac.uk); the algorith itself is on [GitHub](https://github.com/raphael-susewind/name2community/tree/ngram) and described more fully in the following article of mine:
> Susewind, Raphael (2015). [What's in a name? Probabilistic inference of religious community from South Asian names](http://dx.doi.org/10.1177/1525822X14564275). Field Methods 27(4), 319-332.
Another useful source that complements this data are the GIS shapefiles for polling booths, stations, assembly segments and parliamentary constituencies which I published here:
> Susewind, R. (2014). [GIS shapefiles for India's parliamentary and assembly constituencies including polling booth localities](http://dx.doi.org/10.4119/unibi/2674065). Published under a CC-BY-NC-SA 4.0 license.
From 2013 to 2016, the whole dataset was located on my [personal website](https://www.raphael-susewind.de), and the [blog there](https://www.raphael-susewind.de/blog/category/quantitativemethods) continues to provide bits and pieces of advice on how to use it, as do my various [publications](https://writing.raphael-susewind.de). This created unnecessary hurdles for collaboration, though, and created its unique challenges in terms of long-term availability. After pondering various options, I decided to move to GitHub entirely. Technically, the final dataset comes as a **SQLite database** with a number of relational tables (generally state by state rather than india-wide, since the raw data can be very different). For each table, this repository contains:
* table - a directory containing the scraping and cleanup code used to generate this table from raw data. Note that the raw data itself can often not be redistributed for legal reasons and may not be available at its earstwhile URL anymore - a chief reason to curate this repository. If you want access to original raw data in order to check the scripts, drop me an email and we can arrange something.
* table.md - a description of each variable in this table alongside notes on raw data sources, notes on accuracy, and, if relevant, additional license information.
* table.csv - a CSV dump of said table. I personally prefer to work straight from SQLite, but you may not.
* table.sql - a set of SQLite commands that you can use to add the table to your master database.
One particularly important set of tables are the *id ones - they map the various ID codes in use across the dataset against each other (there is one id table per state, re-generated after each addition to the dataset). Unfortunately, but necessarily, the Election Commission changes polling booth IDs and names once in a while and we had a delimitation exercise in 2008 with even starker impact on precincts. Consequently, you cannot simply assume that, for instance, booth 143 in constituency 47 of Uttar Pradesh in the uploksabha2014 table is the same entity as booth 143 in constituency 47 of Uttar Pradesh in the upvidhansabha2012 table. Likewise, spatial matching - for instance used to tell which district and taluqa a given polling station falls into - has its own set of inaccuracies. So if you need to combine tables with a different set of ID codes, you need to look up what matches what in the state's id table (id codes with the same name are directly compatible across tables within the same state). For more on this and other **general problems** with the dataset, make sure to study TROUBLESHOOTING.md with great care!
If you wish to **recreate the whole database**, the easiest way would be to clone this repository in its entirety, and then run the equivalent of `cat combined.sql | sqlite3 combined.sqlite` on your system. This will automatically create a new combined.sqlite file by running all table.sql files in the correct order. You can then extract your data from one or multiple tables for further processing.
If you wish to **add or correct stuff** in the dataset, you can either send me an informal email (see below) or, if sufficiently technically minded, create a pull request against this repository. If adding new tables, please follow the structure outlined above. If making corrections or merely adding more variables to an existing table, please update the respective table.md with an explanation, update table.sql with the necessary SQL code, and create a new table.csv dump (code for which should already be included in the table.sql).
The dataset in its entirety is **licensed** under an [ODC Open Database license](http://www.opendatacommons.org/licenses/odbl/1.0/). This allows you to download, copy, use and redistribute it, as long as you attribute correctly, abstrain from technical methods of copy protection, and most importantely make any additions and modifications publicly available on equal terms. A number of tables in this dataset come with their own legal baggage, which is mentioned and explained further in their respective table.md file. In an academic context, I suggest you attribute using this reference:
> Susewind, R. (2016). Data on religion and politics in India. Published under an ODbL 1.0 license. https://data.raphael-susewind.de.
I invite all to download and use this dataset for more localized quantitative analyses of political, religious and demographic dynamics in India in the spirit of Open Data sharing and subject to applicable license conditions - in particular the requirement to share alterations and additions to the dataset with the research community on equivalent terms. Please let me know if you find the dataset useful and alert me to errors and mistakes. I provide this dataset without any guarantee - see TROUBLESHOOTING.md.
Raphael Susewind, mail@raphael-susewind.de, GPG key 10AEE42F
View
@@ -0,0 +1,29 @@
# Data on religion and politics in India
## General troubleshooting notes
There are numerous potential problems with a dataset of this magnitude and I provide all data without any guarantee. I urge you to a) look at raw data closely and b) run your own plausibility checks before using this data. Also, if you use this dataset and discover errors or implausible values, please let me know! Let me highlight four main areas of (potential) trouble:
## Problems in raw data quality
I can't do anything if raw data is inaccurate or incomplete. Inaccuracies are more likely in election result data, since the Election Commission uses a number of different and mutually incompatible report sheets for Form 20 reports, at times manipulated further by ROs - and since I cannot guarantee that I integrated all those error-free, even if they themselves should be correct, there is a certain degree of potential trouble in that area. The issue of incomplete data, in turn, is likely more of a problem with the GIS variables, since the latitude/longitude database of the Election Commission is still in draft form and roughly one fifth of their coordinates had to be discarded as implausible (see my comments below).
Also note that polling booth and station names and other text fields are given in different forms and languages in raw sources: latin script, devanagari unicode, and devanagari Kruti Dev. I did not unify these.
## Problems in namematching accuracy
A specific subset of raw data quality issues concerns the booth-level data on religious demography (as well as the religious categorization of candidate and MLA names) derived from my name-matching algorithm. First of all, this algorithm is - by design and by necessity - probabilistic and thus never 100% accurate. Secondly, it is hard to establish the accuracy irrespective of context. I myself used the algorithm largely to distinguish Muslim from non-Muslim names in UP, Delhi, Gujarat, and other North Indian states, and it works fairly well for this purpose. On first look, percentages of smaller religious minorities (Parsi, Buddhist especially) don't look reasonable to me, though - the n-gram part of the algorithm in particular might lead to an overestimation of these. If you want to use any other religious categories than Muslim/non-Muslim and/or data outside the core Hindu belt, you **need** to do your own accuracy testing, **and** read my Field Methods paper as well as the various blog posts on the matter.
This problem is particularly prominent in the case of candidate and MLA categorizations, since wrong decisions by the algorithm here cannot be balanced out through aggregation. Take extra care when using these variables! Finally: should you have access to an additional test corpus of names with known religious affiliation, please contact me - any additional robustness test is more than welcome!
## Problematic integration across years
The dataset uses data from various years and sources; not all of them can smoothly be integrated with each other. The Election Commission changes polling booth IDs and names once in a while and we had a delimitation exercise in 2008 with even starker impact (2007 is thus almost not integrated at all with the other years). Four main sets of ID codes stick out: 2007, 2009, 2011-13, 2014-16. As long as you only look at data within one of these four groups, you are fairly safe - but data across years had to be integrated with a fuzzy matching algorithm (thats how the id table was generated). This algorithm attempts to find the equivalent of a 2007 polling booth in a 2012 dataset based on that booth's constituency, name, number of electors, etc. Since names are sometimes given in devanagari and sometimes in latin script, and since number of electors vary as do constituencies before and after delimitation, all three indicators come with a certain degree of fuzziness: the same empirical entity might occur multiple times in the dataset (once with 2007 data and once with 2012 data) because the algorithm wasn't sure enough that both refer to the same thing - or the algorithm made a mistake and considered a polling booth in different years the same entity while it wasn't that.
I optimized the matching algorithm so that it integrates the three key data years to the maximum possible extent without producing sizeable errors (and programmed it to be cautious: rather integrate less than wrongly). The id table includes a number of control variables (merge_XY) which one could use as quality benchmarks; please contact me if you need to cross years and would be interested in using these benchmarks. Also, I did fine-tune this logic at the example of Lucknow only - and cannot guarantee similarly successful results for other parts of India. Please be extra careful when correlating variables across these three main years - the subset which overlaps might be too small and the matching be inaccurate.
The safest parts of the dataset with respect to this problem are variables with data years from 2011 onwards (i.e. the namematching variables and election results for 2012) and/or integrated through spatial matching (i.e. census data).
## Problems in spatial matching
Spatial matching, finally, has one central problem: the latitude/longitude data used by the Election Commission is contains a number of inaccuracies, the extent of which tends to vary district by district (depending on which agency was used to outsource the task to, it seems). Consequently, some booths might have been matched to the wrong taluqa and/or district polygon (which mostly affects Census data). While hard-and-fast cleaning removed the most outrageous ones, I cannot guarantee the accuracy of these data. For Lucknow and most parts of UP, they seem fine - but one never knows. You most likely need local knowledge to judge the extent of this problem.
View
@@ -0,0 +1 @@
.read upvidhansabha2007.sql
Oops, something went wrong.

0 comments on commit cf3a715

Please sign in to comment.