Skip to content

Commit

Permalink
Added UP namematching for 2015 and 2016 rolls
Browse files Browse the repository at this point in the history
  • Loading branch information
raphael-susewind committed Jan 11, 2017
1 parent 22a6525 commit a1033d8
Show file tree
Hide file tree
Showing 21 changed files with 279,324 additions and 0 deletions.
177 changes: 177 additions & 0 deletions uprolls2015/LICENSE.md

Large diffs are not rendered by default.

46 changes: 46 additions & 0 deletions uprolls2015/README.md
@@ -0,0 +1,46 @@
# Data on religion and politics in India

## uprolls2015

This table contains booth-level estimates of religious demography based on the connotations of electors' names in the electoral rolls of Uttar Pradesh (revision 2015), using an optimized version of my [name2community](https://github.com/raphael-susewind/name2community) algorithm

## Variables

name | description
--- | ---
id | unique code for each row, in case one ever needs it
ac_id_09 | ID code of the assembly segment assigned by the Election Commission (identical with all other post-delimitation codes, hence the _09)
booth_id_14 | ID code of the polling booth assigned by the Election Commission (together with ac_id_09, this should suffice for matching with other tables)
electors_15 | Number of registered electors
missing_percent_15 | Percentage of electors whose names could not be matched by the algorithm (one crude aggregate measure of reliability)
hindu_percent_15 | Estimated percentage of electors who are Hindu
muslim_percent_15 | Estimated percentage of electors who are Muslim
christian_percent_15 | Estimated percentage of electors who are Christian (be aware that accuracy of the algorithm has only been tested for Hindu and Muslim names, not Christian ones)
sikh_percent_15 | Estimated percentage of electors who are Sikh (be aware that accuracy of the algorithm has only been tested for Hindu and Muslim names, not Sikh ones)
jain_percent_15 | Estimated percentage of electors who are Jain (be aware that accuracy of the algorithm has only been tested for Hindu and Muslim names, not Jain ones)
buddhist_percent_15 | Estimated percentage of electors who are Buddhist (be aware that accuracy of the algorithm has only been tested for Hindu and Muslim names, not Buddhist ones)
age_avg_15 | Average age of all electors
age_stddev_15 | Standard deviation of the age distribution of all electors
female_percent_15 | Percentage of female electors among all electors
age_*_avg_15 | Average age of electors estimated to be * (Hindu / Muslim / Christian / Sikh / Jain / Buddhist)
age_*_stddev_15 | Standard deviation of the age distribution of electors estimated to be * (Hindu / Muslim / Christian / Sikh / Jain / Buddhist)
female_*_percent_15 | Percentage of female electors among electors estimated to be * (Hindu / Muslim / Christian / Sikh / Jain / Buddhist)
revision_percent_new_15 | Percentage of electors added to this booth's rolls in 2015, against the baseline of 2014
revision_percent_deleted_15 | Percentage of electors deleted from this booth's rolls in 2015, against the baseline of 2014
revision_percent_modified_15 | Percentage of electors modified in this booth's rolls in 2015, against the baseline of 2014

## Raw data

Originally, the electoral rolls were crawled in August 2016 from http://ceouttarpradesh.nic.in/_RollPDF.aspx using run-in-osc/downloadpdf.pl; the "last updated on" entry on the rolls' cover sheet reads "1/1/2014", but the PDFs included later additions for 2015 and 2016 roll revisions (each dated on 1st January for that particular year).

Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number).

The subsequent processing chain is however preserved in the run-in-arc folder of the [uprolls2016](https://github.com/raphael-susewind/india-religion-politics/blob/master/uprolls2016) table (from where it originally ran, processing 2015 and 2016 in one go).

Fortunately, at least booth IDs did not change between 2014 and 2015, so this table uses the same IDs as those in use for the 2014 General Elections.

The final task of pulling everything together for this dataset is delivered by combine.pl, which results in one large booths.sqlite file, which is shared here as .tgz archive (even this is quite large). The SQL code to put this into the main database and create the subsequent CSV dumps was done using transform.pl.

## License

While the database in its entirety is subject to an [ODC Open Database License](http://opendatacommons.org/licenses/odbl/), as explained in the main [README](https://github.com/raphael-susewind/india-religion-politics/blob/master/README.md) and [LICENSE](https://github.com/raphael-susewind/india-religion-politics/blob/master/LICENSE.md) files, the content of this specific table as well as code used for crawling and compilation is subject to a [CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license: you can use it for non-commercial purposes as long as you attribute and share any additions or modifications on equal terms.
Binary file added uprolls2015/booths.sqlite.tgz
Binary file not shown.
24 changes: 24 additions & 0 deletions uprolls2015/combine.pl
@@ -0,0 +1,24 @@
#!/usr/bin/perl

system("rm -f booths.sqlite");

for ($i=1;$i<=403;$i++) {
next if !-e $i;
system("cd $i && echo '.dump' | sqlite3 $i.sqlite > $i.sql");
open (FILE, "$i/$i.sql");
my @file = <FILE>;
close (FILE);
open (FILE, ">$i/$i.sql");
my $insert;
foreach my $line (@file) {
if ($line =~ /^CREATE TABLE booths (.*?);/) {$insert=$1;$insert=~s/ CHAR//gs; $insert=$1;$insert=~s/ FLOAT//gs; $insert=~s/ INTEGER//gs; next unless $i==1}
if ($line =~ /^INSERT INTO \"booths\"/) {$line =~ s/^INSERT INTO \"booths\"/INSERT INTO \"booths\" $insert/}
print FILE $line;
}
close (FILE);
system("cd $i && cat $i.sql | sqlite3 ../booths.sqlite");
}

system("tar -czf booths.sqlite.tgz booths.sqlite");

system("rm -f booths.sqlite");
26 changes: 26 additions & 0 deletions uprolls2015/transform.pl
@@ -0,0 +1,26 @@
#!/usr/bin/perl

if (!-e "booths.sqlite") {system("tar -xzf booths.sqlite.tgz")}

use DBD::SQLite;

#
# Create and populate temporary tables with proper table and variable names
#

$dbh = DBI->connect("DBI:SQLite:dbname=:memory:", "","", {sqlite_unicode=>1});
$dbh->sqlite_backup_from_file('booths.sqlite');
$dbh->do ("CREATE TABLE uprolls2015 (id INTEGER PRIMARY KEY AUTOINCREMENT, ac_id_09 INTEGER, booth_id_15 INTEGER, electors_15 INTEGER, missing_percent_15 FLOAT, age_avg_15 FLOAT, age_stddev_15 FLOAT, age_muslim_avg_15 FLOAT, age_muslim_stddev_15 FLOAT, women_percent_15 FLOAT, women_muslim_percent_15 FLOAT, muslim_percent_15 FLOAT, buddhist_percent_15 FLOAT, age_buddhist_avg_15 FLOAT, age_buddhist_stddev_15 FLOAT, women_buddhist_percent_15 FLOAT, hindu_percent_15 FLOAT, age_hindu_avg_15 FLOAT, age_hindu_stddev_15 FLOAT, women_hindu_percent_15 FLOAT, jain_percent_15 FLOAT, age_jain_avg_15 FLOAT, age_jain_stddev_15 FLOAT, women_jain_percent_15 FLOAT, parsi_percent_15 FLOAT, age_parsi_avg_15 FLOAT, age_parsi_stddev_15 FLOAT, women_parsi_percent_15 FLOAT, sikh_percent_15 FLOAT, age_sikh_avg_15 FLOAT, age_sikh_stddev_15 FLOAT, women_sikh_percent_15 FLOAT, christian_percent_15 FLOAT, age_christian_avg_15 FLOAT, age_christian_stddev_15 FLOAT, women_christian_percent_15 FLOAT, revision_percent_new_15 FLOAT, revision_percent_deleted_15 FLOAT, revision_percent_modified_15 FLOAT)");
$dbh->do ("INSERT INTO uprolls2015 (ac_id_09, booth_id_15, electors_15, missing_percent_15, age_avg_15, age_stddev_15, age_muslim_avg_15, age_muslim_stddev_15, women_percent_15, women_muslim_percent_15, muslim_percent_15, buddhist_percent_15, age_buddhist_avg_15, age_buddhist_stddev_15, women_buddhist_percent_15, hindu_percent_15, age_hindu_avg_15, age_hindu_stddev_15, women_hindu_percent_15, jain_percent_15, age_jain_avg_15, age_jain_stddev_15, women_jain_percent_15, parsi_percent_15, age_parsi_avg_15, age_parsi_stddev_15, women_parsi_percent_15, sikh_percent_15, age_sikh_avg_15, age_sikh_stddev_15, women_sikh_percent_15, christian_percent_15, age_christian_avg_15, age_christian_stddev_15, women_christian_percent_15, revision_percent_new_15, revision_percent_deleted_15, revision_percent_modified_15) SELECT constituency, booth, voters_total, missing_percent, age_avg, age_stddev, age_muslim_avg, age_muslim_stddev, women_percent, women_muslim_percent, muslim_percent, buddhist_percent, age_buddhist_avg, age_buddhist_stddev, women_buddhist_percent, hindu_percent, age_hindu_avg, age_hindu_stddev, women_hindu_percent, jain_percent, age_jain_avg, age_jain_stddev, women_jain_percent, parsi_percent, age_parsi_avg, age_parsi_stddev, women_parsi_percent, sikh_percent, age_sikh_avg, age_sikh_stddev, women_sikh_percent, christian_percent, age_christian_avg, age_christian_stddev, women_christian_percent, revision15_percent_new, revision15_percent_deleted, revision15_percent_modified FROM booths");

#
# Finally create sqlite dump
#

print "Create dumps and CSV\n";

$dbh->sqlite_backup_to_file("temp.sqlite");

system("sqlite3 temp.sqlite '.dump uprolls2015' > uprolls2015.sql");

system("rm temp.sqlite booths.sqlite");

0 comments on commit a1033d8

Please sign in to comment.