Permalink
Browse files

Added frontpage details for Gujarat

  • Loading branch information...
raphael-susewind committed Mar 8, 2017
1 parent ac56831 commit 7ef1f71fac010e9b41729b0e380896991f63ff7e
@@ -17,6 +17,17 @@ ac_reserved_14 | Reservation status of that assembly segment, as assigned by the
booth_id_14 | ID code of the polling booth, as assigned by the Election Commission in 2014
station_id_14 | ID code of the polling station, i.e. the physical unit housing this polling booth (note that this is a concept not used by the Election Commission, but introduced by me - basically all polling booths with subsequent ID codes and roughly similar names are considered to fall within one station)
station_name_14 | Name of the polling station, i.e. the physical unit housing this polling booth (cleaned up to be the same across all booths within this station)
booth_name_14 | Name of the polling booth as listed on the cover sheet of the electoral rolls of 2014 (this will be in Gujarati, while station_name_14 is in English - different raw source)
address_14 | Address of the polling booth as listed on the cover sheet of the electoral rolls of 2014
district_14 | District into which this booth falls as listed on the cover sheet of the electoral rolls of 2014
taluk_14 | Taluk into which this booth falls as listed on the cover sheet of the electoral rolls of 2014
thana_14 | Police that jurisdiction into which this booth falls as listed on the cover sheet of the electoral rolls of 2014
revenue_14 | revenue circle into which this booth falls as listed on the cover sheet of the electoral rolls of 2014
ward_14 | Ward (if urban) into which this booth falls as listed on the cover sheet of the electoral rolls of 2014
village_14 | Village / main town (if urban) into which this booth falls as listed on the cover sheet of the electoral rolls of 2014
parts_14 | 'Parts' (usually streets) covered by this booth as listed on the cover sheet of the electoral rolls of 2014
pincode_14 | Pincode of this booth as listed on the cover sheet of the electoral rolls of 2014


## Processing

Large diffs are not rendered by default.

Oops, something went wrong.
@@ -32,7 +32,7 @@ Originally, the electoral rolls were crawled in July 2014 from http://ceouttarpr
Raw data itself (electoral roll PDFs as well as the voter-by-voter name classifications derived from them) are not shared here, though, both to save space (it amounts to several GBs of binary dumps) and in light of privacy concerns (electoral rolls are public data, but I doubt that electors like to have their probable religion searchable by EPIC card number).
The subsequent processing chain is however preserved in the run-in-osc, run-in-osc-add-ngram and run-in-osc-add-ngram folders for reference. It ran on the [Oxford Advanced Research Computing cluster](https://www.arc.ox.ac.uk) and several hardcoded binary paths as well as the Torque scheduler commands are unique to this environment. After running createnamedb.pl once and putting all additional software in place, the chain was started using run.sh, which basically sparked 403 parallel processes (one for each assembly segment), in which roll PDFs were downloaded, relevant data extracted, names of electors matched to likely religion and ultimately booth-wise estimates of religious demography calculated. After the main processing chain, the run-in-osc-add-ngram chain attempted to further reduce missing_percent_14 using ngram technology (see scripts there for details). Finally there was a glitch with gender extraction; this was rectified using the run-in-osc-add-gender scripts.
The subsequent processing chain is however preserved in the run-in-osc, run-in-osc-add-ngram, run-in-osc-add-ngram and run-in-arc-frontpage folders for reference. It ran on the [Oxford Advanced Research Computing cluster](https://www.arc.ox.ac.uk) and several hardcoded binary paths as well as the Torque scheduler commands are unique to this environment. After running createnamedb.pl once and putting all additional software in place, the chain was started using run.sh, which basically sparked 403 parallel processes (one for each assembly segment), in which roll PDFs were downloaded, relevant data extracted, names of electors matched to likely religion and ultimately booth-wise estimates of religious demography calculated. After the main processing chain, the run-in-osc-add-ngram chain attempted to further reduce missing_percent_14 using ngram technology (see scripts there for details). Finally there was a glitch with gender extraction; this was rectified using the run-in-osc-add-gender scripts. Finally run-in-arc-frontpage added frontpage details for the gujid table.
The major problem was something else, though: Because the PDFs were corrupted, one could not simply extract non-latin text from them as was the case in earlier years - it came out garbled. Tturns out the version of Crystal Reports used in 2014 resulted in wrong ToUnicode CMaps in the PDF - an unfixable problem. Ultimately, I thus settled on an OCR solution - see pdf2list.pl for the gory details (each electoral roll is dissected into tiny TIFFs, which are then fed through tesseract).
BIN +6.24 MB (270%) gujrolls2014/booths.sqlite.tgz
Binary file not shown.

Large diffs are not rendered by default.

Oops, something went wrong.
@@ -0,0 +1,14 @@
#!/usr/bin/perl

use Parallel::ForkManager;
$pm = new Parallel::ForkManager(16);

for ($i=1;$i<=182;$i++) {
next if -e "/data/area-mnni/rsusewind/ceogujarat.nic.in/Voter-List-2014/$i/donefront";
$pm->start and next;
system('cp *.pl /data/area-mnni/rsusewind/ceogujarat.nic.in/Voter-List-2014/'.$i);
exec('cd /data/area-mnni/rsusewind/ceogujarat.nic.in/Voter-List-2014/'.$i.' && perl subcontrolfront.pl '.$i);
$pm->finish;
}

$pm->wait_all_children;
@@ -0,0 +1,114 @@
#!/usr/bin/perl -CSDA

use DBI;
use utf8;

my $constituency=$ARGV[0];
chomp $constituency;

# Connect to database and alter structure
my $dbh = DBI->connect("dbi:SQLite:dbname=$constituency.sqlite","","",{sqlite_unicode => 1});

$dbh->do ("ALTER TABLE booths ADD COLUMN name CHAR");
$dbh->do ("ALTER TABLE booths ADD COLUMN address CHAR");
$dbh->do ("ALTER TABLE booths ADD COLUMN parts CHAR");
$dbh->do ("ALTER TABLE booths ADD COLUMN village CHAR");
$dbh->do ("ALTER TABLE booths ADD COLUMN ward CHAR");
$dbh->do ("ALTER TABLE booths ADD COLUMN taluk CHAR");
$dbh->do ("ALTER TABLE booths ADD COLUMN district CHAR");
$dbh->do ("ALTER TABLE booths ADD COLUMN thana CHAR");
$dbh->do ("ALTER TABLE booths ADD COLUMN revenue CHAR");
$dbh->do ("ALTER TABLE booths ADD COLUMN pincode INTEGER");


# Iterate through frontpages
my @files= `ls *pdf`;

foreach my $file (@files) {
$file =~ /(\d+)-(\d+)/gs;
$booth=$2;
chomp ($file);

my $frontpage = `pdftotext -f 1 -l 1 -nopgbrk $file -`;

$frontpage =~ /(\d\d\d\d\d\d)/gs;
my $pincode = $1;
if ($pincode !~ /\d\d\d\d\d\d/) {undef($pincode)}

my $right=2260;
my $left=228;
my $top=1154;
my $bottom=2262;
my $width=$right-$left;
my $height=$bottom-$top;
my $bufferx=int($left/300*72);
my $buffery=int(842-($top+$height)/300*72);
system("gs -q -r300 -dFirstPage=1 -dLastPage=1 -sDEVICE=tiffgray -sCompression=lzw -o temp.tif -g".$width."x".$height." -c '<</Install {-$bufferx -$buffery translate}>> setpagedevice' -f $file");
my $parts = `tesseract -psm 4 -l guj --tessdata-dir /home/area-mnni/rsusewind/share/tessdata temp.tif stdout`;

my $right=1436;
my $left=228;
my $top=2666;
my $bottom=2733;
my $width=$right-$left;
my $height=$bottom-$top;
my $bufferx=int($left/300*72);
my $buffery=int(842-($top+$height)/300*72);
system("gs -q -r300 -dFirstPage=1 -dLastPage=1 -sDEVICE=tiffgray -sCompression=lzw -o temp.tif -g".$width."x".$height." -c '<</Install {-$bufferx -$buffery translate}>> setpagedevice' -f $file");
my $name = `tesseract -psm 4 -l guj --tessdata-dir /home/area-mnni/rsusewind/share/tessdata temp.tif stdout`;

my $right=1436;
my $left=228;
my $top=2784;
my $bottom=2918;
my $width=$right-$left;
my $height=$bottom-$top;
my $bufferx=int($left/300*72);
my $buffery=int(842-($top+$height)/300*72);
system("gs -q -r300 -dFirstPage=1 -dLastPage=1 -sDEVICE=tiffgray -sCompression=lzw -o temp.tif -g".$width."x".$height." -c '<</Install {-$bufferx -$buffery translate}>> setpagedevice' -f $file");
my $address = `tesseract -psm 4 -l guj --tessdata-dir /home/area-mnni/rsusewind/share/tessdata temp.tif stdout`;

my $right=2260;
my $left=1870;
my $top=2284;
my $bottom=2516;
my $width=$right-$left;
my $height=$bottom-$top;
my $bufferx=int($left/300*72);
my $buffery=int(842-($top+$height)/300*72);
system("gs -q -r300 -dFirstPage=1 -dLastPage=1 -sDEVICE=tiffgray -sCompression=lzw -o temp.tif -g".$width."x".$height." -c '<</Install {-$bufferx -$buffery translate}>> setpagedevice' -f $file");
my $taluk = `tesseract -psm 4 -l guj --tessdata-dir /home/area-mnni/rsusewind/share/tessdata temp.tif stdout`;

my $right=1060;
my $left=625;
my $top=2284;
my $bottom=2455;
my $width=$right-$left;
my $height=$bottom-$top;
my $bufferx=int($left/300*72);
my $buffery=int(842-($top+$height)/300*72);
system("gs -q -r300 -dFirstPage=1 -dLastPage=1 -sDEVICE=tiffgray -sCompression=lzw -o temp.tif -g".$width."x".$height." -c '<</Install {-$bufferx -$buffery translate}>> setpagedevice' -f $file");
my $boxleft = `tesseract -psm 4 -l guj --tessdata-dir /home/area-mnni/rsusewind/share/tessdata temp.tif stdout`;
$boxleft =~ s/\n\s+/\n/gs;
my @boxleft = split(/\n/,$boxleft);

my $right=1700;
my $left=1334;
my $top=2284;
my $bottom=2516;
my $width=$right-$left;
my $height=$bottom-$top;
my $bufferx=int($left/300*72);
my $buffery=int(842-($top+$height)/300*72);
system("gs -q -r300 -dFirstPage=1 -dLastPage=1 -sDEVICE=tiffgray -sCompression=lzw -o temp.tif -g".$width."x".$height." -c '<</Install {-$bufferx -$buffery translate}>> setpagedevice' -f $file");
my $boxright = `tesseract -psm 4 -l guj --tessdata-dir /home/area-mnni/rsusewind/share/tessdata temp.tif stdout`;
$boxright =~ s/\n\s+/\n/gs;
my @boxright = split(/\n/,$boxright);

$dbh->do("UPDATE booths SET pincode = ?, name = ?, address = ?, parts = ?, village = ?, thana = ?, district = ?, revenue = ?, ward = ?, taluk = ? WHERE booth = ?",undef,$pincode,$name,$address,$parts,$boxleft[0],$boxleft[1],$boxright[0],$boxright[1],$boxright[2], $taluk, $booth);
}

system("rm temp.tif");

$dbh->disconnect;
undef($dbh);
@@ -0,0 +1,27 @@
#!/bin/bash

# set cpu requirement
#PBS -l nodes=1

# set max wallclock time MAXIMUM 100 hours
#PBS -l walltime=24:00:00

# set name of job
#PBS -N gujfront

# mail alert at start, end and abortion of executio
#PBS -M raphael.susewind@area.ox.ac.uk
#PBS -m bea

# use submission environment
#PBS -V

# start job from the directory it was submitted


module load tesseract

export PATH=$HOME/bin:$PATH

cd $PBS_O_WORKDIR
perl -Mlocal::lib -I$HOME/perl5/lib/perl5 controlfront.pl $PBS_ARRAYID
@@ -0,0 +1,7 @@
#!/usr/bin/perl

my $i=$ARGV[0];

system("perl -CSDA -Mlocal::lib -I$HOME/perl5/lib/perl5 frontpage.pl $i");
system("rm -r *.pl");
system("touch donefront");
@@ -1,7 +1,8 @@
#!/usr/bin/perl
#!/usr/bin/perl -CSDA

if (!-e "booths.sqlite") {system("tar -xzf booths.sqlite.tgz")}

use utf8;
use DBD::SQLite;

#
@@ -13,6 +14,48 @@
$dbh->do ("CREATE TABLE gujrolls2014 (id INTEGER PRIMARY KEY AUTOINCREMENT, ac_id_09 INTEGER, booth_id_14 INTEGER, electors_14 INTEGER, missing_percent_14 FLOAT, age_avg_14 FLOAT, age_stddev_14 FLOAT, age_muslim_avg_14 FLOAT, age_muslim_stddev_14 FLOAT, women_percent_14 FLOAT, women_muslim_percent_14 FLOAT, muslim_percent_14 FLOAT, buddhist_percent_14 FLOAT, age_buddhist_avg_14 FLOAT, age_buddhist_stddev_14 FLOAT, women_buddhist_percent_14 FLOAT, hindu_percent_14 FLOAT, age_hindu_avg_14 FLOAT, age_hindu_stddev_14 FLOAT, women_hindu_percent_14 FLOAT, jain_percent_14 FLOAT, age_jain_avg_14 FLOAT, age_jain_stddev_14 FLOAT, women_jain_percent_14 FLOAT, parsi_percent_14 FLOAT, age_parsi_avg_14 FLOAT, age_parsi_stddev_14 FLOAT, women_parsi_percent_14 FLOAT, sikh_percent_14 FLOAT, age_sikh_avg_14 FLOAT, age_sikh_stddev_14 FLOAT, women_sikh_percent_14 FLOAT, christian_percent_14 FLOAT, age_christian_avg_14 FLOAT, age_christian_stddev_14 FLOAT, women_christian_percent_14 FLOAT)");
$dbh->do ("INSERT INTO gujrolls2014 (ac_id_09, booth_id_14, electors_14, missing_percent_14, age_avg_14, age_stddev_14, age_muslim_avg_14, age_muslim_stddev_14, women_percent_14, women_muslim_percent_14, muslim_percent_14, buddhist_percent_14, age_buddhist_avg_14, age_buddhist_stddev_14, women_buddhist_percent_14, hindu_percent_14, age_hindu_avg_14, age_hindu_stddev_14, women_hindu_percent_14, jain_percent_14, age_jain_avg_14, age_jain_stddev_14, women_jain_percent_14, parsi_percent_14, age_parsi_avg_14, age_parsi_stddev_14, women_parsi_percent_14, sikh_percent_14, age_sikh_avg_14, age_sikh_stddev_14, women_sikh_percent_14, christian_percent_14, age_christian_avg_14, age_christian_stddev_14, women_christian_percent_14) SELECT constituency, booth, voters_total, missing_percent, age_avg, age_stddev, age_muslim_avg, age_muslim_stddev, women_percent, women_muslim_percent, muslim_percent, buddhist_percent, age_buddhist_avg, age_buddhist_stddev, women_buddhist_percent, hindu_percent, age_hindu_avg, age_hindu_stddev, women_hindu_percent, jain_percent, age_jain_avg, age_jain_stddev, women_jain_percent, parsi_percent, age_parsi_avg, age_parsi_stddev, women_parsi_percent, sikh_percent, age_sikh_avg, age_sikh_stddev, women_sikh_percent, christian_percent, age_christian_avg, age_christian_stddev, women_christian_percent FROM booths");

open (FILE, ">gujrolls2014-b.sql");

print FILE "ALTER TABLE gujid ADD COLUMN pincode_14 INTEGER;\n";
print FILE "ALTER TABLE gujid ADD COLUMN booth_name_14 CHAR;\n";
print FILE "ALTER TABLE gujid ADD COLUMN address_14 CHAR;\n";
print FILE "ALTER TABLE gujid ADD COLUMN parts_14 CHAR;\n";
print FILE "ALTER TABLE gujid ADD COLUMN village_14 CHAR;\n";
print FILE "ALTER TABLE gujid ADD COLUMN ward_14 CHAR;\n";
print FILE "ALTER TABLE gujid ADD COLUMN taluk_14 CHAR;\n";
print FILE "ALTER TABLE gujid ADD COLUMN district_14 CHAR;\n";
print FILE "ALTER TABLE gujid ADD COLUMN thana_14 CHAR;\n";
print FILE "ALTER TABLE gujid ADD COLUMN revenue_14 CHAR;\n";

print FILE "BEGIN TRANSACTION;\n";

my $sth = $dbh->prepare("SELECT * FROM booths");
$sth->execute();
while (my $row=$sth->fetchrow_hashref) {
my $parts = $row->{parts};
my $name = $row->{name};
my $address = $row->{address};
my $taluk = $row->{taluk};
$parts =~ s/\n/-/gs;
$parts =~ s/^[ \-]+//gs;
$parts =~ s/[ \-]+$//gs;
$name =~ s/\n/-/gs;
$name =~ s/^[ \-]+//gs;
$name =~ s/[ \-]+$//gs;
$address =~ s/\n/-/gs;
$address =~ s/^[ \-]+//gs;
$address =~ s/[ \-]+$//gs;
$taluk =~ s/\n/-/gs;
$taluk =~ s/^[ \-]+//gs;
$taluk =~ s/[ \-]+$//gs;
print FILE "UPDATE gujid SET pincode_14 = ".$dbh->quote($row->{pincode}).", booth_name_14 = ".$dbh->quote($name).", address_14 = ".$dbh->quote($address).", parts_14 = ".$dbh->quote($parts).", village_14 = ".$dbh->quote($row->{village}).", ward_14 = ".$dbh->quote($row->{ward}).", taluk_14 = ".$dbh->quote($taluk).", thana_14 = ".$dbh->quote($row->{thana}).", revenue_14 = ".$dbh->quote($row->{revenue}).", district_14 = ".$dbh->quote($row->{district})." WHERE ac_id_09 = ".$row->{constituency}." AND booth_id_14 = ".$row->{booth}.";\n";
}
$sth->finish();

print FILE "COMMIT;\n";

close (FILE);

#
# Finally create sqlite dump
#

0 comments on commit 7ef1f71

Please sign in to comment.