Skip to content
Spousal Age Gap in India. Estimate: ~2x of US.
Jupyter Notebook Python
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.


Type Name Latest commit message Commit time
Failed to load latest commit information.

The Older Half: Spousal Age Gap in India

Using the Indian electoral roll data, we estimate gap between the ages of husband and wife, and how the age difference varies across states, and by the age of husband and wife. In particular, we use data from the following states and union territorries: Andaman and Nicobar, Andhra Pradesh, Arunachal Pradesh, Dadra and Nagar Haveli, Daman and Diu, Goa, Jammu and Kashmir, Manipur, Meghalaya, Mizoram, Nagaland, and Puducherry.

The average age gap between the couple is 5.5 years (the median is 5 and the 25th percentile is 2 years), with husbands generally older than their wives. The gap is more than double in the US, where the average gap is 2.3 (538, CPS data). Compared to the US, where in 64% of the couples the man is older, in India, in ~90% of couples, the man is older.

The age gap between the spouses varies across states, with the median gap of about 3 years in jammu and kashmir, 4 years in Manipur, Mizoram, and Dadra and Nagar Haveli, and 6 years in Puducherry and Andaman and Nicobar Islands. The spread also varies by husband and wife age with the age gap being larger for older husbands.

Research Design

We exploit the fact that for married women, electoral rolls have the husband's name. The basic analysis is as follows: within each household, we find all married couples (where both the spouses are alive). For each married couple, we calculate the difference between their average. Our final dataset has the following fields: husband_age, wife_age, household_id, state, electoral_roll_year. We next normalize ages so that all ages are using current year as 2017. Next, we do a density plot of the differences, and present mean, median, and standard deviation. Next, we check whether the difference is statistically significant from 0. Next, we present boxplots by states. And lastly, we plot difference as a function of age of husband and wife.


Finding Couples takes path to CSV (output from pdfparser), maximum Levenshtein string distance (if 0, then only exact matching is done) and outputs a CSV with the following fields: household_id, wife_id, wife_name, wife_age, husband_name, husband_id, husband name, husband_age, state, electoral_roll_year Each couple is a separate row.

Functionality For each state file, within each household (house_no), the script uses the household_no as key to find all members in a house and then finds all cases where husband_name == elector_name.

If there is more than one match, it writes id to the file state_name_more_than_one_match.csv, and writes multiple rows to the final CSV--one for each husband match. For e.g., say there is an elector named Sita whose husband name is listed as Ram. And say in the household, there are 2 electors named Ram, one with id = 1234 and another with id = 2345. In this case, the script will add 2 rows to the final output: ..., sita, 1234, ram,... and ..., sita, 2345, ram, ....

If there is no exact match, the script checks all the names in the household within the maximum Levenshtein distance specified by the user. And again the script adds as many rows as there are matches and posts ids where more than 1 match is made to state_name_more_than_one_match.csv.

If no matches are found within the maximal Levenshtein distance, the script writes the wife id to file state_name_no_match_found.csv


Check out the usage information or help contents of by command --help. Requires python-Levenshtein library to be installed.

cd finding-couples/
python --help
usage: [-h] [--ldcost LD_COST] [--extra-ld] [--version] [--releases]

Find gaps between the ages of husband and wife using parsed Electoral Roll
data in CSV format created by "pdfparser" tool.

positional arguments:
  FILE              a CSV file parsed by pdfparser

optional arguments:
  -h, --help        show this help message and exit
  --ldcost LD_COST  maximum Levenshtein distance cost accepted for matching
                    husband name (default is 0)
  --extra-ld        +1 Levenshtein distance if husband name length longer than
  --version         show program's version number and exit
  --releases        display release notes and exit


# Find gaps of a state in CSV using default settings
python /path/to/electoral_rolls/state.csv

# Specify maximum Levenshtein distance cost for name searching
python /path/to/electoral_rolls/state.csv --ldcost 1



The parsed electoral rolls can be found here.

Our final state-wise datasets with the the following fields are posted here. Each state folder has 6 files---3 exact match files, state_name_couples_exact_match_lev_0.csv, state_name_more_than_one_match_lev_0.csv, state_name_no_match_found_lev_0.csv and 3 files for Levenshtein distance of 1 (with lev_1 suffix).


Suriyan Laohaprapanon and Gaurav Sood

You can’t perform that action at this time.