### Q6.  (10 marks) For each state, district and overall, find the following ratios: total number of females vaccinated (either 1 or 2 doses) to total number of males vaccinated (same). For that district/state/country, find the ratio of population of females to males. (If a district is absent in 2011 census, drop it from analysis.) Now find the ratio of the two ratios, i.e., vaccination ratio to population ratio. Output them in the following manner: districtid, vaccinationratio, populationratio, ratioofratios. Call this output file vaccination-population-ratio.csv and the script/program to generate this vaccination-population-ratio-generator.sh. Sort the output by the final ratio.

# 1. Importing the necessary libraries

In [1]:
import json
import numpy as np
import pandas as pd
import datetime
from dateutil import relativedelta  # used for handling dates and doing relative arithmetic

# 2. Load the cleaned vaccination data (done in Q1)

In [2]:
# Load the cowin_vaccine_data_districtwise.csv
cowin_vaccine_data_districtwise = pd.read_csv('./dataset/cowin_vaccine_data_districtwise_clean.csv', dtype='string')
cowin_vaccine_data_districtwise.head()

# convert number values columns to numeric
cowin_vaccine_data_districtwise.iloc[:, 4:] = cowin_vaccine_data_districtwise.iloc[:, 4:].apply(
                                                                    pd.to_numeric, errors='ignore')
print('The datatypes of columns containing numeric values has been changed from string to numeric')

The datatypes of columns containing numeric values has been changed from string to numeric


# 3. Load census data and clean

In [3]:
# Load the census data
census_data = pd.read_csv('./dataset/DDW_PCA0000_2011_Indiastatedist.csv', dtype='string')

## (i) Discard the columns which are not useful for our purpose
- Remove the rows containing data for rural and urban population separately. Keep only total population.
- Keep only relevant columns: 'State', 'Level', 'Name', 'TOT_P', 'TOT_M', 'TOT_F'

In [4]:
# Keep only total counts data, i.e. discard separate counts of rural and urban population
census_data = census_data[census_data['TRU'] == 'Total']
# Keep relevant columns only
census_data = census_data[['State', 'Level', 'Name', 'TOT_P', 'TOT_M', 'TOT_F']]

## (ii) Convert columns containing numbers to numeric datatype
As of now our dataframe contains all columns read as string datatype, so we will change the datatype of all numeric columns from string to numeric

In [5]:
# Convert the columns containing numbers to numeric datatype
census_data.iloc[:, 3:] = census_data.iloc[:, 3:].apply(pd.to_numeric, errors='ignore')
print('The datatypes of columns containing numeric values has been changed from string to numeric')
print(census_data.dtypes)

The datatypes of columns containing numeric values has been changed from string to numeric
State    string
Level    string
Name     string
TOT_P     int64
TOT_M     int64
TOT_F     int64
dtype: object


## (iii) Change the state names in the census data as per the vaccination data

The state names in the census data are a bit different from the state names in the vaccination data.

The following changes will be done:
- NCT OF DELHI --> Delhi
- DAMAN AND DIU --> DADRA AND NAGAR HAVELI AND DAMAN AND DIU
- DADRA AND NAGAR HAVELI --> DADRA AND NAGAR HAVELI AND DAMAN AND DIU

In [6]:
# Iterate over the rows in census_data dataframe
for i, row in census_data.iterrows():

    # Update the names of states
    if(row['Level'] == 'STATE'):
        
        # Change the state name to title format. For ex. 'JAMMU & KASHMIR' to 'Jammu & Kashmir'
        # Replace any occurance of '&' by 'and', since 'and' is used in vaccine data
        census_data.at[i, 'Name'] = census_data.at[i, 'Name'].title()
        census_data.at[i, 'Name'] = census_data.at[i, 'Name'].replace('&', 'and')
        
        # Change the state name for Delhi from 'Nct Of Delhi' to 'Delhi'
        if(census_data.at[i, 'Name'] == 'Nct Of Delhi'):
            census_data.at[i, 'Name'] = census_data.at[i, 'Name'].replace('Nct Of Delhi', 'Delhi')
            
        # Change the state name of 'Daman and Diu' and 'Dadra and Nagar Haveli'
        if(census_data.at[i, 'Name'] == 'Daman and Diu'):
            census_data.at[i, 'Name'] = census_data.at[i, 'Name'].replace('Daman and Diu', 'Dadra and Nagar Haveli and Daman and Diu')
        if(census_data.at[i, 'Name'] == 'Dadra and Nagar Haveli'):
            census_data.at[i, 'Name'] = census_data.at[i, 'Name'].replace('Dadra and Nagar Haveli', 'Dadra and Nagar Haveli and Daman and Diu')
    
    if(census_data.at[i, 'State'] == '26'):
        # Change the state code of 'Dadra and Nagar Haveli' from '26' to '25' (same as Daman and Diu)
        census_data.at[i, 'State'] = '25'

## (iv) Merge the data for all components of of Delhi as a single district Delhi

In [7]:
# find the stateid for Delhi (it is '07' in the census_data)
delhi_stateid = census_data[(census_data['Level'] == 'STATE') & (census_data['Name'] == 'Delhi')]['State'].values[0]

# get the row indexes for districts under 'Delhi' state
delhi_districts_index = census_data[(census_data['Level'] == 'DISTRICT') & (census_data['State'] ==  delhi_stateid)].index

# get the sum all the data of 'Delhi' for the numeric columns
delhi_aggregate_data = list(census_data.iloc[delhi_districts_index, 3:].sum())

# insert the aggregate data for 'Delhi' to our dataframe, append at last and increment index
census_data.loc[-1] = ['07', 'DISTRICT', 'Delhi'] + delhi_aggregate_data

# drop all the indexes corresponding to the 'Delhi' components and then reset indexes
census_data.drop(delhi_districts_index, axis=0, inplace=True) 
census_data.reset_index(drop=True, inplace=True)  # reset indexes

print('Operation Successful')
print('Components of Delhi merged and replaced by single district name "Delhi"')

Operation Successful
Components of Delhi merged and replaced by single district name "Delhi"


## (v) Merge the data for Daman and Diu & Dadra and Nagar Haveli

In [8]:
# find the state id for 'Dadra and Nagar Haveli and Daman and Diu'
dadra_nagar_haveli_daman_diu_stateid = census_data[(census_data['Level'] == 'STATE') & (census_data['Name'] == 'Dadra and Nagar Haveli and Daman and Diu')]['State'].values[0]

# get the row indexes for the state 'Dadra and Nagar Haveli' and 'Daman and Diu'
dadra_nagar_haveli_daman_diu_index = census_data[(census_data['Level'] == 'STATE') & (census_data['State'] ==  dadra_nagar_haveli_daman_diu_stateid)].index

# get the sum all the data of 'Dadra and Nagar Haveli' and 'Daman and Diu' for the numeric columns
aggregate_data = list(census_data.iloc[dadra_nagar_haveli_daman_diu_index, 3:].sum())

# insert the aggregate data to our dataframe, append at last and increment index
census_data.loc[-1] = ['25', 'STATE', 'Dadra and Nagar Haveli and Daman and Diu'] + aggregate_data

# drop the old indexes corresponding to the state
census_data.drop(dadra_nagar_haveli_daman_diu_index, axis=0, inplace=True) 
census_data.reset_index(drop=True, inplace=True)  # reset indexes

## (vi) Split Andhra Pradesh into Andhra Pradesh and Telangana

In [9]:
################# Add entry for Telangana #################

# The following districts are now part of 'Telangana' state
tl_districts = ['Adilabad', 'Hyderabad', 'Karimnagar', 'Khammam', 'Mahbubnagar', 'Medak', 'Nalgonda', 'Nizamabad', 'Rangareddy', 'Warangal']
tot_p = 0
tot_m = 0
tot_f = 0

# Iterate over the rows in census_data dataframe
for i, row in census_data.iterrows():
    if(census_data.at[i, 'State'] == '28'):
        if(census_data.at[i, 'Name'] in tl_districts):
            census_data.at[i, 'State'] = '26'
            tot_p += row[3]
            tot_m += row[4]
            tot_f += row[5]

census_data.loc[-1] = ['26', 'STATE', 'Telangana', tot_p, tot_m, tot_f] 
census_data.index += 1

################# Update data for Andhra Pradesh ##################

# Iterate over the rows in census_data dataframe
for i, row in census_data.iterrows():
    if(row[0] == '28' and row[1] == 'STATE'):
        census_data.at[i, 'TOT_P'] = row[3] - tot_p
        census_data.at[i, 'TOT_M'] = row[4] - tot_m
        census_data.at[i, 'TOT_F'] = row[5] - tot_f

## (vii) Update the district names for 'Sikkim' state

In [10]:
sikkim_stateid = census_data[(census_data['Level'] == 'STATE') & (census_data['Name'] == 'Sikkim')]['State'].values[0]

# Iterate over the rows in census_data dataframe
for i, row in census_data.iterrows():
    
    # update the district names
    if(census_data.at[i, 'State'] == sikkim_stateid and census_data.at[i, 'Level'] == 'DISTRICT'):
        census_data.at[i, 'Name'] = census_data.at[i, 'Name'].replace('North  District', 'North Sikkim')
        census_data.at[i, 'Name'] = census_data.at[i, 'Name'].replace('South District', 'South Sikkim')
        census_data.at[i, 'Name'] = census_data.at[i, 'Name'].replace('East District', 'East Sikkim')
        census_data.at[i, 'Name'] = census_data.at[i, 'Name'].replace('West District', 'West Sikkim')

## (ix) Find the unique districts in the vaccine data and the census data

In [11]:
# Find the unique district names in the vaccine data
district_names_from_vaccine_data = cowin_vaccine_data_districtwise['District'].dropna().unique()
district_names_from_vaccine_data = [district_names.lower() for district_names in district_names_from_vaccine_data]
print('Number of unique districts in vaccine data =', len(district_names_from_vaccine_data))

# Find the unique district_names in census_data
# Note: A lot of district names in the census data have trailing whitespace, remove it using strip()
district_names_from_census_data = census_data[census_data['Level'] == 'DISTRICT']['Name'].dropna().unique()
district_names_from_census_data = [district_names.lower().strip() for district_names in district_names_from_census_data]
print('Number of unique districts in district census data =', len(district_names_from_census_data))

Number of unique districts in vaccine data = 714
Number of unique districts in district census data = 626


## (x) Find the common district names between district_census_data and vaccine data

In [12]:
common_districts_vaccine_and_census = set(district_names_from_census_data).intersection(district_names_from_vaccine_data)
print('There are', len(common_districts_vaccine_and_census), 'districts common between the vaccine data and district census data')

There are 540 districts common between the vaccine data and district census data


## (xi) Find the district in district_census_data that are not matching in vaccine data

In [13]:
# Find the districts in districts_census_data that are not in vacine data
districts_not_in_vaccine_data = set(district_names_from_census_data) - set(district_names_from_vaccine_data)
print('\nThere are', len(districts_not_in_vaccine_data),
      'Districts not in vaccine data = ', districts_not_in_vaccine_data)


There are 86 Districts not in vaccine data =  {'jyotiba phule nagar', 'kanniyakumari', 'darjiling', 'khargone (west nimar)', 'debagarh', 'banas kantha', 'mahamaya nagar', 'buldana', 'north twenty four parganas', 'sabar kantha', 'jalor', 'sri potti sriramulu nellore', 'purbi singhbhum', 'kachchh', 'dibang valley', 'narsimhapur', 'kheri', 'chittaurgarh', 'bagalkot', 'pashchimi singhbhum', 'chikmagalur', 'shupiyan', 'bellary', 'shimoga', 'koch bihar', 'bangalore rural', 'mumbai suburban', 'gurgaon', 'mahrajganj', 'sahibzada ajit singh nagar', 'dadra & nagar haveli', 'barddhaman', 'haora', 'baudh', 'gondiya', 'gulbarga', 'baramula', 'south twenty four parganas', 'warangal', 'dhaulpur', 'hugli', 'badgam', 'mahbubnagar', 'ahmadnagar', 'mysore', 'hardwar', 'kaimur (bhabua)', 'maldah', 'the dangs', 'y.s.r.', 'tumkur', 'garhwal', 'kodarma', 'lahul & spiti', 'chamarajanagar', 'muktsar', 'purba champaran', 'faizabad', 'bandipore', 'bara banki', 'firozpur', 'jhunjhunun', 'mewat', 'allahabad', 'be

## (xii) For each unmatched district in district_census_data, find the closest matching district name in the vaccine data
- By using the longest-common-subsequence heuristic, find the closest matching district name to the unmatched districts in district_census_data.

- A lot of district names have been modified in the recent years. So, we will have to update the district names in the district_census_data file.

- To accomplish this, we will use longest_common_subsequence to match each unmatched district name from the district_census_data with the district names in the vaccine data file.

- We will find the matching districts with max length of longest_common_subsequence and use the results to make modifications in the districts names of district_census_data.

In [14]:
def lcs(x, y):
    '''
    This function is the dynamic programming implementation of longest-common-subsequence.
    Input: strings x and y
    Output: length of longest-common-subsequence
    '''
    m, n = len(x), len(y)  # m and n contain the length of strings x and y respectively
    dp = np.zeros((m+1, n+1), dtype='int64')  # 2d array for dp initialized with zeros
    for i in range(dp.shape[0]):
        for j in range(dp.shape[1]):
            # if the characters match then current cell will be diagonally previous cell value + 1
            if(x[i-1] == y[j-1]):
                dp[i][j] = dp[i-1][j-1] + 1
            # else find the max of cell on the left and the cell above
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    # return the bottom right cell value that contains the lcs of x and y
    return dp[m][n]

In [15]:
def find_matches():
    '''
    This function finds the closest district name match in cowin data
    for each unmatched district in district_census_data
    '''
    
    # for each district in district_census_data that didn't match in vaccine data
    for unmatched_district in districts_not_in_vaccine_data:
        # find the size of longest match
        longest_size_match = max([lcs(unmatched_district, x) for x in district_names_from_vaccine_data])
        # create an empty list to store the district names that have max
        # size common subsequence with the unmatched district
        similar_districts = list()
        for district in district_names_from_vaccine_data:
            if lcs(unmatched_district, district) == longest_size_match:
                similar_districts.append(district)
        print('{', unmatched_district, '} ~ ', similar_districts)

## **NOTE: Run below cell only if you want to check the LCS heuristic output**

In [16]:
print('The following unmatched districts are having max lcs with the following districts:')

'''NOTE: Uncomment below line to execute the lcs heuristic'''

# find_matches()

The following unmatched districts are having max lcs with the following districts:


'NOTE: Uncomment below line to execute the lcs heuristic'

## (xiii) Modify the district names in the district_census_data

Based on the above longest-common-subsequence heuristis, we will update the old district names

In [17]:
# define the modifications to be done, the below list contains pair of values
# the first value in pair is the old name and the second value in pair is the new name
district_modifications = [
    ['  ', ' '],  # some district names have extra whitespace, replace it by a single whitespace
    ['mahbubnagar', 'mahabubnagar'],
    ['rangareddy', 'ranga reddy'],
    ['sri potti sriramulu nellore', 's.p.s. nellore'],
    ['y.s.r.', 'y.s.r. kadapa'],
    ['kaimur (bhabua)', 'kaimur'],
    ['pashchim champaran', 'west champaran'],
    ['purba champaran', 'east champaran'],
    ['janjgir - champa', 'janjgir champa'],
    ['ahmadabad', 'ahmedabad'],
    ['banas kantha', 'banaskantha'],
    ['dohad', 'dahod'],
    ['kachchh', 'kutch'],
    ['mahesana', 'mehsana'],
    ['panch mahals', 'panchmahal'],
    ['sabar kantha', 'sabarkantha'],
    ['the dangs', 'dang'],
    ['lahul & spiti', 'lahaul and spiti'],
    ['gurgaon', 'gurugram'],
    ['mewat', 'nuh'],
    ['kodarma', 'koderma'],
    ['pashchimi singhbhum', 'west singhbhum'],
    ['purbi singhbhum', 'east singhbhum'],
    ['saraikelakharsawan', ''],
    ['badgam', 'budgam'],
    ['bandipore', 'bandipora'],
    ['baramula', 'baramulla'],
    ['shupiyan', 'shopiyan'],
    ['bagalkot', 'bagalkote'],
    ['bangalore', 'bengaluru'],
    ['bangalore rural', 'bengaluru rural'],
    ['belgaum', 'belagavi'],
    ['bellary', 'ballari'],
    ['bijapur', 'vijayapura'],
    ['chamarajanagar', 'chamarajanagara'],
    ['chikmagalur', 'chikkamagaluru'],
    ['gulbarga', 'kalaburagi'],
    ['mysore', 'mysuru'],
    ['shimoga', 'shivamogga'],
    ['tumkur', 'tumakuru'],
    ['ahmadnagar', 'ahmednagar'],
    ['bid', 'beed'],
    ['buldana', 'buldhana'],
    ['gondiya', 'gondia'],
    ['khandwa (east nimar)', 'khandwa'],
    ['khargone (west nimar)', 'khargone'],
    ['narsimhapur', 'narsinghpur'],
    ['anugul', 'angul'],
    ['baleshwar', 'balasore'],
    ['baudh', 'boudh'],
    ['debagarh', 'deogarh'],
    ['jagatsinghapur', 'jagatsinghpur'],
    ['jajapur', 'jajpur'],
    ['firozpur', 'ferozepur'],
    ['muktsar', 'sri muktsar sahib'],
    ['sahibzada ajit singh nagar', 's.a.s. nagar'],
    ['chittaurgarh', 'chittorgarh'],
    ['dhaulpur', 'dholpur'],
    ['jalor', 'jalore'],
    ['jhunjhunun', 'jhunjhunu'],
    ['kanniyakumari', 'kanyakumari'],
    ['the nilgiris', 'nilgiris'],
    ['allahabad', 'prayagraj'],
    ['bara banki', 'barabanki'],
    ['faizabad', 'ayodhya'],
    ['jyotiba phule nagar', 'amroha'],
    ['kanshiram nagar', 'kasganj'],
    ['mahamaya nagar', 'hathras'],
    ['mahrajganj', 'maharajganj'],
    ['sant ravidas nagar (bhadohi)', 'bhadohi'],
    ['garhwal', 'pauri garhwal'],
    ['hardwar', 'haridwar'],
    ['darjiling', 'darjeeling'],
    ['haora', 'howrah'],
    ['hugli', 'hooghly'],
    ['koch bihar', 'cooch behar'],
    ['maldah', 'malda'],
    ['north twenty four parganas', 'north 24 parganas'],
    ['puruliya', 'purulia'],
    ['south twenty four parganas', 'south 24 parganas'],
    ['north & middle andaman', 'north and middle andaman'],
    ['leh(ladakh)', 'leh'],
    ['dadra & nagar haveli', 'dadra and nagar haveli'],
    ['lakhimpur', 'lakhimpur kheri'],
    ['kheri', 'lakhimpur kheri'],
    ['dibang valley', 'upper dibang valley']
]

print('The number of changes to be done =', len(district_modifications))

# iterate over the rows in census_data dataframe
for i, row in census_data.iterrows():
    # update the district names
    if(census_data.at[i, 'Level'] == 'DISTRICT'):
        # remove the trailing whitespace and convert them to lowercase
        census_data.at[i, 'Name'] = census_data.at[i, 'Name'].strip().lower()
        # replace the double space by a single space (some district names have double space between words)
        census_data.at[i, 'Name'] = census_data.at[i, 'Name'].replace('  ', ' ')
        # for each modification in the above nested list, update the district_names_from_census_data
        for modification in district_modifications:
            if(census_data.at[i, 'Name'] == modification[0]):
                census_data.at[i, 'Name'] = census_data.at[i, 'Name'].replace(modification[0], modification[1])

print('Operation Successful\n Census Data District Names Modified!')

The number of changes to be done = 86
Operation Successful
 Census Data District Names Modified!


## (xiv) Save the cleaned dataset for future use

In [18]:
census_data.to_csv('./dataset/census_data_clean.csv', index=False)

# 4. Find the common districts between census data and vaccine data

In [19]:
# Find the unique district names in the vaccine data
district_names_from_vaccine_data = cowin_vaccine_data_districtwise['District'].dropna().unique()
district_names_from_vaccine_data = [district_names.lower() for district_names in district_names_from_vaccine_data]
print('Number of unique districts in vaccine data =', len(district_names_from_vaccine_data))

# Find the unique district_names in census_data
district_names_from_census_data = census_data[census_data['Level'] == 'DISTRICT']['Name'].dropna().unique()
district_names_from_census_data = [district_names.lower() for district_names in district_names_from_census_data]
print('Number of unique districts in district census data =', len(district_names_from_census_data))

# Find the common districts between the vaccine and census data
common_districts_vaccine_and_census = set(district_names_from_census_data).intersection(district_names_from_vaccine_data)
print('There are', len(common_districts_vaccine_and_census), 'districts common between the vaccine data and district census data')

Number of unique districts in vaccine data = 714
Number of unique districts in district census data = 625
There are 620 districts common between the vaccine data and district census data


# 5. Prepare district-vaccination-population-ratio.csv

In [20]:
# Prepare a file for storing vaccination population ratio for each district
district_vaccination_population_ratio = pd.DataFrame(columns=['districtid', 'vaccinationratio', 'populationratio', 'ratioofratios'])

for district in list(common_districts_vaccine_and_census):                                                 
    
    # find the vaccination data for this district
    district_data = cowin_vaccine_data_districtwise[cowin_vaccine_data_districtwise['District'].str.lower() == district]
    
    # find population data for this district
    population_data = census_data[(census_data['Level'] == 'DISTRICT') & (census_data['Name'].str.lower() == district)]
                                                     
    # find the district_key for this district
    district_key = district_data['District_Key'].values[0]

    # define start_date and end_date
    # the vaccination data starts from 16 January 2021
    start_date = datetime.datetime.strptime('16/01/2021', '%d/%m/%Y')
    end_date = datetime.datetime.strptime('14/08/2021', '%d/%m/%Y')
    
    # change the date format to match the format in dataframe
    start_date = start_date.strftime('%d/%m/%Y')
    end_date = end_date.strftime('%d/%m/%Y')
    
    # calculate total males and females vaccinated                                                 
    total_males_vaccinated = district_data[end_date + '-' + 'Male(Individuals Vaccinated)'].values[0]
    total_females_vaccinated = district_data[end_date + '-' + 'Female(Individuals Vaccinated)'].values[0]                                                 

    # calculate population of males and females
    total_males = population_data['TOT_M'].values[0]
    total_females = population_data['TOT_F'].values[0]
    
    # calculate the required ratios, put NaN if division by zero occurs
    if(total_males_vaccinated == 0):
        vaccination_ratio = float('NaN')
    else:
        vaccination_ratio = total_females_vaccinated / total_males_vaccinated
    
    if(total_males == 0):
        population_ratio = float('NaN')
    else:
        population_ratio = total_females / total_males
        
    if(population_ratio == 0 or population_ratio == float('NaN')):
        ratioofratios = float('NaN')
    else:
        ratioofratios = vaccination_ratio / population_ratio

    # append data to dataframe
    district_vaccination_population_ratio.loc[-1] = [district_key, vaccination_ratio, population_ratio, ratioofratios]
    district_vaccination_population_ratio.index += 1

# dump data to csv files
district_vaccination_population_ratio = district_vaccination_population_ratio.sort_values('ratioofratios')
district_vaccination_population_ratio.to_csv('./output/district-vaccination-population-ratio.csv', index=False)
district_vaccination_population_ratio.head()

Unnamed: 0,districtid,vaccinationratio,populationratio,ratioofratios
130,NL_Kiphire,0.531756,0.956225,0.556099
48,DN_Dadra and Nagar Haveli,0.44121,0.77389,0.570119
380,HR_Nuh,0.537717,0.9071,0.592787
283,JK_Srinagar,0.539059,0.899529,0.599268
333,TN_Kancheepuram,0.652007,0.986257,0.661093


# 6. Find the common states between census data and vaccine data

In [21]:
# Find the unique state names in the vaccine data
state_names_from_vaccine_data = cowin_vaccine_data_districtwise['State'].dropna().unique()
state_names_from_vaccine_data = [state_name.lower() for state_name in state_names_from_vaccine_data]
print('Number of unique state in vaccine data =', len(state_names_from_vaccine_data))

# Find the unique state names in the census
state_names_from_census_data = census_data[census_data['Level'] == 'STATE']['Name'].dropna().unique()
state_names_from_census_data = [state_name.lower() for state_name in state_names_from_census_data]
print('Number of unique state in census data =', len(state_names_from_census_data))

# Find the common states between the vaccine and census data
common_states_vaccine_and_census = set(state_names_from_census_data).intersection(state_names_from_vaccine_data)
print('There are', len(common_states_vaccine_and_census), 'states common between the vaccine data and census data')

Number of unique state in vaccine data = 36
Number of unique state in census data = 35
There are 35 states common between the vaccine data and census data


# 7. Prepare state-vaccination-population-ratio.csv

In [22]:
# Prepare a file for storing vaccination population ratio for each state
state_vaccination_population_ratio = pd.DataFrame(columns=['stateid', 'vaccinationratio', 'populationratio', 'ratioofratios'])

for state in list(common_states_vaccine_and_census):                                                 
    
    # find the vaccination data for this state
    state_data = cowin_vaccine_data_districtwise[cowin_vaccine_data_districtwise['State'].str.lower() == state]
    
    # find population data for this state
    population_data = census_data[(census_data['Level'] == 'STATE') & (census_data['Name'].str.lower() == state)]
                                                     
    # find the state_code for this district
    state_code = state_data.iloc[0]['State_Code']

    # define start_date and end_date
    # the vaccination data starts from 16 January 2021
    start_date = datetime.datetime.strptime('16/01/2021', '%d/%m/%Y')
    end_date = datetime.datetime.strptime('14/08/2021', '%d/%m/%Y')
    
    # change the date format to match the format in dataframe
    start_date = start_date.strftime('%d/%m/%Y')
    end_date = end_date.strftime('%d/%m/%Y')
    
    # calculate total males and females vaccinated                                                 
    total_males_vaccinated = sum(state_data[end_date + '-' + 'Male(Individuals Vaccinated)'])
    total_females_vaccinated = sum(state_data[end_date + '-' + 'Female(Individuals Vaccinated)'])                                                 

    # calculate population of males and females
    total_males = population_data['TOT_M'].values[0]
    total_females = population_data['TOT_F'].values[0]
    
    # calculate the required ratios, put NaN if division by zero occurs
    if(total_males_vaccinated == 0):
        vaccination_ratio = float('NaN')
    else:
        vaccination_ratio = total_females_vaccinated / total_males_vaccinated
    
    if(total_males == 0):
        population_ratio = float('NaN')
    else:
        populationratio = total_females / total_males
        
    if(population_ratio == 0 or population_ratio == float('NaN')):
        ratioofratios = float('NaN')
    else:
        ratioofratios = vaccination_ratio / populationratio

    # append data to dataframe
    state_vaccination_population_ratio.loc[-1] = [state_code, vaccination_ratio, populationratio, ratioofratios]
    state_vaccination_population_ratio.index += 1

# dump data to csv files
state_vaccination_population_ratio = state_vaccination_population_ratio.sort_values('ratioofratios')
state_vaccination_population_ratio.to_csv('./output/state-vaccination-population-ratio.csv', index=False)
state_vaccination_population_ratio.head()

Unnamed: 0,stateid,vaccinationratio,populationratio,ratioofratios
2,DN,0.497221,0.705965,0.704315
14,NL,0.752385,0.930907,0.808228
34,JK,0.745552,0.888562,0.839055
11,DL,0.739717,0.867957,0.852251
6,UP,0.779276,0.912437,0.85406


# 8. Prepare overall-vaccination-population-ratio.csv

In [23]:
# Prepare a file for storing vaccination population ratio for India (overall)
overall_vaccination_population_ratio = pd.DataFrame(columns=['overallid', 'vaccinationratio', 'populationratio', 'ratioofratios'])

# define start_date and end_date
# the vaccination data starts from 16 January 2021
start_date = datetime.datetime.strptime('16/01/2021', '%d/%m/%Y')
end_date = datetime.datetime.strptime('14/08/2021', '%d/%m/%Y')

# change the date format to match the format in dataframe
start_date = start_date.strftime('%d/%m/%Y')
end_date = end_date.strftime('%d/%m/%Y')
    
# calculate total males and females vaccinated                                                 
total_males_vaccinated = sum(cowin_vaccine_data_districtwise[end_date + '-' + 'Male(Individuals Vaccinated)'])
total_females_vaccinated = sum(cowin_vaccine_data_districtwise[end_date + '-' + 'Female(Individuals Vaccinated)'])                                                 

# calculate population of males and females
total_males = census_data[census_data['Level'] == 'India']['TOT_M'].values[0]
total_females = census_data[census_data['Level'] == 'India']['TOT_F'].values[0]

# calculate the required ratios, put NaN if division by zero occurs
if(total_males_vaccinated == 0):
    vaccination_ratio = float('NaN')
else:
    vaccination_ratio = total_females_vaccinated / total_males_vaccinated

if(total_males == 0):
    population_ratio = float('NaN')
else:
    populationratio = total_females / total_males

if(population_ratio == 0 or population_ratio == float('NaN')):
    ratioofratios = float('NaN')
else:
    ratioofratios = vaccination_ratio / populationratio

# append data to dataframe
overall_vaccination_population_ratio.loc[-1] = ['India', vaccination_ratio, populationratio, ratioofratios]
overall_vaccination_population_ratio.index += 1

# dump data to csv files
overall_vaccination_population_ratio.to_csv('./output/overall-vaccination-population-ratio.csv', index=False)
overall_vaccination_population_ratio.head()

Unnamed: 0,overallid,vaccinationratio,populationratio,ratioofratios
0,India,0.890081,0.942745,0.944138


--------------------------------------------------------------------------------- END of Q6 ---------------------------------------------------------------------------------------------