### Question 1. (10 marks) Modify the neighbor districts data according to the districts found from the Covid19 portal. A neighbor of a larger district is a combination of all the neighbors of its components. Output the new data as neighbor-districts-modified.json. Use the state code and district codes from vaccination data as their ids. Arrange all the district and state keys in alphabetical order. Only include common districts from vaccination data and covid data.

# 1. Importing the necessary libraries

In [1]:
import json
import numpy as np
import pandas as pd

# 2. Load vaccination data and clean

In [2]:
# Load the cowin_vaccine_data_districtwise.csv
cowin_vaccine_data_districtwise = pd.read_csv('./dataset/cowin_vaccine_data_districtwise.csv', dtype='string')
cowin_vaccine_data_districtwise.head()

Unnamed: 0,S No,State_Code,State,District_Key,Cowin Key,District,16/01/2021,16/01/2021.1,16/01/2021.2,16/01/2021.3,...,31/10/2021,31/10/2021.1,31/10/2021.2,31/10/2021.3,31/10/2021.4,31/10/2021.5,31/10/2021.6,31/10/2021.7,31/10/2021.8,31/10/2021.9
0,,,,,,,Total Individuals Registered,Sessions,Sites,First Dose Administered,...,Total Doses Administered,Sessions,Sites,First Dose Administered,Second Dose Administered,Male(Individuals Vaccinated),Female(Individuals Vaccinated),Transgender(Individuals Vaccinated),Covaxin (Doses Administered),CoviShield (Doses Administered)
1,1.0,AN,Andaman and Nicobar Islands,AN_Nicobars,Nicobar,Nicobars,745,0,0,0,...,,,,,,,,,,
2,2.0,AN,Andaman and Nicobar Islands,AN_North and Middle Andaman,North and Middle Andaman,North and Middle Andaman,1496,0,0,0,...,,,,,,,,,,
3,3.0,AN,Andaman and Nicobar Islands,AN_South Andaman,South Andaman,South Andaman,6028,2,2,23,...,,,,,,,,,,
4,4.0,AP,Andhra Pradesh,AP_Anantapur,Anantapur,Anantapur,20781,28,26,287,...,,,,,,,,,,


## (i) Merge column headers with the first row headers

The column information is present in first row (header) as well as second row. To make things easier, we will merge both names into a single column header. For example: '10/01/2021' and 'Total Doses Administered' will be merged as '10/01/2021-Total Doses Administered'.

Also, we will ignore 'NA' values while merging. For example: 'District' and 'NA' will be merged as 'District'


In [3]:
# Get the list of default column names
# Note that due to conflicting column names, pandas adds '.number_id' to repeated column names
# We will use split() function over dot character ('.') and remove the number_id suffix from the name
default_column_names = [x.split('.')[0] for x in list(cowin_vaccine_data_districtwise.columns)]

# Get the list of column names that are in the first row
secondary_column_names = list(cowin_vaccine_data_districtwise.loc[0])

# Get the merged column names by concatenating the corresponding elements of the above two lists
merged_column_names = [i + '-' + j
                       if not(pd.isna(j)) else i  # ignore the concatenation where the values are <NA>
                       for i, j in zip(default_column_names, secondary_column_names)]

# Update the column names in the dataframe
cowin_vaccine_data_districtwise.columns = merged_column_names

# Drop the first row since we already merged the names to original column headers
cowin_vaccine_data_districtwise.drop(0, axis=0, inplace=True)
cowin_vaccine_data_districtwise.drop(['S No'], axis=1, inplace=True)  # delete 'S No' column
cowin_vaccine_data_districtwise.drop(['Cowin Key'], axis=1, inplace=True)  # delete 'Cowin Key' column
cowin_vaccine_data_districtwise.dropna(axis=1, how='all', inplace=True)  # drop columns containing <NA>
cowin_vaccine_data_districtwise.reset_index(drop=True, inplace=True)  # reset indexes
cowin_vaccine_data_districtwise.fillna('0', inplace=True)  # fill missing values as '0'
cowin_vaccine_data_districtwise.head()

Unnamed: 0,State_Code,State,District_Key,District,16/01/2021-Total Individuals Registered,16/01/2021-Sessions,16/01/2021-Sites,16/01/2021-First Dose Administered,16/01/2021-Second Dose Administered,16/01/2021-Male(Individuals Vaccinated),...,01/09/2021-Total Doses Administered,01/09/2021-Sessions,01/09/2021-Sites,01/09/2021-First Dose Administered,01/09/2021-Second Dose Administered,01/09/2021-Male(Individuals Vaccinated),01/09/2021-Female(Individuals Vaccinated),01/09/2021-Transgender(Individuals Vaccinated),01/09/2021-Covaxin (Doses Administered),01/09/2021-CoviShield (Doses Administered)
0,AN,Andaman and Nicobar Islands,AN_Nicobars,Nicobars,745,0,0,0,0,0,...,30560,121,1,22234,8326,16468,14090,2,0,30560
1,AN,Andaman and Nicobar Islands,AN_North and Middle Andaman,North and Middle Andaman,1496,0,0,0,0,0,...,104841,3847,10,73655,31186,54856,49972,13,0,104841
2,AN,Andaman and Nicobar Islands,AN_South Andaman,South Andaman,6028,2,2,23,0,12,...,229879,6970,18,163017,66862,123444,106397,38,0,229879
3,AP,Andhra Pradesh,AP_Anantapur,Anantapur,20781,28,26,287,0,28,...,2471041,263380,106,1678857,792184,1141050,1328828,1163,342472,2128569
4,AP,Andhra Pradesh,AP_Chittoor,Chittoor,26285,63,31,424,0,93,...,2774228,193860,160,1935756,838472,1288979,1484856,393,436745,2322388


## (ii) Convert columns containing numbers to numeric datatype
As of now our dataframe contains all columns read as string datatype, so we will change the datatype of all numeric columns from string to numeric

In [4]:
cowin_vaccine_data_districtwise.iloc[:, 4:] = cowin_vaccine_data_districtwise.iloc[:, 4:].apply(
                                                                    pd.to_numeric, errors='ignore')
print('The datatypes of columns containing numeric values has been changed from string to numeric')
print(cowin_vaccine_data_districtwise.dtypes)

The datatypes of columns containing numeric values has been changed from string to numeric
State_Code                                        string
State                                             string
District_Key                                      string
District                                          string
16/01/2021-Total Individuals Registered            int64
                                                   ...  
01/09/2021-Male(Individuals Vaccinated)            int64
01/09/2021-Female(Individuals Vaccinated)          int64
01/09/2021-Transgender(Individuals Vaccinated)     int64
01/09/2021-Covaxin (Doses Administered)            int64
01/09/2021-CoviShield (Doses Administered)         int64
Length: 2294, dtype: object


## (iii) Merge the data for the rows having the same 'District_Key'
For ex: Ahmedabad and Ahmedabad Coorporation have the same district_key, all such rows will be merged

In [5]:
# Get a list of duplicate rows in our dataframe that have the same 'District_Key' using groupby
duplicate_rows = pd.concat(r for _, r in cowin_vaccine_data_districtwise.groupby("District_Key") if len(r) > 1)

# Get the list of index of all such rows
index = list(duplicate_rows.index)

# Create an empty list to record all the indexes that need to be removed
index_to_remove = list()

# Iterate over every odd index and merge with the next index that contains the data with same district_key
for i in range(0, len(index), 2):
    # add the data of duplicate row with index[i+1] to the row index [i] 
    # do this for only 4th column onwards which contains numeric values
    cowin_vaccine_data_districtwise.iloc[index[i], 4:] += cowin_vaccine_data_districtwise.iloc[index[i+1], 4:]
    # the next index for iteration is the index of duplicate row, these rows will be removed later
    index_to_remove.append(index[i+1])

cowin_vaccine_data_districtwise.drop(index_to_remove, axis=0, inplace=True)  # delete the duplicate rows
cowin_vaccine_data_districtwise.reset_index(drop=True, inplace=True)  # reset indexes
cowin_vaccine_data_districtwise.head()

Unnamed: 0,State_Code,State,District_Key,District,16/01/2021-Total Individuals Registered,16/01/2021-Sessions,16/01/2021-Sites,16/01/2021-First Dose Administered,16/01/2021-Second Dose Administered,16/01/2021-Male(Individuals Vaccinated),...,01/09/2021-Total Doses Administered,01/09/2021-Sessions,01/09/2021-Sites,01/09/2021-First Dose Administered,01/09/2021-Second Dose Administered,01/09/2021-Male(Individuals Vaccinated),01/09/2021-Female(Individuals Vaccinated),01/09/2021-Transgender(Individuals Vaccinated),01/09/2021-Covaxin (Doses Administered),01/09/2021-CoviShield (Doses Administered)
0,AN,Andaman and Nicobar Islands,AN_Nicobars,Nicobars,745,0,0,0,0,0,...,30560,121,1,22234,8326,16468,14090,2,0,30560
1,AN,Andaman and Nicobar Islands,AN_North and Middle Andaman,North and Middle Andaman,1496,0,0,0,0,0,...,104841,3847,10,73655,31186,54856,49972,13,0,104841
2,AN,Andaman and Nicobar Islands,AN_South Andaman,South Andaman,6028,2,2,23,0,12,...,229879,6970,18,163017,66862,123444,106397,38,0,229879
3,AP,Andhra Pradesh,AP_Anantapur,Anantapur,20781,28,26,287,0,28,...,2471041,263380,106,1678857,792184,1141050,1328828,1163,342472,2128569
4,AP,Andhra Pradesh,AP_Chittoor,Chittoor,26285,63,31,424,0,93,...,2774228,193860,160,1935756,838472,1288979,1484856,393,436745,2322388


## (iv) Merge the components of Delhi
The data for delhi is divided into 11 components so we will merge them as one.

Merge the following 11 components:

- central delhi,
- east delhi,
- new delhi,
- north delhi,
- north east delhi,
- north west delhi,
- shahdara
- south delhi,
- south east delhi,
- south west delhi,
- west delhi

In [6]:
# get the row indexes for districts under 'Delhi' state
delhi_data_index = cowin_vaccine_data_districtwise[cowin_vaccine_data_districtwise['State'] == 'Delhi'].index

# get the sum all the data of 'Delhi' for the numeric columns
delhi_aggregate_data = list(cowin_vaccine_data_districtwise.iloc[delhi_data_index, 4:].sum())

# insert the aggregate data for 'Delhi' to out dataframe, append at last and increment index
cowin_vaccine_data_districtwise.loc[-1] = ['DL', 'Delhi', 'DL_Delhi', 'Delhi'] + delhi_aggregate_data

# drop all the indexes corresponding to the 'Delhi' components and then reset indexes
cowin_vaccine_data_districtwise.drop(delhi_data_index, axis=0, inplace=True) 
cowin_vaccine_data_districtwise.reset_index(drop=True, inplace=True)  # reset indexes

print('Operation Successful')
print('Components of Delhi merged and replaced by single district name "Delhi"')

Operation Successful
Components of Delhi merged and replaced by single district name "Delhi"


## (v) Save the cleaned dataset for future use

In [7]:
cowin_vaccine_data_districtwise.to_csv('./dataset/cowin_vaccine_data_districtwise_clean.csv', index=False)

# 3. Load districts cases data and clean

In [8]:
# Load the districts.csv file
districts_cases = pd.read_csv('./dataset/districts.csv', dtype='string')
districts_cases.head()

Unnamed: 0,Date,State,District,Confirmed,Recovered,Deceased,Other,Tested
0,2020-04-26,Andaman and Nicobar Islands,Unknown,33,11,0,0,
1,2020-04-26,Andhra Pradesh,Anantapur,53,14,4,0,
2,2020-04-26,Andhra Pradesh,Chittoor,73,13,0,0,
3,2020-04-26,Andhra Pradesh,East Godavari,39,12,0,0,
4,2020-04-26,Andhra Pradesh,Guntur,214,29,8,0,


## (i) Drop the columns which are not required

In [9]:
# delete the columns which are not required
districts_cases.drop(['Recovered', 'Deceased', 'Other', 'Tested'], axis=1, inplace=True)
districts_cases.head()

Unnamed: 0,Date,State,District,Confirmed
0,2020-04-26,Andaman and Nicobar Islands,Unknown,33
1,2020-04-26,Andhra Pradesh,Anantapur,53
2,2020-04-26,Andhra Pradesh,Chittoor,73
3,2020-04-26,Andhra Pradesh,East Godavari,39
4,2020-04-26,Andhra Pradesh,Guntur,214


# 4. Find the common district names between vaccine data and cases data
## (i) Find the unique districts in the vaccine data and the cases data

In [10]:
# Find the unique district names in the vaccine data
district_names_from_vaccine_data = cowin_vaccine_data_districtwise['District'].dropna().unique()
district_names_from_vaccine_data = [district_names.lower() for district_names in district_names_from_vaccine_data]
print('Number of unique districts in vaccine data =', len(district_names_from_vaccine_data))
      
# Find the unique district names from the districts cases data
district_names_from_districts_cases = districts_cases['District'].dropna().unique()
district_names_from_districts_cases = [district_name.lower() for district_name in district_names_from_districts_cases]
print('Number of unique districts in cases data =', len(district_names_from_districts_cases))

Number of unique districts in vaccine data = 714
Number of unique districts in cases data = 643


In [11]:
# Find the districts in vaccine data that are not present in districts.csv
districts_not_in_cases_data = set(district_names_from_vaccine_data) - set(district_names_from_districts_cases)
print('There are', len(districts_not_in_cases_data),
      'Districts not in districts cases data = ', districts_not_in_cases_data)

# Find the districts in districts,csv that are not present in vaccine data
districts_not_in_vaccine_data = set(district_names_from_districts_cases) - set(district_names_from_vaccine_data)
print('\nThere are', len(districts_not_in_vaccine_data),
      'Districts not in vaccine data = ', districts_not_in_vaccine_data)

There are 88 Districts not in districts cases data =  {'chandel', 'narayanpet', 'vikarabad', 'bishnupur', 'warangal rural', 'nirmal', 'tamenglong', 'komaram bheem', 'adilabad', 'kangpokpi', 'peddapalli', 'nizamabad', 'karbi anglong', 'sangareddy', 'siddipet', 'nagaon', 'medchal malkajgiri', 'jorhat', 'majuli', 'west sikkim', 'mancherial', 'south andaman', 'south sikkim', 'kakching', 'tinsukia', 'goalpara', 'noney', 'kokrajhar', 'north and middle andaman', 'east sikkim', 'mulugu', 'sonitpur', 'north goa', 'ukhrul', 'nicobars', 'rajanna sircilla', 'jogulamba gadwal', 'imphal west', 'jiribam', 'dhubri', 'karimnagar', 'lakhimpur', 'bhadradri kothagudem', 'hojai', 'wanaparthy', 'west karbi anglong', 'biswanath', 'imphal east', 'darrang', 'barpeta', 'bongaigaon', 'dima hasao', 'south salmara mankachar', 'baksa', 'warangal urban', 'kamrup metropolitan', 'tengnoupal', 'kamrup', 'kamjong', 'khammam', 'pherzawl', 'south goa', 'morigaon', 'jagtial', 'nalgonda', 'north sikkim', 'jayashankar bhupal

## (ii) Find the common districts between vaccine data and cases data

In [12]:
common_districts_vaccine_and_cases = set(district_names_from_districts_cases).intersection(district_names_from_vaccine_data)
print('There are', len(common_districts_vaccine_and_cases), 'districts common between the vaccine data and cases data')

There are 626 districts common between the vaccine data and cases data


# 5. Load neighbor-districts.json and clean

In [13]:
# Load the neighbor_districts.json file as a dictionary
with open('./dataset/neighbor-districts.json') as json_file:
    neighbor_districts = json.load(json_file)

## (i) Remove the Q codes from the district entries

In [14]:
# Renaming the district codes in neighbor_districts.json file
# For Ex: "leh_district/Q1921210" will be renamed as "leh_district"

# Create a dictionary to store the modified json file
neighbor_districts_modified = dict()

# Iterate over the original json file and update
for key in neighbor_districts:
    new_key = key.split('/')[0]  # Keep only the district name and discard the codes
    new_value = list()
    for value in neighbor_districts[key]:
        new_value.append(value.split('/')[0])
    neighbor_districts_modified[new_key] = new_value

# 6. Update the neighbor-districts.json file based on the common districts from the vaccine and cases data

## (i) Find the unique districts in neighbor_districts_json file

In [15]:
# Find the unique district names from neighbor_districts_modified
district_names_from_neighbor_districts_json = [district_name.lower() for district_name in neighbor_districts_modified.keys()]
print('Number of unique districts in neighbor_districts_json =', len(district_names_from_neighbor_districts_json))

Number of unique districts in neighbor_districts_json = 718


## (ii) Find the district in neighbor_districts_json that are not in vaccine data

In [16]:
# Find the districts in neighbor_districts_json that are not in vacine data
neighbor_districts_not_in_vaccine_data = set(district_names_from_neighbor_districts_json) - \
                                         set(district_names_from_vaccine_data)
print('\nThere are', len(neighbor_districts_not_in_vaccine_data),
      'Districts not in vaccine data = ', neighbor_districts_not_in_vaccine_data)


There are 252 Districts not in vaccine data =  {'shi_yomi', 'tapi_district', 'west_delhi', 'amreli_district', 'sahibzada_ajit_singh_nagar', 'yadadri_bhuvanagiri', 'sangli_district', 'debagarh', 'kabirdham', 'vellore_district', 'west_garo_hills', 'bhavnagar_district', 'tiruvanamalai_district', 'pattanamtitta', 'jamnagar_district', 'faizabad', 'tirunelveli_kattabo', 'imphal_east', 'ramanagara_district', 'paschim_medinipur', 'kannur_district', 'niwari', 'dhaulpur', 'lower_siang', 'dima_hasao_district', 'jalor', 'uttar_dinajpur', 'surendranagar_district', 'anugul', 'jayashankar_bhupalapally', 'thiruvarur_district', 'kheda_district', 'gadag_district', 'the_nilgiris_district', 'firozpur', 'chhota_udaipur_district', 'south_garo_hills', 'ashok_nagar', 'bangalore_rural', 'rajanna_sircilla', 'ariyalur_district', 'konkan_division', 'jajapur', 'lower_subansiri', 'muktsar', 'thanjavur_district', 'nandubar', 'kamrup_metropolitan', 'north_24_parganas', 'chamarajanagar_district', 'west_khasi_hills', 

## (iii) For each unmatched district in json file, find the closest matching district name in the vaccine data
- By using the longest-common-subsequence heuristic, find the closest matching district name to the unmatched districts in neighbor_districts_json.

- A lot of district names have been modified in the recent years. So, we will have to update the district names in the neighbor_districts_json file.

- To accomplish this, we will use longest_common_subsequence to match each unmatched district name from the json file with the district names in the vaccine data file.

- We will find the matching districts with max length of longest_common_subsequence and use the results to make modifications in the districts names of json file.

In [17]:
def lcs(x, y):
    '''
    This function is the dynamic programming implementation of longest-common-subsequence.
    Input: strings x and y
    Output: length of longest-common-subsequence
    '''
    m, n = len(x), len(y)  # m and n contain the length of strings x and y respectively
    dp = np.zeros((m+1, n+1), dtype='int64')  # 2d array for dp initialized with zeros
    for i in range(dp.shape[0]):
        for j in range(dp.shape[1]):
            # if the characters match then current cell will be diagonally previous cell value + 1
            if(x[i-1] == y[j-1]):
                dp[i][j] = dp[i-1][j-1] + 1
            # else find the max of cell on the left and the cell above
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    # return the bottom right cell value that contains the lcs of x and y
    return dp[m][n]

In [18]:
def find_matches():
    '''
    This function finds the closest district name match in cowin data
    for each unmatched district in neighbor-districts.json
    '''
    
    # for each district in json file that didn't match in vaccine data
    for unmatched_district in neighbor_districts_not_in_vaccine_data:
        # find the size of longest match
        longest_size_match = max([lcs(unmatched_district, x) for x in district_names_from_vaccine_data])
        # create an empty list to store the district names that have max
        # size common subsequence with the unmatched district
        similar_districts = list()
        for district in district_names_from_vaccine_data:
            if lcs(unmatched_district, district) == longest_size_match:
                similar_districts.append(district)
        print('{', unmatched_district, '} ~ ', similar_districts)

## **NOTE: Run below cell only if you want to check the LCS heuristic output**

In [19]:
print('The following unmatched districts are having max lcs with the following districts:')

'''NOTE: Uncomment below line to execute the lcs heuristic'''

# find_matches()

The following unmatched districts are having max lcs with the following districts:


'NOTE: Uncomment below line to execute the lcs heuristic'

## (iv) Modify the district_names in neighbor_districts_json

Based on the matchings from the above heuristic we will do the following changes
- Replace all dashes ('-') and underscores ('_') with space (' ')
- Remove all occurances of ' district' in the district name
- Modify the old names to the new ones
- Delhi district has 11 components which we will merge later

In [20]:
# define the modifications to be done, the below list contains pair of values
# the first value in pair is the old name and the second value in pair is the new name
district_modifications = [
    ['_', ' '],  # replace '_' (underscore) by ' ' (space)
    ['-', ' '],  # replace '-' (dash) by ' ' (space)
    [' district', ''],  # remove ' district' word from the district names
    ['lahul and spiti', 'lahaul and spiti'],
    ['bangalore rural', 'bengaluru rural'],
    ['bangalore urban', 'bengaluru urban'],
    ['komram bheem', 'komaram bheem'],
    ['purba champaran', 'east champaran'],
    ['pashchim champaran', 'west champaran'],
    ['faizabad', 'ayodhya'],
    ['aizwal', 'aizawl'],
    ['anugul', 'angul'],
    ['ashok nagar', 'ashoknagar'],
    ['badgam', 'budgam'],
    ['baleshwar', 'balasore'],
    ['baramula', 'baramulla'],
    ['banas kantha', 'banaskantha'],
    ['baudh', 'boudh'],
    ['belgaum', 'belagavi'],
    ['bellary', 'ballari'],
    ['bemetara', 'bametara'],
    ['bid', 'beed'],
    ['beedar', 'bidar'],
    ['bishwanath', 'biswanath'],
    ['chamarajanagar', 'chamarajanagara'],
    ['dantewada', 'dakshin bastar dantewada'],
    ['debagarh', 'deogarh'],
    ['devbhumi dwaraka', 'devbhumi dwarka'],
    ['dhaulpur', 'dholpur'],
    ['east karbi anglong', 'karbi anglong'],
    ['fategarh sahib', 'fatehgarh sahib'],
    ['firozpur', 'ferozepur'],
    ['gondiya', 'gondia'],
    ['hugli', 'hooghly'],
    ['jagatsinghapur', 'jagatsinghpur'],
    ['jajapur', 'jajpur'],
    ['jalor', 'jalore'],
    ['jhunjhunun', 'jhunjhunu'],
    ['jyotiba phule nagar', 'amroha'],
    ['kabirdham', 'kabeerdham'],
    ['kaimur (bhabua)', 'kaimur'],
    ['kanchipuram', 'kancheepuram'],
    ['kheri', 'lakhimpur'],
    ['lakhimpur', 'lakhimpur kheri'],
    ['kochbihar', 'cooch behar'],
    ['kodarma', 'koderma'],
    ['mahesana', 'mehsana'],
    ['marigaon', 'morigaon'],
    ['mahrajganj', 'maharajganj'],
    ['maldah', 'malda'],
    ['muktsar', 'sri muktsar sahib'],
    ['mumbai city', 'mumbai'],
    ['medchal-malkajgiri', 'medchal malkajgiri'],
    ['nandubar', 'nandurbar'],
    ['narsimhapur', 'narsinghpur'],
    ['nav sari', 'navsari'],
    ['pakaur', 'pakur'],
    ['palghat', 'palakkad'],
    ['panch mahal', 'panchmahal'],
    ['pashchimi singhbhum', 'west singhbhum'],
    ['pattanamtitta', 'pathanamthitta'],
    ['purbi singhbhum', 'east singhbhum'],
    ['puruliya', 'purulia'],
    ['rae bareilly', 'rae bareli'],
    ['rajauri', 'rajouri'],
    ['rangareddy', 'ranga reddy'],
    ['ri bhoi', 'ribhoi'],
    ['sabar kantha', 'sabarkantha'],
    ['sahibzada ajit singh nagar', 's.a.s. nagar'],
    ['sait kibir nagar', 'sant kabir nagar'],
    ['sant ravidas nagar', 'bhadohi'],
    ['sepahijala', 'sipahijala'],
    ['seraikela kharsawan', 'saraikela-kharsawan'],
    ['shaheed bhagat singh nagar', 'shahid bhagat singh nagar'],
    ['sharawasti', 'shrawasti'],
    ['shimoga', 'shivamogga'],
    ['shopian', 'shopiyan'],
    ['siddharth nagar', 'siddharthnagar'],
    ['sivagangai', 'sivaganga'],
    ['sonapur', 'subarnapur'],
    ['sri ganganagar', 'ganganagar'],
    ['sri potti sriramulu nellore', 's.p.s. nellore'],
    ['the dangs', 'dang'],
    ['the nilgiris', 'nilgiris'],
    ['thoothukudi', 'thoothukkudi'],
    ['tiruchchirappalli', 'tiruchirappalli'],
    ['tiruvanamalai', 'tiruvannamalai'],
    ['tirunelveli kattabo', 'tirunelveli'],
    ['tumkur', 'tumakuru'],
    ['yadagiri', 'yadgir'],
    ['ysr', 'y.s.r. kadapa'],
]

print('The number of changes to be done', len(district_modifications))

# for each modification in the above nested list, change the neighbor_districts_modified_dictionary
for modification in district_modifications:
    modified = dict()
    for key in neighbor_districts_modified:
        value = [x.replace(modification[0], modification[1]) for x in neighbor_districts_modified[key]]
        modified[key.replace(modification[0], modification[1])] = value
    neighbor_districts_modified = modified

print('Operation Successful\n', 'Neighbor districts dictionary modified!')

The number of changes to be done 91
Operation Successful
 Neighbor districts dictionary modified!


## (v) Merge the components of Delhi into a single district name Delhi.
Merge the following 11 components:

- central delhi,
- east delhi,
- new delhi,
- north delhi,
- north east delhi,
- north west delhi,
- shahdara
- south delhi,
- south east delhi,
- south west delhi,
- west delhi

In [21]:
# first find the neighbors of delhi by finding the neighbors of all of its components
components_of_delhi = ['central delhi', 'east delhi', 'new delhi', 'north delhi',
                       'north east delhi', 'north west delhi', 'shahdara', 'south delhi',
                       'south east delhi', 'south west delhi', 'west delhi']

neighbors_of_delhi = set()
for key in list(neighbor_districts_modified.keys()):
    if key in components_of_delhi:
        neighbors_of_delhi.update(neighbor_districts_modified[key])
        neighbor_districts_modified.pop(key, None)

# the final neighbors of delhi can be found out by removing delhi components from this list
neighbors_of_delhi = [x for x in neighbors_of_delhi if x not in components_of_delhi]
print('The neighbors of Delhi are', neighbors_of_delhi)

# now replace all delhi components from neighbors lists by the new name 'delhi'
for key in neighbor_districts_modified:
    value = list(set([x.replace(x, 'delhi') if x in components_of_delhi else x for x in neighbor_districts_modified[key]]))
    neighbor_districts_modified[key] = value

# now add delhi to the json file
neighbor_districts_modified['delhi'] = neighbors_of_delhi

print('Operation Successful\n', 'The components of Delhi have been successfully merged!')

The neighbors of Delhi are ['baghpat', 'jhajjar', 'gurugram', 'faridabad', 'sonipat', 'ghaziabad', 'gautam buddha nagar']
Operation Successful
 The components of Delhi have been successfully merged!


## (vi) Find the number of common districts between vaccine data and neighbor-districts.json

Now, since we have done a lot of cleaning. Let's find the number of common districts between the vaccine data and the json file

In [22]:
# Find the unique district names from neighbor_districts_modified
district_names_from_neighbor_districts_json = [district_name.lower() for district_name in neighbor_districts_modified.keys()]
print('Number of unique districts in neighbor_districts_json =', len(district_names_from_neighbor_districts_json))

# Find the common districts with vaccine data
common_districts_vaccine_and_json = set(district_names_from_neighbor_districts_json).intersection(district_names_from_vaccine_data)
print('There are', len(common_districts_vaccine_and_json), 'districts common between the vaccine data and json file')

Number of unique districts in neighbor_districts_json = 706
There are 701 districts common between the vaccine data and json file


## (vii) Remove the unmatched districts in json file

After a lot of data cleaning, we still find some districts in json file that do not match with the vaccine data. We will remove all such districts

In [23]:
# Find the districts in neighbor_districts_json that are not in vacine data
neighbor_districts_not_in_vaccine_data = set(district_names_from_neighbor_districts_json) - \
                                         set(district_names_from_vaccine_data)
print('\nThere are', len(neighbor_districts_not_in_vaccine_data),
      'Districts not in vaccine data = ', neighbor_districts_not_in_vaccine_data)

districts_to_remove = list(neighbor_districts_not_in_vaccine_data)

# Remove these districts from the json file
modified = dict()
for key in neighbor_districts_modified:
    if key not in districts_to_remove:
        value = [x for x in neighbor_districts_modified[key] if x not in districts_to_remove]
        modified[key] = value
neighbor_districts_modified = modified

print('These districts have now been removed from the json file')


There are 5 Districts not in vaccine data =  {'noklak', 'niwari', 'medchal–malkajgiri', 'mumbai suburban', 'konkan division'}
These districts have now been removed from the json file


## (viii) Removing some districts as specified in the Note section of assignment PDF

Avoid considering the below districts
- FROM COWIN DATA:
    Chengalpattu, Gaurela Pendra Marwahi, Nicobars, North and Middle Andaman, Saraikela-Kharsawan,
    South Andaman, Tenkasi, Tirupathur, Yanam
- FROM neighbor-district.json, remove all entries of:
    Kheri, Konkan division, Niwari, Noklak, Parbhani, Pattanamtitta
    
In out program we have already removed such districts as a part of data cleaning and will go ahead with only interesection of districts between vaccine data and neighbor-district.json

# 7. Final update on the neighbor_json with district_keys

In [24]:
# To prepare the final neighbor-districts-modified.csv we will replace all district names with district_keys

def find_key(district):
    '''
    This function returns state-key_district-name (i.e. district_key) for a given district name
    '''
    return cowin_vaccine_data_districtwise[cowin_vaccine_data_districtwise['District'].str.lower() == district]['District_Key'].values[0]

modified = dict()
for key in neighbor_districts_modified:
    # replace all the entries with district_keys
    values = [x.replace(x, find_key(x)) for x in neighbor_districts_modified[key]]
    values.sort()  # we will store neighbors in sorted order
    modified[key.replace(key, find_key(key))] = values
# Sort the keys in modified dictionary
neighbor_districts_modified = dict(sorted(modified.items()))

# 8. Save the updated dictionary as neighbor-districts-modified.json

In [25]:
# save the neighbor_districts_modified dictionary as the neighbor_districts_modified.json
with open('./output/neighbor-districts-modified.json', 'w') as f:
    json.dump(neighbor_districts_modified, f, indent=2)

--------------------------------------------------------------------------------- END of Q1 ---------------------------------------------------------------------------------------------