### Question 4. (10 marks) For every district, state and overall, find the week and month having peak (highest) number of active cases for wave-1 and wave-2. The output file contains columns: districtid, wave1 − weekid, wave2 − weekid, wave1 − monthid, wave2 − monthid. Call this output file peaks.csv and the script/program to generate this peaks-generator.sh. A week starts from Sunday and runs till Saturday. The next week starts from Thursday and ends in the next Wednesday. Thus, two consecutive weeks overlap. A wave starts when cases start rising, and ends when cases flatten out. The peak of a wave is its highest point. Identify the two most important peaks. (Roughly, wave-1 was in the summer of 2020, while wave-2 was in April-May of 2021.)


# 1. Importing the necessary libraries

In [1]:
import json
import numpy as np
import pandas as pd
import datetime
from dateutil import relativedelta  # used for handling dates and doing relative arithmetic

# 2. Load districts cases data and clean

In [2]:
# Load the districts.csv file
districts_cases = pd.read_csv('./dataset/districts.csv', dtype='string')
districts_cases.head()

Unnamed: 0,Date,State,District,Confirmed,Recovered,Deceased,Other,Tested
0,2020-04-26,Andaman and Nicobar Islands,Unknown,33,11,0,0,
1,2020-04-26,Andhra Pradesh,Anantapur,53,14,4,0,
2,2020-04-26,Andhra Pradesh,Chittoor,73,13,0,0,
3,2020-04-26,Andhra Pradesh,East Godavari,39,12,0,0,
4,2020-04-26,Andhra Pradesh,Guntur,214,29,8,0,


## (i) Drop the columns which are not required

In [3]:
# delete the columns which are not required
districts_cases.drop(['Recovered', 'Deceased', 'Other', 'Tested'], axis=1, inplace=True)
districts_cases.head()

Unnamed: 0,Date,State,District,Confirmed
0,2020-04-26,Andaman and Nicobar Islands,Unknown,33
1,2020-04-26,Andhra Pradesh,Anantapur,53
2,2020-04-26,Andhra Pradesh,Chittoor,73
3,2020-04-26,Andhra Pradesh,East Godavari,39
4,2020-04-26,Andhra Pradesh,Guntur,214


## (ii) Convert the 'Confirmed' column to numeric type

In [4]:
districts_cases['Confirmed'] = districts_cases['Confirmed'].apply(pd.to_numeric, errors='ignore')
print(districts_cases.dtypes)

Date         string
State        string
District     string
Confirmed     int64
dtype: object


# 3. Load the cleaned vaccination data (done in Q1)

In [5]:
# Load the cowin_vaccine_data_districtwise.csv
cowin_vaccine_data_districtwise = pd.read_csv('./dataset/cowin_vaccine_data_districtwise_clean.csv', dtype='string')
cowin_vaccine_data_districtwise.head()

# convert number values columns to numeric
cowin_vaccine_data_districtwise.iloc[:, 4:] = cowin_vaccine_data_districtwise.iloc[:, 4:].apply(
                                                                    pd.to_numeric, errors='ignore')
print('The datatypes of columns containing numeric values has been changed from string to numeric')

The datatypes of columns containing numeric values has been changed from string to numeric


# 4. Find the common district names between vaccine data and cases data

In [6]:
# Find the unique district names in the vaccine data
district_names_from_vaccine_data = cowin_vaccine_data_districtwise['District'].dropna().unique()
district_names_from_vaccine_data = [district_name.lower() for district_name in district_names_from_vaccine_data]
print('Number of unique districts in vaccine data =', len(district_names_from_vaccine_data))
      
# Find the unique district names from the districts cases data
district_names_from_districts_cases = districts_cases['District'].dropna().unique()
district_names_from_districts_cases = [district_name.lower() for district_name in district_names_from_districts_cases]
print('Number of unique districts in cases data =', len(district_names_from_districts_cases))

# find the common districts between the unique districts of cases data and vaccine data
common_districts_vaccine_and_cases = set(district_names_from_districts_cases).intersection(district_names_from_vaccine_data)
print('There are', len(common_districts_vaccine_and_cases), 'districts common between the vaccine data and cases data')

Number of unique districts in vaccine data = 714
Number of unique districts in cases data = 643
There are 626 districts common between the vaccine data and cases data


# 5. Find wave1 and wave2 peaks for each district

In [7]:
def cases_between_time(data, start_date, end_date):
    '''
    Helper function to extract the number of cases that arise in a given duration.
    Input: data, start_date, end_date
    Output: cases in this duration
    Logic: cases = cases on end_date - cases on the day before start_date
    Note: The data value is cumulative.
    '''
    # calculate the day before start date (will be useful since the data is cumulative)
    day_before_start_date = start_date - datetime.timedelta(days=1)
    # change date format to match the format in dataframe
    start_date = start_date.strftime('%Y-%m-%d')
    end_date = end_date.strftime('%Y-%m-%d')
    day_before_start_date = day_before_start_date.strftime('%Y-%m-%d')
    try:
        cases_r = data[data['Date'] == end_date]['Confirmed'].values[0]
    except:
        # assign 0 if the data doesn't exist for that date
        cases_r = 0
    try:
        cases_l = data[data['Date'] == day_before_start_date]['Confirmed'].values[0]
    except:
        # assign 0 if the data doesn't exist for that date
        cases_l = 0
    return cases_r - cases_l

In [8]:
# Prepare a file to store weekid and monthid of both the waves in each district
district_peaks = pd.DataFrame(columns=['districtid', 'wave1-weekid', 'wave2-weekid', 'wave1-monthid', 'wave2-monthid'])

for district in list(common_districts_vaccine_and_cases):

    # find the district_key for this district
    district_key = cowin_vaccine_data_districtwise[cowin_vaccine_data_districtwise['District'].str.lower() == district]['District_Key'].values[0]
    
    # find data for this district
    district_data = districts_cases[districts_cases['District'].str.lower() == district]
    
    # define start_date and end_date based on our time period of analysis
    start_date = datetime.datetime.strptime('15/03/2020', '%d/%m/%Y')
    end_date = datetime.datetime.strptime('14/08/2021', '%d/%m/%Y')
    
    # create an empty list to store the number of cases each week
    weekly_data = list()
    
    # iterate from start_date to end_date with step size of one week
    # note that the weeks here overlap
    # the first week is from sunday to saturday and the next week is from thursday to wednesday
    # week1 is 15/03/2020-21/03/2020 and week 2 is 19/03/2020-25/03/20
    weekid = 1
    while start_date < end_date:
        # the current week ends on saturday (add 6 days to start_date)
        week_end_date = start_date + datetime.timedelta(days=6)
        # calculate the cases for this week using a helper function we defined earlier
        cases = cases_between_time(district_data, start_date, week_end_date)
        # append data to our list
        weekly_data.append(cases)
        weekid += 1
        # update the start_date for the next week
        if weekid % 2 == 0:
            # if next weekid is even, it runs from thursday to wednesday
            start_date = week_end_date - datetime.timedelta(days=2)
        else:
            # if next weekid is odd, it runs from sunday to saturday
            start_date = week_end_date - datetime.timedelta(days=3)
    
    # Now we will find the weekid of both the peaks
    mid = int(len(weekly_data)/2)
    # find index of peak value in first half and second half of the weekly_data
    # note that the index starts from 0 but our weekid starts from 1, so we add 1 in the result
    wave1_weekid = weekly_data.index(max(weekly_data[:mid])) + 1
    wave2_weekid = weekly_data.index(max(weekly_data[mid:])) + 1
    
    # define start_date and end_date based on our time period of analysis
    start_date = datetime.datetime.strptime('15/03/2020', '%d/%m/%Y')
    end_date = datetime.datetime.strptime('14/08/2021', '%d/%m/%Y')
    
    # create an empty list to store the number of cases each month
    monthly_data = list()
    
    # iterate from start_date to end_date with step size of one month
    # First month is 15/03/2020-14/04/2020
    # Last month is 15/07/2021-14/08/2021
    # Total number of cases in a month = (cases on last date) - (cases on a day before first day)
    monthid = 1
    while start_date < end_date:
        # the current month ends on 14th of next month
        month_end_date = start_date + relativedelta.relativedelta(months=1) - datetime.timedelta(days=1)
        # calculate the cases for this month using a helper function we defined earlier
        cases = cases_between_time(district_data, start_date, month_end_date)
        # append data to our list
        monthly_data.append(cases)
        # update the start_date for the next month
        start_date = month_end_date + datetime.timedelta(days=1)
        monthid += 1
    
    # Now we will find the monthid of both the peaks
    mid = int(len(monthly_data)/2)
    # find index of peak value in first half and second half of the monthly_data
    # note that the index starts from 0 but our monthid starts from 1, so we add 1 in the result
    wave1_monthid = monthly_data.index(max(monthly_data[:mid])) + 1
    wave2_monthid = monthly_data.index(max(monthly_data[mid:])) + 1

    # append the weekid and monthid of both the waves to our dataframe
    district_peaks.loc[-1] = [district_key, wave1_weekid, wave2_weekid, wave1_monthid, wave2_monthid]
    district_peaks.index += 1

district_peaks = district_peaks.sort_values('districtid')
district_peaks.to_csv('./output/district-peaks.csv', index=False)  # save the file to csv
district_peaks.head()

Unnamed: 0,districtid,wave1-weekid,wave2-weekid,wave1-monthid,wave2-monthid
385,AP_Anantapur,40,122,5,15
110,AP_Chittoor,49,123,6,15
124,AP_East Godavari,52,123,6,15
494,AP_Guntur,39,121,5,14
106,AP_Krishna,58,123,7,15


# 6. Find wave1 and wave2 peaks for each state

In [9]:
# Find the unique state names in the vaccine data
state_names_from_vaccine_data = cowin_vaccine_data_districtwise['State'].dropna().unique()
state_names_from_vaccine_data = [state_name.lower() for state_name in state_names_from_vaccine_data]
print('Number of unique states in vaccine data =', len(state_names_from_vaccine_data))

# Find the unique state names from the districts cases data
state_names_from_districts_cases = districts_cases['State'].dropna().unique()
state_names_from_districts_cases = [state_name.lower() for state_name in state_names_from_districts_cases]
print('Number of unique state in cases data =', len(state_names_from_districts_cases))

# Find the common state names between the vaccine and districts cases data
common_states_vaccine_and_cases = set(state_names_from_districts_cases).intersection(state_names_from_vaccine_data)
print('There are', len(common_states_vaccine_and_cases), 'states common between the vaccine data and cases data')

Number of unique states in vaccine data = 36
Number of unique state in cases data = 36
There are 36 states common between the vaccine data and cases data


In [10]:
def cases_between_time_for_series(data, start_date, end_date):
    '''
    Helper function to extract the number of cases that arise in a given duration.
    Input: data, start_date, end_date
    Output: cases in this duration
    Logic: cases = cases on end_date - cases on the day before start_date
    Note: The data is cumulative.
    '''
    # calculate the day before start date (will be useful since the data is cumulative)
    day_before_start_date = start_date - datetime.timedelta(days=1)
    # change date format to match the format in dataframe
    start_date = start_date.strftime('%Y-%m-%d')
    end_date = end_date.strftime('%Y-%m-%d')
    day_before_start_date = day_before_start_date.strftime('%Y-%m-%d')
    try:
        cases_r = sum(data[data['Date'] == end_date]['Confirmed'])
    except:
        # assign 0 if the data doesn't exist for that date
        cases_r = 0
    try:
        cases_l = sum(data[data['Date'] == day_before_start_date]['Confirmed'])
    except:
        # assign 0 if the data doesn't exist for that date
        cases_l = 0
    return cases_r - cases_l

In [11]:
# Prepare a file to store weekid and monthid of both the waves in each state
state_peaks = pd.DataFrame(columns=['stateid', 'wave1-weekid', 'wave2-weekid', 'wave1-monthid', 'wave2-monthid'])

for state in list(common_states_vaccine_and_cases):

    # find the state_code for this state
    state_code = cowin_vaccine_data_districtwise[cowin_vaccine_data_districtwise['State'].str.lower() == state]['State_Code'].values[0]
    
    # find data for this state
    state_data = districts_cases[districts_cases['State'].str.lower() == state]
    
    # define start_date and end_date based on our time period of analysis
    start_date = datetime.datetime.strptime('15/03/2020', '%d/%m/%Y')
    end_date = datetime.datetime.strptime('14/08/2021', '%d/%m/%Y')
    
    # create an empty list to store the number of cases each week
    weekly_data = list()
    
    # iterate from start_date to end_date with step size of one week
    # note that the weeks here overlap
    # the first week is from sunday to saturday and the next week is from thursday to wednesday
    # week1 is 15/03/2020-21/03/2020 and week 2 is 19/03/2020-25/03/20
    weekid = 1
    while start_date < end_date:
        # the current week ends on saturday (add 6 days to start_date)
        week_end_date = start_date + datetime.timedelta(days=6)
        # calculate the cases for this week using a helper function we defined earlier
        cases = cases_between_time_for_series(state_data, start_date, week_end_date)
        # append data to our list
        weekly_data.append(cases)
        weekid += 1
        # update the start_date for the next week
        if weekid % 2 == 0:
            # if next weekid is even, it runs from thursday to wednesday
            start_date = week_end_date - datetime.timedelta(days=2)
        else:
            # if next weekid is odd, it runs from sunday to saturday
            start_date = week_end_date - datetime.timedelta(days=3)
    
    # Now we will find the weekid of both the peaks
    mid = int(len(weekly_data)/2)
    # find index of peak value in first half and second half of the weekly_data
    # note that the index starts from 0 but our weekid starts from 1, so we add 1 in the result
    wave1_weekid = weekly_data.index(max(weekly_data[:mid])) + 1
    wave2_weekid = weekly_data.index(max(weekly_data[mid:])) + 1
    
    # define start_date and end_date based on our time period of analysis
    start_date = datetime.datetime.strptime('15/03/2020', '%d/%m/%Y')
    end_date = datetime.datetime.strptime('14/08/2021', '%d/%m/%Y')
    
    # create an empty list to store the number of cases each month
    monthly_data = list()
    
    # iterate from start_date to end_date with step size of one month
    # First month is 15/03/2020-14/04/2020
    # Last month is 15/07/2021-14/08/2021
    # Total number of cases in a month = (cases on last date) - (cases on a day before first day)
    monthid = 1
    while start_date < end_date:
        # the current month ends on 14th of next month
        month_end_date = start_date + relativedelta.relativedelta(months=1) - datetime.timedelta(days=1)
        # calculate the cases for this month using a helper function we defined earlier
        cases = cases_between_time_for_series(state_data, start_date, month_end_date)
        # append data to our list
        monthly_data.append(cases)
        # update the start_date for the next month
        start_date = month_end_date + datetime.timedelta(days=1)
        monthid += 1
    
    # Now we will find the monthid of both the peaks
    mid = int(len(monthly_data)/2)
    # find index of peak value in first half and second half of the monthly_data
    # note that the index starts from 0 but our monthid starts from 1, so we add 1 in the result
    wave1_monthid = monthly_data.index(max(monthly_data[:mid])) + 1
    wave2_monthid = monthly_data.index(max(monthly_data[mid:])) + 1

    # append the weekid and monthid of both the waves to our dataframe
    state_peaks.loc[-1] = [state_code, wave1_weekid, wave2_weekid, wave1_monthid, wave2_monthid]
    state_peaks.index += 1

state_peaks = state_peaks.sort_values('stateid')
state_peaks.to_csv('./output/state-peaks.csv', index=False)  # save the file to csv
state_peaks.head()

Unnamed: 0,stateid,wave1-weekid,wave2-weekid,wave1-monthid,wave2-monthid
28,AN,43,117,5,14
33,AP,49,122,6,14
27,AR,56,141,7,15
26,AS,49,123,6,15
7,BR,43,118,5,14


# 7. Find wave1 and wave2 peaks for India (overall)

In [14]:
# Prepare a file to store weekid and monthid of both the waves in India (overall)
overall_peaks = pd.DataFrame(columns=['overallid', 'wave1-weekid', 'wave2-weekid', 'wave1-monthid', 'wave2-monthid'])

# define start_date and end_date based on our time period of analysis
start_date = datetime.datetime.strptime('15/03/2020', '%d/%m/%Y')
end_date = datetime.datetime.strptime('14/08/2021', '%d/%m/%Y')

# create an empty list to store the number of cases each week
weekly_data = list()

# iterate from start_date to end_date with step size of one week
# note that the weeks here overlap
# the first week is from sunday to saturday and the next week is from thursday to wednesday
# week1 is 15/03/2020-21/03/2020 and week 2 is 19/03/2020-25/03/20
weekid = 1
while start_date < end_date:
    # the current week ends on saturday (add 6 days to start_date)
    week_end_date = start_date + datetime.timedelta(days=6)
    # calculate the cases for this month using a helper function we defined earlier
    cases = cases_between_time_for_series(districts_cases, start_date, week_end_date)
    # append data to our list
    weekly_data.append(cases)
    weekid += 1
    # update the start_date for the next week
    if weekid % 2 == 0:
        # if next weekid is even, it runs from thursday to wednesday
        start_date = week_end_date - datetime.timedelta(days=2)
    else:
        # if next weekid is odd, it runs from sunday to saturday
        start_date = week_end_date - datetime.timedelta(days=3)

# Now we will find the weekid of both the peaks
mid = int(len(weekly_data)/2)
# find index of peak value in first half and second half of the weekly_data
# note that the index starts from 0 but our weekid starts from 1, so we add 1 in the result
wave1_weekid = weekly_data.index(max(weekly_data[:mid])) + 1
wave2_weekid = weekly_data.index(max(weekly_data[mid:])) + 1
    
# define start_date and end_date based on our time period of analysis
start_date = datetime.datetime.strptime('15/03/2020', '%d/%m/%Y')
end_date = datetime.datetime.strptime('14/08/2021', '%d/%m/%Y')

# create an empty list to store the number of cases each month
monthly_data = list()

# iterate from start_date to end_date with step size of one month
# First month is 15/03/2020-14/04/2020
# Last month is 15/07/2021-14/08/2021
# Total number of cases in a month = (cases on last date) - (cases on a day before first day)
monthid = 1
while start_date < end_date:
    # the current month ends on 14th of next month
    month_end_date = start_date + relativedelta.relativedelta(months=1) - datetime.timedelta(days=1)
    # calculate the cases for this month using a helper function we defined earlier
    cases = cases_between_time_for_series(districts_cases, start_date, month_end_date)
    # append data to our list
    monthly_data.append(cases)
    # update the start_date for the next month
    start_date = month_end_date + datetime.timedelta(days=1)
    monthid += 1

# Now we will find the monthid of both the peaks
mid = int(len(monthly_data)/2)
# find index of peak value in first half and second half of the monthly_data
# note that the index starts from 0 but our monthid starts from 1, so we add 1 in the result
wave1_monthid = monthly_data.index(max(monthly_data[:mid])) + 1
wave2_monthid = monthly_data.index(max(monthly_data[mid:])) + 1

# append the weekid and monthid of both the waves to our dataframe
overall_peaks.loc[-1] = ['India', wave1_weekid, wave2_weekid, wave1_monthid, wave2_monthid]
overall_peaks.index += 1

overall_peaks.to_csv('./output/overall-peaks.csv', index=False)  # save the file to csv
overall_peaks.head()

Unnamed: 0,overallid,wave1-weekid,wave2-weekid,wave1-monthid,wave2-monthid
0,India,52,119,6,14


--------------------------------------------------------------------------------- END of Q4 ---------------------------------------------------------------------------------------------