**Name: Snigdha Yalam**

**Course: HDS-5230-07**
Assignment: Week 5 - COVID-19 Data Analysis**


In [11]:
import dask.dataframe as dd
import pandas as pd
import numpy as np
import requests
import zipfile
import os
from io import BytesIO

In [13]:
url = "https://coronadatascraper.com/timeseries.csv.zip"
response = requests.get(url)
response.raise_for_status()
data_dir = "covid_data"
os.makedirs(data_dir, exist_ok=True)
with zipfile.ZipFile(BytesIO(response.content), 'r') as zip_ref:
    zip_ref.extractall(data_dir)
csv_file = os.path.join(data_dir, "timeseries.csv")
df = dd.read_csv(csv_file, dtype={'population': 'float64', 'deaths': 'float64', 'cases': 'float64', 'state': 'object'}, assume_missing=True)


In [15]:
df_us = df[df['country'] == 'United States']
df_states = df_us.dropna(subset=['state'])

start_date = "2020-01-01"
end_date = "2021-02-28"
df_states['date'] = dd.to_datetime(df_states['date'], errors='coerce')
df_states = df_states[(df_states['date'] >= start_date) & (df_states['date'] <= end_date)]
def compute_per_capita_mortality(df):
    deaths = df.groupby('state').agg({'deaths': 'sum'}).compute()
    avg_population = df.groupby('state').agg({'population': 'mean'}).compute()
    per_capita_mortality = deaths['deaths'] / avg_population['population']
    return per_capita_mortality.dropna()
per_capita_mortality = compute_per_capita_mortality(df_states)
state_ranking = per_capita_mortality.sort_values(ascending=False)
print("Per-Capita Mortality Ranking:\n", state_ranking)
def compute_cfr(df):
    df['month'] = df['date'].dt.to_period('M')
    df_filtered = df[df['cases'] > 0]
    cfr_matrix = df_filtered.groupby(['state', 'month']).agg({'deaths': 'sum', 'cases': 'sum'}).compute()
    cfr_matrix['CFR'] = (cfr_matrix['deaths'] / cfr_matrix['cases']) * 100
    cfr_matrix = cfr_matrix.pivot_table(values='CFR', index='state', columns='month')
    return cfr_matrix.dropna(axis=1, how='all')
cfr_matrix = compute_cfr(df_states)
print("CFR Matrix:\n", cfr_matrix)

def compute_cfr_change_ranking(cfr_matrix):
    cfr_changes = cfr_matrix.diff(axis=1).sum(axis=1)
    return cfr_changes.sort_values(ascending=False).dropna()

cfr_change_ranking = compute_cfr_change_ranking(cfr_matrix)
print("CFR Change Ranking:\n", cfr_change_ranking)

Per-Capita Mortality Ranking:
 state
New York                        15.838824
Texas                            9.136704
Michigan                         8.862964
Louisiana                        8.508834
Georgia                          7.937739
Illinois                         7.774370
Mississippi                      7.038137
New Jersey                       5.708752
Pennsylvania                     5.698869
Virginia                         5.131386
Ohio                             3.920150
Minnesota                        3.673553
Iowa                             3.579520
Florida                          3.422462
Kentucky                         3.215970
Colorado                         3.086699
Alabama                          3.040476
Missouri                         3.033844
North Carolina                   2.869425
Massachusetts                    2.517199
Tennessee                        2.421777
Nebraska                         2.328852
South Carolina                   2.2649

**Question 3c**

**Approach Used**

The approach used in the code follows the naïve CFR method as defined by the World Health Organization (WHO) in their scientific brief:

The method used in the code aggregates the number of deaths and cases per state per month and then computes the ratio of deaths to cases.
This corresponds to the Crude CFR approach outlined by WHO, where:
This approach does not account for the lag between infection and death, which means it might slightly underestimate or overestimate the actual CFR.

**Assumptions Made**

Deaths are correctly attributed to the state and time period in the dataset.
Case and death data are complete and accurate (no missing cases or underreporting).
No significant lag between case reporting and death is considered, which may cause some inaccuracies.
Population sizes are stable throughout the given time period.

**Question 4**

Using Dask is beneficial due to:
1. Large dataset size: COVID-19 time series data is large, requiring parallel computing.
2. Aggregations: Operations like grouping, summing, and computing CFR benefit from parallel execution.
3. Speed: Dask efficiently handles operations on a distributed computing framework.