Asritha Suraparaju

HDS 5230 - High Performance Computing

Week 05 - Dask Programming Assignment

In [5]:
# Import required libraries
import pandas as pd
import dask.dataframe as dd
from datetime import datetime

In [6]:
# define data types for columns
dtypes = {
    'county': 'object',
    'state': 'object',
    'country':'object',
    'level':'object',
    'city': 'object',
    'aggregate': 'object',
    'population':'float64',
    'deaths': 'float64',
    'cases': 'float64',
    'date': 'object'

}

In [7]:
# Read the CSV, to specify the correct data types.
df = dd.read_csv('timeseries.csv', dtype=dtypes, assume_missing=True)

In [8]:
# Converting the 'date' column to datetime objects
df['date'] = dd.to_datetime(df['date'])

In [10]:
# Filter for US states and specified date range
mask = (
    (df['country'] == 'United States') &
    (df['level'] == 'state') &
    (df['date'] >= '2020-01-01') &
    (df['date'] <= '2021-02-28')
)
us_states_df = df[mask]

Parallelization effectively optimizes the COVID-19 dataset loading process because it involves reading CSV files and selecting U.S. states. The vast dataset consists of numerous millions of records which incorporate worldwide data points from different regions and countries. The parallel CSV processing feature of Dask provides beneficial split operations because it distributes data portions between multiple cores while doing separate processing work at each node. The operation can be easily parallelized since each table row contains independent computing elements.

In [11]:
# Group data by state and compute key metrics:
# - Maximum reported deaths per state (assuming final recorded death count is needed)
# - Mean population per state (averaging over the dataset timeframe)
state_metrics = us_states_df.groupby('state').agg({
    'deaths': 'max',
    'population': 'mean'
}).compute()

## Calculate per-capita mortality:
# - Deaths per 100,000 people for better comparability between states
state_metrics['per_capita_mortality'] = (
    state_metrics['deaths'] / state_metrics['population'] * 100000
)

# Rank states based on per-capita mortality in descending order
ranked_states = state_metrics.sort_values(
    'per_capita_mortality',
    ascending=False
)

ranked_states = ranked_states.round(2)
# print the results of ranked states
print(ranked_states[['deaths', 'population', 'per_capita_mortality']])

                               deaths  population  per_capita_mortality
state                                                                  
New Jersey                    15211.0   8882190.0                171.25
New York                      24904.0  19453561.0                128.02
Connecticut                    4335.0   3565287.0                121.59
Massachusetts                  8183.0   6892503.0                118.72
Rhode Island                    960.0   1059361.0                 90.62
Washington, D.C.                559.0    705749.0                 79.21
Louisiana                      3288.0   4648794.0                 70.73
Michigan                       6218.0   9986857.0                 62.26
Illinois                       7020.0  12671821.0                 55.40
Maryland                       3243.0   6045680.0                 53.64
Pennsylvania                   6753.0  12801989.0                 52.75
Delaware                        512.0    973764.0               

Computing per-capita mortality rates by state does not yield many benefits from parallelization. Our data analysis now operates only with 50 states after grouping by state which constitutes small data. The death counts and population numbers are simple to calculate because the data set fits entirely within the system memory. The additional costs needed to establish parallel processing exceed the expenses of executing calculations one after another. The problem requires basic operations from pandas data processing framework.

In [12]:
# Extracting the month and year from the date column for monthly aggregation
us_states_df['month_year'] = us_states_df['date'].dt.strftime('%Y-%m')

# Group data by state and month-year, by aggregating the cases and deaths per month
monthly_metrics = us_states_df.groupby(['state', 'month_year']).agg({
    'cases': 'max',
    'deaths': 'max'
}).compute()

# Reset index to ensure 'state' and 'month_year' are columns
monthly_metrics = monthly_metrics.reset_index()

# Calculating the new cases per month by taking the difference from the previous month's cases
monthly_metrics['new_cases'] = monthly_metrics.groupby('state')['cases'].diff().fillna(monthly_metrics['cases'])

# Calculate new deaths per month by taking the difference from the previous month's deaths
monthly_metrics['new_deaths'] = monthly_metrics.groupby('state')['deaths'].diff().fillna(monthly_metrics['deaths'])

# Compute Case Fatality Rate (CFR) per month:
monthly_metrics['cfr'] = (monthly_metrics['new_deaths'] / monthly_metrics['new_cases'] * 100).round(2)

# Pivot the data to create a CFR matrix with states as rows and months as columns
cfr_matrix = monthly_metrics.pivot(
    index='state',
    columns='month_year',
    values='cfr'
)
# print the results
print(cfr_matrix)

month_year                    2020-01  2020-02  2020-03  2020-04  2020-05  \
state                                                                       
Alabama                           NaN      NaN     1.30     4.25     3.30   
Alaska                            NaN      NaN      NaN     4.05     0.88   
American Samoa                    NaN      NaN      NaN      NaN      NaN   
Arizona                           NaN      NaN      NaN     5.03     4.74   
Arkansas                          NaN      NaN     1.42     1.97     1.89   
California                        NaN      NaN     2.13     4.43     3.47   
Colorado                          NaN      NaN     2.43     5.85     5.85   
Connecticut                       NaN      NaN     1.80     8.79    11.78   
Delaware                          NaN      NaN     3.13     3.22     4.54   
Florida                           NaN      NaN     1.26     4.39     5.28   
Georgia                           NaN      NaN     3.04     4.51     4.46   

In [13]:
# Calculate month-to-month changes in the CFR for each state
cfr_changes = cfr_matrix.diff(axis=1)

# Create a new DataFrame to store state-level metrics based on CFR changes
state_metrics = pd.DataFrame(index=cfr_matrix.index)

# Compute the total absolute change in CFR over time
state_metrics['total_absolute_change'] = cfr_changes.abs().sum(axis=1)

# Compute the average monthly change in CFR for each state
state_metrics['avg_monthly_change'] = cfr_changes.mean(axis=1)

# Measure the volatility in CFR changes using standard deviation
state_metrics['volatility'] = cfr_changes.std(axis=1)

# Count the number of months where CFR increased
state_metrics['positive_changes'] = (cfr_changes > 0).sum(axis=1)
# Count the number of months where CFR decreased
state_metrics['negative_changes'] = (cfr_changes < 0).sum(axis=1)

# Count the number of months where CFR remained unchanged
state_metrics['no_changes'] = (cfr_changes == 0).sum(axis=1)

# Rank states based on total absolute CFR change, with the highest change first
ranked_states = state_metrics.sort_values('total_absolute_change', ascending=False)

#printing the top 10 states
print(ranked_states.head(10))

                              total_absolute_change  avg_monthly_change  \
state                                                                     
United States Virgin Islands                 127.28           -2.020000   
Hawaii                                        51.13          -51.130000   
Alaska                                        44.05          -14.683333   
Rhode Island                                  33.70            0.710000   
New Jersey                                    31.86            7.965000   
Pennsylvania                                  24.35           -1.382500   
Connecticut                                   18.57            0.347500   
Michigan                                      17.18           -0.620000   
Washington                                    15.82           -1.628000   
New Hampshire                                 14.78            1.440000   

                              volatility  positive_changes  negative_changes  \
state              

In [14]:
#printing the least 10 states
print(ranked_states.tail(10))

                total_absolute_change  avg_monthly_change  volatility  \
state                                                                   
Minnesota                        4.11           -1.370000    2.329742   
Texas                            4.00           -0.180000    1.382389   
West Virginia                    4.00           -1.333333    0.748487   
Tennessee                        3.95            0.097500    1.416977   
Wyoming                          3.37           -0.550000    1.349704   
Nebraska                         2.91           -0.370000    1.126810   
Utah                             2.11            0.082500    0.664649   
Arkansas                         2.01           -0.227500    0.605275   
South Dakota                     1.84            0.270000    0.609043   
American Samoa                   0.00                 NaN         NaN   

                positive_changes  negative_changes  no_changes  
state                                                     

The ranking of states through CFR change computation should avoid parallel execution strategies because it would produce negative effects. The entire chronological data for each state needs evaluation alongside statewide inter-state comparisons for this procedure. All data has already been reduced to a small dimension of 50 states which means the ordering process (calculations) must access the complete data set concurrently. Implementing parallelization for this task would create more expensive operational costs than matching the advantages achieved through parallel processing. The ranking process demands examination of all states simultaneously because it needs complete state interaction during the production of rankings.