John Hopkins Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'

df = pd.read_csv(url)
df.head()

In [None]:
print(df.shape)
print(df.columns [:10])

Finding which columns are location and which are dates so I can make each row one location and date

In [None]:
meta_cols = ['Province/State', 'Country/Region', 'Lat', 'Long']
date_cols = [c for c in df.columns if c not in meta_cols]

In [None]:
df_long = df.melt(id_vars=meta_cols,
                  value_vars=date_cols,
                  var_name='Date',
                  value_name='Cumulative_Deaths')

In [None]:
df_long['Date'] = pd.to_datetime(df_long['Date'], format='%m/%d/%y')

df_long.head(10)

In [None]:
df_long = df_long.rename(columns={
    'Province/State': 'Province_State',
    'Country/Region': 'Country_Region',
    'Lat': 'Latitude',
    'Long': 'Longitude'
})

df_long['Province_State'] = df_long['Province_State'].fillna('')

df_long.info()

In [None]:
df_long['Location'] = df_long['Country_Region'] + \
    df_long['Province_State'].apply(lambda x: f", {x}" if x else "")

df_long = df_long.sort_values(['Location', 'Date'])

df_long['Daily_New_Deaths'] = df_long.groupby('Location')['Cumulative_Deaths'].diff().fillna(0)

df_long['Daily_New_Deaths'] = df_long['Daily_New_Deaths'].clip(lower=0)

df_long.head(10)


In [None]:
df_long['Rolling_7Day_Avg'] = (
    df_long.groupby('Location')['Daily_New_Deaths']
    .transform(lambda x: x.rolling(7, min_periods=1).mean())
)
df_long.head(10)


In [None]:
latest_date = df_long['Date'].max()

summary = (df_long[df_long['Date'] == latest_date]
           .groupby('Country_Region')['Cumulative_Deaths']
           .sum()
           .sort_values(ascending=False)
           .head(10))

summary

In [None]:
countries = ['United States', 'India', 'Brazil', 'France']

plt.figure(figsize=(10,6))
for c in countries:
    subset = df_long[df_long['Country_Region'] == c]

    series = subset.groupby('Date')['Rolling_7Day_Avg'].sum()
    plt.plot(series.index, series.values, label=c)

plt.title('7-Day Rolling Average of Daily COVID-19 Deaths')
plt.xlabel('Date')
plt.ylabel('Deaths (7-day avg)')
plt.legend()
plt.grid(True)
plt.show()


I started with the Johns Hopkins global deaths time-series dataset. The data was in a wide format, with dates across columns.

Using pandas.melt(), I reshaped it into a tidy form where each row represents one location on one date. Then I created new columns for daily new deaths and a 7-day rolling average

When I first calculated daily new deaths, most of the early dates showed zeros. This is because the dataset starts before any deaths were reported, so the cumulative totals were still at zero. Since daily deaths are calculated as the difference from the previous day, the result naturally stays zero until the first increase occurs. This pattern shows the initial phase before COVID-19 deaths began being reported.