# Mapping the Pandemic: A Statistical Approach to COVID-19

The COVID-19 pandemic has had a profound global impact, with varying effects across different countries. This analysis aims to provide a comprehensive overview of the pandemic’s progression, focusing on confirmed cases, death tolls, growth, and mortality rates over time. By comparing key countries and identifying trends, the analysis offers insights into the virus’s spread, the effectiveness of public health measures, and the evolving nature of the pandemic through its various waves. This data-driven approach helps in understanding the pandemic's trajectory and its implications on a global scale.

## Data Cleaning and Exploratory Data Analysis

In [None]:
import pandas as pd
import numpy as np

In [None]:
covid_df = pd.read_csv("covid_19_data.csv")

In [None]:
covid_df.head()

In [None]:
covid_df.info()

In [None]:
covid_df.isnull().sum()

In [None]:
# Change data types
covid_df[["Confirmed", "Deaths", "Recovered"]] = covid_df[["Confirmed", "Deaths", "Recovered"]].astype(int)

In [None]:
# Convert columns to date time columns
covid_df['ObservationDate'] = pd.to_datetime(covid_df['ObservationDate'], errors='coerce')

# Convert the 'Last Update' column to datetime
covid_df['Last Update'] = pd.to_datetime(covid_df['Last Update'], errors='coerce')
covid_df['Last Update'] = covid_df['Last Update'].dt.strftime('%d/%m/%Y')

In [None]:
# Rename the column names
covid_df = covid_df.rename(columns={'ObservationDate': 'Observation Date', 'Country/Region': 'Country'})

In [None]:
# Replace multiple values in the Country column
covid_df['Country'] = covid_df['Country'].replace(['Mainland China', 'Macau', 'Hong Kong'], 'China')
covid_df['Country'] = covid_df['Country'].replace(['French Guiana', 'Reunion'], 'France')
covid_df['Country'] = covid_df['Country'].replace('Holy See', 'Rome')
covid_df['Country'] = covid_df['Country'].replace('occupied Palestinian territory', 'Palestine')
covid_df['Country'] = covid_df['Country'].replace("('St. Martin',)", 'St. Martin')
covid_df['Country'] = covid_df['Country'].replace('Ireland', 'Republic of Ireland')
covid_df['Country'] = covid_df['Country'].replace('Congo (Brazzaville)', 'Republic of the Congo')
covid_df['Country'] = covid_df['Country'].replace('Congo (Kinshasa)', 'DR Congo')
covid_df['Country'] = covid_df['Country'].replace(['Gambia', 'Gambia, The'], 'The Gambia')

In [None]:
# Delete irrelevant column
covid_df = covid_df.drop("SNo", axis=1)

In [None]:
covid_df.info()

In [None]:
# Delete duplicates
covid_df = covid_df.drop_duplicates()

In [None]:
covid_df.shape

### What is the overall trend of confirmed COVID-19 cases globally?

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Plot the trend in the total number of confirmed cases over time
plt.figure(figsize=(10, 6))
plt.plot(covid_df.groupby("Observation Date")["Confirmed"].sum())
plt.xlabel('Observation Date')
plt.ylabel('Confirmed Cases')
plt.title('Confirmed Cases Over Time')
plt.xticks(rotation=45)  # Rotate month labels for better readability
plt.yticks([30000000, 60000000, 90000000, 120000000, 150000000, 180000000], 
           ['30,000,000', '60,000,000', '90,000,000', '120,000,000', '150,000,000', '180,000,000'])
plt.tight_layout()
plt.show()

This line graph shows a clear exponential rise in confirmed COVID-19 cases from early 2020 to mid-2021. Initial growth is slow, but cases surge from mid-2020, reflecting global spread and multiple waves. By May 2021, the curve continues upward with no sign of slowing, indicating ongoing rapid increases in case counts despite public health efforts.

### Which countries had the highest number of confirmed COVID-19 cases?

In [None]:
top_countries = covid_df.groupby('Country')['Confirmed'].sum().sort_values(ascending=False).head(10)
print(top_countries)

The U.S. leads by a significant margin, while India and Brazil follow closely, highlighting the pandemic's severe impact in these regions. Lower counts in Russia, France, and the UK suggest varying healthcare responses and transmission rates.

### How does the mortality rate (deaths/confirmed cases) vary across different countries?

In [None]:
covid_df['Mortality Rate'] = np.where(covid_df['Confirmed'] != 0,
                                      covid_df['Deaths'] / covid_df['Confirmed'], 
                                      np.nan)
mortality_rate = covid_df.groupby('Country')['Mortality Rate'].mean()
top_mortality_rate = mortality_rate.sort_values(ascending=False).head(10)

print(f"The average mortality rate is {mortality_rate.mean():.2f}")

In [None]:
print(top_mortality_rate)

A mortality rate of 0.02 (or 2%) indicates that 2 out of every 100 confirmed cases of a disease result in death. This suggests a significant risk, especially for vulnerable populations, while also implying that the disease can often be managed with proper healthcare responses. 

### What are the recovery rates for different countries?

In [None]:
covid_df['Recovery Rate'] = np.where(covid_df['Confirmed'] != 0,
                                      covid_df['Recovered'] / covid_df['Confirmed'], 
                                      np.nan)
recovery_rate = covid_df.groupby('Country')['Recovery Rate'].mean()
top_recovery_rate = recovery_rate.sort_values(ascending=False).head(10)

print(f"The average recovery rate is {recovery_rate.mean()}")

In [None]:
print(top_recovery_rate)

A recovery rate of 0.59 (or 59%) indicates that 59% of individuals diagnosed with COVID-19 have recovered. This suggests a moderate level of recovery, but it also means that 41% of cases have not recovered, which could include ongoing treatments, complications, or fatalities. While the rate reflects some success in recovery efforts, it underscores the continued burden on healthcare systems and the need for ongoing public health measures.

### What is the trend in daily new confirmed cases globally and per country?

In [None]:
covid_df.sort_values(by=['Country', 'Observation Date'], inplace=True)

# Group by date and country, summing the confirmed cases
daily_cases = covid_df.groupby(['Observation Date', 'Country'])['Confirmed'].sum().reset_index()

# Calculate daily new confirmed cases
daily_cases['New Confirmed'] = daily_cases.groupby('Country')['Confirmed'].diff().fillna(0)

global_daily_cases = daily_cases.groupby('Observation Date')['New Confirmed'].sum().reset_index()

top_daily_cases = daily_cases.groupby('Country')['New Confirmed'].sum().reset_index().sort_values(by='New Confirmed', ascending=False).head(10)

print(top_daily_cases)

In [None]:
# Plot global daily new confirmed cases
plt.figure(figsize=(12, 6))
plt.plot(global_daily_cases['Observation Date'], global_daily_cases['New Confirmed'], label='Global Daily New Confirmed Cases', color='blue')
plt.xlabel('Date')
plt.ylabel('Number of New Cases')
plt.title('Trend in Global Daily New Confirmed COVID-19 Cases')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

The line graph illustrates the global trend of daily new confirmed COVID-19 cases over time. The x-axis represents the date, spanning from early 2020 to mid-2021, while the y-axis depicts the number of new cases.

The graph showcases an overall upward trend in daily new cases, with noticeable fluctuations and distinct peaks. A significant surge is observed towards the end of 2020, followed by a period of sustained high numbers. The graph provides a visual representation of the global impact and spread of the COVID-19 pandemic during the period illustrated.

In [None]:
countries_to_plot = ['US', 'India', 'Brazil']
for country in countries_to_plot:
    country_data = daily_cases[daily_cases['Country'] == country]
    plt.plot(country_data['Observation Date'], country_data['New Confirmed'], label=country)

plt.xlabel('Date')
plt.ylabel('Number of New Cases')
plt.title('Trend in Daily New Confirmed COVID-19 Cases by Country')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

The line graph shows the number of daily new COVID-19 cases in the US, India, and Brazil. The US had two large peaks, one in late 2020 and one in early 2021. India had one very large peak in mid-2021. Brazil had a more steady number of cases, with a small peak in early 2021.

### Are there any seasonal trends in the spread of COVID-19?

In [None]:
covid_df['Month'] = covid_df['Observation Date'].dt.strftime('%B')
seasonal_trend = covid_df.groupby('Month')['Confirmed'].sum()

In [None]:
# Sort the months by their calendar order
months_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 
                'September', 'October', 'November', 'December']

In [None]:
# Plot global daily new confirmed cases
plt.figure(figsize=(12, 6))
plt.plot(covid_df.groupby("Month")["Confirmed"].sum().reindex(months_order).dropna())
plt.xlabel('Month')
plt.ylabel('Number of Cases')
plt.title('Seasonal Trend of COVID-19 Cases')
plt.xticks(rotation=50)
plt.yticks([200000000, 1000000000, 2000000000, 3000000000, 4000000000, 5000000000], 
           ['200,000,000', '1,000,000,000', '2,000,000,000', '3,000,000,000', '4,000,000,000', '5,000,000,000'])
plt.legend()
plt.tight_layout()
plt.show()

COVID-19 cases peak in spring (May), indicating heightened transmission. They decline significantly in early summer (June), likely due to warmer weather and outdoor activities, before rising steadily from summer through fall and winter, driven by indoor gatherings and holiday travel.

### Impact of Lockdown on Cases

In [None]:
# Define lockdown periods (example)
covid_df['Is_Lockdown'] = np.where((covid_df['Observation Date'] >= '2020-03-15') & (covid_df['Observation Date'] <= '2020-06-01'), True, False)

In [None]:
# Aggregate daily new cases
covid_df['New_Confirmed'] = covid_df['Confirmed'].diff().fillna(0)

In [None]:
# Group by date and lockdown status
daily_cases = covid_df.groupby(['Observation Date', 'Is_Lockdown'])['New_Confirmed'].sum().reset_index()

In [None]:
# Plotting
plt.figure(figsize=(14, 7))
for label, df in daily_cases.groupby('Is_Lockdown'):
    plt.plot(df['Observation Date'], df['New_Confirmed'], label='Lockdown' if label else 'No Lockdown')

plt.axvline(x=pd.to_datetime('2020-03-15'), color='gray', linestyle='--', label='Lockdown Start')
plt.axvline(x=pd.to_datetime('2020-06-01'), color='red', linestyle='--', label='Lockdown End')
plt.title('Impact of Lockdown on Daily New Confirmed Cases')
plt.xlabel('Date')
plt.ylabel('New Confirmed Cases')
plt.legend()
plt.show()

The graph suggests a decrease in daily new confirmed cases following the implementation of the lockdown

## Conclusion

In conclusion, this COVID-19 analysis highlights the dynamic and evolving nature of the pandemic across different regions. While early 2020 saw rapid growth in confirmed cases and mortality rates, global responses such as lockdowns and improved medical interventions helped stabilize the situation over time. However, second waves and regional disparities particularly in countries like the U.S., India, and Brazil, the ongoing challenges in managing the virus. Understanding these trends is crucial for guiding future public health strategies and ensuring preparedness for similar crises in the future.