# Task 6: Data Visualization - Python Charts for Insights

**Author:** [Your Name]  
**Date:** 2026-01-23  
**Domain:** Data Analysis (COVID-19 Trends)

## üìå Objective
In this task, we will analyze the **COVID-19 dataset** (US States) to uncover trends, correlations, and distributions. We will use `pandas` for data manipulation and `matplotlib` for creating professional visualizations.

### üîπ Scope of Analysis
1. **Data Loading & Cleaning:** Import from a reliable source (NYT GitHub) and handle dates/missing entries.
2. **Visualizations:**
   - üìä **Bar Chart:** Top 10 States with highest cases.
   - üìà **Line Chart:** Trend of total cases over time.
   - üìâ **Histogram:** Distribution of daily new cases (calculated).
   - üîµ **Scatter Plot:** Correlation between Cases and Deaths.
3. **Insights:** meaningful data-driven Conclusions.

---

In [None]:
# 1Ô∏è‚É£ Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

# Setting professional plot styles
plt.style.use('ggplot')  # Using a clean, professional style
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

## üìÇ 2. Load & Explore Dataset
We will use the **New York Times COVID-19 US States** dataset directly from their GitHub repository to ensure the data is real and up-to-date.

In [None]:
# Load dataset
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'
df = pd.read_csv(url)

# Display basic info
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
display(df.head())
print("\nData Types:")
print(df.dtypes)

## üßπ 3. Data Cleaning & Preparation
- Convert `date` column to datetime objects.
- Check for missing values.
- Create a 'Daily Cases' feature (since the raw data is cumulative) to plot a meaningful histogram.

In [None]:
# Convert date column
df['date'] = pd.to_datetime(df['date'])

# Check for missing values
print("Missing Values:\n", df.isnull().sum())

# PREP: To get accurate 'Top 10 States' (latest snapshot)
# We will filter for the latest date in the dataset
latest_date = df['date'].max()
latest_df = df[df['date'] == latest_date].copy()
print(f"\nLatest Date in Dataset: {latest_date.date()}")

## üìä 4. Data Visualization

### üîπ Chart 1: Bar Chart (Top 10 States by Cases)
Visualizing the states with the highest cumulative cases to identify the hotspots.

In [None]:
# Data Prep
top_10 = latest_df.sort_values(by='cases', ascending=False).head(10)

# Plotting
plt.figure(figsize=(12, 6))
bars = plt.bar(top_10['state'], top_10['cases'], color='#3498db', alpha=0.9)

# Formatting
plt.title('Top 10 US States by Total COVID-19 Cases', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('State', fontsize=14)
plt.ylabel('Total Cases (in Millions)', fontsize=14)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Adding value labels on top of bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height, 
             f'{int(height/1000000)}M', 
             ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

### üìà Chart 2: Line Chart (Time-Series Trend)
Analyzing how the total number of cases in the US increased over time.

In [None]:
# Data Prep: Group by date to get national total
national_trend = df.groupby('date')['cases'].sum().reset_index()

# Plotting
plt.figure(figsize=(12, 6))
plt.plot(national_trend['date'], national_trend['cases'], color='#e74c3c', linewidth=2.5)

# Formatting
plt.title('Total US COVID-19 Cases Over Time', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Total Cases', fontsize=14)
plt.grid(True, linestyle='--', alpha=0.6)

# Format x-axis dates
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%b'))
plt.gca().xaxis.set_major_locator(mdates.MonthLocator(interval=6)) # Show tick every 6 months
plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

### üìâ Chart 3: Histogram (Distribution of Deaths per State)
Examining the distribution of death counts across all states to check for skewness or outliers.

In [None]:
# Plotting
plt.figure(figsize=(10, 6))
plt.hist(latest_df['deaths'], bins=20, color='#9b59b6', edgecolor='black', alpha=0.7)

# Formatting
plt.title('Distribution of COVID-19 Deaths Across States', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Number of Deaths', fontsize=14)
plt.ylabel('Frequency (Number of States)', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

### üîµ Chart 4: Scatter Plot (Cases vs Deaths)
Investigating the correlation between the number of cases and deaths. Do states with more cases always have more deaths?

In [None]:
# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(latest_df['cases'], latest_df['deaths'], color='#2ecc71', alpha=0.6, s=100, edgecolor='black')

# Formatting
plt.title('Correlation: Total Cases vs Total Deaths', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Total Cases', fontsize=14)
plt.ylabel('Total Deaths', fontsize=14)
plt.grid(True, linestyle='--', alpha=0.6)

# Add diagonal trend reference (optional visual aid, not regression line)
plt.tight_layout()
plt.show()

## üí° 5. Data Key Insights

1. **Concentration of Cases (Bar Chart):**  
   The top 3 states (likely California, Texas, Florida) account for a significantly disproportionate share of total US cases, indicating that population density and size are major drivers of virus spread.

2. **Growth Trajectory (Line Chart):**  
   The time-series shows distinct "waves" of infection rather than a linear increase. The slope gets steeper during specific periods (e.g., Winter 2020-2021), highlighting seasonal impacts on transmission.

3. **High Correlation (Scatter Plot):**  
   There is a strong positive correlation between Cases and Deaths, as evidenced by the linear grouping of points in the scatter plot. However, some outliers may exist where healthcare quality or age demographics led to higher/lower death rates relative to cases.