<a href="https://colab.research.google.com/github/ysc4/CCDATSCL_EXERCISES_COM222/blob/main/Exercise4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 4

This exercise focuses on data visualization and interpretation using a real-world COVID-19 dataset. The dataset contains daily records of confirmed cases, deaths, recoveries, and active cases across countries and regions, along with temporal and geographic information.
The goal of this exercise is not only to create charts, but to choose appropriate visualizations, apply correct data aggregation, and draw meaningful insights from the data. You will work with time-based, categorical, numerical, and geographic variables, and you are expected to think critically about how design choices affect interpretation.

Your visualizations should follow good practices:
- Use clear titles, axis labels, and legends
- Choose chart types appropriate to the data and question
- Avoid misleading scales or cluttered designs
- Clearly explain patterns, trends, or anomalies you observe

Unless stated otherwise, you may filter, aggregate, or group the data as needed.

<img src="https://d3i6fh83elv35t.cloudfront.net/static/2020/03/Screen-Shot-2020-03-05-at-6.29.29-PM-1024x574.png"/>

In [1]:
import kagglehub
import os
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("imdevskp/corona-virus-report")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/imdevskp/corona-virus-report?dataset_version_number=166...


100%|██████████| 19.0M/19.0M [00:01<00:00, 17.2MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/imdevskp/corona-virus-report/versions/166


In [2]:
if os.path.isdir(path):
  print(True)

contents = os.listdir(path)
contents

mydataset = path + "/" + contents[0]
mydataset


df = pd.read_csv(mydataset)

True


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49068 entries, 0 to 49067
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Province/State  14664 non-null  object 
 1   Country/Region  49068 non-null  object 
 2   Lat             49068 non-null  float64
 3   Long            49068 non-null  float64
 4   Date            49068 non-null  object 
 5   Confirmed       49068 non-null  int64  
 6   Deaths          49068 non-null  int64  
 7   Recovered       49068 non-null  int64  
 8   Active          49068 non-null  int64  
 9   WHO Region      49068 non-null  object 
dtypes: float64(2), int64(4), object(4)
memory usage: 3.7+ MB


In [4]:
df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
0,,Afghanistan,33.93911,67.709953,2020-01-22,0,0,0,0,Eastern Mediterranean
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0,0,Europe
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0,0,Africa
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0,0,Europe
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0,0,Africa


In [5]:
df.query("`Country/Region` == 'Philippines'")

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
180,,Philippines,12.879721,121.774017,2020-01-22,0,0,0,0,Western Pacific
441,,Philippines,12.879721,121.774017,2020-01-23,0,0,0,0,Western Pacific
702,,Philippines,12.879721,121.774017,2020-01-24,0,0,0,0,Western Pacific
963,,Philippines,12.879721,121.774017,2020-01-25,0,0,0,0,Western Pacific
1224,,Philippines,12.879721,121.774017,2020-01-26,0,0,0,0,Western Pacific
...,...,...,...,...,...,...,...,...,...,...
47943,,Philippines,12.879721,121.774017,2020-07-23,74390,1871,24383,48136,Western Pacific
48204,,Philippines,12.879721,121.774017,2020-07-24,76444,1879,24502,50063,Western Pacific
48465,,Philippines,12.879721,121.774017,2020-07-25,78412,1897,25752,50763,Western Pacific
48726,,Philippines,12.879721,121.774017,2020-07-26,80448,1932,26110,52406,Western Pacific


In [15]:
import plotly.express as px
import plotly.graph_objects as go

## A. Time-Based Visualizations

1. Global Trend `(5 pts)`

Aggregate the data by Date and create a line chart showing the global number of confirmed COVID-19 cases over time.

In [6]:
# put your answer here

global_confirmed = df.groupby("Date")["Confirmed"].sum().reset_index()
global_confirmed.head()

Unnamed: 0,Date,Confirmed
0,2020-01-22,555
1,2020-01-23,654
2,2020-01-24,941
3,2020-01-25,1434
4,2020-01-26,2118


In [16]:
fig = px.line(global_confirmed, x="Date", y="Confirmed", title="Global Confirmed Cases Over Time")
fig.show()

2. Country-Level Trends `(5 pts)`

Select three countries and visualize their confirmed case counts over time on the same plot.

In [20]:
selected_countries = ['Philippines', 'US', 'India']
country_confirmed = df[df['Country/Region'].isin(selected_countries)]
country_confirmed = country_confirmed.groupby(['Date', 'Country/Region'])['Confirmed'].sum().reset_index()
display(country_confirmed.head())

Unnamed: 0,Date,Country/Region,Confirmed
0,2020-01-22,India,0
1,2020-01-22,Philippines,0
2,2020-01-22,US,1
3,2020-01-23,India,0
4,2020-01-23,Philippines,0


In [56]:
fig = px.line(country_confirmed, x="Date", y="Confirmed", color='Country/Region',
              title="Confirmed Cases Over Time in Philippines, US, and India")
fig.show()

3. Active vs Recovered `(5 pts)`

For a selected country, create a line chart showing Active and Recovered cases over time.

In [22]:
# put your answer here

ph_active_recovered = df[df['Country/Region'] == 'Philippines'][['Date', 'Active', 'Recovered']]
ph_active_recovered.head()

Unnamed: 0,Date,Active,Recovered
180,2020-01-22,0,0
441,2020-01-23,0,0
702,2020-01-24,0,0
963,2020-01-25,0,0
1224,2020-01-26,0,0


In [23]:
fig = px.line(ph_active_recovered, x="Date", y=["Active", "Recovered"],
              labels={'value': 'Count', 'variable': 'Case Type'},
              title="Active and Recovered Cases Over Time in the Philippines")
fig.show()

## B: Comparative Visualizations

4. Country Comparison `(5 pts)`

Using data from a single date, create a bar chart showing the top 10 countries by confirmed cases.

In [25]:
# put your answer here

date = '2020-04-13'
df_date = df[df['Date'] == date]
top_10_confirmed = df_date.groupby("Country/Region")["Confirmed"].sum().reset_index()
top_10_confirmed = top_10_confirmed.sort_values(by="Confirmed", ascending=False).head(10)
top_10_confirmed

Unnamed: 0,Country/Region,Confirmed
173,US,581813
157,Spain,170099
85,Italy,159516
65,Germany,130072
61,France,125394
177,United Kingdom,98017
36,China,83213
81,Iran,73303
172,Turkey,61049
16,Belgium,30589


In [68]:
fig = px.bar(top_10_confirmed, x="Country/Region", y="Confirmed",
             title=f"Top 10 Countries by Confirmed Cases on {date}")
fig.show()

5. WHO Region Comparison `(5 pts)`

Aggregate confirmed cases by WHO Region and visualize the result using a bar chart.

In [27]:
# put your answer here

region_confirmed = df.groupby("WHO Region")["Confirmed"].sum().reset_index()
region_confirmed

Unnamed: 0,WHO Region,Confirmed
0,Africa,21791827
1,Americas,402261194
2,Eastern Mediterranean,74082892
3,Europe,248879793
4,South-East Asia,55118365
5,Western Pacific,26374411


In [28]:
fig = px.bar(region_confirmed, x="WHO Region", y="Confirmed",
             title="Confirmed Cases by WHO Region")
fig.show()

## C. Geographic Visualization

6. Geographic Spread `(10 pts)`

Using Latitude and Longitude, create a map-based visualization showing confirmed cases for a selected date.

In [34]:
# put your answer here

date = '2020-04-13'
df_date = df[df['Date'] == date]
geo_confirmed = df_date.groupby(["Lat", "Long", "Country/Region"])["Confirmed"].sum().reset_index()
geo_confirmed.head()

Unnamed: 0,Lat,Long,Country/Region,Confirmed
0,-51.7963,-59.5236,United Kingdom,5
1,-42.8821,147.3272,Australia,144
2,-40.9006,174.886,New Zealand,1349
3,-38.4161,-63.6167,Argentina,2208
4,-37.8136,144.9631,Australia,1281


In [36]:
fig = px.choropleth(
    geo_confirmed,
    locations="Country/Region",
    locationmode='country names',
    color="Confirmed",
    hover_name="Country/Region",
    projection="natural earth",
    title=f"Confirmed Cases by Country on {date}",
    color_continuous_scale=px.colors.sequential.Plasma
)

fig.show()

7. Regional Clustering `(15 pts)`

Create a visualization that shows how confirmed cases are distributed geographically within a single WHO Region.

In [65]:
# put your answer here

df['Region Confirmed Cases'] = df.groupby('WHO Region')['Confirmed'].transform('sum')
who_confirmed = df[['WHO Region', 'Country/Region', 'Region Confirmed Cases']]
who_confirmed

Unnamed: 0,WHO Region,Country/Region,Region Confirmed Cases
0,Eastern Mediterranean,Afghanistan,74082892
1,Europe,Albania,248879793
2,Africa,Algeria,21791827
3,Europe,Andorra,248879793
4,Africa,Angola,21791827
...,...,...,...
49063,Africa,Sao Tome and Principe,21791827
49064,Eastern Mediterranean,Yemen,74082892
49065,Africa,Comoros,21791827
49066,Europe,Tajikistan,248879793


In [67]:
fig = px.choropleth(
    who_confirmed,
    locations="Country/Region",
    locationmode='country names',
    color="Region Confirmed Cases",
    hover_name="Country/Region",
    hover_data={'Region Confirmed Cases': True, 'WHO Region': False},
    projection="natural earth",
    title=f"Confirmed Cases by WHO Region",
    color_continuous_scale=px.colors.sequential.Plasma
)
fig.show()