<a href="https://colab.research.google.com/github/ypcio/Covid19-Analysis-using-K-Means/blob/main/K_Means_vs_K_Medoids_on_Covid_19_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
imdevskp_corona_virus_report_path = kagglehub.dataset_download('imdevskp/corona-virus-report')

print('Data source import complete.')

Downloading from https://www.kaggle.com/api/v1/datasets/download/imdevskp/corona-virus-report?dataset_version_number=166...


100%|██████████| 19.0M/19.0M [00:00<00:00, 75.6MB/s]

Extracting files...





Data source import complete.


# 🦠 Comparative Analysis of K-Means and K-Medoids Algorithm on Covid-19 Dataset

# 📄 **Covid-19 Dataset Metadata**

## 🗂️ **Dataset Overview**
This dataset provides detailed information on Covid-19 cases, deaths, recoveries, and trends across various countries and regions. The data helps analyze the pandemic's progression and regional impacts.

---

## 📝 **Metadata**
| **Attribute**          | **Description**                                                                 |
|-------------------------|---------------------------------------------------------------------------------|
| 🌍 **Country/Region**   | Name of the country or region.                                                  |
| 🟢 **Confirmed**        | Total number of confirmed Covid-19 cases.                                       |
| ⚰️ **Deaths**           | Total number of deaths caused by Covid-19.                                     |
| 💚 **Recovered**        | Total number of recoveries from Covid-19.                                      |
| 🟡 **Active**           | Number of active Covid-19 cases.                                               |
| 🆕 **New Cases**        | Newly reported Covid-19 cases.                                                 |
| ⚠️ **New Deaths**       | Newly reported deaths caused by Covid-19.                                      |
| 💹 **New Recovered**    | Newly reported recoveries from Covid-19.                                       |
| ⚰️ **Deaths / 100 Cases** | Death rate as a percentage of confirmed cases.                                |
| 💚 **Recovered / 100 Cases** | Recovery rate as a percentage of confirmed cases.                           |
| ⚖️ **Deaths / 100 Recovered** | Death rate as a percentage of recoveries.                                  |
| 🔄 **Confirmed Last Week** | Number of confirmed cases reported in the last week.                         |
| 📉 **1 Week Change**    | Change in confirmed cases over the past week.                                  |
| 📈 **1 Week % Increase**| Percentage increase in confirmed cases over the past week.                     |
| 🌐 **WHO Region**       | World Health Organization region classification of the cocation of the country/region.          |

---


### 🧑‍💻 Let’s Dive In!

## 📚 Importing Libraries

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

## 📥 Loading the Dataset

In [None]:
df = pd.read_csv(imdevskp_corona_virus_report_path + "/covid_19_clean_complete.csv")

## 📊 Exploring the Dataset

In [None]:
df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
0,,Afghanistan,33.93911,67.709953,2020-01-22,0,0,0,0,Eastern Mediterranean
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0,0,Europe
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0,0,Africa
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0,0,Europe
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0,0,Africa


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49068 entries, 0 to 49067
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Province/State  14664 non-null  object 
 1   Country/Region  49068 non-null  object 
 2   Lat             49068 non-null  float64
 3   Long            49068 non-null  float64
 4   Date            49068 non-null  object 
 5   Confirmed       49068 non-null  int64  
 6   Deaths          49068 non-null  int64  
 7   Recovered       49068 non-null  int64  
 8   Active          49068 non-null  int64  
 9   WHO Region      49068 non-null  object 
dtypes: float64(2), int64(4), object(4)
memory usage: 3.7+ MB


## 📊 Data Cleaning

In [None]:
df = df.drop(columns=['Province/State'])

## 📊 Brief EDA

In [None]:
df["Country/Region"].value_counts()

Unnamed: 0_level_0,count
Country/Region,Unnamed: 1_level_1
China,6204
Canada,2256
France,2068
United Kingdom,2068
Australia,1504
...,...
Sao Tome and Principe,188
Yemen,188
Comoros,188
Tajikistan,188


In [None]:
# Assuming 'df' is the original DataFrame
country_cases_grouped = df.groupby('Country/Region', as_index=False)['Confirmed'].sum()

# Selecting the top 10 countries with the most confirmed cases
top_countries = country_cases_grouped.nlargest(10, 'Confirmed')

# Create a bar chart using Plotly
fig = go.Figure(
    data=[
        go.Bar(
            x=top_countries['Country/Region'],
            y=top_countries['Confirmed'],
            marker=dict(color='blue'),
        )
    ]
)

# Update layout for the black background and other styles
fig.update_layout(
    title={
        'text': 'Top 10 Countries with Most Confirmed Cases',
        'x': 0.5,  # Center the title
        'font': {'size': 16, 'color': 'white'},
    },
    xaxis=dict(
        title='Country',
        titlefont=dict(size=14, color='white'),
        tickangle=45,
        tickfont=dict(color='white'),
    ),
    yaxis=dict(
        title='Total Confirmed Cases',
        titlefont=dict(size=14, color='white'),
        gridcolor='gray',
    ),
    plot_bgcolor='black',  # Black background for the plot
    paper_bgcolor='black',  # Black background for the figure
)

# Show the plot
fig.show()

In [None]:
country_recovered = df.groupby('Country/Region')['Recovered'].sum().sort_values(ascending=False)

In [None]:
# Convert to DataFrame and reset the index
top_recovered_countries = country_recovered.head(10).reset_index()

# Plotting in Plotly
import plotly.graph_objects as go

fig = go.Figure(
    data=[
        go.Bar(
            x=top_recovered_countries['Country/Region'],  # Now valid column
            y=top_recovered_countries['Recovered'],  # Now valid column
            marker=dict(color='green'),
        )
    ]
)

# Update layout for the black background and other styles
fig.update_layout(
    title={
        'text': 'Top 10 Countries with Most Recovered Cases',
        'x': 0.5,  # Center the title
        'font': {'size': 16, 'color': 'white'},
    },
    xaxis=dict(
        title='Country',
        titlefont=dict(size=14, color='white'),
        tickangle=45,
        tickfont=dict(color='white'),
    ),
    yaxis=dict(
        title='Total Recovered Cases',
        titlefont=dict(size=14, color='white'),
        gridcolor='gray',  # Dashed grid lines
        gridwidth=0.5,
    ),
    plot_bgcolor='black',  # Black background for the plot
    paper_bgcolor='black',  # Black background for the figure
)

# Add dashed grid lines
fig.update_yaxes(showgrid=True, gridwidth=0.5, gridcolor='gray', griddash='dash')

# Show the plot
fig.show()

In [None]:
country_deaths = df.groupby('Country/Region')['Deaths'].sum().sort_values(ascending=False)

In [None]:
top_death_countries = country_deaths.head(10).reset_index()  # Convert to DataFrame for Plotly

# Plotting with Plotly
fig = go.Figure(
    data=[
        go.Bar(
            x=top_death_countries['Country/Region'],  # Countries as x-axis
            y=top_death_countries['Deaths'],  # Death counts as y-axis
            marker=dict(color='red'),  # Red bars for deaths
        )
    ]
)

# Update layout for black background and other styles
fig.update_layout(
    title={
        'text': 'Top 10 Countries with Most Deaths',
        'x': 0.5,  # Center the title
        'font': {'size': 16, 'color': 'white'},
    },
    xaxis=dict(
        title='Country',
        titlefont=dict(size=14, color='white'),
        tickangle=45,  # Rotate labels for better readability
        tickfont=dict(color='white'),
    ),
    yaxis=dict(
        title='Total Deaths',
        titlefont=dict(size=14, color='white'),
        gridcolor='gray',  # Gray dashed grid lines
        gridwidth=0.5,
    ),
    plot_bgcolor='black',  # Black background for the plot area
    paper_bgcolor='black',  # Black background for the entire figure
)

# Add dashed grid lines
fig.update_yaxes(showgrid=True, gridwidth=0.5, gridcolor='gray', griddash='dash')

# Show the plot
fig.show()

In [None]:
region_cases = df.groupby('WHO Region')['Confirmed'].sum().sort_values(ascending=False).reset_index()

In [None]:
# Plotting with Plotly
fig = go.Figure(
    data=[
        go.Bar(
            x=region_cases['WHO Region'],  # WHO Regions as x-axis
            y=region_cases['Confirmed'],  # Confirmed cases as y-axis
            marker=dict(color='blue'),  # Blue bars for confirmed cases
        )
    ]
)

# Update layout for black background and other styles
fig.update_layout(
    title={
        'text': 'Total Confirmed Cases by WHO Region',
        'x': 0.5,  # Center the title
        'font': {'size': 16, 'color': 'white'},
    },
    xaxis=dict(
        title='WHO Region',
        titlefont=dict(size=14, color='white'),
        tickangle=45,
        tickfont=dict(color='white'),
    ),
    yaxis=dict(
        title='Total Confirmed Cases',
        titlefont=dict(size=14, color='white'),
        gridcolor='gray',  # Dashed grid lines
        gridwidth=0.5,
    ),
    plot_bgcolor='black',  # Black background for the plot
    paper_bgcolor='black',  # Black background for the figure
)

# Add dashed grid lines
fig.update_yaxes(showgrid=True, gridwidth=0.5, gridcolor='gray', griddash='dash')

# Show the plot
fig.show()

In [None]:
region_recoveries = df.groupby('WHO Region')['Recovered'].sum().sort_values(ascending=False).reset_index()

In [None]:
# Plotting with Plotly
fig = go.Figure(
    data=[
        go.Bar(
            x=region_recoveries['WHO Region'],  # WHO Regions as x-axis
            y=region_recoveries['Recovered'],  # Recovered cases as y-axis
            marker=dict(color='green'),  # Green bars for recovered cases
        )
    ]
)

# Update layout for black background and other styles
fig.update_layout(
    title={
        'text': 'Total Recoveries by WHO Region',
        'x': 0.5,  # Center the title
        'font': {'size': 16, 'color': 'white'},
    },
    xaxis=dict(
        title='WHO Region',
        titlefont=dict(size=14, color='white'),
        tickangle=45,
        tickfont=dict(color='white'),
    ),
    yaxis=dict(
        title='Total Recovered Cases',
        titlefont=dict(size=14, color='white'),
        gridcolor='gray',  # Dashed grid lines
        gridwidth=0.5,
    ),
    plot_bgcolor='black',  # Black background for the plot
    paper_bgcolor='black',  # Black background for the figure
)

# Add dashed grid lines
fig.update_yaxes(showgrid=True, gridwidth=0.5, gridcolor='gray', griddash='dash')

# Show the plot
fig.show()

In [None]:
region_deaths = df.groupby('WHO Region')['Deaths'].sum().sort_values(ascending=False).reset_index()

In [None]:
# Plotting with Plotly
fig = go.Figure(
    data=[
        go.Bar(
            x=region_deaths['WHO Region'],  # WHO Regions as x-axis
            y=region_deaths['Deaths'],  # Deaths as y-axis
            marker=dict(color='red'),  # Red bars for deaths
        )
    ]
)

# Update layout for black background and other styles
fig.update_layout(
    title={
        'text': 'Total Deaths by WHO Region',
        'x': 0.5,  # Center the title
        'font': {'size': 16, 'color': 'white'},
    },
    xaxis=dict(
        title='WHO Region',
        titlefont=dict(size=14, color='white'),
        tickangle=45,
        tickfont=dict(color='white'),
    ),
    yaxis=dict(
        title='Total Deaths',
        titlefont=dict(size=14, color='white'),
        gridcolor='gray',  # Dashed grid lines
        gridwidth=0.5,
    ),
    plot_bgcolor='black',  # Black background for the plot
    paper_bgcolor='black',  # Black background for the figure
)

# Add dashed grid lines
fig.update_yaxes(showgrid=True, gridwidth=0.5, gridcolor='gray', griddash='dash')

# Show the plot
fig.show()

# 🤖 Machine Learning

In [None]:
from sklearn.preprocessing import StandardScaler

data = df[['Confirmed', 'Deaths', 'Recovered', 'Active']]

scaler = StandardScaler()
data_normalized = scaler.fit_transform(data)

print(data_normalized[:5])

[[-0.13263982 -0.14004535 -0.14444638 -0.10602164]
 [-0.13263982 -0.14004535 -0.14444638 -0.10602164]
 [-0.13263982 -0.14004535 -0.14444638 -0.10602164]
 [-0.13263982 -0.14004535 -0.14444638 -0.10602164]
 [-0.13263982 -0.14004535 -0.14444638 -0.10602164]]


## KMeans

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(data_normalized)

kmeans_labels = kmeans.labels_
kmeans_centroids = kmeans.cluster_centers_

df['KMeans_Cluster'] = kmeans_labels

In [None]:
df['KMeans_Cluster'].value_counts()

Unnamed: 0_level_0,count
KMeans_Cluster,Unnamed: 1_level_1
0,48854
2,153
1,61


In [None]:
df[df['KMeans_Cluster'] == 1].head()

Unnamed: 0,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region,KMeans_Cluster
33370,US,40.0,-100.0,2020-05-28,1730260,102643,399991,1227626,Americas,1
33631,US,40.0,-100.0,2020-05-29,1754764,103809,406446,1244509,Americas,1
33892,US,40.0,-100.0,2020-05-30,1779214,104778,416461,1257975,Americas,1
34153,US,40.0,-100.0,2020-05-31,1799124,105364,444758,1249002,Americas,1
34414,US,40.0,-100.0,2020-06-01,1816479,106136,458231,1252112,Americas,1


In [None]:
df.head()

Unnamed: 0,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region,KMeans_Cluster
0,Afghanistan,33.93911,67.709953,2020-01-22,0,0,0,0,Eastern Mediterranean,0
1,Albania,41.1533,20.1683,2020-01-22,0,0,0,0,Europe,0
2,Algeria,28.0339,1.6596,2020-01-22,0,0,0,0,Africa,0
3,Andorra,42.5063,1.5218,2020-01-22,0,0,0,0,Europe,0
4,Angola,-11.2027,17.8739,2020-01-22,0,0,0,0,Africa,0


In [None]:
# Scatter plot using Plotly Express
fig = px.scatter(
    df,
    x='Confirmed',  # X-axis: Confirmed Cases
    y='Deaths',  # Y-axis: Deaths
    color='KMeans_Cluster',  # Cluster label for coloring the points
    color_continuous_scale='Viridis',  # Color scale
    title='K-Means Clustering: Confirmed vs Deaths',  # Title
    labels={'Confirmed': 'Confirmed Cases', 'Deaths': 'Deaths'},  # Axis labels
)

# Update layout for black background and other styles
fig.update_layout(
    plot_bgcolor='black',  # Black background for the plot
    paper_bgcolor='black',  # Black background for the figure
    title={
        'text': 'K-Means Clustering: Confirmed vs Deaths',
        'x': 0.5,  # Center the title
        'font': {'size': 16, 'color': 'white'},
    },
    xaxis=dict(
        title='Confirmed Cases',
        titlefont=dict(size=14, color='white'),
        tickfont=dict(color='white'),
    ),
    yaxis=dict(
        title='Deaths',
        titlefont=dict(size=14, color='white'),
        tickfont=dict(color='white'),
    ),
    coloraxis_colorbar=dict(
        title='Cluster',
        tickvals=[df['KMeans_Cluster'].min(), df['KMeans_Cluster'].max()],
        ticktext=[f"Cluster {df['KMeans_Cluster'].min()}", f"Cluster {df['KMeans_Cluster'].max()}"],
        ticks='outside',
        tickfont=dict(color='white'),
    ),
)

# Show the plot
fig.show()

## KMedoids

In [None]:
!pip install pyclustering

Collecting pyclustering
  Downloading pyclustering-0.10.1.2.tar.gz (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyclustering
  Building wheel for pyclustering (setup.py) ... [?25l[?25hdone
  Created wheel for pyclustering: filename=pyclustering-0.10.1.2-py3-none-any.whl size=2395099 sha256=a77f4837f8aebe44860a1004d51f619b16511d2e23ed5c1f1a7dd9284f642689
  Stored in directory: /root/.cache/pip/wheels/9f/99/15/e881f46a92690ae77c2e3b255b89ea45d3a867b1b6c2ab3ba9
Successfully built pyclustering
Installing collected packages: pyclustering
Successfully installed pyclustering-0.10.1.2


In [None]:
from pyclustering.cluster.kmedoids import kmedoids
from pyclustering.cluster import cluster_visualizer

initial_medoids = [0, 1, 2]

kmedoids_instance = kmedoids(data_normalized.tolist(), initial_medoids)
kmedoids_instance.process()

kmedoids_labels = kmedoids_instance.get_clusters()

df['KMedoids_Cluster'] = -1
for cluster_id, cluster in enumerate(kmedoids_labels):
    for idx in cluster:
        df.at[idx, 'KMedoids_Cluster'] = cluster_id

In [None]:
# Scatter plot using Plotly Express
fig = px.scatter(
    df,
    x='Confirmed',  # X-axis: Confirmed Cases
    y='Deaths',  # Y-axis: Deaths
    color='KMedoids_Cluster',  # Cluster label for coloring the points
    color_continuous_scale='Plasma',  # Color scale (similar to 'plasma' colormap)
    title='K-Medoids Clustering: Confirmed vs Deaths',  # Title
    labels={'Confirmed': 'Confirmed Cases', 'Deaths': 'Deaths'},  # Axis labels
)

# Update layout for black background and other styles
fig.update_layout(
    plot_bgcolor='black',  # Black background for the plot
    paper_bgcolor='black',  # Black background for the figure
    title={
        'text': 'K-Medoids Clustering: Confirmed vs Deaths',
        'x': 0.5,  # Center the title
        'font': {'size': 16, 'color': 'white'},
    },
    xaxis=dict(
        title='Confirmed Cases',
        titlefont=dict(size=14, color='white'),
        tickfont=dict(color='white'),
    ),
    yaxis=dict(
        title='Deaths',
        titlefont=dict(size=14, color='white'),
        tickfont=dict(color='white'),
    ),
    coloraxis_colorbar=dict(
        title='Cluster',
        tickvals=[df['KMedoids_Cluster'].min(), df['KMedoids_Cluster'].max()],
        ticktext=[f"Cluster {df['KMedoids_Cluster'].min()}", f"Cluster {df['KMedoids_Cluster'].max()}"],
        ticks='outside',
        tickfont=dict(color='white'),
    ),
)

# Show the plot
fig.show()

In [None]:
from sklearn.metrics import silhouette_score

kmeans_silhouette = silhouette_score(data_normalized, kmeans_labels)

kmedoids_labels_flat = df['KMedoids_Cluster'].values
kmedoids_silhouette = silhouette_score(data_normalized, kmedoids_labels_flat)

print(f"Silhouette Score for K-Means: {kmeans_silhouette:.4f}")
print(f"Silhouette Score for K-Medoids: {kmedoids_silhouette:.4f}")

Silhouette Score for K-Means: 0.9767
Silhouette Score for K-Medoids: 0.9542


### PCA analysis

In [None]:
# import the PCA algorithm and apply it to the dataset
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_result = pca.fit_transform(data_normalized)

# convert pca_result into df
pca_result = pd.DataFrame(pca_result, columns=['PC1', 'PC2'])
pca_result["clusters"] = kmeans_labels
pca_result.head()

Unnamed: 0,PC1,PC2,clusters
0,-0.261045,-0.026449,0
1,-0.261045,-0.026449,0
2,-0.261045,-0.026449,0
3,-0.261045,-0.026449,0
4,-0.261045,-0.026449,0


In [None]:
# visuliza the pc1 and pc2 with clusters
fig = px.scatter(
    pca_result,
    x="PC1",
    y="PC2",
    color="clusters",
    title="PCA Clustering",)

fig.show()

## TSNE analysis

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
tsne_result = tsne.fit_transform(data_normalized)

tsne_result = pd.DataFrame(tsne_result, columns=['TSNE1', 'TSNE2'])
tsne_result["clusters"] = kmeans_labels
tsne_result

Unnamed: 0,TSNE1,TSNE2,clusters
0,-78.670761,-1.319201,0
1,-78.670761,-1.319201,0
2,-78.670761,-1.319201,0
3,-78.670761,-1.319201,0
4,-78.670761,-1.319201,0
...,...,...,...
49063,8.580592,102.594086,0
49064,48.799522,-31.051170,0
49065,-23.407118,134.172974,0
49066,60.860821,44.257179,0


In [None]:
# plot the tnse chart
fig = px.scatter(
    tsne_result,
    x="TSNE1",
    y="TSNE2",
    color="clusters",
    title="TSNE Clustering",)

fig.show()

## Isolation Forest

In [None]:
from sklearn.ensemble import IsolationForest

forest = IsolationForest(n_estimators=100, contamination=0.01, random_state=42)
forest.fit(data_normalized)

df['IsolationForest_Score'] = forest.decision_function(data_normalized)
df['IsolationForest_Score'].describe()

Unnamed: 0,IsolationForest_Score
count,49068.0
mean,0.40595
std,0.104611
min,-0.104915
25%,0.415824
50%,0.45242
75%,0.462362
max,0.462567


In [None]:
# plot the isolation forest score using histogram
fig = px.histogram(
    df,
    x="IsolationForest_Score",
    nbins=50,
    title="Isolation Forest Score Distribution",
)

fig.show()

In [None]:
df[df["IsolationForest_Score"] < 0]

Unnamed: 0,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region,KMeans_Cluster,KMedoids_Cluster,IsolationForest_Score
19537,US,40.000000,-100.000000,2020-04-05,337573,12470,17448,307655,Americas,0,1,-0.004174
19798,US,40.000000,-100.000000,2020-04-06,367215,14138,19581,333496,Americas,0,1,-0.008897
20059,US,40.000000,-100.000000,2020-04-07,397992,16447,21763,359782,Americas,0,1,-0.012062
20320,US,40.000000,-100.000000,2020-04-08,429686,18563,23559,387564,Americas,0,1,-0.022169
20581,US,40.000000,-100.000000,2020-04-09,464442,20638,25410,418394,Americas,0,1,-0.029700
...,...,...,...,...,...,...,...,...,...,...,...,...
48992,Russia,61.524010,105.318756,2020-07-27,816680,13334,602249,201097,Europe,2,1,-0.063358
49005,South Africa,-30.559500,22.937500,2020-07-27,452529,7067,274925,170537,Africa,0,1,-0.022705
49006,Spain,40.463667,-3.749220,2020-07-27,272421,28432,150376,93613,Europe,0,1,-0.003129
49028,United Kingdom,55.378100,-3.436000,2020-07-27,300111,45759,0,254352,Europe,0,1,-0.029700


**Notebook Author: Muhammad Hassan Saboor**