<a href="https://colab.research.google.com/github/yusrayalavuz/COVID-19-Research-Dataset-Analysis/blob/main/CORD_19_Data_Analysis_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
allen_institute_for_ai_cord_19_research_challenge_path = kagglehub.dataset_download('allen-institute-for-ai/CORD-19-research-challenge')

print('Data source import complete.')


# COVID-19 Research Dataset Analysis


## 📂 Data Loading

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


import os

for filename in os.listdir('/kaggle/input/CORD-19-research-challenge'):
    print(filename)


In [None]:
df = pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv')
df.head()


## 📊 Dataset Overview

This dataset contains metadata of research articles related to COVID-19 and other coronaviruses. Below is a brief description of each column in the **metadata.csv** file:

| Column Name         | Description                                                                                         |
|---------------------|-----------------------------------------------------------------------------------------------------|
| `cord_uid`          | Unique identifier for each research article                                                        |
| `sha`               | SHA hash of the full text PDF, used to match full-text files                                       |
| `source_x`          | Source of the publication (e.g., PMC, Elsevier)                                                    |
| `title`             | Title of the research paper                                                                        |
| `doi`               | Digital Object Identifier (DOI) of the paper                                                       |
| `pmcid`             | PubMed Central ID                                                                                   |
| `pubmed_id`         | PubMed ID                                                                                           |
| `license`           | License type of the paper                                                                          |
| `abstract`          | Abstract (summary) of the research paper                                                           |
| `publish_time`      | Publication date                                                                                   |
| `authors`           | List of authors                                                                                    |
| `journal`           | Journal where the article was published                                                            |
| `mag_id`, `who_covidence_id`, `arxiv_id` | Additional IDs for cross-referencing (many may be missing)                    |
| `pdf_json_files`    | Path to the parsed PDF JSON file                                                                   |
| `pmc_json_files`    | Path to the parsed PMC JSON file                                                                   |
| `url`               | URL of the article                                                                                 |
| `s2_id`             | Semantic Scholar paper ID                                                                          |

This dataset is large and diverse, covering a wide range of COVID-19 research literature. The main purpose of this project is to explore trends, identify missing or anomalous data, and visualize key features of the literature.


In [None]:
# Columns and their data types
df.info()


In [None]:
df.describe(include='all')


## 🔍 Missing Data Analysis

In [None]:
# Calculate the number of missing values per column
missing_values = df.isnull().sum().sort_values()

# Calculate the percentage of missing values
missing_percent = (missing_values / len(df)) * 100

# Create a summary DataFrame
missing_data = pd.DataFrame({
    'Missing Values': missing_values,
    'Missing (%)': missing_percent
})

# Display columns with more than 10% missing data
missing_data[missing_data['Missing (%)'] > 10]


In [None]:
msno.matrix(df)
plt.title("Missing Data Matrix", fontsize=14)
plt.show()


In [None]:
# Bar plot for missing values percentage
missing_percent = (df.isnull().sum() / len(df)) * 100

plt.figure(figsize=(12, 6))
missing_percent[missing_percent > 0].sort_values(ascending=False).plot(kind='bar', color='salmon')
plt.ylabel("Missing Value Percentage (%)")
plt.title("Percentage of Missing Values by Column")
plt.xticks(rotation=45)
plt.show()


In [None]:
# Drop irrelevant columns with high missing data
df_clean = df.drop(columns=['mag_id', 'sha', 'pmcid', 'pdf_json_files', 'pmc_json_files', 'arxiv_id'])


In [None]:
# Drop rows where key columns are missing
df_clean = df_clean.dropna(subset=['title', 'abstract', 'publish_time'])


In [None]:
# New dataset size
print(f"The cleaned dataset contains {df_clean.shape[0]} rows and {df_clean.shape[1]} columns.")

# Check for remaining missing data
df_clean.isnull().sum().sort_values(ascending=False)


## 📑 Statistical Summary

In this section, we will explore the statistical summary of numeric and text-based columns in the dataset.  
We will also look at the distribution of key categorical variables.


In [None]:
# Summary statistics for numeric columns
df_clean.describe()


In [None]:
# Check unique values for key categorical columns
print("Source Types:", df_clean['source_x'].unique())
print("License Types:", df_clean['license'].unique())
print("Example Journals:", df_clean['journal'].unique()[:10])  # Çok fazla varsa ilk 10 göster


## 🚨 Outlier Detection



In [None]:
df_clean['publish_time'] = pd.to_datetime(df_clean['publish_time'], errors='coerce')

In [None]:
print("Number of invalid publish_time entries:", df_clean['publish_time'].isna().sum())

In [None]:
# Distribution of publication years
df_clean['publish_year'] = df_clean['publish_time'].dt.year

# Earliest and latest publication year
print("Earliest publish year:", df_clean['publish_year'].min())
print("Latest publish year:", df_clean['publish_year'].max())


In [None]:


plt.figure(figsize=(12,6))
df_clean['publish_year'].hist(bins=50)
plt.title("Publication Year Distribution")
plt.xlabel("Year")
plt.ylabel("Number of Publications")
plt.show()


In [None]:
outliers_early = df_clean[df_clean['publish_year'] < 1900]
outliers_late = df_clean[df_clean['publish_year'] > 2025]

print("Number of publications before 1900:", len(outliers_early))
print("Number of publications after 2025:", len(outliers_late))


In [None]:
# Calculate Q1, Q3, and IQR for publish_year
Q1 = df_clean['publish_year'].quantile(0.25)
Q3 = df_clean['publish_year'].quantile(0.75)
IQR = Q3 - Q1

# Define lower and upper bounds for outliers using the 1.5 * IQR rule
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1 (25th percentile): {Q1}")
print(f"Q3 (75th percentile): {Q3}")
print(f"IQR (Q3 - Q1): {IQR}")
print(f"Lower bound for outliers (Q1 - 1.5*IQR): {lower_bound}")
print(f"Upper bound for outliers (Q3 + 1.5*IQR): {upper_bound}")

# Count the number of outliers based on the IQR method
outliers_iqr = df_clean[(df_clean['publish_year'] < lower_bound) | (df_clean['publish_year'] > upper_bound)]

print(f"\nNumber of potential outliers based on IQR method: {len(outliers_iqr)}")
print("Potential outlier publish_year values:")
print(outliers_iqr['publish_year'].unique())

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(y=df_clean['publish_year'])
plt.title("Boxplot of Publication Year")
plt.ylabel("Publication Year")
plt.show()

Outlier Analysis for Publication Year

Analysis of the `publish_year` distribution, including the boxplot and IQR method, reveals that the majority of publications in the dataset are tightly concentrated between {Q1\_value} and {Q3\_value}. The narrow IQR of {IQR\_value} year highlights this concentration in recent years, likely due to the CORD-19 dataset's focus.

Based on the IQR rule, {Outlier\_count} publications with years outside the range of approximately {Lower\_bound\_value} and {Upper\_bound\_value} were identified as potential outliers. These include publications from various years, notably very early ones such as 1879, which stand out from the rest of the data.

While statistically identified as outliers, many of these are older publications rather than errors, reflecting the dataset's specific focus on recent COVID-19 research. Extremely early years require further investigation.

## 📈 Data Visualization

In [None]:
# Count the occurrences of each source type
source_counts = df_clean['source_x'].value_counts()

# Create a bar plot
plt.figure(figsize=(10, 6))
source_counts.plot(kind='bar', color='skyblue')
plt.title("Distribution of Publication Sources (source_x)")
plt.xlabel("Source")
plt.ylabel("Number of Publications")
plt.xticks(rotation=45, ha='right') # Rotate labels for better readability
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()

This bar chart shows the distribution of publications by source. The majority of the articles come from WHO, followed by other sources such as PMC, Medline, and Elsevier.
This indicates that a significant portion of the COVID-19 literature is aggregated from major scientific repositories, with WHO being the largest contributor.

In [None]:
license_counts = df_clean['license'].value_counts()

plt.figure(figsize=(10, 6))
license_counts.plot(kind='bar', color='lightgreen')
plt.title("Distribution of Publication Licenses")
plt.xlabel("License Type")
plt.ylabel("Number of Publications")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

This bar chart shows the distribution of publication licenses. The most common license is 'unk' (unknown), indicating that many articles lack clear license information.
Among the defined licenses, 'cc-by', 'no-cc', and 'cc-by-nc' are the most frequent, reflecting various levels of open-access permissions.
This distribution highlights the variability in licensing within the dataset.

In [None]:
# Get the top 10 journals with the most publications
top_journals = df_clean['journal'].value_counts().head(10)

#  Create a bar chart
plt.figure(figsize=(10, 6))
top_journals.plot(kind='bar', color='skyblue')
plt.title("Top 10 Journals by Number of Publications")
plt.xlabel("Journal")
plt.ylabel("Number of Publications")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


This bar chart highlights the top 10 journals with the highest number of COVID-19-related publications.
PLoS One, bioRxiv, and International Journal of Environmental Research and Public Health are the leading journals, indicating their crucial role in disseminating pandemic-related scientific findings.

In [None]:
df_clean['publish_time'] = pd.to_datetime(df_clean['publish_time'], errors='coerce')



In [None]:
# Create a new column in 'YYYY-MM' format
df_clean['publish_month'] = df_clean['publish_time'].dt.to_period('M')

# Count publications per month, filtering for years >= 2019
monthly_counts = df_clean[df_clean['publish_time'].dt.year >= 2019]['publish_month'].value_counts().sort_index()


In [None]:
plt.figure(figsize=(14, 6))
monthly_counts.plot(kind='line', marker='o', color='teal')
plt.title("Monthly Publication Trends Since 2019")
plt.xlabel("Publication Month")
plt.ylabel("Number of Publications")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


This line chart visualizes the monthly publication trend over time, focusing on publications from 2019 onwards. It shows the evolution of research output related to COVID-19 and other coronaviruses over this critical period.

The graph clearly illustrates a significant increase in the number of publications starting in early 2020, coinciding with the global spread of the COVID-19 pandemic. This surge reflects the intense global research effort to understand and combat the virus. The trend shows how research output peaked and potentially evolved in the subsequent months and years.