Project Title:
Exploratory Data Analysis on COVID-19 Clinical Trials

Objective:
The goal of this project is to explore the clinical trials dataset for COVID-19, understand the characteristics of trials such as status, phases, study designs, and other key factors, and derive meaningful insights from the data.

Tools Used:
Python
Pandas
Matplotlib
Seaborn
Steps to Follow:
Import necessary libraries
Load the dataset
Clean the data (handle missing values)
Perform univariate and bivariate analysis
Conduct time-series and country-based analysis
Summarize insights

In [100]:
import pandas as pd
import numpy as np

print("Libraries imported successfully!")


Libraries imported successfully!


In [101]:
# Load the dataset
file_path = r"C:\Users\ACER\OneDrive\Documents\Desktop\covid\New folder\COVID clinical trials.csv"
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
print("Dataset loaded successfully!")
df.head()


Dataset loaded successfully!


Unnamed: 0,Rank,NCT Number,Title,Acronym,Status,Study Results,Conditions,Interventions,Outcome Measures,Sponsor/Collaborators,...,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents,URL
0,1,NCT04785898,Diagnostic Performance of the ID Now™ COVID-19...,COVID-IDNow,"Active, not recruiting",No Results Available,Covid19,Diagnostic Test: ID Now™ COVID-19 Screening Test,Evaluate the diagnostic performance of the ID ...,Groupe Hospitalier Paris Saint Joseph,...,COVID-IDNow,"November 9, 2020","December 22, 2020","April 30, 2021","March 8, 2021",,"March 8, 2021","Groupe Hospitalier Paris Saint-Joseph, Paris, ...",,https://ClinicalTrials.gov/show/NCT04785898
1,2,NCT04595136,Study to Evaluate the Efficacy of COVID19-0001...,COVID-19,Not yet recruiting,No Results Available,SARS-CoV-2 Infection,Drug: Drug COVID19-0001-USR|Drug: normal saline,Change on viral load results from baseline aft...,United Medical Specialties,...,COVID19-0001-USR,"November 2, 2020","December 15, 2020","January 29, 2021","October 20, 2020",,"October 20, 2020","Cimedical, Barranquilla, Atlantico, Colombia",,https://ClinicalTrials.gov/show/NCT04595136
2,3,NCT04395482,Lung CT Scan Analysis of SARS-CoV2 Induced Lun...,TAC-COVID19,Recruiting,No Results Available,covid19,Other: Lung CT scan analysis in COVID-19 patients,A qualitative analysis of parenchymal lung dam...,University of Milano Bicocca,...,TAC-COVID19,"May 7, 2020","June 15, 2021","June 15, 2021","May 20, 2020",,"November 9, 2020","Ospedale Papa Giovanni XXIII, Bergamo, Italy|P...",,https://ClinicalTrials.gov/show/NCT04395482
3,4,NCT04416061,The Role of a Private Hospital in Hong Kong Am...,COVID-19,"Active, not recruiting",No Results Available,COVID,Diagnostic Test: COVID 19 Diagnostic Test,Proportion of asymptomatic subjects|Proportion...,Hong Kong Sanatorium & Hospital,...,RC-2020-08,"May 25, 2020","July 31, 2020","August 31, 2020","June 4, 2020",,"June 4, 2020","Hong Kong Sanatorium & Hospital, Hong Kong, Ho...",,https://ClinicalTrials.gov/show/NCT04416061
4,5,NCT04395924,Maternal-foetal Transmission of SARS-Cov-2,TMF-COVID-19,Recruiting,No Results Available,Maternal Fetal Infection Transmission|COVID-19...,Diagnostic Test: Diagnosis of SARS-Cov2 by RT-...,COVID-19 by positive PCR in cord blood and / o...,Centre Hospitalier Régional d'Orléans|Centre d...,...,CHRO-2020-10,"May 5, 2020",May 2021,May 2021,"May 20, 2020",,"June 4, 2020","CHR Orléans, Orléans, France",,https://ClinicalTrials.gov/show/NCT04395924


In [102]:
# Display basic information about the dataset
print("Dataset Information:")
df.info()

# Check for missing values in each column
print("\nMissing Values in Columns:")
print(df.isnull().sum())

# Show basic statistics for numerical columns
print("\nStatistical Summary:")
df.describe()


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5783 entries, 0 to 5782
Data columns (total 27 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Rank                     5783 non-null   int64  
 1   NCT Number               5783 non-null   object 
 2   Title                    5783 non-null   object 
 3   Acronym                  2480 non-null   object 
 4   Status                   5783 non-null   object 
 5   Study Results            5783 non-null   object 
 6   Conditions               5783 non-null   object 
 7   Interventions            4897 non-null   object 
 8   Outcome Measures         5748 non-null   object 
 9   Sponsor/Collaborators    5783 non-null   object 
 10  Gender                   5773 non-null   object 
 11  Age                      5783 non-null   object 
 12  Phases                   3322 non-null   object 
 13  Enrollment               5749 non-null   float64
 14  Fun

Unnamed: 0,Rank,Enrollment
count,5783.0,5749.0
mean,2892.0,18319.49
std,1669.552635,404543.7
min,1.0,0.0
25%,1446.5,60.0
50%,2892.0,170.0
75%,4337.5,560.0
max,5783.0,20000000.0


In [103]:
# Drop columns with excessive missing values
columns_to_drop = ['Acronym', 'Study Documents', 'Results First Posted']
df = df.drop(columns=columns_to_drop, errors='ignore')

# Fill missing values in relevant columns without inplace=True
df = df.assign(
    **{
        'Primary Completion Date': df['Primary Completion Date'].fillna("Unknown"),
        'Interventions': df['Interventions'].fillna("No Intervention"),
        'Phases': df['Phases'].fillna("No Phase Information"),
        'Enrollment': df['Enrollment'].fillna(df['Enrollment'].median()) if 'Enrollment' in df.columns else df['Enrollment']
    }
)

# Check remaining missing values
print("Remaining missing values:")
print(df.isnull().sum())


Remaining missing values:
Rank                         0
NCT Number                   0
Title                        0
Status                       0
Study Results                0
Conditions                   0
Interventions                0
Outcome Measures            35
Sponsor/Collaborators        0
Gender                      10
Age                          0
Phases                       0
Enrollment                   0
Funded Bys                   0
Study Type                   0
Study Designs               35
Other IDs                    1
Start Date                  34
Primary Completion Date      0
Completion Date             36
First Posted                 0
Last Update Posted           0
Locations                  585
URL                          0
dtype: int64


In [104]:
# Status distribution (show counts only)
print("Distribution of Clinical Trial Status:")
print(df['Status'].value_counts())

# Phase distribution
if 'Phases' in df.columns:
    print("\nDistribution of Clinical Trial Phases:")
    print(df['Phases'].value_counts())

# Age group analysis
if 'Age' in df.columns:
    print("\nAge Group Distribution:")
    print(df['Age'].value_counts())


Distribution of Clinical Trial Status:
Status
Recruiting                   2805
Completed                    1025
Not yet recruiting           1004
Active, not recruiting        526
Enrolling by invitation       181
Withdrawn                     107
Terminated                     74
Suspended                      27
Available                      19
No longer available            12
Approved for marketing          2
Temporarily not available       1
Name: count, dtype: int64

Distribution of Clinical Trial Phases:
Phases
No Phase Information    2461
Not Applicable          1354
Phase 2                  685
Phase 3                  450
Phase 1                  234
Phase 2|Phase 3          200
Phase 1|Phase 2          192
Phase 4                  161
Early Phase 1             46
Name: count, dtype: int64

Age Group Distribution:
Age
18 Years and older   (Adult, Older Adult)           2885
Child, Adult, Older Adult                            486
18 Years to 80 Years   (Adult, Older Adult)

In [105]:
# Status vs. Phases Analysis
if 'Phases' in df.columns and 'Status' in df.columns:
    status_phase_counts = df.groupby(['Status', 'Phases']).size().reset_index(name='Count')
    print("\nStatus vs. Phases Counts:")
    print(status_phase_counts.sort_values(by='Count', ascending=False).head(10))

# Most common conditions by status
if 'Conditions' in df.columns:
    common_conditions = df.groupby('Status')['Conditions'].apply(lambda x: x.value_counts().head(3))
    print("\nTop 3 Conditions per Status:")
    print(common_conditions)



Status vs. Phases Counts:
                    Status                Phases  Count
40              Recruiting  No Phase Information   1224
41              Recruiting        Not Applicable    647
12               Completed  No Phase Information    565
31      Not yet recruiting  No Phase Information    350
44              Recruiting               Phase 2    343
32      Not yet recruiting        Not Applicable    282
13               Completed        Not Applicable    226
46              Recruiting               Phase 3    196
1   Active, not recruiting  No Phase Information    175
35      Not yet recruiting               Phase 2    114

Top 3 Conditions per Status:
Status                                                                                  
Active, not recruiting     COVID-19                                                          87
                           Covid19                                                           47
                           COVID              

In [106]:
# Ensure 'Start Date' column is in the correct date format
if 'Start Date' in df.columns:
    df['Start Date'] = pd.to_datetime(df['Start Date'], errors='coerce')

    # Count trials started by month
    trials_by_month = df['Start Date'].dt.to_period('M').value_counts().sort_index()
    print("\nTrials Started Over Time (Monthly Counts):")
    print(trials_by_month.head(10))  # Display the first 10 months



Trials Started Over Time (Monthly Counts):
Start Date
1998-01    1
2010-03    1
2011-02    1
2011-03    1
2012-01    1
2012-02    1
2012-05    1
2013-01    1
2013-04    1
2013-10    1
Freq: M, Name: count, dtype: int64


In [107]:
# Filter data for trials starting in or after 2020
filtered_trials = df[df['Start Date'] >= '2020-01-01']

# Display counts by month for relevant trials
relevant_trials_by_month = filtered_trials['Start Date'].dt.to_period('M').value_counts().sort_index()
print("\nRelevant Trials Started Over Time (Post-2020):")
print(relevant_trials_by_month.head(10))



Relevant Trials Started Over Time (Post-2020):
Start Date
2020-01     61
2020-02    100
2020-03    417
2020-04    825
2020-05    645
2020-06    502
2020-07    361
2020-08    265
2020-09    306
2020-10    257
Freq: M, Name: count, dtype: int64


Summary of Findings:
The most common trial statuses are:
Completed
Recruiting
Trials span multiple phases, with a significant number of trials having no clear phase information.
Most trials target adult populations with a steady increase in the number of trials starting after 2020.
Relevant trials have been filtered to highlight post-2020 trends.


Project Summary: COVID-19 Clinical Trials Analysis
Key Findings:
Trial Status Distribution:

The dataset contains various statuses, with a significant number of completed and ongoing trials.
Phase Distribution:

Many trials lacked clear phase information, while others were evenly spread across Phase 1, Phase 2, and beyond.
Top Conditions for Trials:

COVID-19 was the dominant condition under study.
Time-Series Analysis:

A steady increase in trials was observed post-2020 when the COVID-19 pandemic began.