# COVID Vaccine : Clinical Trials and Progress EDA

<img src="https://images.moneycontrol.com/static-mcnews/2020/06/coronavirus-vaccine-770x433.jpg?impolicy=website&width=770&height=431" width=500><br>

[Clinical Trials](ClinicalTrials.gov) is a database of privately and publicly funded clinical studies conducted around the world. It is maintained by the National Institute of Health. All data is publicly available and the site provides a direct download feature which makes it super easy to use relevant data for analysis.

In this Notebook we will analyse the studies undergoing for finding the COVID and pneaumonia related disease vaccines, their Progress and Timeline.

In [None]:
!pip install calmap

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.express as px
from plotly.subplots import make_subplots
from collections import defaultdict 
import calmap
plt.rcParams['figure.figsize'] = 8, 5
plt.style.use("fivethirtyeight")
pd.options.plotting.backend = "plotly"
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import re

In [None]:
data = pd.read_csv('../input/covid19-clinical-trials-dataset/COVID clinical trials.csv')
data.head()

In [None]:
data['Start Date'] = pd.to_datetime(data['Start Date'])
data['Completion Date'] = pd.to_datetime(data['Completion Date'])
data['Primary Completion Date'] = pd.to_datetime(data['Primary Completion Date'])
data['Start Date'] = pd.to_datetime(data['First Posted'])
data['Start Date'] = pd.to_datetime(data['Results First Posted'])
data['Start Date'] = pd.to_datetime(data['Last Update Posted'])

# Exploring Null and Unique values distribution

In [None]:
def NullUnique(df):
    dic = defaultdict(list)
    for col in df.columns:
        dic['Feature'].append(col)
        dic['NumUnique'].append(len(df[col].unique()))
        dic['NumNull'].append(df[col].isnull().sum())
        dic['%Null'].append(round(df[col].isnull().sum()/df.shape[0] * 100,2))
    return pd.DataFrame(dict(dic)).sort_values(['%Null'],ascending=False).style.background_gradient()

In [None]:
NullUnique(data)

**Observations**:
- Null Observations
    - Results First Posted has 99.9% Null values (with only 1 non Null value)
    - Study Documents has 98.15% Null Values
    - Acronyms has 53.1% Null Values
    - Phases has 43.43% Null Values
- Unique Observations
    - URL, Rank and NCT Number have Unique values for every data point
    - Some data points have same title and same 'Other Ids'
    

In [None]:
# Returns list of Series index and its count where count > threshold
def popularity(col,threshold):
    idx = []
    counts = []
    other = 0
    for index,vcount in zip(data[col].value_counts().index,data[col].value_counts().values):
        if vcount < threshold:
            other+=1
            continue
        idx.append(index)
        counts.append(vcount)
    idx.append('Others')
    counts.append(other)
    return idx,counts

# Exploring Study Results

In [None]:
fig = px.pie(data,'Study Results')
fig.update_layout(title='Do we have any results to study?')
fig.show()

**Observations**:
- 0.038% studies have attained some results
- Remaining 99.97% results have NO RESULTS

# Exploring the Phases across Studies
<img src="https://lupustrials.org/wp-content/uploads/2019/01/clinical-trial-phases-graphic.jpg" width=800>

In [None]:
fig = go.Figure(go.Bar(
    x= data.groupby('Phases').agg('count')['Rank'].sort_values(ascending=False).index, 
    y= data.groupby('Phases').agg('count')['Rank'].sort_values(ascending=False).values,  
    text=data.groupby('Phases').agg('count')['Rank'].sort_values(ascending=False).index,
    textposition='outside',
    marker_color=data.groupby('Phases').agg('count')['Rank'].sort_values(ascending=False).values
))
fig.update_layout(title='Phases across Studies')
fig.show()

**Observations**:
- Most relevant takes are that a high number (378) of studies are in the 2nd Phase 
- 77 Studies are in Phase 3
- 72 Studies are in Phase 4 (Very close to reaching a solution)
- Majority Studies have *Not Applicable* Phase

# Exploring Status

In [None]:
data.Status.hist()

Status : Indicates the current recruitment status or the expanded access status<br>
**Observations**:
- Most studies are recruiting indicating a need of skilled individuals
- Next highest group is Not Yet Recruiting indicating studies havent attained a stage to begin getting professionals onboard

# Most Popular Interventions

In [None]:
idx , counts = popularity('Interventions',8)
fig = go.Figure([go.Pie(labels=idx,values=counts,textinfo='label+percent')])
fig.update_layout(title='What are the top Interventions?')
fig.show()

Intervention refers to the medicinal product (e.g drug, device, vaccine, placebo etc) given to the patients in a study<br>
**Observations**:
- Hydroxychloroquine is most used Intervention
- Next common occurence is *No Intervention*
- NOTE : Others include all interventions with less than 8 occurences

# What Conditions are these Studies treating?
The disease, disorder, syndrome, illness, or injury that is being studied

In [None]:
conditions=list(data['Conditions'].dropna().unique())
fig, (ax2) = plt.subplots(1,1,figsize=[17, 10])
wordcloud2 = WordCloud(width=1000,height=400).generate(" ".join(conditions))
ax2.imshow(wordcloud2,interpolation='bilinear')
ax2.axis('off')
ax2.set_title('What Conditions are we trying to treat',fontsize=20)

**Observations**:
- The keywords are : COVID, Coronavirus, SARS, CoV indicating major research being done to find a cure for these diseases
- Less Common conditions are Hypoxemia, Viral Pneumonia, Pregnancy

# What Age Bracket and Gender are these Studies considering?

In [None]:
def cleanAge(age):
    if len(re.findall(r'\(.*\)',age)):
        return re.findall(r'\(.*\)',age)[0]
    return '('+age+')'

In [None]:
ageData = data.Age.apply(lambda x : cleanAge(x))
ageData.hist()

**Observations**:
- Most Studies involve (Adult, Older Adult) Population
- Only Child studies are very few

In [None]:
data['AgeBrackets'] = ageData

In [None]:
i = 0
fig = make_subplots(rows=3, cols=2, subplot_titles=list(pd.DataFrame(data.groupby(['AgeBrackets'])['Gender'].value_counts()).unstack().index))
for row in range(1,4):
    for col in range(1,3):
        dt = pd.DataFrame(data.groupby(['AgeBrackets'])['Gender'].value_counts()).unstack().iloc[i]
        fig.add_trace(go.Bar(x=dt.Gender.index,y=dt.Gender.values),row = row, col = col)        
        i+=1
fig.show()

**Observations**:
- Most studies have taken data from All Genders
- In (Adult) and (Child,Adult) Category there is significant number of Female patients considered for the studies

# Number of Patients participating in studies

In [None]:
data.Enrollment.hist()

**Observations**:
- Most studies have 0-40k participants
- Some Data Points go upto 10M participants

# Analysing Study Type
The nature of a clinical study. Study types include interventional studies

In [None]:
data['Study Type'].hist()

**Observations**:
- Most Studies are observational and interventional
- Rarely do studies follow Expanded Access Methods

# Where are these studies taking place?

In [None]:
def splitLoc(loc):
    return loc.split(',')[-1].strip()

In [None]:
data['Loc'] = data.Locations.apply(lambda x:splitLoc(str(x)))

In [None]:
fig = go.Figure([go.Choropleth(
    locations=data.groupby(['Loc']).agg('count')['Rank'].index,
    z=data.groupby(['Loc']).agg('count')['Rank'].values.astype(float),
    locationmode='country names',
    colorscale='Blues',
    autocolorscale=False,
    marker_line_color='white',
    showscale = True,
)])
fig.update_layout(title='Study Locations')
fig.show()

**Observations**:
- Most Studies take place in USA (517)
- Next highest count is in France (349)

# Who Funded these Studies?

In [None]:
idx , counts = popularity('Funded Bys',0)
fig = go.Figure([go.Pie(labels=idx,values=counts,textinfo='label+percent')])
fig.update_layout(title='Who are the top Funders?')
fig.show()

**Observations**:
- Max Funding is by Industry
- NIH - National Institute of Health and US FED has also funded many studies

# Exploring Studies Timelines

In [None]:
fig,ax = calmap.calendarplot(data.groupby(['Start Date']).Rank.count(), monthticks=1, daylabels='MTWTFSS',cmap='YlGn',
                    linewidth=0, fig_kws=dict(figsize=(20,5)))
fig.suptitle('Start Date of Studies' )
fig.colorbar(ax[0].get_children()[1], ax=ax.ravel().tolist())
fig.show()

**Observations**:
- Initial studies began in  early February
- Large number of studies kicked off in May-June

## What are the target Completion Dates for studies?

In [None]:
data['Completion Date'].dt.year.hist()

**Observations**:
- Some studies mark their completion date in 2099, which seems to be an outlier
- Most studies aim to complete between 2020-2025

**Final Summary:**
- Most Undergoing Studies are in Phase 2 with only 2.77% studies in Phase 4
- Most studies are recruiting indicating a need of skilled individuals
- Hydroxychloroquine is most used Intervention
- Most Studies involve (Adult, Older Adult) Population with very few studies focussed on Children
- Most studies have 0-40k participants in the clinical trials
- Most Studies follow observational and interventional methodology
- Most Studies take place in USA (517) and France (349)
- Max Funding for these Clinical trials are done by 'Industries'. Some studies are funded by NIH and US-FEDs
- Most Studies aim to be completed by 2020-2025