<a href="https://colab.research.google.com/github/stogaja/clinical-trials/blob/main/clinical_trials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CLINICAL TRIALS**

### 1. a)Defining the Question

> What are the key trends and patterns observed in clinical studies related to Cancer, malaria, Covid-19, HIV, Heart Conditions, and pneumonia with a specific focus on studies conducted in Kenya?"

### b) Defining the metric of success

1. **Data Completeness:** Ensure that the datasets for clinical studies related to Cancer, malaria, Covid-19, HIV, Heart Conditions, and pneumonia are comprehensive and contain a substantial amount of relevant information.

2. **Trend Identification:** Successfully identify and analyze trends within the clinical studies, such as the frequency of studies, common study locations, prevalent health conditions, and emerging areas of research.

3. **Geographical Coverage:** Assess the extent of geographical coverage in the clinical studies, evaluating how many different states and countries are involved. This metric's success would be determined by a wide and diverse coverage.

4. **Conclusions and Recommendations:** Produce meaningful and actionable conclusions and recommendations derived from the analysis, providing valuable insights for stakeholders and researchers.

5. **Relevance to Leading Causes of Death:** Ensure that the selected clinical studies are highly relevant to addressing the leading causes of death in Kenya, which include Cancer, malaria, Covid-19, HIV, Heart Conditions, and pneumonia.

6. **Use of Appropriate Tools and Software:** Utilize suitable software or tools proficiently for data analysis. Success in this metric would involve employing the chosen resources effectively.

7. **Documentation and References:** Maintain proper documentation of the analysis process, including references to external sources and datasets. This ensures transparency and reproducibility, which are critical indicators of success in data analysis.

### c) Understanding the context

> The context of the study revolves around the clinical trials database, accessible through https://clinicaltrials.gov/. This database encompasses a comprehensive collection of information regarding ongoing, upcoming, and past clinical research studies. These studies span across all 50 states in the United States and extend to over 200 countries globally.

> For this particular study, data has been extracted from clinical trials focusing on major health conditions including Cancer, malaria, Covid-19, HIV, Heart Conditions, and pneumonia. These conditions are significant contributors to mortality rates in Kenya.

> The objective of this study is to conduct a thorough analysis of the clinical study data. This includes investigating trends within the studies, examining geographical coverage, and ensuring data completeness. Additionally, the study aims to draw meaningful conclusions and recommendations from the analysis.

> The success of this study will be measured by the ability to effectively analyze the data, identify trends, and produce valuable insights relevant to the leading causes of death in Kenya. Additionally, the study's documentation and transparency in the analytical process will be crucial indicators of its success.

### d) Recording the experimental design

1. Data sourcing/loading
2. Data Understanding
3. Data Relevance
4. External Dataset Validation
5. Data Preparation
6. Univariate Analysis
7. Bivariate Analysis
8. Multivariate Analysis
9. Implementing the solution
10. Challenging the solution
11. Conclusion
12. Follow up questions

### e) Data relevance

> The data should have variables that adequately contribute to effective analysis of clinical trials.

> The dataset should lead to a high model fit (high accuracy, after all possible model optimization procedures have been applied.


# 2. Data Understanding

In [22]:
# lets import the libraries we need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder, Normalizer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, f1_score, precision_score, recall_score, classification_report, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB, CategoricalNB
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image
import pydotplus
import os

# filter to ignore warnings
import warnings
warnings.filterwarnings('ignore')


### a) reading the data

In [23]:
# lets mount our google drive folder
from google.colab import drive

# lets mount Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
# lets create a function to read the data
def read_data(data):
   df = pd.read_csv(data)
   return df

In [24]:
# lets define the directory path containing your CSV files
directory_path = '/content/drive/MyDrive/DTE/'

# lets get a list of all CSV files in the directory
csv_files = [file for file in os.listdir(directory_path) if file.endswith('.csv')]

# lets initialize an empty dictionary to store DataFrames
data_frames = {}

# lets iterate through the CSV files and read them into DataFrames
for file_name in csv_files:
    # here we generate a key for the DataFrame using the file name (excluding extension)
    key = os.path.splitext(file_name)[0]

    # finally read the CSV file into a DataFrame
    file_path = os.path.join(directory_path, file_name)
    data_frames[key] = pd.read_csv(file_path)

# lets print the first few rows of each DataFrame
#for key, df in data_frames.items():
#    print(f"==============DataFrame=========== '{key}':")
#   print(df.head())

### b) Checking the data

Number of records in the datasets

In [31]:
# lets see the number of rows and columns in the datasets using a for loop
for name, df in data_frames.items():
    print(f'{name}: rows = {df.shape[0]} and columns = {df.shape[1]}')


Copy of Cancer-studies: rows = 100262 and columns = 30
Copy of Covid 19-studies: rows = 9342 and columns = 30
Copy of Heart: rows = 32176 and columns = 30
Copy of HIV-studies: rows = 9113 and columns = 30
Copy of Malaria-studies: rows = 1330 and columns = 30
Copy of Pneumonia-studies: rows = 9640 and columns = 30


> The datasets are big in terms of size which makes analysis and drawing insights from them accurate

Top view of datasets

In [35]:
# getting individual csv files from drive and printing the top view of the datasets
# for cancer dataset
cancer = data_frames['Copy of Cancer-studies']
cancer.head(3)

Unnamed: 0,NCT Number,Study Title,Study URL,Acronym,Study Status,Brief Summary,Study Results,Conditions,Interventions,Primary Outcome Measures,...,Study Design,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents
0,NCT02426125,A Study of Ramucirumab (LY3009806) Plus Doceta...,https://clinicaltrials.gov/study/NCT02426125,RANGE,COMPLETED,The main purpose of this study is to evaluate ...,YES,Urothelial Carcinoma,DRUG: Ramucirumab|DRUG: Docetaxel|DRUG: Placebo,"Progression Free Survival (PFS), PFS defined a...",...,Allocation: RANDOMIZED|Intervention Model: PAR...,15679|I4T-MC-JVDC|2014-003655-66,2015-07-13,2017-04-21,2022-07-26,2015-04-24,2019-01-25,2023-08-21,"Highlands Oncology Group, Fayetteville, Arkans...",Study Protocol|Statistical Analysis Plan
1,NCT04910425,PSMA-Targeted 18F-DCFPyL PET/MRI for the Detec...,https://clinicaltrials.gov/study/NCT04910425,,NOT_YET_RECRUITING,This phase II trial studies how well 18F-DCFPy...,NO,Prostate Carcinoma,DRUG: Fluorine F 18 DCFPyL|DRUG: Gadobenate Di...,18-F-DCFPyL positron emission tomography (PET)...,...,Allocation: NA|Intervention Model: SINGLE_GROU...,NU 19U05|NCI-2021-05593|STU00212326|NU 19U05|P...,2023-06-17,2026-06-17,2028-07,2021-06-02,,2022-08-03,"Northwestern University, Chicago, Illinois, 60...",
2,NCT04116125,Omitting Biopsy of SEntinel Lymph Node With Ra...,https://clinicaltrials.gov/study/NCT04116125,OBSERB,NOT_YET_RECRUITING,"The OBSERB study is a multi-center, non-blinde...",NO,Breast Neoplasm Female|Lymphatic Metastasis,PROCEDURE: Radiotherapy,"Disease-free survival, Disease free survival i...",...,Allocation: RANDOMIZED|Intervention Model: SIN...,2019-09-023,2020-07-01,2023-06-30,2025-06-30,2019-10-04,,2019-10-04,,


In [36]:
# for covid dataset
covid = data_frames['Copy of Covid 19-studies']
covid.head(3)

Unnamed: 0,NCT Number,Study Title,Study URL,Acronym,Study Status,Brief Summary,Study Results,Conditions,Interventions,Primary Outcome Measures,...,Study Design,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents
0,NCT04614025,Open-label Multicenter Study to Evaluate the E...,https://clinicaltrials.gov/study/NCT04614025,,ACTIVE_NOT_RECRUITING,This clinical trial will examine if a new trea...,NO,COVID|ARDS,BIOLOGICAL: PLX-PAD,"Number of ventilator-free days, 28 days",...,Allocation: RANDOMIZED|Intervention Model: PAR...,PLX-COV-03,2020-10-19,2021-08-04,2023-01,2020-11-03,,2022-12-21,"Charite Campus Virchow, Berlin, 10117, Germany...",
1,NCT04646525,The Relationship Between Covid-19 Infection in...,https://clinicaltrials.gov/study/NCT04646525,,UNKNOWN,We aimed to find out whether the tonsils and n...,NO,Covid19|Immune Deficiency|Tonsillitis|Tonsil H...,DIAGNOSTIC_TEST: Physical examination,The primary outcome of our study was the evalu...,...,Observational Model: |Time Perspective: p,2020/480,2020-10-01,2021-02-01,2021-02-01,2020-11-30,,2020-11-30,"Selcuk University, Konya, Selcuklu, 42100, Turkey",
2,NCT04333225,Hydroxychloroquine in the Prevention of COVID-...,https://clinicaltrials.gov/study/NCT04333225,,COMPLETED,In order to assess the efficacy of hydroxychlo...,YES,COVID-19,DRUG: Hydroxychloroquine,Number of Participants Infected With COVID-19 ...,...,Allocation: NON_RANDOMIZED|Intervention Model:...,020-132,2020-04-03,2020-06-30,2020-06-30,2020-04-03,2021-08-02,2021-08-20,"Baylor University Medical Center, Dallas, Texa...",Study Protocol|Statistical Analysis Plan


In [37]:
# for heart dataset
heart = data_frames['Copy of Heart']
heart.head(3)

Unnamed: 0,NCT Number,Study Title,Study URL,Acronym,Study Status,Brief Summary,Study Results,Conditions,Interventions,Primary Outcome Measures,...,Study Design,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents
0,NCT00291525,Randomized On-X Anticoagulation Trial,https://clinicaltrials.gov/study/NCT00291525,PROACT,ACTIVE_NOT_RECRUITING,Various patient groups with the On-X Valve can...,NO,Heart Valve Disease,DEVICE: On-X valve using reduced anticoagulati...,"Thromboembolism, Rate of thromboembolism evalu...",...,Allocation: RANDOMIZED|Intervention Model: PAR...,2005-01|G050208,2006-06-06,2022-12-31,2022-12-31,2006-02-14,,2022-09-14,"Tucson Medical Center, Tucson, Arizona, 85718,...",
1,NCT03419325,A Genomic Approach for Clopidogrel in Caribbea...,https://clinicaltrials.gov/study/NCT03419325,,ACTIVE_NOT_RECRUITING,Clopidogrel is a prescription medicine used to...,NO,Cardiovascular Disease (CVD)|Stroke|Acute Coro...,GENETIC: CYP2C19 test|DIAGNOSTIC_TEST: P2RY12 ...,Major adverse cardiovascular events (MACE) red...,...,Allocation: NON_RANDOMIZED|Intervention Model:...,A4070417|2U54MD007600-31,2020-09-01,2023-04-30,2023-12-30,2018-02-01,,2023-05-30,"University Hospital at Carolina, Carolina, 009...",
2,NCT03176225,Evaluate Safety and Effectiveness of XenoSure ...,https://clinicaltrials.gov/study/NCT03176225,,RECRUITING,The purpose of this clinical trial is to colle...,NO,Heart Diseases,PROCEDURE: Open heart surgery to address the h...,Leakage rate at 6 month post-procedure measure...,...,Allocation: RANDOMIZED|Intervention Model: PAR...,P15077-1,2017-08-15,2025-11-15,2026-02-15,2017-06-05,,2023-03-10,"Chinese PLA General Hospital, Beijing, Beijing...",


In [38]:
# for HIV dataset
hiv = data_frames['Copy of HIV-studies']
hiv.head(3)

Unnamed: 0,NCT Number,Study Title,Study URL,Acronym,Study Status,Brief Summary,Study Results,Conditions,Interventions,Primary Outcome Measures,...,Study Design,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents
0,NCT00914225,Effect of Bednets and a Water Purification Dev...,https://clinicaltrials.gov/study/NCT00914225,ITN,COMPLETED,In many areas of the world most severely affec...,NO,HIV Infections|Human Immunodeficiency Virus|Ma...,OTHER: Bednets and Water Purification,To determine the effect of LLIN and a simple m...,...,Observational Model: |Time Perspective: p,35464-B|SSC#1554,2009-09,2011-12,2011-12,2009-06-04,,2015-05-29,"Kisii Provincial Hospital, Kisii, Kenya|Kisumu...",
1,NCT02167425,Study of Integrating Antiretroviral Therapy Wi...,https://clinicaltrials.gov/study/NCT02167425,IMAT,UNKNOWN,To improve ART initiation among people who inj...,NO,HIV|Opioid Dependence,OTHER: IMAT,"Time to CD4 Screening, Number of days between ...",...,Observational Model: |Time Perspective: p,IMAT-01|R34DA037787,2015-02,2017-03,2017-03,2014-06-19,,2015-11-23,,
2,NCT01423825,Evaluating the Safety and Immune Response to a...,https://clinicaltrials.gov/study/NCT01423825,,COMPLETED,This is an extension of the HVTN 073/SAAVI 102...,NO,HIV Infections,BIOLOGICAL: Sub C gp140 Vaccine|BIOLOGICAL: MF...,"Safety data, including signs and symptoms of l...",...,Allocation: RANDOMIZED|Intervention Model: PAR...,HVTN 073E/SAAVI 102|11824,2011-08,2013-07,2013-07,2011-08-26,,2021-10-14,Brigham and Women's Hospital Vaccine CRS (BWH ...,


In [39]:
# for malaria dataset
malaria = data_frames['Copy of Malaria-studies']
malaria.head(3)

Unnamed: 0,NCT Number,Study Title,Study URL,Acronym,Study Status,Brief Summary,Study Results,Conditions,Interventions,Primary Outcome Measures,...,Study Design,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents
0,NCT01976325,Evaluating the Ottawa Malaria Decision Aid,https://clinicaltrials.gov/study/NCT01976325,OMDA,UNKNOWN,BRIEF SUMMARY\n\nCanadians often visit areas w...,NO,Malaria,OTHER: Ottawa Malaria Decision Aid,"Travellers' Knowledge Score, The traveller's k...",...,Allocation: RANDOMIZED|Intervention Model: PAR...,2010462-01H,2014-01,2015-11,2015-11,2013-11-05,,2015-04-16,National Capital Region Occupational Health Cl...,
1,NCT00914225,Effect of Bednets and a Water Purification Dev...,https://clinicaltrials.gov/study/NCT00914225,ITN,COMPLETED,In many areas of the world most severely affec...,NO,HIV Infections|Human Immunodeficiency Virus|Ma...,OTHER: Bednets and Water Purification,To determine the effect of LLIN and a simple m...,...,Observational Model: |Time Perspective: p,35464-B|SSC#1554,2009-09,2011-12,2011-12,2009-06-04,,2015-05-29,"Kisii Provincial Hospital, Kisii, Kenya|Kisumu...",
2,NCT05605925,Ivermectin-artemisinin Combination Therapy for...,https://clinicaltrials.gov/study/NCT05605925,IVIME,RECRUITING,Malaria remains a leading cause of morbidity a...,NO,Malaria,DRUG: Artemether/lumefantrine|DRUG: Ivermectin,"Malaria transmission rates in a household, Mal...",...,Allocation: RANDOMIZED|Intervention Model: PAR...,MAKSHSREC-2021-237,2022-08-04,2022-12-31,2022-12-31,2022-11-04,,2022-11-04,"ST. Paul's Health Center, Kasese, Uganda",


In [40]:
# for pneumonia dataset
pneumonia = data_frames['Copy of Pneumonia-studies']
pneumonia.head(3)

Unnamed: 0,NCT Number,Study Title,Study URL,Acronym,Study Status,Brief Summary,Study Results,Conditions,Interventions,Primary Outcome Measures,...,Study Design,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents
0,NCT02708225,The Influence of Medical Clowns on the Perform...,https://clinicaltrials.gov/study/NCT02708225,clowns,UNKNOWN,Medical clowns are known to assist in relaxing...,NO,Asthma|Pneumonia,BEHAVIORAL: medical clown,"length of experium (seconds), length of experi...",...,Allocation: RANDOMIZED|Intervention Model: PAR...,0026-15,2016-03,2016-10,2017-10,2016-03-15,,2016-03-15,,
1,NCT03962725,Avoiding Neuromuscular Blockers to Reduce Comp...,https://clinicaltrials.gov/study/NCT03962725,,TERMINATED,The goal of this study to evaluate whether eli...,NO,Respiratory Failure|Respiratory Infection|Aspi...,DRUG: Neuromuscular Blocking Agents|DRUG: Anes...,Number of participants who either had postoper...,...,Allocation: RANDOMIZED|Intervention Model: PAR...,2019P000260,2019-08-07,2022-12-19,2022-12-19,2019-05-24,,2023-02-02,"Massachusetts General Hospital, Boston, Massac...",
2,NCT04646525,The Relationship Between Covid-19 Infection in...,https://clinicaltrials.gov/study/NCT04646525,,UNKNOWN,We aimed to find out whether the tonsils and n...,NO,Covid19|Immune Deficiency|Tonsillitis|Tonsil H...,DIAGNOSTIC_TEST: Physical examination,The primary outcome of our study was the evalu...,...,Observational Model: |Time Perspective: p,2020/480,2020-10-01,2021-02-01,2021-02-01,2020-11-30,,2020-11-30,"Selcuk University, Konya, Selcuklu, 42100, Turkey",


The datasets have missing values, the columns need reformarting in terms of the names and the null values need to be handled too, but first we append the datasets seeing that they have **same structure, and we want to combine their rows.**

In [46]:
# lets apppend the datasets to make analysis easy
# Concatenate the DataFrames
# 'ignore_index=True' resets the index after concatenation

main_df = pd.concat([cancer, covid, heart, hiv, malaria, pneumonia], ignore_index=True)

# lets see the top view of main df
main_df.head()


Unnamed: 0,NCT Number,Study Title,Study URL,Acronym,Study Status,Brief Summary,Study Results,Conditions,Interventions,Primary Outcome Measures,...,Study Design,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents
0,NCT02426125,A Study of Ramucirumab (LY3009806) Plus Doceta...,https://clinicaltrials.gov/study/NCT02426125,RANGE,COMPLETED,The main purpose of this study is to evaluate ...,YES,Urothelial Carcinoma,DRUG: Ramucirumab|DRUG: Docetaxel|DRUG: Placebo,"Progression Free Survival (PFS), PFS defined a...",...,Allocation: RANDOMIZED|Intervention Model: PAR...,15679|I4T-MC-JVDC|2014-003655-66,2015-07-13,2017-04-21,2022-07-26,2015-04-24,2019-01-25,2023-08-21,"Highlands Oncology Group, Fayetteville, Arkans...",Study Protocol|Statistical Analysis Plan
1,NCT04910425,PSMA-Targeted 18F-DCFPyL PET/MRI for the Detec...,https://clinicaltrials.gov/study/NCT04910425,,NOT_YET_RECRUITING,This phase II trial studies how well 18F-DCFPy...,NO,Prostate Carcinoma,DRUG: Fluorine F 18 DCFPyL|DRUG: Gadobenate Di...,18-F-DCFPyL positron emission tomography (PET)...,...,Allocation: NA|Intervention Model: SINGLE_GROU...,NU 19U05|NCI-2021-05593|STU00212326|NU 19U05|P...,2023-06-17,2026-06-17,2028-07,2021-06-02,,2022-08-03,"Northwestern University, Chicago, Illinois, 60...",
2,NCT04116125,Omitting Biopsy of SEntinel Lymph Node With Ra...,https://clinicaltrials.gov/study/NCT04116125,OBSERB,NOT_YET_RECRUITING,"The OBSERB study is a multi-center, non-blinde...",NO,Breast Neoplasm Female|Lymphatic Metastasis,PROCEDURE: Radiotherapy,"Disease-free survival, Disease free survival i...",...,Allocation: RANDOMIZED|Intervention Model: SIN...,2019-09-023,2020-07-01,2023-06-30,2025-06-30,2019-10-04,,2019-10-04,,
3,NCT03566225,Pioglitazone Versus Metformin as First Treatme...,https://clinicaltrials.gov/study/NCT03566225,,COMPLETED,Participants with PCOS will be divided into tw...,NO,Pioglitazone,DRUG: Pioglitazone|DRUG: Metformin|DRUG: Clomi...,"Clinical pregnancy rate, Pregnancy rate diagno...",...,Allocation: RANDOMIZED|Intervention Model: SIN...,AinShamaU,2018-01-30,2021-02-28,2021-03-30,2018-06-25,,2021-06-02,"Ain Shams Univerisity, Cairo, Egypt",
4,NCT01756625,"PREMIUM, Observational Study",https://clinicaltrials.gov/study/NCT01756625,,UNKNOWN,PREMIUM is an observational pharmaco-epidemiol...,NO,First Line WT KRAS mCRC,,To compare PFS rate at 1 year with PFS in clin...,...,Observational Model: |Time Perspective: p,PREMIUM,2010-01,2012-03,2013-06,2012-12-27,,2012-12-27,"Institut Sainte-Catherine, Avignon, Vaucluse, ...",


### c) Checking the Datatypes

In [47]:
# lets see the data types of main_df
main_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161863 entries, 0 to 161862
Data columns (total 30 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   NCT Number                  161863 non-null  object 
 1   Study Title                 161863 non-null  object 
 2   Study URL                   161863 non-null  object 
 3   Acronym                     48532 non-null   object 
 4   Study Status                161863 non-null  object 
 5   Brief Summary               161863 non-null  object 
 6   Study Results               161863 non-null  object 
 7   Conditions                  161857 non-null  object 
 8   Interventions               144610 non-null  object 
 9   Primary Outcome Measures    153676 non-null  object 
 10  Secondary Outcome Measures  119675 non-null  object 
 11  Other Outcome Measures      13107 non-null   object 
 12  Sponsor                     161863 non-null  object 
 13  Collaborators 

> We have 30 columns in total, with alot of null values, which we will sort after further analysis, the column names will be edited for easy refference and some of the data types need to be corrected.

# 3. External Data Validation

> After cross-referencing the information gathered from clinical trials with authoritative sources such as medical journals and regulatory bodies, we can confirm the validity of the data and identify any discrepancies or errors that that may have occurred during data collection. In so doing we conclude that the dataset has high credibility and can be used for the research question as it falls within its scope.

# 4. Data Preparation

### a) Consistency

Here we check for duplicates in our datasets before dropping any column

In [48]:
# Checking for duplicates
main_df.duplicated().any().any()

True

In [49]:
percentage_duplicates = (main_df.duplicated().sum() / len(main_df)) * 100
percentage_duplicates

5.995811272495876