<a href="https://colab.research.google.com/github/stogaja/clinical-trials/blob/main/clinical_trials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CLINICAL TRIALS**

### 1. a)Defining the Question

> What are the key trends and patterns observed in clinical studies related to Cancer, malaria, Covid-19, HIV, Heart Conditions, and pneumonia with a specific focus on studies conducted in Kenya?"

### b) Defining the metric of success

1. **Data Completeness:** Ensure that the datasets for clinical studies related to Cancer, malaria, Covid-19, HIV, Heart Conditions, and pneumonia are comprehensive and contain a substantial amount of relevant information.

2. **Trend Identification:** Successfully identify and analyze trends within the clinical studies, such as the frequency of studies, common study locations, prevalent health conditions, and emerging areas of research.

3. **Geographical Coverage:** Assess the extent of geographical coverage in the clinical studies, evaluating how many different states and countries are involved. This metric's success would be determined by a wide and diverse coverage.

4. **Conclusions and Recommendations:** Produce meaningful and actionable conclusions and recommendations derived from the analysis, providing valuable insights for stakeholders and researchers.

5. **Relevance to Leading Causes of Death:** Ensure that the selected clinical studies are highly relevant to addressing the leading causes of death in Kenya, which include Cancer, malaria, Covid-19, HIV, Heart Conditions, and pneumonia.

6. **Use of Appropriate Tools and Software:** Utilize suitable software or tools proficiently for data analysis. Success in this metric would involve employing the chosen resources effectively.

7. **Documentation and References:** Maintain proper documentation of the analysis process, including references to external sources and datasets. This ensures transparency and reproducibility, which are critical indicators of success in data analysis.

### c) Understanding the context

> The context of the study revolves around the clinical trials database, accessible through https://clinicaltrials.gov/. This database encompasses a comprehensive collection of information regarding ongoing, upcoming, and past clinical research studies. These studies span across all 50 states in the United States and extend to over 200 countries globally.

> For this particular study, data has been extracted from clinical trials focusing on major health conditions including Cancer, malaria, Covid-19, HIV, Heart Conditions, and pneumonia. These conditions are significant contributors to mortality rates in Kenya.

> The objective of this study is to conduct a thorough analysis of the clinical study data. This includes investigating trends within the studies, examining geographical coverage, and ensuring data completeness. Additionally, the study aims to draw meaningful conclusions and recommendations from the analysis.

> The success of this study will be measured by the ability to effectively analyze the data, identify trends, and produce valuable insights relevant to the leading causes of death in Kenya. Additionally, the study's documentation and transparency in the analytical process will be crucial indicators of its success.

### d) Recording the experimental design

1. Data sourcing/loading
2. Data Understanding
3. Data Relevance
4. External Dataset Validation
5. Data Preparation
6. Univariate Analysis
7. Bivariate Analysis
8. Multivariate Analysis
9. Implementing the solution
10. Challenging the solution
11. Conclusion
12. Follow up questions

### e) Data relevance

> The data provided for this study is highly relevant and valuable. It encompasses clinical trial information from a diverse range of studies, covering major health conditions such as Cancer, malaria, Covid-19, HIV, Heart Conditions, and pneumonia. These conditions represent significant health challenges globally, particularly in Kenya.

> Furthermore, the data is sourced from the clinical trials database (https://clinicaltrials.gov/), which is a reputable and comprehensive platform. It aggregates studies conducted across all 50 states in the U.S. and over 200 countries worldwide. This extensive reach ensures that the data captures a broad spectrum of medical research and provides a comprehensive view of clinical studies.

> In conclusion, the data's relevance is underscored by its focus on critical health issues and its extensive coverage, making it a robust foundation for conducting meaningful analysis and deriving valuable insights for the study.

# 2. Data Understanding

In [22]:
# lets import the libraries we need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder, Normalizer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, f1_score, precision_score, recall_score, classification_report, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB, CategoricalNB
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image
import pydotplus
import os

# filter to ignore warnings
import warnings
warnings.filterwarnings('ignore')


In [23]:
# lets mount our google drive folder
from google.colab import drive

# lets mount Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
# lets create a function to read the data
def read_data(data):
   df = pd.read_csv(data)
   return df

In [24]:
# lets define the directory path containing your CSV files
directory_path = '/content/drive/MyDrive/DTE/'

# lets get a list of all CSV files in the directory
csv_files = [file for file in os.listdir(directory_path) if file.endswith('.csv')]

# lets initialize an empty dictionary to store DataFrames
data_frames = {}

# lets iterate through the CSV files and read them into DataFrames
for file_name in csv_files:
    # here we generate a key for the DataFrame using the file name (excluding extension)
    key = os.path.splitext(file_name)[0]

    # finally read the CSV file into a DataFrame
    file_path = os.path.join(directory_path, file_name)
    data_frames[key] = pd.read_csv(file_path)

# lets print the first few rows of each DataFrame
#for key, df in data_frames.items():
#    print(f"==============DataFrame=========== '{key}':")
#   print(df.head())

In [29]:
# getting individual csv files from drive
cancer = data_frames['Copy of Cancer-studies']
cancer.head()
cancer.shape

(100262, 30)

In [14]:
# reading the data from the drive host
cancer = read_data('/content/drive/MyDrive/DTE/Copy of Cancer-studies.csv')
covid = read_data('/content/drive/MyDrive/DTE/Copy of Covid 19-studies.csv')
heart = read_data('/content/drive/MyDrive/DTE/Copy of Heart.csv')
hiv = read_data('/content/drive/MyDrive/DTE/Copy of HIV-studies.csv')
malaria = read_data('/content/drive/MyDrive/DTE/Copy of Malaria-studies.csv')
pneumonia = read_data('/content/drive/MyDrive/DTE/Copy of Pneumonia-studies.csv')


In [19]:
# lets define a list of the csv file names
file_names = [
   '/content/drive/MyDrive/DTE/Copy of Cancer-studies.csv',
   '/content/drive/MyDrive/DTE/Copy of Covid 19-studies.csv',
   '/content/drive/MyDrive/DTE/Copy of Heart.csv',
   '/content/drive/MyDrive/DTE/Copy of HIV-studies.csv',
   '/content/drive/MyDrive/DTE/Copy of Malaria-studies.csv',
   '/content/drive/MyDrive/DTE/Copy of Pneumonia-studies.csv'
]

# let's create an empty dictionary to store the dataframes
data_frames = {}

# lets loop through the file names and read the data into dataframes
for file_name in file_names:
   name = file_name.split('.')[0] # this extracts name from file name
   data_frames[name] = read_data(file_name)


In [20]:
# lets iterate through the dictionary and print the name of each dataframe
for name, df in data_frames.items():
    print(f'{name}: rows = {df.shape[0]} and columns = {df.shape[1]}')

/content/drive/MyDrive/DTE/Copy of Cancer-studies: rows = 100262 and columns = 30
/content/drive/MyDrive/DTE/Copy of Covid 19-studies: rows = 9342 and columns = 30
/content/drive/MyDrive/DTE/Copy of Heart: rows = 32176 and columns = 30
/content/drive/MyDrive/DTE/Copy of HIV-studies: rows = 9113 and columns = 30
/content/drive/MyDrive/DTE/Copy of Malaria-studies: rows = 1330 and columns = 30
/content/drive/MyDrive/DTE/Copy of Pneumonia-studies: rows = 9640 and columns = 30


In [None]:
# lets preview the top entries of the datasets using a for loop
for name, df in data_frames.items():
   print(f"Top view of {name}:")
   print(df.head(4))
   print("="*30) # separator for clarity

NameError: ignored