# **Cancer Clinical Trial qualification prediction**

**The aim of this project is to predict whether the patient is qualified for the cancer clinical trial or not given the condition. The corresponding study intervention suggests the required treatment for the particular subject.**

# Business Use Case:

**Clinical trial are research studies performed on people to find out if a new drug, treatment is safe and effective on people. There are various challenges that they face like entering and transferring data, correct dosage etc. To improve these trails, researchers are moving towards AI and NLP to smoothen the process.**

**NLP when applied to the field of medicine has the potential to go through the doctors' notes i.e. unstructured data and extract meaningful information from it in less time.**

**Let us see the steps followed in this project:-**

1. Import the libraries
2. Exploratory Data analysis (number of classes, NAN values, types of cancer and corresponding study etc.)
3. Text data preprocessing on condition column
4. Feature Extraction (words to one hot vectors and pass to Embedding layer)
5. Build LSTM model and passing embedded vectors in it.
6. Performace analysis using metrics

# A] Import the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

In [None]:
df=pd.read_csv('../input/clinical-trial/cancer_clinical_trials.csv')
df.head()

In [None]:
df.shape

In [None]:
df.dtypes


In [None]:
df['qualification'].value_counts().plot.bar()   #balanced dataset

In [None]:
#check for NAN values

features_with_nan=[feature for feature in df.columns if df[feature].isnull().sum()>=1]
features_with_nan    #no nulll values

# Independent and dependent variables

In [None]:
X=df['condition']
y=df['qualification']

In [None]:
#declaring the vocab size : It will contain all the unique the words from the condition column

voc_size=5000
messages=X.copy()

# Data Preprocessing : Process like removal of stopwords,stemming/Lemmatization,removal of  punctuations, upper cases to lower and storing in corpus variable.

**We will use NLTK library for this. Stemming is the process of reducing the form of words to its base form. It helps in bringing uniformity in the corpus. PorterStemmer helps in doing this.**

In [None]:
import re
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
ps=PorterStemmer()
corpus=[]
for i in range(len(messages)):
    review=re.sub('[^a-zA-Z]',' ',messages[i])
    review=review.lower()
    review=review.split()
    review=[ps.stem(word) for word in review if not word in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)

# Feature extraction

**There are various techniques for feature extraction(convert word to vectors) like bag of words, tf-idf and embeddings. We will work on embedding layer here as it focuses better on semantic information and the size of vectors is less as embeddings lead to formation of dense matrix instead of sparse matrix(more 0's)**

**#now we convert each word into corpus into one hot vectors where each word is assigned a number(index) based on vocabulary size(5000) declared earlier. Now we pass those vectors to the embedding layer**

In [None]:
onehot_repr=[one_hot(words,voc_size) for words in corpus]
onehot_repr

**In word embedding we will convert the words into vectors based on features.It is a featurized representation of the words where similar words will be represented by almost equal vector for a particular feature.That is why Feature representation is useful as it helps in capturing semantic information and it also reduces into a dense matrix and low dimension unlike Bag of words/TF-IDF representation where it is sparse matrix and high dimension.**

**Before passing One Hot representation to the embedding layer , we need to make sure that all the length of the sentences are equal . If it is not the same , we apply pre padding with zeroes to make the lengths equal by first defining a sentence length.**

In [None]:
sent_length=20
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)

In [None]:
print(embedded_docs)

# Model Building LSTM

In [None]:
from tensorflow.keras.layers import Dropout
embedding_vector_features=40
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model.add(Dropout(0.5))
model.add(LSTM(200))
model.add(Dropout(0.5))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

In [None]:
import numpy as np

X_final=np.array(embedded_docs)
y_final=np.array(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X_final,y_final,test_size=0.33,random_state=42)

In [None]:
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)

# Performance Metrics

**In healthcare industries,choosing the right performance metric to evaluate the model is very crucial. In healthcare problems, the dataset is usually imbalanced so accuracy metric is not reliable as it is gets biased towards the class which has more occurences. Therefore, we generally use Precision, recall and f1 score in these cases.**

**We will focus more on False Positive (Precision) when it comes to clinical trial use case because if even if the prediction is positive(eligible) but actual is not eligible, that is more dangerous case.**

# Accuracy = (TP+TN)/(TP+TN+FP+FN)

**Accuracy is the ratio of total correct predictions and total number of predictions**

In [None]:
y_pred=np.argmax(model.predict(X_test), axis=-1)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

In [None]:
from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix

In [None]:
print(classification_report(y_test,y_pred,digits=2))

In [None]:
set(y_test) - set(y_pred)

**As we can see above that there is an ill behaviour with respect to label 1.This means that there is no F-score to calculate for this label, and thus the F-score for this case is considered to be 0.0. Since we are requested an average of the score, you must take into account that a score of 0 was included in the calculation, and this is why scikit-learn is showing us that warning**

# Solution :

**What we can do, is decide that if we are not interested in the scores of labels that were not predicted,then explicitly specify the labels we are interested in (which are labels that were predicted at least once).**



In [None]:
cm=confusion_matrix(y_test,y_pred)
cm

In [None]:
from sklearn import metrics

# F1 Score : Harmonic mean of Precision and Recall

In [None]:
metrics.f1_score(y_test, y_pred, average='weighted', labels=np.unique(y_pred))

# Recall=TP/TP+FN

In [None]:
metrics.recall_score(y_test, y_pred, average='weighted', labels=np.unique(y_pred))

# Precision=TP/(TP+FP)

In [None]:
metrics.precision_score(y_test, y_pred, average='weighted', labels=np.unique(y_pred))