# Cancer Clinical Trials
Input Features = type of study and short description of disease.

Output Feature = Qualification.

This model predicts the eligibility or qualification of the patient for the clinical trials.

### Approach to the problem
* First we read the dataset  with the help of pandas read_csv.
* Then we did some EDA if necessary( checked the information , null values ,value counts, etc.)
* Separated the independent and dependent column.
* Combining both Independent features into one paragraph.
* Did Feature engineering on text data ---
    1. Removed all other characters except alphabets
    2. converted all alphabets into lower case.
    3. split into list of words
    4. removed the stopwords
    5. agan combined to form sentence.
* Converting words into vectors by using countvectorizer( we can also use tf-idf vectorizer)
* Divide the dataset into train and test 
* Choose multinomial naive bayes classifier( as it works well for text data)
* Checked the performance metric using confusion metrix and accuracy score.



In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv("cancer_clinical_trials.csv")
df.head(20)

Unnamed: 0,study,condition,qualification
0,study interventions are recombinant CD40-ligand,melanoma skin diagnosis and no active cns met...,0
1,study interventions are Liposomal doxorubicin,colorectal cancer diagnosis and cardiovascular,0
2,study interventions are BI 836909,multiple myeloma diagnosis and indwelling cen...,0
3,study interventions are Immunoglobulins,recurrent fallopian tube carcinoma diagnosis ...,0
4,study interventions are Paclitaxel,stage ovarian cancer diagnosis and patients m...,0
5,"study interventions are Antibodies, Monoclonal",recurrent verrucous carcinoma of the oral cav...,0
6,study interventions are Hormones,prostate cancer diagnosis and imaging examina...,0
7,study interventions are Bendamustine Hydrochlo...,diffuse large cell lymphoma diagnosis and no ...,0
8,study interventions are Nivolumab,recovered from all toxicities associated with...,0
9,study interventions are Thalidomide,kidney cancer diagnosis and no diabetes mellitus,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   study          1000000 non-null  object
 1   condition      1000000 non-null  object
 2   qualification  1000000 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 22.9+ MB


In [4]:
df.isnull().sum()

study            0
condition        0
qualification    0
dtype: int64

In [5]:
df.qualification.value_counts()

0    500000
1    500000
Name: qualification, dtype: int64

In [6]:
X=df[['study','condition']]
y=df['qualification']

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
text=[]
for row in range(0,len(X.index)):
    text.append(' '.join(str(x) for x in X.iloc[row,0:2]))

In [9]:
text[2]

'study interventions are BI 836909   multiple myeloma diagnosis and indwelling central venous cateder or willingness to undergo intra venous central line placement'

In [11]:
import nltk
import re
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Vaishali\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [12]:
corpus=[]
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [None]:

for i in range(0, len(text)):
    review = re.sub('[^a-zA-Z]', ' ', text[i])
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [None]:
## Applying Countvectorizer
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,ngram_range=(1,3))
X_cv = cv.fit_transform(corpus).toarray()

In [None]:
## Divide the dataset into Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_cv, y, test_size=0.33, random_state=0)

In [None]:
from sklearn.naive_bayes import MultinomialNB
classifier=MultinomialNB()

In [None]:
from sklearn import metrics
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)

In [None]:
score = metrics.accuracy_score(y_test, pred)
print("accuracy:   %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred)