# Cancer Diagnosis : Approaches Comparison

In this notebook we're going to use the **Breast Cancer Wisconsin (Diagnostic) Data Set** and we're going to apply different methods for classification

The problem in hand here is a binary classification problem, given a set of features we wanna know whether a specific tumor is Malignant or Benign 

## The Approaches We're Going to Test Are

1. Machine Learning Approaches :
    1. Logistic Regression
    2. K Neighbors Classifier
    3. C-Support Vector Classification (SVC) With Linear Kernel
    4. C-Support Vector Classification (SVC) With rbf Kernel
    5. Gaussian Naive Bayes
    6. Decision Trees
    7. Random Forest

2. A Fuzzy Logic Approach
3. Neural Network Approach
4. A Neural Network Trained With Genetic Algorithms (Hybrid Approach)

## Data Preparation

Let's start by loading some important libraries

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Now let's import our dataset

In [None]:
dataset = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

Let's take a look at our dataset

In [None]:
print(dataset.count().unique())
dataset.head()

As we can see we have 33 columns, 30 of them represents our features and 1 is the classification label and also we have 569 rows

Let's check for null values in the dataset

In [None]:
dataset.isnull().sum()

We can see that there is a whole column of nulls in our dataset, so let's drop that column

In [None]:
del dataset['Unnamed: 32']

### After Cleaning Our Data We do The Necessary Splits

First we need to split the dataset into **Features:X** and **Labels:y**

In [None]:
X = dataset.iloc[:, 2:].values
y = dataset.iloc[:, 1].values

Let's take a look at the first values of X and y

In [None]:
X[0]

In [None]:
y[:20]

As we can see out **Dataframe** turned into a **Numpy Array** which is what we need for our models

but we can see that our labels:y is characters, we need to convert it to binary values, so let's do that using LabelEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)

In [None]:
y[:20]

Great
Now we need to split the data into **Train Data** and **Test Data**
we're going to set the ratio to 20% Test : 80% Train 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Now let's take a look at X_train for example

In [None]:
print(len(X_train))
X_train

Great, now we have on last problem to solve in our dataset, let's take a look at some statistics form our original data

In [None]:
dataset.describe()

Looking at the mean values of each feature we notice that our data have very different ranges, so we need to **Normalize** our data, or whats called **Feature Scalling** so let's do that

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## 1. Machine Learning Approaches

## 1.A Logistic Regression

Now with our first classifier which is Logistic Regression, Let's Create a Model

In [None]:
from sklearn.linear_model import LogisticRegression
LogisticRegressionModel = LogisticRegression(random_state = 0)

Now let's train the Model using our train data

In [None]:
LogisticRegressionModel.fit(X_train, y_train)

Now we start classifying the test data

In [None]:
y_pred_A1 = LogisticRegressionModel.predict(X_test)

Now let's create a Confusion Matrix to see how our results look like

In [None]:
from sklearn.metrics import confusion_matrix
cm_A1 = confusion_matrix(y_test, y_pred_A1)
print('Confusion Matrix for Logistic Regression Model')
sns.heatmap(cm_A1,annot=True)

In [None]:
print("Logistic Regression Model accuracy is {}%".format(((cm_A1[0][0] + cm_A1[1][1])/cm_A1.sum())*100))

## 1.B K Neighbors Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
KNeighborsModel = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p=2)

KNeighborsModel.fit(X_train, y_train)

y_pred_A2 = KNeighborsModel.predict(X_test)

from sklearn.metrics import confusion_matrix
cm_A2 = confusion_matrix(y_test, y_pred_A2)
print('Confusion Matrix for KNeighbors Model')
sns.heatmap(cm_A2,annot=True)

In [None]:
print("KNeighbors Model accuracy is {}%".format(((cm_A2[0][0] + cm_A2[1][1])/cm_A2.sum())*100))

## 1.C C-Support Vector Classification (SVC) With Linear Kernel

In [None]:
from sklearn.svm import SVC
SVCModel = SVC(kernel = 'linear', random_state=0)

SVCModel.fit(X_train, y_train)

y_pred_A3 = SVCModel.predict(X_test)

from sklearn.metrics import confusion_matrix
cm_A3 = confusion_matrix(y_test, y_pred_A3)
print('Confusion Matrix for SVC Model')
sns.heatmap(cm_A3,annot=True)

In [None]:
print("SVC Model accuracy is {}%".format(((cm_A3[0][0] + cm_A3[1][1])/cm_A3.sum())*100))

## 1.D C-Support Vector Classification (SVC) With rbf Kernel

In [None]:
from sklearn.svm import SVC
SVCrModel = SVC(kernel = 'rbf', random_state = 0)

SVCrModel.fit(X_train, y_train)

y_pred_A4 = SVCrModel.predict(X_test)

from sklearn.metrics import confusion_matrix
cm_A4 = confusion_matrix(y_test, y_pred_A4)
print('Confusion Matrix for SVC Kernelized Model')
sns.heatmap(cm_A4,annot=True)

In [None]:
print("SVC Kernelized Model accuracy is {}%".format(((cm_A4[0][0] + cm_A4[1][1])/cm_A4.sum())*100))

## 1.E Gaussian Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
GaussianNBModel = GaussianNB()

GaussianNBModel.fit(X_train, y_train)

y_pred_A5 = GaussianNBModel.predict(X_test)

from sklearn.metrics import confusion_matrix
cm_A5 = confusion_matrix(y_test, y_pred_A5)
print('Confusion Matrix for Gaussian NB Model')
sns.heatmap(cm_A5,annot=True)

In [None]:
print("Gaussian NB Model accuracy is {}%".format(((cm_A5[0][0] + cm_A5[1][1])/cm_A5.sum())*100))

## 1.F Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier
DecisionTreeModel = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

DecisionTreeModel.fit(X_train, y_train)

y_pred_A6 = DecisionTreeModel.predict(X_test)

from sklearn.metrics import confusion_matrix
cm_A6 = confusion_matrix(y_test, y_pred_A6)
print('Confusion Matrix for Decision Tree Model')
sns.heatmap(cm_A6,annot=True)

In [None]:
print("Decision Tree Model accuracy is {}%".format(((cm_A6[0][0] + cm_A6[1][1])/cm_A6.sum())*100))

## 1.G Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
RandomForestModel = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)

RandomForestModel.fit(X_train, y_train)

y_pred_A7 = RandomForestModel.predict(X_test)

from sklearn.metrics import confusion_matrix
cm_A7 = confusion_matrix(y_test, y_pred_A7)
print('Confusion Matrix for Random Forest Model')
sns.heatmap(cm_A7,annot=True)

In [None]:
print("Random Forest Model accuracy is {}%".format(((cm_A7[0][0] + cm_A7[1][1])/cm_A7.sum())*100))

### Now Let's Rank These Machine Learning Approaches Before Moving Forward

1. SVC Model accuracy is 98.24561403508771%
2. SVC Kernelized Model accuracy is 98.24561403508771%
3. Random Forest Model accuracy is 97.36842105263158%
4. Logistic Regression Model accuracy is 96.49122807017544%
5. KNeighbors Model accuracy is 95.6140350877193%
6. Decision Tree Model accuracy is 92.98245614035088%
7. Gaussian NB Model accuracy is 90.35087719298247%

## 2. A Fuzzy Logic Approach

To use fuzzy logic on our dataset we first we'll need to create 2 new features
- uniformity : difference between the radius extreme value and the radius mean value
- homogeneity : difference between the extreme value of symmetry and the mean value of symmetry

so let's create these 2 columns in a new dataset copied from our cleaned data

In [None]:
fuzzy_data = dataset.copy()

In [None]:
fuzzy_data['uniformity'] = fuzzy_data['radius_worst'] - fuzzy_data['radius_mean']
fuzzy_data['homogeneity'] = fuzzy_data['symmetry_worst'] - fuzzy_data['symmetry_mean']

Now let's take a look at our modified new dataset

In [None]:
fuzzy_data.head()

We'll base our model on 4 features 
- AREA
- PERIMETER
- UNIFORMITY
- HOMOGENEITY

Let's limit our dataset only to these values

In [None]:
fuzzy_data = fuzzy_data[['area_mean', 'perimeter_mean', 'uniformity', 'homogeneity', 'diagnosis']]
fuzzy_data.head()

Now we import the fuzzy libraries

In [None]:
!pip install scikit-fuzzy
import numpy as np
import skfuzzy as fuzz
from skfuzzy import control as ctrl

Now before we create our antecedents let's first check the ranges of values in our dataset so we can set the limits of our universe

In [None]:
fuzzy_data.describe()

Now let's set our limits based on these maximum values

In [None]:
AREA = ctrl.Antecedent(np.arange(0, 2501.000000, 0.0001), 'AREA')
PERIMETER = ctrl.Antecedent(np.arange(0, 188.500000, 0.0001), 'PERIMETER')
UNIFORMITY = ctrl.Antecedent(np.arange(0, 11.760000, 0.0001), 'UNIFORMITY')
HOMOGENEITY = ctrl.Antecedent(np.arange(0, 0.404100, 0.0001), 'HOMOGENEITY')

Also our diagnosis will have a value of [0:1] which will then be mapped to Malignant and Bengin

In [None]:
DIAGNOSIS = ctrl.Consequent(np.arange(0, 1, 0.0001), 'DIAGNOSIS')

Now let's the universe for our Antecedent, note that these values are based on a research by "National Center for Biotechnology Information, U.S. National Library of Medicine"

Area Universe

In [None]:
AREA['Smaller'] = fuzz.trapmf(AREA.universe, [0, 0, 748.8,1000])
AREA['Larger'] = fuzz.trapmf(AREA.universe, [508.1, 2194, 2501,2501])
AREA.view()

Perimeter Universe

In [None]:
PERIMETER['Smaller'] = fuzz.trapmf(PERIMETER.universe, [0, 0, 92.58,103])
PERIMETER['Larger'] = fuzz.trapmf(PERIMETER.universe, [85.1, 159.8, 188.5,188.5])
PERIMETER.view()

Uniformity Universe

In [None]:
UNIFORMITY['Smaller'] = fuzz.trapmf(UNIFORMITY.universe, [0, 0, 1.669,2.6])
UNIFORMITY['Larger'] = fuzz.trapmf(UNIFORMITY.universe, [0.65, 6.205, 11.76,11.76])
UNIFORMITY.view()

Homogenity Universe

In [None]:
HOMOGENEITY['Smaller'] = fuzz.trapmf(HOMOGENEITY.universe, [0, 0, 0.1232,.19])
HOMOGENEITY['Larger'] = fuzz.trapmf(HOMOGENEITY.universe, [0.0295, 0.2168, 0.4041,0.4041])
HOMOGENEITY.view()

And finaly the Diagnosis Universe Which will not be trapezoidals but triangles

In [None]:
DIAGNOSIS['B'] = fuzz.trimf(DIAGNOSIS.universe, [0, 0, 1])
DIAGNOSIS['M'] = fuzz.trimf(DIAGNOSIS.universe, [0, 1, 1])
DIAGNOSIS.view()

#### Now we need to define our rules, based on the research mentioned above out of 16 combination of rules we have, only 2 of them have an output and the other 14 have undefined output neither Malignant nor Bengin

so we're only going to define these 2 rules

In [None]:
rule1 = ctrl.Rule(AREA['Smaller'] & PERIMETER['Smaller'] & UNIFORMITY['Smaller'] & HOMOGENEITY['Smaller'], DIAGNOSIS['B'])
rule2 = ctrl.Rule(AREA['Larger'] & PERIMETER['Larger'] & UNIFORMITY['Larger'] & HOMOGENEITY['Larger'], DIAGNOSIS['M'])

Now Let's create a Control Systems using these 2 rules

In [None]:
Diag_ctrl = ctrl.ControlSystem([rule1, rule2])

And in order to simulate this control system, we will create a ControlSystemSimulation

In [None]:
Diag = ctrl.ControlSystemSimulation(Diag_ctrl)

### Now as we've created our Fuzzy System, Let's use it to classify our dataset

In [None]:
# A list to store predections generated by the fuzzy system
fuzzy_preds = []
# A list to store the equivilant real label, that's because we're skipping some rows, we'll talk why
fuzzy_real_vals = []

# looping over the rows of the dataset
for index, row in fuzzy_data.iterrows():
    
    #assigning antecedents values
    Diag.input['AREA'] = row['area_mean']
    Diag.input['PERIMETER'] = row['perimeter_mean']
    Diag.input['UNIFORMITY'] = row['uniformity']
    Diag.input['HOMOGENEITY'] = row['homogeneity']
    '''
    here we'll try to compute the output of the fuzzy system
    but why TRY ? why not compute it directly
    do you remember when we said we have a total of 16 rules for our fuzzy system
    and we're using only 2 of them becuase the rest outputs Undefined
    these undefined values will cause some errors so we'll just ignore these rows
    that why we're using Try Except, ok let's continue :D
    '''
    try:
        Diag.compute()
        
        #as we said the fuzzy system outputs value in range [0:1]
        #so here we discretize it and then store it
        fuzzy_preds.append(Diag.output['DIAGNOSIS'] > 0.5)
        
        fuzzy_real_vals.append(y[index])
    except:
        pass

let's see how many rows got skipped

In [None]:
print(len(y) - len(fuzzy_preds))

#### Now let's see the result's of our fuzzy system, shall we!

In [None]:
cm_fuzz = confusion_matrix(fuzzy_preds, fuzzy_real_vals)
sns.heatmap(cm_fuzz,annot=True)

In [None]:
print("Fuzzy System accuracy is {}%".format(((cm_fuzz[0][0] + cm_fuzz[1][1])/cm_fuzz.sum())*100))

Well, Not Bad Right?

## 3. Neural Network Approach

In this approach we'll use a simple Neural Network with only 3 layers using Keras library, so let's start

First we import some libraries to be used

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input

First we create a sequential model to add layers to

In [None]:
NNmodel = Sequential()

Now We'll add our 3 layers to the model, the 3 layers will be Dense layers, the output layer will have only 1 unit becuase we're doing binary classification, also we add a dropout after each layer of the first 2 to avoid over fitting

In [None]:
#We add the input layer
NNmodel.add(Input(shape=(30,)))

#we add our first hidden layer and a dropout to avoid overfitting
NNmodel.add(Dense(30, activation='relu', kernel_initializer='uniform'))
NNmodel.add(Dropout(0.1))


#we add the second layer and a dropout to avoid overfitting
NNmodel.add(Dense(16, activation='relu', kernel_initializer='uniform'))
NNmodel.add(Dropout(0.1))


#now we add the output layer with a sigmoid activation cuz it's a binary classification
NNmodel.add(Dense(1, activation='sigmoid', kernel_initializer='uniform'))

Let's Look at a summary of our model before compiling it

In [None]:
NNmodel.summary()

Now let's compile our model

In [None]:
NNmodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Now let's train it using our training data

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint
#first we create a checkpointer to store best epochs
checkpointer = ModelCheckpoint(
    filepath="NNweights",
    save_weights_only=True,
    monitor='accuracy',
    mode='max',
    save_best_only=True)

#we'll set epochs to 300, it won't take long
history = NNmodel.fit(X_train, y_train, batch_size=100, epochs=300, callbacks=[checkpointer])

Now let's classify our test data and discretize the results

In [None]:

#first we load the best saved weights
NNmodel.load_weights("NNweights")

nn_pred = NNmodel.predict(X_test)
nn_pred = (nn_pred > 0.5)

Let's see a confusion matrix and some accuracy

In [None]:
cm_nn = confusion_matrix(y_test, nn_pred)
sns.heatmap(cm_nn,annot=True)

In [None]:
print("Neural Network accuracy is {}%".format(((cm_nn[0][0] + cm_nn[1][1])/cm_nn.sum())*100))

## 4. Neural Network Trained With Genetic Algorithms

This is a Hybrid Approach, we'll be using the same previous Neural Network model, but we'll be training it using Genetic algorithms, instead of the usual Adam optimizer and fit function

we'll be using PyGAD library to do the genetic algorithms stuff

First lets import some libraries

In [None]:
!pip install pygad
import tensorflow.keras
import pygad.kerasga
import numpy
import pygad

Now we define the fitness function which is used to evaluate each population, since our problem is binary classification we'll use tensorflow.keras.losses.BinaryCrossentropy() as mesurment for the fitness

In [None]:
def fitness_func(solution, sol_idx):
    model_weights_matrix = pygad.kerasga.model_weights_as_matrix(model=NNmodel,
                                                                 weights_vector=solution)
    NNmodel.set_weights(weights=model_weights_matrix)

    predictions = NNmodel.predict(X_train)
    
    bce = tensorflow.keras.losses.BinaryCrossentropy()
    solution_fitness = 1.0 / (bce(y_train, predictions).numpy() + 0.00000001)

    return solution_fitness


Also we define some callbacks to show us some statistics while training

In [None]:
def callback_generation(ga_instance):
    print("Generation = {generation}".format(generation=ga_instance.generations_completed))
    print("Fitness    = {fitness}".format(fitness=ga_instance.best_solution()[1]))


In [None]:
#initialize the weights vector as a chromosome from the NN Model
weights_vector = pygad.kerasga.model_weights_as_vector(model=NNmodel)

#Create Genetic Population
keras_ga = pygad.kerasga.KerasGA(model=NNmodel,
                                 num_solutions=100)


num_generations = 100
num_parents_mating = 50
crossover_type = "single_point"
mutation_type = "random"
mutation_percent_genes = 10

initial_population = keras_ga.population_weights

ga_instance = pygad.GA(num_generations=num_generations, 
                       num_parents_mating=num_parents_mating, 
                       initial_population=initial_population,
                       fitness_func=fitness_func,
                       on_generation=callback_generation,
                       crossover_type=crossover_type,
                       mutation_type=mutation_type,
                       mutation_percent_genes=mutation_percent_genes)
ga_instance.run()

# After the generations complete, some plots are showed that summarize how the outputs/fitness values evolve over generations.
ga_instance.plot_result(title="PyGAD & Keras - Iteration vs. Fitness", linewidth=4)

# Returning the details of the best solution.
solution, solution_fitness, solution_idx = ga_instance.best_solution()
print("Fitness value of the best solution = {solution_fitness}".format(solution_fitness=solution_fitness))
print("Index of the best solution : {solution_idx}".format(solution_idx=solution_idx))



Let's load the best chromosome of weights and classify the test data

In [None]:
# Fetch the parameters of the best solution.
best_solution_weights = pygad.kerasga.model_weights_as_matrix(model=NNmodel,
                                                              weights_vector=solution)
# Set the Weights
NNmodel.set_weights(best_solution_weights)

#predict
predictions = NNmodel.predict(X_test)
predictions = (predictions > 0.5)

Now let's see a confusion matrix and some accuracy

In [None]:
cm_ga = confusion_matrix(y_test, predictions)
sns.heatmap(cm_ga,annot=True)

In [None]:
print("Hybrid Neural Network/Genetic accuracy is {}%".format(((cm_ga[0][0] + cm_ga[1][1])/cm_ga.sum())*100))