# Breast Cancer Prediction using Predictive Analysis Techniques

Authored By: Vivek Poddar

### Description

For the final project of DSE200x (EDx), I am taking up the Breast Cancer Wisconsin (Diagnostic) Data Set from<br> (Source: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/) UCI ML Repository. 

### Objective
The aim is to predict the stage of cancer as either 'Benign' or 'Malignant', based on certain cell attributes. <br>Also, to compare accuracy of prediction using various Machine Learning Classification Algorithm and choose the best among them.

### Approach
The following steps will be involved in the process:
   <ol>Feature Selection</ol>
   

##### Importing Required Libraries

In [None]:
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Loading .data file

In [None]:
cancer_data = pd.read_csv('breast-cancer-wisconsin.data', header=None)

In [None]:
cancer_data.shape


Retrieving column names present in the Data Description and assigning it to DataFrame columns.
<br>Class = 2 represents 'Benign'
<br>Class = 4 represents 'Malignant'

In [None]:
colnames = ['Sample_code', 'Clump_Thickness', 'Uniformity_of_Cell_Size', 'Uniformity_of_Cell_Shape', 'Marginal_Adhesion', 'Single_Epithelial_Cell_Size', 'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli', 'Mitoses', 'Class']
cancer_data.columns = colnames

In [None]:
cancer_data.head()


## Data Cleaning

Making Sure tha all columns are numeric.

In [None]:
df = cancer_data.apply(pd.to_numeric, errors='coerce')

for col in colnames[1:10]:
    print(type(df[col][1]))

Checking for Null Values in New DataFrame.

In [None]:
df.isnull().any()

In [None]:
df = df.dropna()

In [None]:
df.isnull().any()

Dropping Sample_Code as it doesn't contain any relevant data

In [None]:
del[df['Sample_code']]

## Exploring Data Using Histogram to analyze Data Distribution

In [None]:
hist = df.hist(bins=10, figsize = (15,10))

### Now we will analyze the accuracy using 3 Prediction models available in sklearn

Namely,

<ul>Gaussian Naive Bayes</ul>
<ul>Support Vector Machine Classifier</ul>
<ul>Decision Tree Classifier</ul>

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
clean_df  = df.copy()

Labelling Class Value 2 as 0 ('Benign') and Class Value 4 as ('Malignant')

In [None]:
clean_df['target'] = (clean_df['Class']>2)*1

In [None]:
features = colnames[1:10]
X = clean_df[features].copy()
y = clean_df['target'].copy()

In [None]:
X.head()

In [None]:
y.describe()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

In [None]:
GNB_Classifier = GaussianNB()

In [None]:
GNB_Classifier.fit(X_train, y_train)

In [None]:
prediction1 = GNB_Classifier.predict(X_test)

In [None]:
print(prediction1)

In [None]:
accuracy1 = accuracy_score(y_true = y_test, y_pred = prediction1)
print(accuracy1)

#### Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree_classifier = DecisionTreeClassifier(max_leaf_nodes = 15, random_state=0)

In [None]:
tree_classifier.fit(X_train, y_train)

In [None]:
prediction2 = tree_classifier.predict(X_test)

In [None]:
print (prediction2)

In [None]:
accuracy2 = accuracy_score(y_true = y_test, y_pred = prediction2)
print(accuracy2)

#### Support Vector Machine Classifier

In [None]:
from sklearn import svm

In [None]:
X.head()

In [None]:
y.head()

In [None]:
classifier = svm.SVC(kernel='linear')

In [None]:
classifier.fit(X_train, y_train)

In [None]:
prediction3 = classifier.predict(X_test)

In [None]:
accuracy3 = accuracy_score(y_true = y_test, y_pred = prediction3)
print(accuracy3)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix_graph = confusion_matrix(y_test, prediction1)

In [None]:
pd.DataFrame(confusion_matrix_graph, index = )

In [None]:
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
class_names=['Benign', 'Malignant']
# Compute confusion matrix
cnf_matrix1 = confusion_matrix(y_test, prediction1)
np.set_printoptions(precision=2)

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix for GaussianNB')

plt.show()

cnf_matrix2 = confusion_matrix(y_test, prediction2)
np.set_printoptions(precision=2)

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix for Decision Tree')

plt.show()

cnf_matrix3 = confusion_matrix(y_test, prediction3)
np.set_printoptions(precision=2)

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix for SVM')

plt.show()