<h1><center>Breast Cancer Prediction with AI</center></h1>

<center><img src="https://static-01.hindawi.com/styles/hindawi_wide/s3/2019-11/Cancer_Awareness-2019_blog_v1.0_noText.jpg?itok=CR034IE-"></center>

# **Introduction**
Cancer occurs when changes called mutations take place in genes that regulate cell growth. The mutations let the cells divide and multiply in an uncontrolled way.

Breast cancer is cancer that develops in breast cells. Typically, the cancer forms in either the lobules or the ducts of the breast.

Lobules are the glands that produce milk, and ducts are the pathways that bring the milk from the glands to the nipple. Cancer can also occur in the fatty tissue or the fibrous connective tissue within your breast.

The uncontrolled cancer cells often invade other healthy breast tissue and can travel to the lymph nodes under the arms. The lymph nodes are a primary pathway that help the cancer cells move to other parts of the body.

## Description
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

## Contents

1. [Import data and python packages](#t1.)

2. [Data visualization](#t2.)

3. [Classification](#t3.)

    3.1 [Split data for train and test](#t3.1)
    
    3.2 [Functions for models](#t3.2)
    
    3.3 [Models](#t3.3)

4. [Result](#t4.)

<a id="t1."></a>
# 1. Import data and python packages

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore')

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn.cluster import KMeans
from xgboost import XGBClassifier,XGBRFClassifier
from sklearn.metrics import confusion_matrix,accuracy_score

In [None]:
plt.style.use('ggplot')

orange_black = ['#fdc029', '#df861d', 'FF6347', '#aa3d01',
                '#a30e15', '#800000', '#171820']

plt.rcParams['figure.figsize'] = (10,5) 
plt.rcParams['figure.facecolor'] = '#FFFACD' 
plt.rcParams['axes.facecolor'] = 'FFFFE0' 
plt.rcParams['axes.grid'] = True 
plt.rcParams['grid.color'] = orange_black[3]
plt.rcParams['grid.linestyle'] = '--' 

In [None]:
df = pd.DataFrame(load_breast_cancer().data)
df.columns = load_breast_cancer().feature_names
df['target'] = load_breast_cancer().target
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe().T.style.bar(color='#d65f5f')

<a id="t2."></a>
# 2. Data visualization

In [None]:
matrix = np.triu(df.corr(method='spearman'))
f,ax=plt.subplots(figsize = (17,15),dpi=250)
sns.heatmap(df.corr(method='spearman'),annot= True,fmt = ".0%",ax=ax,
            vmin = -1,
            vmax = 1, mask = matrix,cmap = "coolwarm",
            linewidth = 0.2,linecolor = "white")
plt.xticks(rotation=70)
plt.yticks(rotation=0)
plt.title('Correlation matrix (spearman)', size = 30)
plt.show()

In [None]:
plt.figure(figsize=(25,25))
for col,index in zip(df.columns,range(1,31)):
    if col == 'target':
        pass
    else:
        plt.subplot(6,5,index)
        plt.hist(df.loc[df["target"]==1][col],alpha=0.7,label="malignant",density=True,bins=20)
        plt.hist(df.loc[df["target"]==0][col],alpha=0.7,label="benign",density=True,bins=20)
        plt.legend()
        plt.title(col.upper())
plt.tight_layout()
plt.show()

In [None]:
def scatterAndBoxen(x, y):
    global df
    data = df.copy()
    
    data['TARGET'] = data['target'].replace([0, 1], ['benign'.upper(),'malignant'.upper()]) 
    plt.figure(figsize=(15,10))
    plt.subplot(2,2,(1,2))
    sns.scatterplot(data = data, x = x, y = y, hue = 'TARGET')
    plt.xlabel(x.upper())
    plt.ylabel(y.upper())
    plt.subplot(2,2,3)
    sns.boxenplot(data=data, x='TARGET', y = x)
    plt.xlabel('')
    plt.ylabel(x.upper())
    plt.subplot(2,2,4)
    sns.boxenplot(data=data, x='TARGET', y= y)
    plt.xlabel('')
    plt.ylabel(y.upper())
    plt.show()

In [None]:
scatterAndBoxen('mean radius','mean area' )

In [None]:
scatterAndBoxen('mean radius','mean perimeter' )

In [None]:
scatterAndBoxen('worst perimeter','mean radius' )

In [None]:
feature = []
for col in range(30):
    if df.iloc[:,col].max() < 1:
        feature.append(col)
plt.figure(figsize=(14,10))
sns.violinplot(data=df.iloc[:,feature], 
            orient="h", palette=["teal"])
plt.title("Box Plot of Data")
plt.show()

In [None]:
feature = []
for col in range(30):
    if df.iloc[:,col].max() > 1 and df.iloc[:,col].max() < 10:
        feature.append(col)
plt.figure(figsize=(14,6))
sns.violinplot(data=df.iloc[:,feature], 
            orient="h", palette=["teal"])
plt.title("Box Plot of Data")
plt.show()

In [None]:
feature = []
for col in range(30):
    if df.iloc[:,col].max() > 10 and df.iloc[:,col].max() < 100:
        feature.append(col)
plt.figure(figsize=(14,8))
sns.violinplot(data=df.iloc[:,feature], 
            orient="h", palette=["teal"])
plt.title("Box Plot of Data")
plt.show()

In [None]:
feature = []
for col in range(30):
    if df.iloc[:,col].max() > 100 and df.iloc[:,col].max() < 1000:
        feature.append(col)
plt.figure(figsize=(14,6))
sns.violinplot(data=df.iloc[:,feature], 
            orient="h", palette=["teal"])
plt.title("Box Plot of Data")
plt.show()

In [None]:
feature = []
for col in range(30):
    if df.iloc[:,col].max() > 1000:
        feature.append(col)
plt.figure(figsize=(14,6))
sns.violinplot(data=df.iloc[:,feature], 
            orient="h", palette=["teal"])
plt.title("Box Plot of Data")
plt.show()

<a id="t3."></a>
# 3. Classification

In [None]:
scaler = MinMaxScaler(feature_range=(-1, 1))
scaler.fit(df.drop('target',axis=1))

<a id="t3.1"></a>
## 3.1 Split data for train and test

In [None]:
X = scaler.transform(df.drop('target',axis=1))
y = df.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=123)

In [None]:
warnings.simplefilter(action='ignore', category=FutureWarning)
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import DiscriminationThreshold

In [None]:
plt.style.use('ggplot')

orange_black = ['#fdc029', '#df861d', 'FF6347', '#aa3d01',
                '#a30e15', '#800000', '#171820']

plt.rcParams['figure.figsize'] = (10,5) 
plt.rcParams['figure.facecolor'] = '#FFFACD' 
plt.rcParams['axes.facecolor'] = 'FFFFE0' 
plt.rcParams['axes.grid'] = True 
plt.rcParams['grid.color'] = orange_black[3]
plt.rcParams['grid.linestyle'] = '--' 

In [None]:
classes = ['benign'.upper(),'malignant'.upper()]

<a id="t3.2"></a>
## 3.2 Functions for models

<a id="t3.3"></a>
## 3.3 Models

In [None]:
model = KNeighborsClassifier(12)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Acc:",accuracy_score(y_test, y_pred))

visualizer = ROCAUC(model, classes=classes)
visualizer.fit(X_train, y_train)      
visualizer.score(X_test, y_test)        
visualizer.show();

plt.figure(figsize=(3,3))
cm = ConfusionMatrix(model, classes=classes)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)
plt.xticks(rotation=0)
cm.show();

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Acc:",accuracy_score(y_test, y_pred))

visualizer = ROCAUC(model, classes=classes)
visualizer.fit(X_train, y_train)      
visualizer.score(X_test, y_test)        
visualizer.show();

plt.figure(figsize=(3,3))
cm = ConfusionMatrix(model, classes=classes)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)
plt.xticks(rotation=0)
cm.show();

In [None]:
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Acc:",accuracy_score(y_test, y_pred))

visualizer = ROCAUC(model, classes=classes)
visualizer.fit(X_train, y_train)      
visualizer.score(X_test, y_test)        
visualizer.show();

plt.figure(figsize=(3,3))
cm = ConfusionMatrix(model, classes=classes)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)
plt.xticks(rotation=0)
cm.show();

In [None]:
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Acc:",accuracy_score(y_test, y_pred))

visualizer = ROCAUC(model, classes=classes)
visualizer.fit(X_train, y_train)      
visualizer.score(X_test, y_test)        
visualizer.show();

plt.figure(figsize=(3,3))
cm = ConfusionMatrix(model, classes=classes)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)
plt.xticks(rotation=0)
cm.show();

In [None]:
model = XGBRFClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Acc:",accuracy_score(y_test, y_pred))

visualizer = ROCAUC(model, classes=classes)
visualizer.fit(X_train, y_train)      
visualizer.score(X_test, y_test)        
visualizer.show();

plt.figure(figsize=(3,3))
cm = ConfusionMatrix(model, classes=classes)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)
plt.xticks(rotation=0)
cm.show();

<a id="t4."></a>
# 4. Result

In [None]:
def Prediction(test):
    global classes, model
    
    pred = model.predict(test.reshape(1, -1))
    
    if (pred == 0):
        result = classes[0]
    else:
        result = classes[1]
    return result

In [None]:
random = np.random.randint(0,len(X_test),1)
print("Predict:",Prediction(X_test[random]))
print("Actual:",y_test.values[random])