##Project Title:- Breast Cancer Prediction.

##BY :- VIVEK SINGH

##Introduction:

Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body.

This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

##Problem Statement:

The goal is to predict whether the patient has a risk of developing Breast Cancer in near future or not.

##Data Description:

The dataset provides the patient's information. It includes over 569 records and 32 attributes. Variables Each attribute is a potential risk factor.

Given breast cancer results from breast fine needle aspiration (FNA) test (is a quick and simple procedure to perform, which removes some fluid or cells from a breast lesion or cyst (a lump, sore or swelling) with a fine needle similar to a blood sample needle). Since this build a model that can classify a breast cancer tumor using two training classification:

1= Malignant (Cancerous) - Present

0= Benign (Not Cancerous) -Absent

The Breast Cancer datasets is available machine learning repository maintained by the University of California, Irvine. The dataset contains 569 samples of malignant and benign tumor cells.

The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnosis (M=malignant, B=benign), respectively. The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.

##Loading Dataset and Libraries.

In [None]:
##importing libraries
import pandas as pd

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno

import warnings
warnings.filterwarnings('ignore')

sns.set()
plt.style.use('ggplot')


In [None]:
#importing dataset from google drive

from google.colab import drive
drive.mount('/content/drive')



In [None]:
df = pd.read_csv('/content/drive/MyDrive/coders cave dataset & certificate/breast-cancer.csv')

In [None]:
# Inspecting the data
df.head()

In [None]:
# Inspecting the number of rows and columns in dataframe
df.shape
# There are 569 rows and 32 columns in the dataset

In [None]:
#Reviewing data type with info of dataframe.
df.info()

In [None]:
# getting the mathemetical insight of the dataframe
df.describe()

In [None]:
# checking for any missing variables in a dataset.
df.isnull().any()

- There is no data missing from the dataset

In [None]:
#getting to know about unique parameters.
df.diagnosis.unique()

From the results above, diagnosis is a categorical variable, because it represents a fix number of possible values (i.e, Malignant, of Benign. The machine learning algorithms wants numbers, and not strings, as their inputs so we need some method of coding to convert them.

## Performing Data Wrangling.

In [None]:
# Id column is redundant and not useful, we want to drop it
df.drop('id', axis =1, inplace=True)
df


In [None]:
#converting
#1= Malignant (Cancerous) - Present
#0= Benign (Not Cancerous) -Absent
df['diagnosis']=np.where(df['diagnosis']=='M',1,0)
df

In [None]:
# grouping by malignant and benign
df= df.groupby('diagnosis')
df.size()


- Malignant = 1 (indicates prescence of cancer cells)
- Benign = 0 (indicates abscence)

## DATA VISULIZATION

In [None]:
#lets get the frequency of cancer diagnosis
plt.figure(figsize=(8,4))
sns.countplot(x = df.diagnosis)
plt.title('frequency of cancer diagnosis')
plt.show()

In [None]:
df.describe()
plt.hist(df['diagnosis'])
plt.title('Diagnosis (M=1 , B=0)')
plt.show()

In [None]:
plt.figure(figsize = (20, 15))
plotnumber = 1

for column in df:
    if plotnumber <= 30:
        ax = plt.subplot(5, 6, plotnumber)
        sns.distplot(df[column])
        plt.xlabel(column)

    plotnumber += 1

plt.tight_layout()
plt.show()

##Observation

We can see that perhaps the attributes perimeter,radius, area, concavity,ompactness may have an exponential distribution ( ). We can also see that perhaps the texture and smooth and symmetry attributes may have a Gaussian or nearly Gaussian distribution.

In [None]:
# heatmap

plt.figure(figsize = (20, 12))

corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))

sns.heatmap(corr, mask = mask, linewidths = 1, annot = True, fmt = ".2f")
plt.show()

##Observation

- We can see strong positive relationship exists with mean values paramaters
  between 1-0.75;.

- The mean area of the tissue nucleus has a strong positive correlation with
  mean values of radius and parameter;
- Some paramters are moderately positive corrlated (r between 0.5-0.75)are
  concavity and area, concavity and perimeter etc
- we see some strong negative correlation between fractal_dimension with  
  radius, texture, parameter mean values.
- We can see that there are many columns which are very highly                  correlated which causes multicollinearity so we have to remove highly  
  correlated features.

In [None]:
# removing highly correlated features

corr_matrix = df.corr().abs()

mask = np.triu(np.ones_like(corr_matrix, dtype = bool))
tri_df = corr_matrix.mask(mask)

to_drop = [x for x in tri_df.columns if any(tri_df[x] > 0.92)]

df = df.drop(to_drop, axis = 1)

print(f"The reduced dataframe has {df.shape[1]} columns.")

- After removing highly corelated features we are left with only 23 columns.

In [None]:
to_drop

In [None]:
df.info()

In [None]:
# creating features and label

X = df.drop('diagnosis', axis = 1)
y = df['diagnosis']

##Assesing Model Accuracy:

- Split data into training and test sets.
- The simplest method to evaluate the performance of a machine learning
  algorithm is to use different training and testing datasets. Here I will
  Split the available data into a training set and a testing set. (70% training, 30% test)
- Train the algorithm on the first part,
  make predictions on the second part and
  evaluate the predictions against the expected results.
- The size of the split can depend on the size and specifics of your dataset,
  although it is common to use 67% of the data for training and the remaining 33% for testing.

In [None]:
# splitting data into training and test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)

In [None]:
# scaling data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

##Logistic Regression

In [None]:
# fitting data to model

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

In [None]:
# model predictions

y_pred = log_reg.predict(X_test)

In [None]:
# accuracy score

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print(accuracy_score(y_train, log_reg.predict(X_train)))

log_reg_acc = accuracy_score(y_test, log_reg.predict(X_test))
print(log_reg_acc)

In [None]:
# confusion matrix

print(confusion_matrix(y_test, y_pred))

In [None]:
# classification report

print(classification_report(y_test, y_pred))

###K Neighbors Classifier (KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

In [None]:
# model predictions

y_pred = knn.predict(X_test)

In [None]:
# accuracy score

print(accuracy_score(y_train, knn.predict(X_train)))

knn_acc = accuracy_score(y_test, knn.predict(X_test))
print(knn_acc)

In [None]:
# confusion matrix

print(confusion_matrix(y_test, y_pred))

In [None]:
# classification report

print(classification_report(y_test, y_pred))

##Support Vector Machine (SVM)

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

svc = SVC(probability=True)
parameters = {
    'gamma' : [0.0001, 0.001, 0.01, 0.1],
    'C' : [0.01, 0.05, 0.5, 0.1, 1, 10, 15, 20]
}

grid_search = GridSearchCV(svc, parameters)
grid_search.fit(X_train, y_train)

In [None]:
# best parameters

grid_search.best_params_

In [None]:
# best score

grid_search.best_score_

In [None]:
svc = SVC(C = 10, gamma = 0.01, probability=True)
svc.fit(X_train, y_train)

In [None]:
# model predictions

y_pred = svc.predict(X_test)

In [None]:
# accuracy score

print(accuracy_score(y_train, svc.predict(X_train)))

svc_acc = accuracy_score(y_test, svc.predict(X_test))
print(svc_acc)

In [None]:
# confusion matrix

print(confusion_matrix(y_test, y_pred))

In [None]:
# classification report

print(classification_report(y_test, y_pred))

###Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()

parameters = {
    'criterion' : ['gini', 'entropy'],
    'max_depth' : range(2, 32, 1),
    'min_samples_leaf' : range(1, 10, 1),
    'min_samples_split' : range(2, 10, 1),
    'splitter' : ['best', 'random']
}

grid_search_dt = GridSearchCV(dtc, parameters, cv = 5, n_jobs = -1, verbose = 1)
grid_search_dt.fit(X_train, y_train)

In [None]:
# best parameters

grid_search_dt.best_params_

In [None]:
# best score

grid_search_dt.best_score_

In [None]:
dtc = DecisionTreeClassifier(criterion= 'entropy', max_depth= 19, min_samples_leaf= 4, min_samples_split= 6, splitter= 'random')
dtc.fit(X_train, y_train)

In [None]:
y_pred = dtc.predict(X_test)

In [None]:
# accuracy score

print(accuracy_score(y_train, dtc.predict(X_train)))

dtc_acc = accuracy_score(y_test, dtc.predict(X_test))
print(dtc_acc)

In [None]:
# confusion matrix

print(confusion_matrix(y_test, y_pred))

In [None]:
# classification report

print(classification_report(y_test, y_pred))

##Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rand_clf = RandomForestClassifier(criterion = 'entropy', max_depth = 10, max_features = 'auto', min_samples_leaf = 2, min_samples_split = 3, n_estimators = 130)
rand_clf.fit(X_train, y_train)

In [None]:
y_pred = rand_clf.predict(X_test)

In [None]:
# accuracy score

print(accuracy_score(y_train, rand_clf.predict(X_train)))

ran_clf_acc = accuracy_score(y_test, y_pred)
print(ran_clf_acc)

In [None]:
# confusion matrix

print(confusion_matrix(y_test, y_pred))

In [None]:
# classification report

print(classification_report(y_test, y_pred))

##Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier()

parameters = {
    'loss': ['deviance', 'exponential'],
    'learning_rate': [0.001, 0.1],
    'n_estimators': [100, 150, 180]
}

grid_search_gbc = GridSearchCV(gbc, parameters, cv = 2, n_jobs = -5, verbose = 1)
grid_search_gbc.fit(X_train, y_train)

In [None]:
# best parameters

grid_search_gbc.best_params_

In [None]:
# best score

grid_search_gbc.best_score_

In [None]:
gbc = GradientBoostingClassifier(learning_rate = 0.1, loss = 'exponential', n_estimators = 180)
gbc.fit(X_train, y_train)

In [None]:
y_pred = gbc.predict(X_test)

In [None]:
# accuracy score

print(accuracy_score(y_train, gbc.predict(X_train)))

gbc_acc = accuracy_score(y_test, y_pred)
print(gbc_acc)

In [None]:
# confusion matrix

print(confusion_matrix(y_test, y_pred))

In [None]:
# classification report

print(classification_report(y_test, y_pred))

##Extreme Gradient Boosting

In [None]:
from xgboost import XGBClassifier

xgb = XGBClassifier(objective = 'binary:logistic', learning_rate = 0.01, max_depth = 5, n_estimators = 180)

xgb.fit(X_train, y_train)

In [None]:
y_pred = xgb.predict(X_test)

In [None]:
# accuracy score

print(accuracy_score(y_train, xgb.predict(X_train)))

xgb_acc = accuracy_score(y_test, y_pred)
print(xgb_acc)

In [None]:
# confusion matrix

print(confusion_matrix(y_test, y_pred))

In [None]:
# classification report

print(classification_report(y_test, y_pred))

###Model Comparison

In [None]:
models = pd.DataFrame({
    'Model': ['Logistic Regression', 'KNN', 'SVM', 'Decision Tree Classifier', 'Random Forest Classifier', 'Gradient Boosting Classifier', 'XgBoost'],
    'Score': [100*round(log_reg_acc,4), 100*round(knn_acc,4), 100*round(svc_acc,4), 100*round(dtc_acc,4), 100*round(ran_clf_acc,4),
              100*round(gbc_acc,4), 100*round(xgb_acc,4)]
})
models.sort_values(by = 'Score', ascending = False)

In [None]:
import pickle
model = svc
pickle.dump(model, open("breast_cancer.pkl",'wb'))

In [None]:
from sklearn import metrics
plt.figure(figsize=(8,5))
models = [
{
    'label': 'LR',
    'model': log_reg,
},
{
    'label': 'DT',
    'model': dtc,
},
{
    'label': 'SVM',
    'model': svc,
},
{
    'label': 'KNN',
    'model': knn,
},
{
    'label': 'XGBoost',
    'model': xgb,
},
{
    'label': 'RF',
    'model': rand_clf,
},
{
    'label': 'GBDT',
    'model': gbc,
}
]
for m in models:
    model = m['model']
    model.fit(X_train, y_train)
    y_pred=model.predict(X_test)
    fpr1, tpr1, thresholds = metrics.roc_curve(y_test, model.predict_proba(X_test)[:,1])
    auc = metrics.roc_auc_score(y_test,model.predict(X_test))
    plt.plot(fpr1, tpr1, label='%s - ROC (area = %0.2f)' % (m['label'], auc))

plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1 - Specificity (False Positive Rate)', fontsize=12)
plt.ylabel('Sensitivity (True Positive Rate)', fontsize=12)
plt.title('ROC - Breast Cancer Prediction', fontsize=12)
plt.legend(loc="lower right", fontsize=12)
plt.show()

In [None]:
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
models = [
{
    'label': 'LR',
    'model': log_reg,
},
{
    'label': 'DT',
    'model': dtc,
},
{
    'label': 'SVM',
    'model': svc,
},
{
    'label': 'KNN',
    'model': knn,
},
{
    'label': 'XGBoost',
    'model': xgb,
},
{
    'label': 'RF',
    'model': rand_clf,
},
{
    'label': 'GBDT',
    'model': gbc,
}
]

means_roc = []
means_accuracy = [100*round(log_reg_acc,4), 100*round(dtc_acc,4), 100*round(svc_acc,4), 100*round(knn_acc,4), 100*round(xgb_acc,4),
                  100*round(ran_clf_acc,4), 100*round(gbc_acc,4)]

for m in models:
    model = m['model']
    model.fit(X_train, y_train)
    y_pred=model.predict(X_test)
    fpr1, tpr1, thresholds = metrics.roc_curve(y_test, model.predict_proba(X_test)[:,1])
    auc = metrics.roc_auc_score(y_test,model.predict(X_test))
    auc = 100*round(auc,4)
    means_roc.append(auc)

print(means_accuracy)
print(means_roc)

# data to plot
n_groups = 7
means_accuracy = tuple(means_accuracy)
means_roc = tuple(means_roc)

# create plot
fig, ax = plt.subplots(figsize=(8,5))
index = np.arange(n_groups)
bar_width = 0.35
opacity = 0.8

rects1 = plt.bar(index, means_accuracy, bar_width,
alpha=opacity,
color='mediumpurple',
label='Accuracy (%)')

rects2 = plt.bar(index + bar_width, means_roc, bar_width,
alpha=opacity,
color='rebeccapurple',
label='ROC (%)')

plt.xlim([-1, 8])
plt.ylim([70, 104])

plt.title('Performance Evaluation - Breast Cancer Prediction', fontsize=12)
plt.xticks(index, ('   LR', '   DT', '   SVM', '   KNN', 'XGBoost' , '   RF', '   GBDT'), rotation=40, ha='center', fontsize=12)
plt.legend(loc="upper right", fontsize=10)
plt.show()

##Summary


Worked through a classification predictive modeling machine learning problem from end-to-end using Python. Specifically, the steps covered were:

- Problem Definition (Breast Cancer data).
- Loading the Dataset.
- Analyze Data (same scale but di↵erent distributions of data).
- Evaluate Algorithms (KNN looked good).
- Evaluate Algorithms with Standardization (KNN and SVM looked good).
- Finalize Model (use all training data and confirm using validation dataset).