# <center> Breast Cancer Prediction using XGBoost </center>
## <center> Authored by: Pratham Tripathi </center>

# Contents:

## <u>1.Aim:</u>
To predict whether the patient has malignant(1) or Benign(0) cells.
## <u>2.Approach:</u> 
The Approach was to build a classifier that could efficient predict the same.
## <u>3.Model:</u> 
The model used here is XGBoost Classifier.
## <u>4.About the model:</u> 
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.
Thus, with proper tuning, the model could easily identify the significant columns on its own and build effective and generalized model easily.

# Importing Major Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sn

%matplotlib inline

# Reading The CSV file

In [None]:
df = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")
df = df.drop(['Unnamed: 32'], axis=1)
df.head()

# Data Analysis 

In [None]:
df.dtypes

In [None]:
df.isnull().sum()

In [None]:
df["diagnosis"].unique()

# Label Encoding "Diagnosis"

In [None]:
le = LabelEncoder()
le.fit(["M","B"])
df["diagnosis"] = le.transform(df["diagnosis"])

In [None]:
df.head()

# Feature and Target Set

In [None]:
X = df[df.columns[df.columns!="diagnosis"]]
X.head()

In [None]:
y = df["diagnosis"]
y.head()

# Creating DMatrix for XGBClassifier

In [None]:
dmatrix = xgb.DMatrix(data = X,label = y)

# Stratified Spliting of Data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 123, stratify = y)

# XGBoost Classifier Model

In [None]:
xgb_css = xgb.XGBClassifier(n_estimators = 100,objective = "reg:logistic",colsample_bytree = 0.3,learning_rate = 0.1, max_depth = 5,alpha =10)

# Training the Model

In [None]:
xgb_css.fit(X_train,y_train)

# Predciting Outcomes

In [None]:
pred = xgb_css.predict(X_test)

# Confusion Matrix

In [None]:
#Evaluation
from sklearn.metrics import confusion_matrix,classification_report
import itertools
def plot_confusion_matrix(cm,classes,
                         normalize = False,
                         title='Confusion Matrix',
                         cmap = plt.cm.Blues):
    if normalize:
        cm = cm.astype('float')/cm.sum(axis = 1)[:,np.newaxis]
        print("After Normalization")
    else:
        print("Without Normalization")
    print(cm)
    plt.imshow(cm,interpolation='nearest',cmap = 'Wistia')
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks,classes,rotation = True,color='white')
    plt.yticks(tick_marks,classes,rotation =True,color='white')
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max()/2
    for i,j in itertools.product(range(cm.shape[0]),range(cm.shape[1])):
        plt.text(j,i,format(cm[i,j],fmt),
                horizontalalignment = "center",
                color = 'white' if cm[i,j]>thresh else "black")
        
    plt.tight_layout()
    plt.xlabel("Predicted",color='white',size=20)
    plt.ylabel("True",color='white',size=20)

In [None]:
cnf_matrix=confusion_matrix(y_test,pred,labels=[0,1])
np.set_printoptions(precision = 2)
plt.figure()
plot_confusion_matrix(cnf_matrix,classes=['benign(0)','malignant(1)'],normalize=False,title='Confusion Matrix')

# Classification Report

In [None]:
print(classification_report(y_test,pred))

# F1 Score of the Model

In [None]:
from sklearn.metrics import f1_score
f1_score(y_test,pred,average='weighted')

# K-fold Cross Validation using xgb.cv()

In [None]:
params = {"objective":"reg:logistic","colsample_bytree":"0.3","learning_rate": "0.1","max_depth":"5","alpha":"10"}
cv_results = xgb.cv(dtrain = dmatrix, params = params, nfold = 3,early_stopping_rounds =10,metrics="error", as_pandas = True, seed = 123)

# CV Results

In [None]:
cv_results.head()

# Last Validation Score

In [None]:
print((cv_results["test-error-mean"]).tail(1))

# About Test-error-Mean and Accuracy
The model has very less loss (O.06) and a high F1 Score (0.96)
Hence, the model is highly efficient in this case.