# Breast Cancer Prediction

## Data Set Information:

There are 10 predictors, all quantitative, and a binary dependent variable, indicating the presence or absence of breast cancer.
The predictors are anthropometric data and parameters which can be gathered in routine blood analysis.
Prediction models based on these predictors, if accurate, can potentially be used as a biomarker of breast cancer.

## Attribute Information:

Quantitative Attributes:
- Age (years)
- BMI (kg/m2)
- Glucose (mg/dL)
- Insulin (µU/mL)
- HOMA
- Leptin (ng/mL)
- Adiponectin (µg/mL)
- Resistin (ng/mL)
- MCP-1(pg/dL)

Labels:
- 1=Healthy controls
- 2=Patients

Abbreviations
- BC: Breast cancer; 
- BMI: Body mass index; 
- HOMA: Homeostasis Model Assessment; 
- MCP-1: Chemokine Monocyte Chemoattractant Protein 1;

## Goals
The goal of this exploratory study was to develop and assess a prediction model which can potentially
be used as a biomarker of breast cancer, based on anthropometric data and parameters which can be gathered in
routine blood analysis.

Nowadays in a medical test, the big indicators of success are specificity and sensitivity. Every medical test strives to reach 100% in both criteria.

Sensitivity/recall:
- How good a test is at detecting the positives. A test can cheat and maximize this by always returning “positive”.

Specificity:
- How good a test is at avoiding false alarms. A test can cheat and maximize this by always returning “negative”.

# Data Exploration

## Load libraries and read the data

In [None]:
#importing the libraries\
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score,roc_curve, auc
import warnings
warnings.filterwarnings('ignore')
import warnings

import plotly.express as px
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff

In [None]:
#importing the dataset 
df = pd.read_csv("../input/breast-cancer-coimbra-data-set/dataR2.csv")

## Description

In [None]:
df.info()

## Data Column Identification

In [None]:
df.columns

## Data numeric

In [None]:
numeric=['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_num=df.select_dtypes(include=numeric)
df_num.head(3)

In [None]:
df.Classification.unique()

# Exploratory Data Analysis (EDA)

## Changing 'status' data value

In [None]:
df['Classification'] = df.Classification.map({1:0, 2:1})
df

## Statistical Summary

In [None]:
describeNum = df.describe(include =['float64', 'int64', 'float', 'int'])
describeNum.T.style.background_gradient(cmap='viridis',low=0.2,high=0.1)

## Missing Value 

In [None]:
null=pd.DataFrame(df.isnull().sum(),columns=["Null Values"])
null["% Missing Values"]=(df.isna().sum()/len(df)*100)
null = null[null["% Missing Values"] > 0]
null.style.background_gradient(cmap='viridis',low =0.2,high=0.1) 

## Graphic Approach

 ### Correlation heatmap

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(30,20))
ax = sns.heatmap(data = df.corr(),cmap='YlGnBu',annot=True)

bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5,top - 0.5)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,10))
df_cor = df.corr()
half = np.triu(np.ones_like(df_cor, dtype=np.bool))

my_colors = ['#bbeb44','#1db1cf']
cmap = matplotlib.colors.LinearSegmentedColormap.from_list('Custom', my_colors)

heatmap = sns.heatmap(df_cor, 
            square=True, 
            mask=half,
            linewidth=2.5, 
            vmax=0.4, vmin=0, 
            cmap=cmap, 
            cbar=False, 
            ax=ax,annot=True)

heatmap.set(title="Heatmap of continous variables")
heatmap.set_yticklabels(heatmap.get_xticklabels(), rotation = 0)
heatmap.spines['top'].set_visible(True)
plt.tight_layout()

### Scatter plot

In [None]:
fig, ax = plt.subplots()
_ = plt.scatter(x=df['HOMA'], y=df['Insulin'], edgecolors="#000000", linewidths=0.5)
_ = ax.set(xlabel="HOMA", ylabel="Insulin")

In [None]:
fig, ax = plt.subplots()
_ = plt.scatter(x=df['Leptin'], y=df['BMI'], edgecolors="#000000", linewidths=0.5)
_ = ax.set(xlabel="Leptin", ylabel="BMI")

### Box plots

In [None]:
featuresNum = ['Age', 'BMI', 'Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin', 'Resistin', 'MCP.1']
plt.figure(figsize=(15, 7))
for i in range(0, len(featuresNum)):
    plt.subplot(1, len(featuresNum), i+1)
    sns.boxplot(y=df[featuresNum[i]], color='#1db1cf', orient='v')
    plt.tight_layout()

### How many patients have breast cancer and not?

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

_ = sns.countplot(x="Classification", data=df, palette="nipy_spectral",
              order=df.Classification.value_counts().index)

_ = ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
_ = ax.set(xlabel="Classification", ylabel="No. of patients")
plt.legend(bbox_to_anchor=(0.945, 0.90))

In [None]:
#How many Startup have both 'acquired' status and is_top500?
len(df[(df["Classification"] == True)].index)

In [None]:
#How many Startup have both 'acquired' status and is_top500?
len(df[(df["Classification"] == False)].index)

### Which Age related to Patients or Healty Control Health?

In [None]:
fig, ax = plt.subplots(figsize=(17,10))

sns.countplot(x="Age", hue="Classification", data=df, palette="nipy_spectral",
              order=df.Age.value_counts().index)
plt.legend(bbox_to_anchor=(0.945, 0.90))

### Breast Cancer Patients Corresponding to Age

In [None]:
fig = px.histogram(df, x="Age",color="Classification",
                   marginal="box",
                   hover_data=df.columns,
                  color_discrete_sequence=['#bbeb44','#1db1cf'])
fig.update_layout(
    title="Breast Cancer Patients Corresponding to Age"
)
fig.show()

### Breast Cancer Patients corresponding to Glucose

In [None]:
more = df[df['Classification']==1]['Glucose']
less = df[df['Classification']==0]['Glucose']
fig = ff.create_distplot([less, more],['Healty Control', 'Breast Cancer Patients']
                         , show_hist=False, 
                        colors=['#bbeb44','#1db1cf'])
fig.update_layout(
    title="Breast Cancer Patients Corresponding to Glucose",
    xaxis_title="Glucose",
)
fig.show()

In [None]:
more = df[df['Classification']==1]['Glucose']
less = df[df['Classification']==0]['Glucose']
fig = ff.create_distplot([less, more],['Healty Control', 'Breast Cancer Patients']
                         , bin_size=5,
                        colors=['#bbeb44','#1db1cf'])
fig.update_layout(
    title="Breast Cancer Patients Corresponding to Glucose",
    xaxis_title="Glucose",
)
fig.show()

### Breast Cancer Patients corresponding to Resistin

In [None]:
more = df[df['Classification']==1]['Resistin']
less = df[df['Classification']==0]['Resistin']
fig = ff.create_distplot([less, more],['Healty Control', 'Breast Cancer Patients']
                         , show_hist=False, 
                        colors=['#bbeb44','#1db1cf'])
fig.update_layout(
    title="Breast Cancer Patients Corresponding to Resistin",
    xaxis_title="Resistin",
)
fig.show()

In [None]:
more = df[df['Classification']==1]['Resistin']
less = df[df['Classification']==0]['Resistin']
fig = ff.create_distplot([less, more],['Healty Control', 'Breast Cancer Patients']
                         , bin_size=5,
                        colors=['#bbeb44','#1db1cf'])
fig.update_layout(
    title="Breast Cancer Patients Corresponding to Resistin",
    xaxis_title="Resistin",
)
fig.show()

### Breast Cancer Patients corresponding to BMI

In [None]:
more = df[df['Classification']==1]['BMI']
less = df[df['Classification']==0]['BMI']
fig = ff.create_distplot([less, more],['Healty Control', 'Breast Cancer Patients']
                         , show_hist=False, 
                        colors=['#bbeb44','#1db1cf'])
fig.update_layout(
    title="Breast Cancer Patients Corresponding to BMI",
    xaxis_title="BMI",
)
fig.show()

In [None]:
more = df[df['Classification']==1]['BMI']
less = df[df['Classification']==0]['BMI']
fig = ff.create_distplot([less, more],['Healty Control', 'Breast Cancer Patients']
                         , bin_size=5,
                        colors=['#bbeb44','#1db1cf'])
fig.update_layout(
    title="Breast Cancer Patients Corresponding to BMI",
    xaxis_title="BMI",
)
fig.show()

### Breast Cancer Patients corresponding to Insulin

In [None]:
more = df[df['Classification']==1]['Insulin']
less = df[df['Classification']==0]['Insulin']
fig = ff.create_distplot([less, more],['Healty Control', 'Breast Cancer Patients']
                         , show_hist=False, 
                        colors=['#bbeb44','#1db1cf'])
fig.update_layout(
    title="Breast Cancer Patients Corresponding to Insulin",
    xaxis_title="Insulin",
)
fig.show()

### Breast Cancer Patients corresponding to HOMA

In [None]:
more = df[df['Classification']==1]['HOMA']
less = df[df['Classification']==0]['HOMA']
fig = ff.create_distplot([less, more],['Healty Control', 'Breast Cancer Patients']
                         , show_hist=False, 
                        colors=['#bbeb44','#1db1cf'])
fig.update_layout(
    title="Breast Cancer Patients Corresponding to HOMA",
    xaxis_title="HOMA",
)
fig.show()

### Breast Cancer Patients corresponding to Leptin

In [None]:
more = df[df['Classification']==1]['Leptin']
less = df[df['Classification']==0]['Leptin']
fig = ff.create_distplot([less, more],['Healty Control', 'Breast Cancer Patients']
                         , show_hist=False, 
                        colors=['#bbeb44','#1db1cf'])
fig.update_layout(
    title="Breast Cancer Patients Corresponding to Leptin",
    xaxis_title="Leptin",
)
fig.show()

### Breast Cancer Patients corresponding to Adiponectin

In [None]:
more = df[df['Classification']==1]['Adiponectin']
less = df[df['Classification']==0]['Adiponectin']
fig = ff.create_distplot([less, more],['Healty Control', 'Breast Cancer Patients']
                         , show_hist=False, 
                        colors=['#bbeb44','#1db1cf'])
fig.update_layout(
    title="Breast Cancer Patients Corresponding to Adiponectin",
    xaxis_title="Adiponectin",
)
fig.show()

### Breast Cancer Patients corresponding to MCP.1

In [None]:
more = df[df['Classification']==1]['MCP.1']
less = df[df['Classification']==0]['MCP.1']
fig = ff.create_distplot([less, more],['Healty Control', 'Breast Cancer Patients']
                         , show_hist=False, 
                        colors=['#bbeb44','#1db1cf'])
fig.update_layout(
    title="Breast Cancer Patients Corresponding to MCP.1",
    xaxis_title="MCP.1",
)
fig.show()

# Data Preprocessing

## Duplicate Values

In [None]:
#check
duplicate = df[df.duplicated()] 
  
print("Duplicate Rows :")

## Negative value

In [None]:
age=['Age', 'BMI', 'Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin', 'Resistin', 'MCP.1']

for a in range(len(age)):
    print("Is there any negative value in '{}' column  : {} ".format(age[a],(df[age[a]]<0).any()))

## Outliers

In [None]:
featuresNum = ['Age', 'BMI', 'Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin', 'Resistin', 'MCP.1']
plt.figure(figsize=(15, 7))
for i in range(0, len(featuresNum)):
    plt.subplot(1, len(featuresNum), i+1)
    sns.boxplot(y=df[featuresNum[i]], color='green', orient='v')
    plt.tight_layout()

## Log-transformation variable

In [None]:
df["Glucose"] = np.log(df["Glucose"])
df["Insulin"] = np.log(df["Insulin"])
df["HOMA"] = np.log(df["HOMA"])
df["Leptin"] = np.log(df["Leptin"])
df["Adiponectin"] = np.log(df["Adiponectin"])
df["Resistin"] = np.log(df["Resistin"])
df["MCP.1"] = np.log(df["MCP.1"])

In [None]:
featuresNum = ['Age', 'BMI', 'Glucose', 'Resistin', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin', 'MCP.1']
plt.figure(figsize=(15, 7))
for i in range(0, len(featuresNum)):
    plt.subplot(1, len(featuresNum), i+1)
    sns.boxplot(y=df[featuresNum[i]], color='green', orient='v')
    plt.tight_layout()

## Drop unused column for modelling

In [None]:
df = df.drop(['Insulin'],axis=1)
df = df.drop(['MCP.1'],axis=1)
df = df.drop(['Adiponectin'],axis=1)
df = df.drop(['Leptin'],axis=1)
df = df.drop(['HOMA'],axis=1)

In [None]:
df

# Modeling

In [None]:
from sklearn.model_selection import train_test_split
# Split the data
# Input/independent variables
X = df.drop('Classification', axis = 1) # her we are droping the output feature as this is the target and 'X' is input features, the changes are not 
                                              # made inplace as we have not used 'inplace = True'

y = df['Classification'] # Output/Dependent variable
# train_x, test_x,train_y,test_y = train_test_split(X,y)

In [None]:
# Scaling the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X =  sc.fit_transform(X)
X

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# lets print the shapes again 
print("Shape of the X Train :", X_train.shape)
print("Shape of the y Train :", y_train.shape)
print("Shape of the X test :", X_test.shape)
print("Shape of the y test :", y_test.shape)

In [None]:
# Model Build
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score,roc_curve, auc, precision_recall_curve, f1_score
import warnings
warnings.filterwarnings('ignore')

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(X_train,y_train)

y_pred_rf = rf.predict(X_test)

print("Training Accuracy :", rf.score(X_train, y_train))
print("Testing Accuracy :", rf.score(X_test, y_test))

cm = confusion_matrix(y_test, y_pred_rf)
plt.rcParams['figure.figsize'] = (3, 3)
sns.heatmap(cm, annot = True, cmap = 'YlGnBu', fmt = '.8g')
plt.show()

cr = classification_report(y_test, y_pred_rf)
print(cr)

print("------------------------------------------")

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,y_pred_rf)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("ROC AUC Curves  : ",roc_auc)

sensitivity1 = cm[0,0]/(cm[0,0]+cm[0,1])
print('Sensitivity : ', sensitivity1)

specificity1 = cm[1,1]/(cm[1,0]+cm[1,1])
print('Specificity : ', specificity1)

## LGBM

In [None]:
from lightgbm import LGBMClassifier
clf = LGBMClassifier()

clf.fit(X_train,y_train)

y_pred_lgb = clf.predict(X_test)

print("Training Accuracy :", clf.score(X_train, y_train))
print("Testing Accuracy :", clf.score(X_test, y_test))

cm = confusion_matrix(y_test, y_pred_lgb)
plt.rcParams['figure.figsize'] = (3, 3)
sns.heatmap(cm, annot = True, cmap = 'YlGnBu', fmt = '.8g')
plt.show()

cr = classification_report(y_test, y_pred_lgb)
print(cr)

print("------------------------------------------")

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,y_pred_lgb)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("ROC AUC Curves  : ",roc_auc)

sensitivity1 = cm[0,0]/(cm[0,0]+cm[0,1])
print('Sensitivity : ', sensitivity1)

specificity1 = cm[1,1]/(cm[1,0]+cm[1,1])
print('Specificity : ', specificity1)

## XGBoost

In [None]:
from xgboost import XGBClassifier

#train
xgb = XGBClassifier()

xgb.fit(X_train,y_train)

#predict
y_predicted_xgb = xgb.predict(X_test)

print("Training Accuracy :", xgb.score(X_train, y_train))
print("Testing Accuracy :", xgb.score(X_test, y_test))

#eval
cm = confusion_matrix(y_test, y_predicted_xgb)
plt.rcParams['figure.figsize'] = (3, 3)
sns.heatmap(cm, annot = True, cmap = 'YlGnBu', fmt = '.8g')
plt.show()

cr = classification_report(y_test, y_predicted_xgb)
print(cr)

print("------------------------------------------")

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,y_predicted_xgb)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("ROC AUC Curves  : ",roc_auc)

sensitivity1 = cm[0,0]/(cm[0,0]+cm[0,1])
print('Sensitivity : ', sensitivity1)

specificity1 = cm[1,1]/(cm[1,0]+cm[1,1])
print('Specificity : ', specificity1)

## SVM

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
#train
gbc = GradientBoostingClassifier(learning_rate=0.02,
                    max_depth=4,
                    random_state=100, n_estimators=1000)


gbc.fit(X_train,y_train)

#predict
y_predicted_gb = gbc.predict(X_test)

print("Training Accuracy :", gbc.score(X_train, y_train))
print("Testing Accuracy :", gbc.score(X_test, y_test))

#eval
cm = confusion_matrix(y_test, y_predicted_gb)
plt.rcParams['figure.figsize'] = (3, 3)
sns.heatmap(cm, annot = True, cmap = 'YlGnBu', fmt = '.8g')
plt.show()

cr = classification_report(y_test, y_predicted_gb)
print(cr)


print("------------------------------------------")

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,y_predicted_gb)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("ROC AUC Curves  : ",roc_auc)

sensitivity1 = cm[0,0]/(cm[0,0]+cm[0,1])
print('Sensitivity : ', sensitivity1)

specificity1 = cm[1,1]/(cm[1,0]+cm[1,1])
print('Specificity : ', specificity1)