# Predicting the Presence of Breast Cancer

The goal of this project is to create a model which accurately determines the presence of Breast Cancer. The dataset used within this notebook was found on kaggle.com (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data), uploaded by UCI Machine Learning. 

In this project, we shall undertake the following tasks:

0. Package and Data Imports
1. Data Cleaning
2. Exploratory Data Analysis and Visualisations
3. Feature Selection
4. Model Building
5. Model Evaluation

## 0: Package and Data Imports

Let us begin by importing the packages necessary for our exploratory data analysis and visualisation. We shall import packages required for model building in section 4.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Let us now use Pandas to import our csv file into a dataframe.

In [None]:
df = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')

## 1: Data Cleaning

In this section we shall attempt to engineer new featres from our dataset, as well as deal with any missing values. 

### 1.1: Missing Data

Let us determine whether there are any missing data points within this dataset.

In [None]:
df.isnull().sum()

We observe that there are no missing data points within any column excepted for the column called "Unnamed: 32". It appears like this column in redundant and as a result we shall remove it from the dataset.

In [None]:
df = df.drop('Unnamed: 32',axis=1)

Let us now check the head of the dataframe to investigate the types of data we have stored for each datapoint.

In [None]:
df.head()

The "id" column is simply a unique number for each item within the dataset. Unfortunately, there is no useful information to be gained from this column and as a result it shall be removed from the dataset.

In [None]:
df = df.drop('id',axis=1)

The "diagnosis" column is our target column. Let us check the different values this column can take.

In [None]:
df['diagnosis'].unique()

We notice that our target column contains two possible values, "M" or "B". "M" stands for malignant and is used when there is presence of breast cancer. "B" stands for benign and is used when there are no signs of breast cancer. Let us change these values for use in our machine learning algorithms. We shall record the presence of breast cancer as a 1 and use 0 to denote no presence of breast cancer.

In [None]:
df['diagnosis'] = df['diagnosis'].apply(lambda x: 1 if x == "M" else 0)

### 1.2: Feature Extraction

Let us check the info method of our dataframe.

In [None]:
df.info()

We notice that there are 10 unique features stored within our dataset, with 3 data points recorded for each. The mean, standard error and "worst" or largest of each of these features was computed and stored (UCI Machine Learning). We observe that each of these entries are of the type "float" and are therefore all numeric. As a result, there is no scope for feature extraction within this project and this means that feature selection will become significantly important when it comes to model building. 

Our data has now been successfully cleaned.

## 2: Exploratory Data Analysis and Visualisations

In this section we shall attempt to determine which features have the most impact on the presence of breast cancer. 

### 2.1: Target Variable

Let us first investigate the distribution of the points within our target variable.

In [None]:
sns.countplot(x='diagnosis',data=df)

In [None]:
df['diagnosis'].value_counts()

In [None]:
len(df[df['diagnosis']==1]) * 100 / len(df)

We notice that 212 of our datapoints are classified as malignant, which equates to approximately 37% of the total dataset. These classes are relatively balanced and as a result it is not necessary to synthetically produce more instances of the minority class using a technique such as SMOTE.

### 2.2: Independent Variables

Let us now begin investigating our independent variables and their impact on the presence of breast cancer. Let us check the describe method of the dataframe.

In [None]:
df.describe()

We observe that some of the variables stored within the dataset have an extremely small range of values that can be taken, where as other have an extremely large range of possible values. In order to make identifying differences between classes, let us scale the data using the Standard Scaler from sci-kit learn.

In [None]:
y = df['diagnosis']              # Target Variable
X = df.drop('diagnosis',axis=1)  # Independent Variables
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
df = pd.DataFrame(X,columns=df.columns[1:])
df['diagnosis'] = y

Let us check the head and describe methods of the dataframe to ensure that the scaler has worked correctly.

In [None]:
df.head()

In [None]:
df.describe()

We can see that the minimum and maximum values for each column have been reduced and we can therefore conclude that the data has been scaled successfully. Let us now produce boxplots for each feature to determine the effect they have on the presence of breast cancer. We shall split the features into 3 groups of ten to make visualisation easier.

In [None]:
x = df.drop('diagnosis',axis=1)
y = df['diagnosis']
data = pd.concat([y,x.iloc[:,0:10]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.boxplot(x="features", y="value", hue="diagnosis", data=data)
plt.xticks(rotation=90)

The box plot above shows the range of the mean values obatined for each of the ten features measured within our dataset. We see that for each feature except fractal dimension, the median value for those cells containing breast cancer is much higher than those that do not. Cancerous cells also seem to have a wider inter quartile range for each feature in comparison to non-cancerous cells. 

Let us now plot the second group of ten features.

In [None]:
data = pd.concat([y,x.iloc[:,10:20]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.boxplot(x="features", y="value", hue="diagnosis", data=data)
plt.xticks(rotation=90)

For most of the features here, we again observe a difference in the median value for each of the two target classes. 

Let us plot the final group of features.

In [None]:
data = pd.concat([y,x.iloc[:,20:30]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.boxplot(x="features", y="value", hue="diagnosis", data=data)
plt.xticks(rotation=90)

Again we able to observe a significant difference between the median values by class for each of the features shown above. 

Let us investigate the relationships between the variables by producing a correlation heatmap using Seaborn.

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df.corr(), annot=True,cmap='viridis')

We can observe that there are variables within the dataset that are almost perfectly correlated. Let us investigate the values that are higher than 0.9.

In [None]:
for i in df.columns:
    print("Features highly related to column {}:".format(i))
    related_list = []
    for j in df.columns:
        if (i != j) & (abs(df.corr()[i][j]) > 0.9):
            related_list.append(j)
    print(related_list)
    print("-" * 50)

After looking through the list, we are able to notice that there are two groups of relationships within this dataset. We notice that the radius, perimeter and area measurements are all related, which intuitively follows expectation. Under the assumption that the cells are of circular shape, then the mathematical formulas for the perimeter and area of a circle can be used. Both formulas, given by perimeter = Pi x diameter = Pi x 2 x radius, and area = Pi x r x 2, clearly involve the radius and hence the extremely positive correlation.

We also observe high correlation between the pairs of variables "texture_worst" and "texture_mean" and "concave points_worst" and "concave points_mean". Let us produce joint plots of these pairs to highlight the relationship between the variables.

In [None]:
sns.jointplot(df['texture_worst'],df['texture_mean'],kind='regg',color='purple')

In [None]:
sns.jointplot(df['concave points_mean'],df['concave points_worst'],kind='regg')

The plots above clearly show the extremely positive relationship between these pair of variables. Let us now produce a pair plot of the 9 variables relating to the radius, perimeter and area of the cells.

In [None]:
sns.pairplot(df[['radius_mean','radius_se','radius_worst','perimeter_mean','perimeter_se','perimeter_worst','area_mean','area_se','area_worst','diagnosis']],hue='diagnosis')

We can clearly see the positive linear relationships between each of these variables in a pairwise format, highlighting the large values for the correlation shown in the heatmap.

## 3: Feature Selection

Within this dataset, we have 30 different numerical features which may have an impact on the presence of breast cancer, with some features being more significant than others. In order to determine which features are most important, we shall make use of both the Random Forest Classifier and XGBoost Classifier. First, we must split our dataset into the input features and output variable.

In [None]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

#from sklearn.model_selection import train_test_split
#from sklearn.model_selection import GridSearchCV
# 

Let us now determine the most important features using a Random Forest Classifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=5000,random_state=11)
rf.fit(X,y)
feat_imp = pd.DataFrame(rf.feature_importances_)
feat_imp.index = pd.Series(df.iloc[:,:-1].columns)
feat_imp = (feat_imp*100).copy().sort_values(by=0,ascending=False)
feat_imp = feat_imp.reset_index()
feat_imp

It appears that the standard error of the measurements recorded within the dataset are the least influential factor on the presence of breast cancer. However, the results shown above do not provide a clear indication into the optimal number of features for use within our model. Let us implement the backward elimination technique in order to to do this. 

We shall be analysis the performance of the models created using the F1 score metric. This is due to the fact that our target class is slightly unbalanced. As a result of this, we would be able to generate high accuracy by designing a model that simply predicts the majority class. The F1 score metric is more beneficial in this case since it takes into account both false positives and negatives. In order for a high F1 score to be achieved, the model must produce high precision and recall values.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

results_list = []
for var in np.arange(feat_imp.shape[0],9,-1):
    X_new = X[feat_imp.iloc[:var,0]].copy()
    X_train, X_test, y_train,y_test = train_test_split(X_new,y,test_size=0.2,random_state=11)
    final_rf = RandomForestClassifier(random_state=11)
    gscv = GridSearchCV(estimator=final_rf,param_grid={
        "n_estimators":[100,500,1000,5000],
        "criterion":["gini","entropy"]
    },cv=5,n_jobs=-1,scoring="f1_weighted")

    model = gscv.fit(X_train,y_train)
    
    results_list.append((var, model.best_score_))
    print("Model Created using the top {} variables".format(var))
    print("F1 Score: {}".format(model.best_score_))
    print("-"*30)
    
    #print(str(var)+" variables:  "+str(model.best_estimator_)+"  F1 score: "+str(model.best_score_))

We notice that the highest F1 score is achieved using the top 17 variables. There is a relatively large drop in performance between 12 and 11 variables, which is to be expected due to the large difference between the importance values found in the feature importance dataframe.

Let us now investigate the use of SMOTE within this backward elimination process.

In [None]:
from imblearn.over_sampling import SMOTE
SMOTE_list = []
for var in np.arange(feat_imp.shape[0],9,-1):
    X_new = X[feat_imp.iloc[:var,0]].copy()
    X_train, X_test, y_train,y_test = train_test_split(X_new,y,test_size=0.2,random_state=11)
    smote = SMOTE(random_state = 11) 
    X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)
    final_rf = RandomForestClassifier(random_state=11)
    gscv = GridSearchCV(estimator=final_rf,param_grid={
        "n_estimators":[100,500,1000,5000],
        "criterion":["gini","entropy"]
    },cv=5,n_jobs=-1,scoring="f1_weighted")

    model = gscv.fit(X_train_smote,y_train_smote)
    SMOTE_list.append((var, model.best_score_))
    print("SMOTE Model Created using the top {} variables".format(var))
    print("F1 Score: {}".format(model.best_score_))
    print("Best Model {}".format(model.best_estimator_))
    print("-"*30)

We observe that the use of SMOTE to balance the target classes resulted in a better F1 socre for each number of variables used, as seen in the plot below. 

In [None]:
x_plot = range(10,31)
y_results = [] 
for i in range(20,-1,-1):
    y_results.append(results_list[i][1])
y_results
y_results_SMOTE = [] 
for i in range(20,-1,-1):
    y_results_SMOTE.append(SMOTE_list[i][1])
y_results
y_1 = y_results
y_2 = y_results_SMOTE

plt.figure(figsize=(10,6))
plt.plot(x_plot, y_1, '-b', label='Without SMOTE')
plt.plot(x_plot, y_2, '-r', label='With SMOTE')
plt.legend()
plt.xlabel('Number of Variables')
plt.ylabel('F1 Score')
plt.title('Figure Comparing the F1 Scores obtained with and without using SMOTE')


The model that achieved the highest F1 score was built using SMOTE with 18 variables and the "entropy" criterion. In the model building section below, we shall implement the final Random Forest Classifier model using these parameters. 

Let us now repeat the process using the XBGoost classifier. First we shall create a feature importance dataframe.

In [None]:
from xgboost import XGBClassifier

xgb = XGBClassifier(n_estimators=5000,random_state=11)
xgb.fit(X,y)
feat_imp_xgb = pd.DataFrame(xgb.feature_importances_)
feat_imp_xgb.index = pd.Series(df.iloc[:,:-1].columns)
feat_imp_xgb = (feat_imp_xgb*100).copy().sort_values(by=0,ascending=False)
feat_imp_xgb = feat_imp_xgb.reset_index()
feat_imp_xgb

We notice that there are 4 features that seem to have the most important impact on the presence of breast cancer. As with the Random Forest Classifier above, it is still difficult to determine the optimal number of variables. Since we found that the use of SMOTE improved the F1 score obtained, we shall use it within the backward elimination process using the XGBoost classifier.

In [None]:
SMOTE_list_xgb = []
for var in np.arange(feat_imp.shape[0],9,-1):
    X_new = X[feat_imp.iloc[:var,0]].copy()
    X_train, X_test, y_train,y_test = train_test_split(X_new,y,test_size=0.2,random_state=11)
    smote = SMOTE(random_state = 11) 
    X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)
    final_xgb = XGBClassifier(random_state=11)
    gscv = GridSearchCV(estimator=final_xgb,param_grid={
        "n_estimators":[100,500,1000,5000],
        "criterion":["gini","entropy"]
    },cv=5,n_jobs=-1,scoring="f1_weighted")

    model = gscv.fit(X_train_smote,y_train_smote)
    SMOTE_list_xgb.append((var, model.best_score_))
    print("SMOTE XGB Model Created using the top {} variables".format(var))
    print("F1 Score: {}".format(model.best_score_))
    print("Best Model {}".format(model.best_estimator_))
    print("-"*30)

Let us plot these values to visually determine which model has produced the highest F1 score.

In [None]:
xgb_results = []
for i in range(20,-1,-1):
    xgb_results.append(SMOTE_list_xgb[i][1])
    
plt.figure(figsize=(10,6))
plt.plot(x_plot, xgb_results, '-b')
plt.xlabel('Number of Variables')
plt.ylabel('F1 Score')
plt.title('Figure Comparing the F1 Scores obtained using XGBoost for various numbers of input variables')

We can see that the model created using 26 variables obtained the highest F1 score of 0.97689 using the 'gini' criterion and 1000 estimators. We shall implement the final XGBoost classifier in the model creation section.

## 4: Model Building

In this section we shall create final models using the Random Forest Classifier and XGBoost classifier using the parameters determined in the previous section. 

### 4.1: Random Forest Classifier

When selecting features for this model, we found that the top 18 features, along with SMOTE, produced the highest F1 score. Let us create the final random forest classifier.

In [None]:
X_new = X[feat_imp.iloc[:18,0]].copy()
X_train, X_test, y_train,y_test = train_test_split(X_new,y,test_size=0.2,random_state=11)
smote = SMOTE(random_state = 11) 
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)
final_rf = RandomForestClassifier(random_state=11)
gscv = GridSearchCV(estimator=final_rf,param_grid={
    "n_estimators":[100,500,1000,5000],
    "criterion":["gini","entropy"]
},cv=5,n_jobs=-1,scoring="f1_weighted")

model = gscv.fit(X_train_smote,y_train_smote)
final_rfc_model = model.best_estimator_
    

We shall now use this model to generate predictions which we shall analyse in the Model Analysis section.

In [None]:
rfc_preds = final_rfc_model.predict(X_test)

### 4.2: XGBoost Classifier

Let us now implement the final version of the XGBoost classifier. In the feature selection section, we found that the optimal number of features was the most important 26 features.

In [None]:
X_new = X[feat_imp.iloc[:26,0]].copy()
X_train, X_test, y_train,y_test = train_test_split(X_new,y,test_size=0.2,random_state=11)
smote = SMOTE(random_state = 11) 
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)
final_xgb = XGBClassifier(random_state=11)
gscv = GridSearchCV(estimator=final_xgb,param_grid={
   "n_estimators":[100,500,1000,5000],
   "criterion":["gini","entropy"]
},cv=5,n_jobs=-1,scoring="f1_weighted")

model = gscv.fit(X_train_smote,y_train_smote)
final_xgb_model = model.best_estimator_

We shall now create predictions using this model.

In [None]:
xgb_preds = final_xgb_model.predict(X_test)

## 5: Model Analysis

In this section we shall analyse the performance achieved by the random forest and XGBoost classifiers. We shall produce confusion matrices and classification reports for each model. A confusion matrix displays the predictions made by the model against the actual label associated to them, where as the classification report displays values such as accuracy, precision and recall for each class individually.

### 5.1: Random Forest Classifier

We shall now analyse the predictions made by the random forest classifier. 

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
print(classification_report(y_test, rfc_preds))

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(confusion_matrix(y_test, rfc_preds),annot=True)
plt.ylabel('Actual Class')
plt.xlabel('Predicted Class')
plt.title('Predictions Using the Random Forest Classifier')

We manage to achieve near perfection predictions on the test set using the Random Forest Classifier, with only 1 instance being incorrectly classified. As a result, an accuracy of 99.1% has been achieved. As well as this, we have been able to produce a model with an F1-score of 99%. 

### 5.2: XGBoost Classifier

In this section, we shall repeat the analysis from above using the XGBoost classifier.

In [None]:
print(classification_report(y_test, xgb_preds))

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(confusion_matrix(y_test, xgb_preds),annot=True)
plt.ylabel('Actual Class')
plt.xlabel('Predicted Class')
plt.title('Predictions Using the XGBoost Classifier')

It turns out that we achieve the exact same predictions as the Random Forest Classsifier whilst using the XGBoost classifier.