## Breast Cancer Classifier from scratch

### Load the required libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

### Load and Read the dataset
### Change the columns name also

In [2]:
df = pd.read_csv("datasets/breast-cancer-wisconsin.data", header = None)
df.rename(columns = {0:"id", 
                     1:"clump-thickness", 
                     2:"cell-size", 
                     3:"cell-shape", 
                     4:"marginal-adhesion", 
                     5:"epithelial-cell-size", 
                     6:"bare-nuclei", 
                     7:"bland-chromatin", 
                     8:"normal-nucleoli", 
                     9:"mitoses", 
                     10:"class"}, 
          inplace = True)

In [3]:
df.head(10)

Unnamed: 0,id,clump-thickness,cell-size,cell-shape,marginal-adhesion,epithelial-cell-size,bare-nuclei,bland-chromatin,normal-nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
5,1017122,8,10,10,8,7,10,9,7,1,4
6,1018099,1,1,1,1,2,10,3,1,1,2
7,1018561,2,1,2,1,2,1,3,1,1,2
8,1033078,2,1,1,1,2,1,1,1,5,2
9,1033078,4,2,1,1,2,1,2,1,1,2


### Count the observations for different classes

In [4]:
df['class'].value_counts()
# 2 is for benign cancer
# 4 is for malignant cancer

2    458
4    241
Name: class, dtype: int64

### Create input matrix and labels for the given dataset

In [5]:
label_vector = df.iloc[:, 10]   #class labels: 2 = benign, 4 = malignant
feature_vector = df.iloc[:, 1:10] #features vectors

In [6]:
feature_vector

Unnamed: 0,clump-thickness,cell-size,cell-shape,marginal-adhesion,epithelial-cell-size,bare-nuclei,bland-chromatin,normal-nucleoli,mitoses
0,5,1,1,1,2,1,3,1,1
1,5,4,4,5,7,10,3,2,1
2,3,1,1,1,2,2,3,1,1
3,6,8,8,1,3,4,3,7,1
4,4,1,1,3,2,1,3,1,1
...,...,...,...,...,...,...,...,...,...
694,3,1,1,1,3,2,1,1,1
695,2,1,1,1,2,1,1,1,1
696,5,10,10,3,7,3,8,10,2
697,4,8,6,4,3,4,10,6,1


### Relabel the observed values as 0 and 1

In [7]:
# type(label_vector)
label_vector = label_vector.replace(2,0)
label_vector = label_vector.replace(4,1)
label_vector.value_counts()

0    458
1    241
Name: class, dtype: int64

### Data Preprocessing

In [8]:
print("Does the feature set contain any null values:",feature_vector.isnull().values.any())
print("\nFeature set information")
feature_vector.info()

Does the feature set contain any null values: False

Feature set information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   clump-thickness       699 non-null    int64 
 1   cell-size             699 non-null    int64 
 2   cell-shape            699 non-null    int64 
 3   marginal-adhesion     699 non-null    int64 
 4   epithelial-cell-size  699 non-null    int64 
 5   bare-nuclei           699 non-null    object
 6   bland-chromatin       699 non-null    int64 
 7   normal-nucleoli       699 non-null    int64 
 8   mitoses               699 non-null    int64 
dtypes: int64(8), object(1)
memory usage: 49.3+ KB


The last cell says that the values for the feature "bare-nuclei" is an object, which is not compatible for our calculations.  
So we need to change it into the int64 type, so that it becomes homogeneous with the entire dataset.

In [9]:
feature_vector.mean()

clump-thickness         4.417740
cell-size               3.134478
cell-shape              3.207439
marginal-adhesion       2.806867
epithelial-cell-size    3.216023
bland-chromatin         3.437768
normal-nucleoli         2.866953
mitoses                 1.589413
dtype: float64

The last cell does not contain the mean value for the "bare-nuclei" feature, as it is of type object.  
Lets try to change the type for this.

In [11]:
# feature_vector["bare-nuclei"].astype('int64')

We got an error while trying to change the type for the "bare-nuclei" feature.  
ValueError: invalid literal for int() with base 10: '?'  
This means that there are some entries within that feature which contains '?' as values (hence incompatible).

In [16]:
# tmp = feature_vector["bare-nuclei"].where(lambda x : x != '?')
# tmp.str.contains("?", regex=False).value_counts()
feature_vector["bare-nuclei"].str.contains("?", regex=False).value_counts()

False    683
True      16
Name: bare-nuclei, dtype: int64

There are 16 such entries which are making the "bare-nuclei" feature incompatible for modelling.  
Now we have to get rid of such entries first before changing the type for that feature.  
One possible way of dealing with such entries would be to drop them and all the corresponding entries of the other features as well.  
But doing so would leave us with lesses data for modelling.  
We will even lose the patterns or insights that these entries might be holding to.  
There is another way to solve this issue, where we can replace the incompatible entries with the mean of all the entries for that particular feature.  
This will save us from lossing the insights into the dataset.  
Although dropping the entries would be advisable if the number of incompatible entries are insignificant than the voulume of the entire dataset.

In [17]:
def get_mean_for_bare_nulcei():
    tmp = feature_vector["bare-nuclei"].to_numpy()
    sum = 0
    for n in tmp:
        if (n == '?'):
            continue
        else:
            sum += int(n)
    return sum/np.size(tmp)
## END

In [20]:
mean_value = get_mean_for_bare_nulcei()
# mean_value  # 3.463519313304721

We will replace the incompatible values with the mean values after converting them into int64.

In [21]:
feature_vector["bare-nuclei"] = feature_vector["bare-nuclei"].replace('?', int(mean_value))

Now we will change the type of the "bare-nuclei" feature to int64.

In [22]:
feature_vector["bare-nuclei"] = feature_vector["bare-nuclei"].astype('int64')

In [23]:
feature_vector.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   clump-thickness       699 non-null    int64
 1   cell-size             699 non-null    int64
 2   cell-shape            699 non-null    int64
 3   marginal-adhesion     699 non-null    int64
 4   epithelial-cell-size  699 non-null    int64
 5   bare-nuclei           699 non-null    int64
 6   bland-chromatin       699 non-null    int64
 7   normal-nucleoli       699 non-null    int64
 8   mitoses               699 non-null    int64
dtypes: int64(9)
memory usage: 49.3 KB


### Split the training and testing data
Here we will keep 30% of the entire dataset for testing.

In [24]:
X_train, X_test, Y_train, Y_test = train_test_split(feature_vector, label_vector, test_size = 0.3, random_state = 2020)

print("shape of input training data:",X_train.shape)
print("shape of output training data:",Y_train.shape)
print("shape of input testing data:",X_test.shape)
print("shape of output testing data:",Y_test.shape)

shape of input training data: (489, 9)
shape of output training data: (489,)
shape of input testing data: (210, 9)
shape of output testing data: (210,)


### Z-score Normalization

In [25]:
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

### Train the model first

In [26]:
def sigmoid_function(args):
    return (1/(1 + np.exp(-args)))
## END

In [27]:
def logLiklihood(z, y):
    """Log-liklihood function (cost function to be minimized in logistic regression classification)"""
    return -1 * np.sum((y * np.log(sigmoid_function(z))) + ((1 - y) * np.log(1 - sigmoid_function(z))))
## END

In [28]:
def model_optimize(w, b, X, Y):
    m = X.shape[0]
    #Prediction
    final_result = sigmoid_function(np.dot(w, X.T) + b)
    # Y_T = Y.T
    # cost = (-1/m)*(np.sum((Y_T*np.log(final_result)) + ((1 - Y_T)*(np.log(1 - final_result)))))
    cost = (-1/m)*(np.sum((np.asarray(Y.T)*np.log(final_result)) + ((1 - np.asarray(Y.T)*(np.log(1 - final_result))))))
    #Gradient calculation
    dw = (1/m)*(np.dot(X.T, (final_result - np.asarray(Y.T)).T))
    db = (1/m)*(np.sum(final_result - np.asarray(Y.T)))
    grads = {"dw": dw, "db": db}
    return grads, cost
## END
print("model_optimize")

model_optimize


In [29]:
def model_predict(w, b, X, Y, learning_rate, no_iterations):
    costs = []
    for i in range(no_iterations):
        grads, cost = model_optimize(w, b, X, Y)
        dw = grads["dw"]
        db = grads["db"]
        #weight update
        w = w - (learning_rate * (dw.T))
        b = b - (learning_rate * db)
        if (i % 100 == 0):
            costs.append(cost)
            print("Cost after %i iteration is %f" %(i, cost))
    #final parameters
    coeff = {"w": w, "b": b}
    gradient = {"dw": dw, "db": db}
    return coeff, gradient, costs
## END
print("model_predict")

model_predict


In [30]:
def weightInitialization(n_features):
    w = np.zeros((1, n_features))
    b = 0
    return w, b
## END
print("weightInitialization")

weightInitialization


In [31]:
def predict(final_pred, m):
    y_pred = np.zeros((1,m))
    for i in range(final_pred.shape[1]):
        if final_pred[0][i] > 0.5:
            y_pred[0][i] = 1
    return y_pred
## END
print("predict")

predict


In [32]:
n_features = X_train_std.shape[1]
print('Number of Features', n_features)
w, b = weightInitialization(n_features)

Number of Features 9


In [33]:
coeff, gradient, costs = model_predict(w, b, X_train_std, Y_train, learning_rate = 0.01, no_iterations = 5001)

Cost after 0 iteration is -1.000000
Cost after 100 iteration is -1.571390
Cost after 200 iteration is -1.821066
Cost after 300 iteration is -1.976312
Cost after 400 iteration is -2.086463
Cost after 500 iteration is -2.170232
Cost after 600 iteration is -2.236726
Cost after 700 iteration is -2.291061
Cost after 800 iteration is -2.336395
Cost after 900 iteration is -2.374819
Cost after 1000 iteration is -2.407784
Cost after 1100 iteration is -2.436343
Cost after 1200 iteration is -2.461281
Cost after 1300 iteration is -2.483200
Cost after 1400 iteration is -2.502572
Cost after 1500 iteration is -2.519774
Cost after 1600 iteration is -2.535109
Cost after 1700 iteration is -2.548829
Cost after 1800 iteration is -2.561143
Cost after 1900 iteration is -2.572223
Cost after 2000 iteration is -2.582220
Cost after 2100 iteration is -2.591258
Cost after 2200 iteration is -2.599447
Cost after 2300 iteration is -2.606881
Cost after 2400 iteration is -2.613641
Cost after 2500 iteration is -2.61980

In [34]:
#Final prediction
w = coeff["w"]
b = coeff["b"]
print('Optimized weights', w)
print('Optimized intercept', b)

Optimized weights [[0.94887462 0.54407288 0.71067117 0.42633928 0.32278778 1.20356732
  0.69247462 0.4263541  0.50740409]]
Optimized intercept -0.9738483095350601


In [35]:
final_train_pred = sigmoid_function(np.dot(w, X_train_std.T) + b)
final_test_pred = sigmoid_function(np.dot(w, X_test_std.T) + b)

In [36]:
m_tr =  X_train_std.shape[0]
m_ts =  X_test_std.shape[0]

In [37]:
y_tr_pred = predict(final_train_pred, m_tr)
print('Training Accuracy', accuracy_score(y_tr_pred.T, Y_train))

Training Accuracy 0.9652351738241309
