**ML | Breast Cancer Wisconsin Diagnosis Prediction using Logistic Regression**

Breast Cancer Wisconsin Diagnosis dataset is commonly used in machine learning to classify breast tumors as malignant (cancerous) or benign (non-cancerous) based on features extracted from breast mass images. In this project I will apply Logistic Regression algorithm for binary classification to predict the nature of breast tumors. It is ideal for such tasks where the goal is to classify data into two categories. I will walk through the steps of preprocessing the data, training the model and evaluating its performance, demonstrating how this model work in early detection and diagnosis of breast cancer.

In [1]:
# 1. Loading Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [4]:
# 2. Loading dataset
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
data = pd.read_csv("/content/drive/MyDrive/Machine_Learning/Project-2/data.csv")
print (data.head)

<bound method NDFrame.head of            id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0      842302         M        17.99         10.38          122.80     1001.0   
1      842517         M        20.57         17.77          132.90     1326.0   
2    84300903         M        19.69         21.25          130.00     1203.0   
3    84348301         M        11.42         20.38           77.58      386.1   
4    84358402         M        20.29         14.34          135.10     1297.0   
..        ...       ...          ...           ...             ...        ...   
564    926424         M        21.56         22.39          142.00     1479.0   
565    926682         M        20.13         28.25          131.20     1261.0   
566    926954         M        16.60         28.08          108.30      858.1   
567    927241         M        20.60         29.33          140.10     1265.0   
568     92751         B         7.76         24.54           47.92      181.0  

In [6]:
#Getting Information about the dataset.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

**3. Processing Dataset**

Dropping columns - 'id' and 'Unnamed: 32' as they have no role in prediction

**data['diagnosis'].map():** Converts the diagnosis column, which contains 'M' (Malignant) and 'B' (Benign) into binary values. 'M' is converted to 1 and 'B' to 0 making it suitable for logistic regression.

In [7]:
# Preprocess data
data.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})

y = data['diagnosis'].values
x_data = data.drop(['diagnosis'], axis=1)

**Normalization**

Here we will normalize dataset.

**x_data - x_data.min():** subtracts the minimum value from each value in the dataset shifting the data so that the smallest value becomes 0.

**x_data.max() - x_data.min():** calculates the range of the data (difference between the maximum and minimum values).

In [8]:
# Normalize features
x = (x_data - x_data.min()) / (x_data.max() - x_data.min())

**4. Splitting data for training and testing.**

**train_test_split: **This function splits your data into two parts one for training your model and another for testing.

**test_size=0.15:** 15% of the data will be used for testing and 85% for training.

**x_train = x_train.T:** Transpose (T) to ensure that the data has the correct shape for matrix operations during the logistic regression.

In [9]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size = 0.15, random_state = 42)

x_train = x_train.T
x_test = x_test.T
y_train = y_train.T
y_test = y_test.T

print("x train: ", x_train.shape)
print("x test: ", x_test.shape)
print("y train: ", y_train.shape)
print("y test: ", y_test.shape)

x train:  (30, 483)
x test:  (30, 86)
y train:  (483,)
y test:  (86,)


**5. Initializing Model Architecture**

In [10]:
# Initializing Weight and bias
def initialize_weights_and_bias(dimension):
    w = np.random.randn(dimension, 1) * 0.01  # Initialize with small random values
    b = 0.0
    return w, b

**Sigmoid Function to calculate z value.**

**sigmoid():** It squashes the input value z between 0 and 1 making it suitable for binary classification.

In [11]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

**Forward-Backward Propagation**

**np.dot(w.T, x_train):** Computes the matrix multiplication of the weights and the input data.

**cost = (-1/m) * np.sum(y_train * np.log(y_head) + (1 - y_train) * np.log(1 - y_head)):** Measures the difference between the predicted probability (y_head) and true label (y_train).

**derivative_weight = (1/m) * np.dot(x_train, (y_head - y_train).T):** This calculates the gradient of the cost with respect to the weights w. It tells us how much we need to change the weights to reduce the cost.

**derivative_bias = (1/m) * np.sum(y_head - y_train):** This computes the gradient of the cost with respect to the bias b. It is simply the average of the difference between predicted probabilities (y_head) and actual labels (y_train).

In [12]:
def forward_backward_propagation(w, b, x_train, y_train):
    m = x_train.shape[1]
    z = np.dot(w.T, x_train) + b
    y_head = sigmoid(z)


    cost = (-1/m) * np.sum(y_train * np.log(y_head) + (1 - y_train) * np.log(1 - y_head))

    derivative_weight = (1/m) * np.dot(x_train, (y_head - y_train).T)
    derivative_bias = (1/m) * np.sum(y_head - y_train)

    gradients = {"derivative_weight": derivative_weight, "derivative_bias": derivative_bias}
    return cost, gradients

**Updating Parameters**

**w -= learning_rate * gradients["derivative_weight"] and b -= learning_rate * gradients["derivative_bias"]:** Weight(w) and Bias(b) are updated by subtracting the gradient scaled by the learning rate.

In [13]:
def update(w, b, x_train, y_train, learning_rate, num_iterations):
    costs = []
    gradients = {}
    for i in range(num_iterations):
        cost, grad = forward_backward_propagation(w, b, x_train, y_train)
        w -= learning_rate * grad["derivative_weight"]
        b -= learning_rate * grad["derivative_bias"]

        if i % 100 == 0:
            costs.append(cost)
            print(f"Cost after iteration {i}: {cost}")

    parameters = {"weight": w, "bias": b}
    return parameters, gradients, costs

**6. Making Predictions**

***np.dot(w.T, x_test):*** This performs a matrix multiplication between the transposed weights (w.T) and test data (x_test).

***sigmoid(z):*** Applies the sigmoid activation function to the logits, this function maps the values to the range [0, 1].

***z[0, i] > 0.5:*** If the probability for the positive class (class 1) is greater than 0.5 then prediction is 1 otherwise it is 0.

In [14]:
def predict(w, b, x_test):
    m = x_test.shape[1]
    y_prediction = np.zeros((1, m))
    z = sigmoid(np.dot(w.T, x_test) + b)

    for i in range(z.shape[1]):
        y_prediction[0, i] = 1 if z[0, i] > 0.5 else 0

    return y_prediction

**Logistic Regression**

**logistic_regression(x_train, y_train, x_test, y_test, learning_rate=0.01, num_iterations=1000):** This line runs the logistic regression model with the given training and test data, a learning rate of 0.01 and 1000 iterations for training.

In [15]:
def logistic_regression(x_train, y_train, x_test, y_test, learning_rate=0.01, num_iterations=1000):
    dimension = x_train.shape[0]
    w, b = initialize_weights_and_bias(dimension)
    parameters, gradients, costs = update(w, b, x_train, y_train, learning_rate, num_iterations)

    y_prediction_test = predict(parameters["weight"], parameters["bias"], x_test)
    y_prediction_train = predict(parameters["weight"], parameters["bias"], x_train)

    print(f"Train accuracy: {100 - np.mean(np.abs(y_prediction_train - y_train)) * 100}%")
    print(f"Test accuracy: {100 - np.mean(np.abs(y_prediction_test - y_test)) * 100}%")

# Run logistic regression
logistic_regression(x_train, y_train, x_test, y_test, learning_rate=0.01, num_iterations=1000)

Cost after iteration 0: 0.6923864756917433
Cost after iteration 100: 0.6651584286832788
Cost after iteration 200: 0.640772226400971
Cost after iteration 300: 0.6184354604101723
Cost after iteration 400: 0.5978537847391072
Cost after iteration 500: 0.5788432059290065
Cost after iteration 600: 0.5612519202377051
Cost after iteration 700: 0.544944781830411
Cost after iteration 800: 0.5297999882891344
Cost after iteration 900: 0.5157079094134956
Train accuracy: 90.47619047619048%
Test accuracy: 88.37209302325581%
