# Logistic Regression of Wisconsin Breast Cancer Dataset

In [2]:
from sklearn.datasets import load_breast_cancer

In [3]:
data = load_breast_cancer()

In [15]:
print(dir(data))

['DESCR', 'data', 'data_module', 'feature_names', 'filename', 'frame', 'target', 'target_names']


## Business Understanding

In [12]:
print(data.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

Every sample contains features of the nuclei (plural form on nucleus) of a person. Malignant cases are considered dangerous or potential cancer. Benign cases are considered harmless.

## Data Understanding

### Description of the data

There are in total 569 samples, each containing 30 numeric features. 212 samples are labeled as malignant (0) and the remaining 357 samples are considered benign (1).

The first 10 features contain the mean values of the characteristics of the nuclei. The next 10 features contain the standard error values of the characteristics. Lastly, the final 10 features contain the worst cases.

In [23]:
import numpy as np

With the check below, we can do a quick check if there are any missing values in the dataset. If there is a `NaN` value in the dataset, the sum of the data will also be `NaN`. So, if the output of the cell below is `True`, there is at least one missing value in the dataset.

In [37]:
np.isnan(np.sum(data.data))

False

The output is `False`. Therefore, there are no `NaN` values in the dataset.

### Feature characteristics

For every feature, the mean and standard deviation is calculated below.

In [87]:
X = data.data

for i in range(X.shape[1]):
    mean = round(np.mean(X[:, i]), 2)
    std = round(np.std(X[:, i]), 2)
    print(data.feature_names[i])
    print("Mean:", mean)
    print("Standard deviation:", std)
    print()

mean radius
Mean: 0.34
Standard deviation: 0.17

mean texture
Mean: 0.32
Standard deviation: 0.15

mean perimeter
Mean: 0.33
Standard deviation: 0.17

mean area
Mean: 0.22
Standard deviation: 0.15

mean smoothness
Mean: 0.39
Standard deviation: 0.13

mean compactness
Mean: 0.26
Standard deviation: 0.16

mean concavity
Mean: 0.21
Standard deviation: 0.19

mean concave points
Mean: 0.24
Standard deviation: 0.19

mean symmetry
Mean: 0.38
Standard deviation: 0.14

mean fractal dimension
Mean: 0.27
Standard deviation: 0.15

radius error
Mean: 0.11
Standard deviation: 0.1

texture error
Mean: 0.19
Standard deviation: 0.12

perimeter error
Mean: 0.1
Standard deviation: 0.1

area error
Mean: 0.06
Standard deviation: 0.08

smoothness error
Mean: 0.18
Standard deviation: 0.1

compactness error
Mean: 0.17
Standard deviation: 0.13

concavity error
Mean: 0.08
Standard deviation: 0.08

concave points error
Mean: 0.22
Standard deviation: 0.12

symmetry error
Mean: 0.18
Standard deviation: 0.12

fract

The values vary from very small values (1e-2) to large values (1e2). Normalizing the values between 0 and 1 is therefore necessary.

## Data Preparation

The data can be considered already prepared. The only transformation left is normalization. Before doing so, the data will be shuffled and the test set will be isolated from the training and evaluation set.

In [104]:
import pandas as pd

In [105]:
df = pd.DataFrame(data=X, columns=data.feature_names)

In [112]:
df["target"] = data.target

In [114]:
n = len(df)
idx = np.arange(n)
np.random.seed(42)
np.random.shuffle(idx)

In [115]:
split_trainval_test = 0.2
n_test = int(n * split_trainval_test)
df_test = df.iloc[:n_test]
df_trainval = df.iloc[n_test:]

The values are normalized to values between 0 and 1 per feature below.

In [65]:
# for i in range(X_trainval.shape[1]):
#     vals = X_trainval[:, i]
#     X_trainval[:, i] = (vals - np.min(vals))/np.ptp(vals)

In [101]:
# X_val = X_trainval[:n_test]
# X_train = X_trainval[n_test:]