Predict the class of breast cancer (malignant or ‘bad’ versus benign or ‘good’) from the features of images taken from breast samples. Ten biological attributes of the cancer cell nuclei have been calculated from the images, as described below:
Attribute 	Domain
1. Sample code number 	id number
2. Clump Thickness 	1 - 10
3. Uniformity of Cell Size 	1 - 10
4. Uniformity of Cell Shape 	1 - 10
5. Marginal Adhesion 	1 - 10
6. Single Epithelial Cell Size 	1 - 10
7. Bare Nuclei 	1 - 10
8. Bland Chromatin 	1 - 10
9. Normal Nucleoli 	1 - 10
10. Mitoses 	1 - 10
11. Class 	(2 for benign, 4 for malignant)

The data can be found here.

In [20]:
# Importing Essential modules
import pandas as pd 
import numpy as np 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

### Feature preparation

In [2]:
df = pd.read_csv('data/cancer.data', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [3]:
column_names = {0:'id number', 1:"Clump Thickness", 2:"Uniformity of Cell Size", 3:"Uniformity of Cell Shape", 4: "Marginal Adhesion", 5:"Single Epithelial Cell Size",
              6:"Bare Nuclei", 7:"Bland Chromatin", 8:"Normal Nucleoli", 9:"Mitosesi", 10:"Class"}
df = df.rename(columns=column_names)

In [4]:
df.describe()

Unnamed: 0,id number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bland Chromatin,Normal Nucleoli,Mitosesi,Class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   id number                    699 non-null    int64 
 1   Clump Thickness              699 non-null    int64 
 2   Uniformity of Cell Size      699 non-null    int64 
 3   Uniformity of Cell Shape     699 non-null    int64 
 4   Marginal Adhesion            699 non-null    int64 
 5   Single Epithelial Cell Size  699 non-null    int64 
 6   Bare Nuclei                  699 non-null    object
 7   Bland Chromatin              699 non-null    int64 
 8   Normal Nucleoli              699 non-null    int64 
 9   Mitosesi                     699 non-null    int64 
 10  Class                        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB


In [6]:
df.tail(20)

Unnamed: 0,id number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitosesi,Class
679,1368882,2,1,1,1,2,1,1,1,1,2
680,1369821,10,10,10,10,5,10,10,10,7,4
681,1371026,5,10,10,10,4,10,5,6,3,4
682,1371920,5,1,1,1,2,1,3,2,1,2
683,466906,1,1,1,1,2,1,1,1,1,2
684,466906,1,1,1,1,2,1,1,1,1,2
685,534555,1,1,1,1,2,1,1,1,1,2
686,536708,1,1,1,1,2,1,1,1,1,2
687,566346,3,1,1,1,2,1,2,3,1,2
688,603148,4,1,1,1,2,1,1,1,1,2


In [7]:
df.iloc[:,6].value_counts()

1     402
10    132
2      30
5      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: Bare Nuclei, dtype: int64

In [8]:
df = df.replace('?', np.nan)

In [9]:
df.isnull().sum()

id number                       0
Clump Thickness                 0
Uniformity of Cell Size         0
Uniformity of Cell Shape        0
Marginal Adhesion               0
Single Epithelial Cell Size     0
Bare Nuclei                    16
Bland Chromatin                 0
Normal Nucleoli                 0
Mitosesi                        0
Class                           0
dtype: int64

In [10]:
for col in df.columns:
    if df[col].dtypes == 'object':
        df = df.fillna(df[col].value_counts().index[6])

In [11]:
df.isnull().sum()

id number                      0
Clump Thickness                0
Uniformity of Cell Size        0
Uniformity of Cell Shape       0
Marginal Adhesion              0
Single Epithelial Cell Size    0
Bare Nuclei                    0
Bland Chromatin                0
Normal Nucleoli                0
Mitosesi                       0
Class                          0
dtype: int64

In [12]:
# Changing objects into numeric  
le = LabelEncoder()
for col in df.columns:
    if df[col].dtypes == 'object':
        df[col]=le.fit_transform(df[col])

In [13]:
df.dtypes

id number                      int64
Clump Thickness                int64
Uniformity of Cell Size        int64
Uniformity of Cell Shape       int64
Marginal Adhesion              int64
Single Epithelial Cell Size    int64
Bare Nuclei                    int64
Bland Chromatin                int64
Normal Nucleoli                int64
Mitosesi                       int64
Class                          int64
dtype: object

In [14]:
# changing pandas DataFrame to numpy array and seperating the data
df = df.values
X, y = df[:,0:10], df[:,10]
scaler = MinMaxScaler(feature_range=(0, 1))
X_transformed = scaler.fit_transform(X)

In [15]:
# Splitting the data into test and train
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=.36, random_state=42)

### Build logistic model

In [16]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [17]:
y_pred = logreg.predict(X_test)

In [18]:
print(f'Accuracy of logistic regression classifier: {logreg.score(X_test, y_test)}')

Accuracy of logistic regression classifier: 0.9642857142857143


In [19]:
col_index= ['Benign', 'Malignant']
pd.DataFrame(data=confusion_matrix(y_test, y_pred), index= col_index, columns= col_index)

Unnamed: 0,Benign,Malignant
Benign,165,4
Malignant,5,78


In [22]:
print('Classification Report')
print('\n')
print(classification_report(y_test, y_pred, target_names= col_index))

Classification Report


              precision    recall  f1-score   support

      Benign       0.97      0.98      0.97       169
   Malignant       0.95      0.94      0.95        83

    accuracy                           0.96       252
   macro avg       0.96      0.96      0.96       252
weighted avg       0.96      0.96      0.96       252

