<p style="font-size: 40px; color: red;">Predict whether the cancer is benign or malignant</p><br>
Breast Cancer Wisconsin (Diagnostic) [Data Set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)<br> 
b) texture (standard deviation of gray-scale values) <br> 
c) perimeter <br> 
d) area <br> 
e) smoothness (local variation in radius lengths) <br> 
f) compactness (perimeter^2 / area - 1.0) <br> 
g) concavity (severity of concave portions of the contour) <br> 
h) concave points (number of concave portions of the contour) <br> 
i) symmetry <br> 
j) fractal dimension ("coastline approximation" - 1)<br> 

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
%matplotlib inline

## Importing Dataset

In [2]:
data = pd.read_csv('data.csv')

In [3]:
type(data)

pandas.core.frame.DataFrame

## Prelimnary Observations

In [4]:
data.tail()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
564,926424,M,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,
565,926682,M,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,
567,927241,M,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,
568,92751,B,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,


In [5]:
#number of rows(training examples)
data.size

18777

In [6]:
# all possible features(including label)
# label for this dataset is the 'diagnosis' column
data.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

In [7]:
len(data.columns)

33

### Check for null values

In [8]:
#checking for null values
data.isnull().any().any()     # This returns a boolean value

True

In [9]:
data.isnull().sum().sum()     # This returns an integer of the total number of NaN values

569

In [10]:
data.isnull().any()
# Conclusion: only the last column has null values

id                         False
diagnosis                  False
radius_mean                False
texture_mean               False
perimeter_mean             False
area_mean                  False
smoothness_mean            False
compactness_mean           False
concavity_mean             False
concave points_mean        False
symmetry_mean              False
fractal_dimension_mean     False
radius_se                  False
texture_se                 False
perimeter_se               False
area_se                    False
smoothness_se              False
compactness_se             False
concavity_se               False
concave points_se          False
symmetry_se                False
fractal_dimension_se       False
radius_worst               False
texture_worst              False
perimeter_worst            False
area_worst                 False
smoothness_worst           False
compactness_worst          False
concavity_worst            False
concave points_worst       False
symmetry_w

In [11]:
data.drop('Unnamed: 32', axis=1 ,inplace=True)

In [12]:
data.isnull().any().any()

False

In [13]:
data.drop('id', axis=1 ,inplace=True)

In [14]:
data.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Converting categorical values to numeric values - for 'diagnosis'

In [15]:
# Since this is a binary classification problem i will use LabelEncoder
le = LabelEncoder()
le.fit(['M', 'B'])
data['label'] = le.transform(data['diagnosis'])

In [16]:
data.drop('diagnosis', axis=1 ,inplace=True)

In [17]:
data.shape

(569, 31)

In [18]:
data.head(40)

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,label
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,1
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,1
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,1
7,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,1
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,...,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,1
9,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,1


In [19]:
Y = data['label']

In [20]:
data = data.drop(columns=['label'] ,axis=1)

In [21]:
data.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Scaling

In [22]:
data.iloc[100]

radius_mean                 13.610000
texture_mean                24.980000
perimeter_mean              88.050000
area_mean                  582.700000
smoothness_mean              0.094880
compactness_mean             0.085110
concavity_mean               0.086250
concave points_mean          0.044890
symmetry_mean                0.160900
fractal_dimension_mean       0.058710
radius_se                    0.456500
texture_se                   1.290000
perimeter_se                 2.861000
area_se                     43.140000
smoothness_se                0.005872
compactness_se               0.014880
concavity_se                 0.026470
concave points_se            0.009921
symmetry_se                  0.014650
fractal_dimension_se         0.002355
radius_worst                16.990000
texture_worst               35.270000
perimeter_worst            108.600000
area_worst                 906.500000
smoothness_worst             0.126500
compactness_worst            0.194300
concavity_wo

In [23]:
# The data is not very well scaled, like for example
print(max(data['perimeter_worst']), min(data['perimeter_worst']))

251.2 50.41


In [24]:
# whereas
print(max(data['radius_mean']), min(data['radius_mean']))

28.11 6.981


In [25]:
from sklearn.preprocessing import StandardScaler

In [26]:
scale = StandardScaler()

In [27]:
scale.fit(data)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [28]:
scaled_data = scale.transform(data)

In [29]:
scaled_data[39]

array([-1.83840043e-01,  3.56123065e-01, -1.47009200e-01, -2.72149630e-01,
        3.72886904e-01,  4.00994653e-01,  2.19720584e-01,  1.41115115e-01,
       -3.34494429e-01,  1.97385794e-01, -6.93589311e-01, -1.13478761e+00,
       -6.53964756e-01, -4.80013039e-01, -5.58015594e-01, -1.72594657e-01,
       -4.65430550e-02,  1.33638528e-01, -8.19979801e-01, -2.29940453e-01,
       -1.53073295e-01,  5.58190740e-02,  1.15531363e-03, -2.46429703e-01,
        1.25508320e+00,  1.07020937e+00,  1.10732368e+00,  1.69310287e+00,
       -1.51676173e-01,  1.28310797e+00])

In [30]:
scaled_data.shape

(569, 30)

### PCA

In [31]:
from sklearn.decomposition import PCA

In [32]:
pca = PCA(n_components=10)

In [33]:
#applying PCA on the scaled data
pca.fit(scaled_data)

PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [34]:
pca_data = pca.transform(scaled_data)  #return numpy ndarray

In [35]:
pca_data.shape

(569, 10)

In [36]:
# converting the pca applied data into a data frame
df_data = pd.DataFrame(data=pca_data, columns=['Component1','Component2','Component3','Component4','Component5','Component6','Component7','Component8','Component9','Component10'])

In [37]:
df_data.head()

Unnamed: 0,Component1,Component2,Component3,Component4,Component5,Component6,Component7,Component8,Component9,Component10
0,9.192837,1.948583,-1.123166,3.633731,-1.19511,1.411424,2.15937,-0.398407,-0.157119,-0.877402
1,2.387802,-3.768172,-0.529293,1.118264,0.621775,0.028656,0.013358,0.240988,-0.711905,1.106995
2,5.733896,-1.075174,-0.551748,0.912083,-0.177086,0.541452,-0.668166,0.097374,0.024066,0.454276
3,7.122953,10.275589,-3.23279,0.152547,-2.960878,3.053422,1.429911,1.059565,-1.405439,-1.116974
4,3.935302,-1.948072,1.389767,2.940639,0.546747,-1.226495,-0.936213,0.636376,-0.263806,0.377704


### Applying LogisticRegression

In [39]:
from sklearn.model_selection import train_test_split

In [40]:
X = df_data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=7)

In [41]:
from sklearn.linear_model import LogisticRegression

In [42]:
lg = LogisticRegression(C=1.5)
#C : float, default: 1.0
#Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

In [43]:
lg.fit(X_train, y_train)

LogisticRegression(C=1.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [44]:
predictions = lg.predict(X_test)

In [45]:
from sklearn.metrics import classification_report

In [46]:
print(classification_report(y_test, predictions))
#The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
#The recall is intuitively the ability of the classifier to find all the positive samples.

             precision    recall  f1-score   support

          0       0.95      1.00      0.97       116
          1       1.00      0.89      0.94        55

avg / total       0.97      0.96      0.96       171



Verify the results using Confusion Matrix

In [47]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, predictions)
print(confusion_matrix)

[[116   0]
 [  6  49]]


The result is telling us that we have 116+49 correct predictions and 6+0 incorrect predictions.

### Applying SVM

In [48]:
from sklearn.svm import LinearSVC

In [121]:
svm = LinearSVC(random_state=0, C=0.1)
svm.fit(X_train, y_train)

LinearSVC(C=0.1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
     verbose=0)

In [122]:
predictions2 = svm.predict(X_test)

In [123]:
print(classification_report(y_test, predictions2))

             precision    recall  f1-score   support

          0       0.96      1.00      0.98       116
          1       1.00      0.91      0.95        55

avg / total       0.97      0.97      0.97       171



In [124]:
from sklearn.metrics import confusion_matrix
confusion_matrix2 = confusion_matrix(y_test, predictions2)
print(confusion_matrix2)

[[116   0]
 [  5  50]]
