## Breast Cancer Prediction

### Data Description-

<br>**1) ID number
<br>**2) Diagnosis** (M = malignant, B = benign)
<br>Ten real-valued features are computed for each cell nucleus:

<br>**a) radius** (mean of distances from center to points on the perimeter)
<br>**b) texture** (standard deviation of gray-scale values)
<br>**c) perimeter
<br>**d) area
<br>**e) smoothness** (local variation in radius lengths)
<br>**f) compactness** (perimeter^2 / area - 1.0)
<br>**g) concavity** (severity of concave portions of the contour)
<br>**h) concave points** (number of concave portions of the contour)
<br>**i) symmetry**

In [None]:
#Importing necessary libraries.
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import os

In [None]:
#Importing warnings library so as to remove warnings from the output.
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
#Reading the data set.
bcancer = pd.read_csv("../input/data.csv")
bcancer.head(4)

In [None]:
#Checking the dimensions of the data set.
bcancer.shape

In [None]:
#Checking the data types of the variables.
bcancer.dtypes

In [None]:
bcancer.drop('Unnamed: 32', axis = 1, inplace = True)

In [None]:
#Dropping the ID column as it is not important.
bcancer.drop('id',axis = 1, inplace=True)

In [None]:
#checking for missing values.
bcancer.isnull().sum()

In [None]:
#Checking the summary statistics of the object attribute 'diagnosis'.
bcancer.describe(include='object')

In [None]:
#Checking the summary statistics of the numeric attributes.
bcancer.describe()

In [None]:
bcancer.groupby('diagnosis').mean()

In [None]:
#Converting the target variable in to integer where B is 0 and M is 1.
bcancer['diagnosis'] = (bcancer['diagnosis'] == 'M').astype('int')

In [None]:
#Checking the count of Malignant as well as Beningn observations.
bcancer['diagnosis'].value_counts()

In [None]:
#Visualizing the count of values in diagnosis variable.
sns.countplot(x='diagnosis',data = bcancer,palette='hls')
plt.show()

In [None]:
#Checking the distribution of the variable 'diagnosis.'
sns.distplot(bcancer['diagnosis'])

In [None]:
#Checking the distribution of the attribute radius mean.
sns.distplot(bcancer['radius_mean'])

In [None]:
#Checking the distribution of the attribute perimeter mean.
sns.distplot(bcancer['perimeter_mean'])

In [None]:
#Nucleus features vs diagnosis
features_mean=list(bcancer.columns[1:11])
# split dataframe into two based on diagnosis
bcancer_M = bcancer[bcancer['diagnosis'] == 1]
bcancer_B = bcancer[bcancer['diagnosis'] == 0]

In [None]:
#Genrating a scatter plot matrix with the "mean" columns
cols = ['diagnosis',
        'radius_mean', 
        'texture_mean', 
        'perimeter_mean', 
        'area_mean', 
        'smoothness_mean', 
        'compactness_mean', 
        'concavity_mean', 
        'symmetry_mean']

sns.pairplot(data = bcancer[cols], hue = 'diagnosis', palette = 'RdBu')

There are almost perfectly linear patterns between the radius, perimeter and area attributes which hint at the presence of multicollinearity between these variables. Another set of variables that possibly imply multicollinearity are the concavity, concave_points and compactness.

I have also genrated a correlation matrix in the cells below to show multicollinearity among variables. 

In [None]:
#Stacking the data
plt.rcParams.update({'font.size': 8})
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(8,10))
axes = axes.ravel()
for idx,ax in enumerate(axes):
    ax.figure
    binwidth= (max(bcancer[features_mean[idx]]) - min(bcancer[features_mean[idx]]))/50
    ax.hist([bcancer_M[features_mean[idx]],bcancer_B[features_mean[idx]]], bins=np.arange(min(bcancer[features_mean[idx]]), \
                        max(bcancer[features_mean[idx]]) + binwidth, binwidth) , alpha=0.5,stacked=True, \
                        normed = True, label=['M','B'],color=['r','g'])
    ax.legend(loc='upper right')
    ax.set_title(features_mean[idx])
plt.tight_layout()
plt.show()

Here we can see the mean values of cell radius, perimeter, area, compactness, concavity and concave points can be used in classification of the cancer. Larger values of these parameters tends to show a correlation with malignant tumors.
Mean values of texture, smoothness, symmetry or fractual dimension does not show a particular preference of one diagnosis over the other. In any of the above histograms there are no noticeable large outliers that requires further cleanup.

## Building the logistic model with all the attributes.

In [None]:
x=bcancer.iloc[:,1:31]
y=bcancer['diagnosis']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = .2, random_state=10) 

In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
#Displaying all the columns.
pd.options.display.max_columns = None
x_train.head(3)

In [None]:
#Creating an instance of logistic regression model
from sklearn.linear_model import LogisticRegression
logistic_model1 = LogisticRegression()

#We fit our model to data
fitted_model1 = logistic_model1.fit(x_train,y_train)

#We use predict_proba() to predict the probabilities
predictedvalues1 = fitted_model1.predict(x_test)

#We print the probabilites to take a glance
print(predictedvalues1)

In [None]:
#Checking the accuracy of the above model.
print('Accuracy of logistic regression classifier on test set: {:.3f}'.format(logistic_model1.score(x_test, y_test)))

In [None]:
#Generating the confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix1 = confusion_matrix(y_test,predictedvalues1)
print(confusion_matrix1)

In [None]:
#Calculating sensitivity and specificity
total=sum(sum(confusion_matrix1))

sensitivity1 = confusion_matrix1[0,0]/(confusion_matrix1[0,0]+confusion_matrix1[1,0])
print('Sensitivity : ', sensitivity1 )

specificity1 = confusion_matrix1[1,1]/(confusion_matrix1[1,1]+confusion_matrix1[0,1])
print('Specificity : ', specificity1)

In [None]:
#Generating the roc and calculating the auc.
fpr, tpr, thresholds = roc_curve(y_test, predictedvalues1)

fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
print("Area under the curve: {:.3f}".format(auc(fpr,tpr)))

In [None]:
#Generating the classification report.
from sklearn.metrics import classification_report
print(classification_report(y_test, predictedvalues1))

In [None]:
#Generating the correlation matrix.
fig, ax=plt.subplots(figsize=(20,20))
correlation=bcancer.corr()
sns.heatmap(correlation,square=True, vmin=-0.2, vmax=0.8,cmap="YlGnBu", annot=True)

As seen in the heatmap above- radius_mean, perimeter_mean, texture_mean, area_mean, radius_worst, perimeter_worst are highly correlated.

Also here we can see that there is multicollinearity between "mean" columns and the "worst" column. For instance, the radius_mean column has a correlation of 0.97 with the radius_worst column. In fact, each of the 10 key attributes display very high (from 0.7 up to 0.97) correlations between its "mean" and "worst" columns. This shows that the "worst" columns are essentially just a subset of the "mean" columns; the "worst" columns are also the "mean" of some values (the three largest values among all observations). Therefore, I think we can discard the "worst" columns from our analysis and only focus on the "mean" columns.

So here we will drop all "worst" columns from the dataset, and pick only one of the three attributes that describe the size of cells. 

Since a cell's **radius** is the basic building block of its size. Therefore, it is reasonable to choose radius as our attribute to represent the size of a cell.

Also there is multicollinearity between the attributes compactness, concavity. So similarly what we did with the size attributes, here also we should pick only one of these three attributes that contain information on the shape of the cell. So we will take compactness attribute as it somewhat describes the size of the cell, and remove the other attribute.

### So now we will remove the unnecessary columns.

In [None]:
#Dropping all "worst" columns. 
cols = ['radius_worst', 
        'texture_worst', 
        'perimeter_worst', 
        'area_worst', 
        'smoothness_worst', 
        'compactness_worst', 
        'concavity_worst', 
        'symmetry_worst']      
bcancer = bcancer.drop(cols, axis = 1)

In [None]:
#Dropping the perimeter and area attributes.
cols1 = ['area_se', 'perimeter_se', 'perimeter_mean', 'area_mean']
bcancer = bcancer.drop(cols1, axis = 1)

In [None]:
#Dropping the concavity attributes.
cols2 = ['concavity_mean', 'concavity_se']
bcancer = bcancer.drop(cols2, axis = 1)

In [None]:
#Drawing the heatmap again, with the new correlation matrix
corr = bcancer.corr().round(2)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(corr, mask=mask, cmap='YlGnBu', vmin=-1, vmax=1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
plt.tight_layout()

## Rebuilding the logistic regression model.

In [None]:
x = bcancer.drop('diagnosis', axis = 1)
y = bcancer['diagnosis']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=20)

In [None]:
#Checking the shape of train as well as test data.
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
#Displaying all the columns.
pd.options.display.max_columns = None
x_train.head(3)

In [None]:
#Creating an instance of logistic regression model
from sklearn.linear_model import LogisticRegression
logistic_model2 = LogisticRegression()

#Fitting our model to data
fitted_model2 = logistic_model2.fit(x_train,y_train)

#We use predict_proba() to predict the probabilities
predictedvalues2 = fitted_model2.predict(x_test)

#We print the probabilites to take a glance
print(predictedvalues2)

In [None]:
#Checking the accuracy of the above model.
print('Accuracy of logistic regression classifier on test set: {:.3f}'.format(logistic_model2.score(x_test, y_test)))

In [None]:
#Generating the confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix2 = confusion_matrix(y_test,predictedvalues2)
print(confusion_matrix2)

In [None]:
#Calculating sensitivity and specificity
total=sum(sum(confusion_matrix1))

sensitivity2 = confusion_matrix2[0,0]/(confusion_matrix2[0,0]+confusion_matrix2[1,0])
print('Sensitivity : ', sensitivity2 )

specificity2 = confusion_matrix2[1,1]/(confusion_matrix2[1,1]+confusion_matrix2[0,1])
print('Specificity : ', specificity2)

In [None]:
#Generating the roc and calculating the auc.
fpr, tpr, thresholds = roc_curve(y_test, predictedvalues2)

fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
print("Area under the curve: {:.3f}".format(auc(fpr,tpr)))

In [None]:
#Generating the classification report.
from sklearn.metrics import classification_report
print(classification_report(y_test, predictedvalues2))

Here we can see after eliminating the multicollinear attributes the accuracy reduced as well as the AUC dropped from 0.945 to 0.909