# Breast Cancer Data - Statistical Analysis 

-- Rohith R Pai

# 1. Introduction 

![](https://www.cancer.org/cancer/breast-cancer/about/what-is-breast-cancer/_jcr_content/par/image.img.gif/1499806644266.gif)

**Breast Cancer: **
    
The American Cancer Society defines Breast Cancer as the condition whtn the cells in the brast grows out of control.These cells usually form a tumor that can often be seen on an x-ray or felt as a lump. The tumor is malignant (cancer) if the cells can grow into (invade) surrounding tissues or spread (metastasize) to distant areas of the body. Breast cancer occurs almost entirely in women, but men can get breast cancer, too.

**Breast Cancer Signs and Symptoms:**
    
The most common symptom of breast cancer is a new lump or mass. A painless, hard mass that has irregular edges is more likely to be cancer, but breast cancers can be tender, soft, or rounded. They can even be painful. For this reason, it is important to have any new breast mass or lump or breast change checked by a health care provider experienced in diagnosing breast diseases.

The above mentioned data set has a number of diagnoisis of lumps and masses that were found in the patients. Based on the diagnosis the tumor or lump is either classified as malignant (denoted by letter 'M') or benign (denoted by letter 'B').
    
    

In [None]:
# importing the modules for analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

The data set is imported using the pyton padas module.

In [None]:
df = pd.read_csv('../input/data.csv')

In [None]:
df.info()

In [None]:
df.head()

The data is a 2-d matrix of [569 X 32] data. Now we need to trim the data of all the unwanted columns before we proceed with the analysis 
It can be observed from df.info(), that the column name "Unnamed: 32" has no values in it and, the "id" columns show no relavance to the analysis on hand. It's ideal to remove these columns before we start to visualize the data and decide upon the statistical classification algorithm.


In [None]:
cancer_df =df.drop(['id','Unnamed: 32'], axis = 1)

Before we start the process of visualization of the data, we need to convert all the diagnosis data into suitable numbers (0 & 1 in this case) to facilitate easy recognition of the data by the computer.

If the diagnosis is Malignant it is denoted by 1 and if it's benign it's denoted by 0.


In [None]:
cancer_df= pd.get_dummies(cancer_df,'diagnosis',drop_first=True) # dropping the column called diagnosis and having a columns of 0 and 1
cancer_df.head() 

## 2. Data visualization 

A simple count plot would give a good indication of the split between the malignant and benign diagnosis based of the dataset. 

In [None]:
sns.countplot(x='diagnosis',data = df,palette='BrBG')

It can be obsereved from the above plot that there were 350 cases in which the diagnosis of the lump/tumor was benign, and there were about 200 cases in which the prognosis was malignant

Simple distribution plot of all the attributes of the data set

In [None]:
colors = np.array('b g r c m y k'.split()) #Different colors for plotting

fig,axes = plt.subplots(nrows =15,ncols=2, sharey=True,figsize = (15,50))
plt.tight_layout()
row = 0
iteration = 0
for j in range(0,len(cancer_df.columns[:-1])):
    iteration+=1
    if(j%2==0):
        k = 0
    else:
        k = 1
    sns.distplot(cancer_df[cancer_df.columns[j]],kde=False,hist_kws=dict(edgecolor="w", linewidth=2),color = np.random.choice(colors) ,ax=axes[row][k])
    if(iteration%2==0):
        row+=1
        plt.ylim(0,200)

It can be observed that all the attributes of this data set follows more or less a normalized distribution. Let's try out a few relationship plots between different attributes. The standard deviation of each of the attributes is as shown below.


In [None]:
cancer_df.std()

In [None]:
plt.figure(figsize =(20,6))
sns.barplot(x='radius_mean',y='texture_mean',data =df, hue= 'diagnosis',palette='viridis')
plt.xlabel('Mean Radius of the lump')
plt.ylabel('Texture of the lump')

In [None]:
plt.figure(figsize =(20,6))
sns.barplot(x='perimeter_worst',y='area_worst',data =df, hue= 'diagnosis')


In [None]:
plt.figure(figsize= (10,10), dpi=100)
sns.heatmap(cancer_df.corr()) # plotting the correlation matrix of the dataset

It can be observed from the two relationship plots that the lump/node attribute have a linear relationship with one another. As seen from the correlation matrix map and the barplot the mean radius of the lump varies linearly/proportionally with the mean texture. The same is is the case with almost all the parameters involved in the diagnosis

# 3. Prediction based on Logistical Regression Model

Logistical regression method of classification is a quite famous algorithm that being used to solve a variety of problems involving "Binary Classification" of the data. 

Some of the example of the Logistical Regression problems are:
1) Spam vs Ham email deduction.
2) Loan Defualt chances (yes/no) based on customer transaction data.
3) Disease diagnosis.

The convention for binary classification is to have twoclasses 0 and 1. We can't use linear regression on these binary groups as they won't lead a good fit. Instead we use a sigmoid function to classify the data. The sigmoid function takes in any values and gives an output between 0 & 1.

In the case of the breast cancer the diagnostics can either Malignant or benign. Therefore we can use Logistical regression model to train this data and predict the out come of a prognosis.




![Confusion Matrix Representation](https://qph.ec.quoracdn.net/main-qimg-7c9b7670c90b286160a88cb599d1b733)

The code for the Logistical regression algorithm is as follows:

In [None]:
from sklearn.model_selection import train_test_split #Importing module
X = cancer_df.drop('diagnosis_M',axis=1)
y = cancer_df['diagnosis_M']

Now we need to split the original data into test data and training data. For the purpose of this analysis we are considering 70% of the original data as the training data and remainig 30% of the data will be used at the end to check the effectiveness of the model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3) # splitting the data for training and testing

In [None]:
from sklearn.linear_model import LogisticRegression #Logistical Regression Module
lm = LogisticRegression()
lm.fit(X_train,y_train)
prediction = lm.predict(X_test)

Confusion matrix and Classification Report

![Confusion Matrix Representation](http://dni-institute.in/blogs/wp-content/uploads/2015/02/ConfusionMatrix.png)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
print("Confusion Matrix")
print(confusion_matrix(y_test,prediction))
CM = confusion_matrix(y_test,prediction)
accuracy = (CM[0,0]+CM[1,1])/CM.sum()*100
error = (CM[0,1]+CM[1,0])/CM.sum()*100
print('Accuracy of the model: {0:.2f}%'.format(accuracy))
print('Error/ Misclassification rate: {0:.2f}%'.format(error))

print('\n\n')
print('Classification Report')
print(classification_report(y_test,prediction))

From the confusion matrix the accuracy of the model was found to be 94.7% and the Error rate was around 5.2%. This indicates that linear regression model is a good predictor of the diagnosis when supplied with the various attributes of the lump.

The classification reports shows how precisely the model predicts the diagnosis. The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative. The model has a precision of 94% for this particular data set.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

# 4. Prediction based on K Nearest Neighbors Model (KNN Model)

K Nearest Neighbors is a classification algorithm that operates on a very simple principle. It's a non-parametric method of classification. Majority voting among the data records in the neighbourhood is usually used to decide the classification of the dataset with or without consideration of distance-based weighting. However, to apply kNN we need to choose an appropriate value for "k", and the success of classification is very much dependent on this value.

The working principe of this algorithm is as follows:

Training Algorithm:
1. Store all the Data (Preferably in the form of a Data Frame) 

Prediction Algorithm:
1. Calculate the distance from x to all points in your data. ('x' here refers to the data you want to test with)
2. Sort the points in your data by increasing distance from x.
3. Predict the majority label of the “k” closest points.

KNN algorithm is one of the simplest classification algorithm. Even with such simplicity, it can give highly competitive results. KNN algorithm can also be used for regression problems. The only difference from the discussed methodology will be using averages of nearest neighbors rather than voting from nearest neighbors. 

![Confusion Matrix Representation](https://www.researchgate.net/profile/Victor_Sheng/publication/260612049/figure/fig2/AS:214207917236228@1428082555895/The-principle-diagram-of-the-kNN-classification-algorithm.png)

Scalarization of the data:

There are several reasons why scalarization is often used for solving multi-objective problems, and whether it is useful or not depends on your application and the structure of your multiobjective problem.
Sometimes in applications you can express the various objectives in a single unit (e.g. costs) by weighting them properly, and making them comparable in that way. Then the multiobjective problem can by scalarization be solved as  a single objective problem.

If scalarization is performed carefully, the solution of the scalarized problem will be a Pareto optimal point


In [None]:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
scalar.fit(cancer_df.drop('diagnosis_M',axis=1))
scalar_features = scalar.transform(cancer_df.drop('diagnosis_M',axis=1))

# Converting these features into a Data frame
df_feat = pd.DataFrame(scalar_features,columns=cancer_df.columns[:-1])

Predicting using KNN model:

In [None]:
from sklearn.neighbors import KNeighborsClassifier # importing KNN module
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_test,y_test)# fitting the test data to the model

# Predicting the outcome for the test data
prediction = knn.predict(X_test)

In [None]:
# Confusion matrix and Classsification report for K = 1
from sklearn.metrics import classification_report,confusion_matrix
print('Confusion matrix and Classsification report for K = 1')
print('\n')

print("Confusion Matrix")
print(confusion_matrix(y_test,prediction))
CM = confusion_matrix(y_test,prediction)
accuracy = (CM[0,0]+CM[1,1])/CM.sum()*100
error = (CM[0,1]+CM[1,0])/CM.sum()*100
print('Accuracy of the model: {0:.2f}%'.format(accuracy))
print('Error/ Misclassification rate: {0:.2f}%'.format(error))

print('\n\n')
print('Classification Report')
print(classification_report(y_test,prediction))

It can be observed that the KNN model predicts the diagnosis with an accuracy of 100%. K=1 is a good value for prediciting the diagnosis with high accuracy and precision for this data set. 

Just for the sake of comparison, we can find how the model behaves for the other K values. The model is iterated over a K value ranging between 1 & 40.


In [None]:
error_rate = []
for k in range(1,41):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_test,y_test)# fitting the test data to the model
    predi = knn.predict(X_test)
    error_rate.append(np.mean(predi!=y_test))

    

In [None]:
plt.figure(figsize = (10,10))
plt.plot(range(1,41),error_rate,ls = '--',color = 'blue',marker = 'o',markerfacecolor = 'red')
plt.xlabel("K- value")
plt.ylabel("Error Rate")
plt.xlim((0,40))
plt.title("Error Rate vs K- value")


From the above plot it can be observed that the error is zero at K =1 but for all the other values of "K" the error rate keeps on increasing.

**Cross validation of the KNN algorithm**


In [None]:
from sklearn.model_selection import cross_val_score
CV_scores = cross_val_score(knn,X_test,y_test,cv =5)
CV_scores

In [None]:
print("Accuracy: %0.2f (+/- %0.2f)" % (CV_scores.mean(),CV_scores.std() * 2))


# 5. Conclusion 

Analysing the results from the both the models it can be concluded that,  at k = 1, the KNN model is the best predictor of the two alogorithams for the given set of data. But this data set has only 17070 data points, and it would be interesting to observe the behaviour of the KNN model for considerably larger data set. 