# Breast Cancer Prediction

Breast cancer is a type of cancer that starts in the breast. Cancer starts when cells begin to grow out of control. Breast cancer cells usually form a tumor that can often be seen on an x-ray or felt as a lump. 

Depending on the types of cells in a tumor, it can be:

1. **Benign** - The tumor doesn’t contain cancerous cells.
2. **Malignant** - The tumor contains cancerous cells.

![](https://gotalktogetherdotcom.files.wordpress.com/2016/05/cancerbenignmalig1.jpg)

In this notebook, we are going to predict whether a breast tumor is benign or malignant based on 30 features in the dataset. This prediction can be useful in diagnosing patients with suspected breast cancer.

# Breast Cancer Dataset Attributes Information:

1st column - ID number,
2nd column -  Diagnosis (M = malignant, B = benign),
3rd to 32nd column -  10 real-valued features are computed for each cell nucleus:

1. radius (mean of distances from center to points on the perimeter)
2. texture (standard deviation of gray-scale values)
3. perimeter
4. area
5. smoothness (local variation in radius lengths)
6. compactness (perimeter² / area — 1.0)
7. concavity (severity of concave portions of the contour)
8. concave points (number of concave portions of the contour)
9. symmetry
10. fractal dimension (“coastline approximation” — 1)

The **"mean"**, **"standard error(se)"** and **“worst”** or largest (mean of the three largest values) of these features were computed for each, resulting in 30 features. 

For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

Loading the initial libraries

In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

Let us load the data set

In [None]:
cancer_df= pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

In [None]:
cancer_df.head()

In [None]:
cancer_df.tail()

In [None]:
cancer_df.info()

The last column looks somthing fishy, bunch of NaN values. Let's get ride over it.

In [None]:
cancer_df = cancer_df.drop('Unnamed: 32', axis=1)

In [None]:
cancer_df.info()

In [None]:
cancer_df.shape

Here we have successfully dropped last column named as "Unnamed 32". Now after getting shape of the data, there are 569 rows and 32 columns.

Now, lets quickly go through the data types of each columns

In [None]:
cancer_df.dtypes

All columns are having numeric data types except "diagnosis". Let's quickly analyze "diagnosis" column.

In [None]:
cancer_df["diagnosis"].unique()

In [None]:
cancer_df["diagnosis"].value_counts()

From above sample of code, we can understand that, "diagnosis" column is having 2 unique categorical fields. Where 'B' stands for Benign and 'M' stands for Malignant. Let's quickly convert them into values.

In [None]:
cancer_df['diagnosis'] = cancer_df['diagnosis'].map({'M':1,'B':0})
cancer_df.head()

In [None]:
cancer_df.info()

So here we have successfully converted data type of "diagnosis" column into numeric.

For our instance now, 
Malignant = 'M' = 1 & Benign = 'B' = 0

Now,let's explore the data

In [None]:
cancer_df.describe()

In [None]:
plt.figure(figsize=(8, 4))
sns.countplot(cancer_df['diagnosis'], palette=('Orange','DarkBlue'))

In [None]:
cols = ['diagnosis',
        'radius_mean', 
        'texture_mean', 
        'perimeter_mean', 
        'area_mean', 
        'smoothness_mean', 
        'compactness_mean', 
        'concavity_mean',
        'concave points_mean', 
        'symmetry_mean', 
        'fractal_dimension_mean']

sns.pairplot(data=cancer_df[cols], hue='diagnosis', palette=('Orange','DarkBlue'))

Here, we have analyzed the relationship between the 10 key attributes and the diagnosis variable by only choosing the "mean" columns.

In [None]:
cols = ['diagnosis',
        'radius_se', 
        'texture_se', 
        'perimeter_se', 
        'area_se', 
        'smoothness_se', 
        'compactness_se', 
        'concavity_se',
        'concave points_se', 
        'symmetry_se', 
        'fractal_dimension_se']

sns.pairplot(data=cancer_df[cols], hue='diagnosis', palette=('Orange','DarkBlue'))

Here, we have analyzed the relationship between the 10 key attributes and the diagnosis variable by only choosing the "se" columns.

In [None]:
cols = ['diagnosis',
        'radius_worst', 
        'texture_worst', 
        'perimeter_worst', 
        'area_worst', 
        'smoothness_worst', 
        'compactness_worst', 
        'concavity_worst',
        'concave points_worst', 
        'symmetry_worst', 
        'fractal_dimension_worst']

sns.pairplot(data=cancer_df[cols], hue='diagnosis', palette=('Orange','DarkBlue'))

In [None]:
corr=cancer_df.corr().round(2)
plt.figure(figsize=(20,20))
sns.heatmap(corr, annot = True)

In [None]:
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=(20,20))
sns.heatmap(corr, mask=mask, annot = True)

Looking at the above plaot, we can verify the presence of **multicollinearity** between some of our variables. For instance, the **radius_mean column** has a correlation of **1** and **0.99** with **perimeter_mean** and **area_mean** columns, respectively. This is probably because the three columns essentially contain the same information, which is the physical size of the observation. Therefore we should only pick one of the three columns when we go into further analysis.

Another place where **multicollienartiy** is apparent is between the **"mean"** columns and the **"worst"** column. For instance, the **radius_mean** column has a correlation of **0.97** with the **radius_worst** column. In fact, each of the 10 key attributes display very high (from 0.7 up to 0.97) correlations between its **"mean"** and **"worst"** columns. 

This is somewhat inevitable, because the **"worst"** columns are essentially just a subset of the **"mean"** columns; the **"worst"** columns are also the **"mean"** of some values. Therefore, I think we should discard the **"worst"** columns from our analysis and only focus on the **"mean"** columns.

In short, we will drop all **"worst"** columns from our dataset, then pick only one of the three attributes that describe the size of cells.

Similarly, it seems like there is **multicollinearity** between the attributes **compactness**, **concavity**, and **concave points**. Just like what we did with the size attributes, we should pick only one of these three attributes that contain information on the shape of the cell. I think **compactness** is an attribute name that is straightforward, so we will remove the other two attributes.

We will now go head and drop all unnecessary columns.

Storing main data set in to another vairiable for our record.

In [None]:
cancer_df1=cancer_df
cancer_df1.head()

First, drop all "worst" columns

In [None]:
cols = ['radius_worst', 
        'texture_worst', 
        'perimeter_worst', 
        'area_worst', 
        'smoothness_worst', 
        'compactness_worst', 
        'concavity_worst',
        'concave points_worst', 
        'symmetry_worst', 
        'fractal_dimension_worst']
cancer_df = cancer_df.drop(cols, axis=1)

Then, drop all columns related to the "perimeter" and "area" attributes

In [None]:
cols = ['perimeter_mean',
        'perimeter_se', 
        'area_mean', 
        'area_se']
cancer_df = cancer_df.drop(cols, axis=1)

Drop all columns related to the "concavity" and "concave points" attributes

In [None]:
cols = ['concavity_mean',
        'concavity_se', 
        'concave points_mean', 
        'concave points_se']
cancer_df = cancer_df.drop(cols, axis=1)

In [None]:
cancer_df.columns

Lastly, also drop the id column

In [None]:
cancer_df = cancer_df.drop('id', axis=1)

In [None]:
cancer_df.columns

Let's take a look at the correlation matrix once again, this time created with our trimmed-down set of variables.

In [None]:
corr=cancer_df.corr().round(2)

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=(20,20))
sns.heatmap(corr, mask=mask, annot = True)

Looks great! Now let's move on to our model.

Let's describe data once more

In [None]:
cancer_df.describe()

Let's assign x and y, and accordingly split the data.

In [None]:
x= cancer_df.drop('diagnosis', axis=1)
y= cancer_df['diagnosis']

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3, random_state = 0)

Finding shape of split data

In [None]:
x_train.shape, x_test.shape

In [None]:
y_train.shape, y_test.shape

Now, let's build logistic model on our data set

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)

Predict on the top of test data

In [None]:
y_pred = logreg.predict(x_test)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

Creating confusion matrix

In [None]:
confmat = confusion_matrix(y_pred, y_test)
confmat

And here is the accuracy

In [None]:
accuracy_score(y_pred, y_test)

The prediction accuracy for the test data set using the above Logistic Regression Model is **91.22%**