<p style="text-align:center; color: #f21170; font-size: 25px;"><b>Breast Cancer Data Analysis and Predictions</b></p>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<h2 style="font-weight:bold; color:#022e57;">Introduction</h2>

<p style="font-size: 16px">Breast cancer is cancer that forms in the cells of the breasts. Signs of breast cancer may include a lump in the breast, a change in breast shape, dimpling of the skin, fluid coming from the nipple, a newly inverted nipple, or a red or scaly patch of skin. <br><br>Most types of breast cancer are easy to diagnose by microscopic analysis of a sample - or biopsy - of the affected area of the breast. Also, there are types of breast cancer that require specialized lab exams.<br><br>The uncontrolled cancer cells often invade other healthy breast tissue and can travel to the lymph nodes under the arms. The lymph nodes are a primary pathway that help the cancer cells move to other parts of the body. </p>

<h2 style="font-weight:bold; color:#022e57;">Description of Attributes</h2>

1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

<h2 style="font-weight:bold; color:#022e57;">Content</h2>

1. [Exploratory Data Analysis](#section1)
2. [Data Preprocessing and Building Models](#section2)
2. [Results](#section3)


<a id="section1"></a>
<h1 style="font-weight:bold; color:#022e57;">1. Exploratory Data Analysis</h1>

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

<p style="font-size: 16px;">It was already given but we still checked for NaN values and found that the whole column'Unamed: 32' had NaN values. So I will drop this column.</p>

In [None]:
df.describe()

Dropping 'Unnamed: 32' column.

In [None]:
df.drop("Unnamed: 32", axis=1, inplace=True)

<h2 style="font-weight:bold; color:#005a8d;">1.1 Data Visualizations</h2>

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize = (12,6))
sns.countplot(x="diagnosis", data=df, palette='magma')

In [None]:
plt.figure(figsize=(20,17))
matrix = np.triu(df.corr())
sns.heatmap(df.corr(), annot=True, linewidth=.5, mask=matrix, cmap="Purples")

As we can observe from the heatmaps that there are many negative correlations in this dataset. Lets observe these by plotting it out.

<h3 style="font-weight:bold; color:#005a8d;">Negative Correlations</h3>

<p style="font-size: 16px;">The column <b>'fractal_dimension_mean'</b> had many negative correlations with many other attributes like <b>'area_mean'</b>, <b>'area_worst'</b> etc. We'll plot some scatter plots for these.<br><br> For your information Fractal analysis of images of breast tissue specimens provides a numeric description of tumour growth patterns as a continuous number between 1 and 2. This number is known as the Fractal Dimension

In [None]:
fig, ax = plt.subplots(2,2, figsize=(15,15))
sns.scatterplot(x='fractal_dimension_mean', y='area_mean', hue="diagnosis", 
                data=df, ax=ax[0][0], palette='magma')
sns.scatterplot(x='fractal_dimension_worst', y='area_worst', hue="diagnosis", 
                data=df, ax=ax[0][1], palette='magma')
sns.scatterplot(x='smoothness_se', y='radius_worst', hue="diagnosis", 
                data=df, ax=ax[1][0], palette='magma')
sns.scatterplot(x='symmetry_se', y='radius_worst', hue="diagnosis", 
                data=df, ax=ax[1][1], palette='magma')

<h3 style="font-weight:bold; color:#005a8d;">Some Pairplots</h3>

In [None]:
# Creating a list of columns with only the columns that represent the mean.
mean_cols = ['diagnosis','radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']

# Creating a list of columns with only the columns that represent the worst values.
worst_cols = ['diagnosis','radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst']

In [None]:
sns.pairplot(df[mean_cols], hue="diagnosis", palette='magma')

In [None]:
sns.pairplot(df[worst_cols], hue="diagnosis", palette='viridis')

<a id="section2"></a>
<h1 style="font-weight:bold; color:#022e57;">2. Data Preprocessing and Building Models</h1>

<h2 style="font-weight:bold; color:#005a8d;">2.1 Data Preprocessing</h2>

In [None]:
df['diagnosis']

We need to convert this categorical column into a numerical one using Label Encoder

In [None]:
tgt = df['diagnosis']
from sklearn.preprocessing import LabelEncoder
encode_lbl = LabelEncoder()
target = encode_lbl.fit_transform(tgt)

'target' is our new numerical target column for our modelling.

<h2 style="font-weight:bold; color:#005a8d;">2.2 Splitting the Data into train and test</h2>

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop('diagnosis', axis=1)
y = target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

print("Shape of training set:", X_train.shape)
print("Shape of test set:", X_test.shape)

In [None]:
from sklearn.preprocessing import StandardScaler
s_sc = StandardScaler()
X_train = s_sc.fit_transform(X_train)
X_test = s_sc.fit_transform(X_test)

<h2 style="font-weight:bold; color:#005a8d;">2.3 Classification Models</h2>

<h3 style="font-weight:bold; color:#005a8d;">2.3.1 Logistic Regression</h3>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score

In [None]:
logmodel = LogisticRegression()

In [None]:
logmodel.fit(X_train, y_train)
predictions1 = logmodel.predict(X_test)

In [None]:
print("Confusion Matrix: \n", confusion_matrix(y_test, predictions1))
print('\n')
print(classification_report(y_test, predictions1))

In [None]:
logmodel_acc = accuracy_score(y_test, predictions1)
print("Accuracy of the Logistic Regression Model is: ", logmodel_acc)

<h3 style="font-weight:bold; color:#005a8d;">2.3.2 K Nearest Neighbours</h3>

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
predictions2 = knn.predict(X_test)

In [None]:
print(confusion_matrix(y_test, predictions2))
print("\n")
print(classification_report(y_test, predictions2))

So we can observe from the Classification report that we have an accuracy of around 0.95. I'll try to increase the accuracy a bit more by using a better value for n_neighbors or K value.

In [None]:
error_rate = []

for i in range(1,40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40), error_rate, color='purple', linestyle="--",marker='o', markersize=10, markerfacecolor='red')
plt.title('Error_Rate vs K value')
plt.xlabel = ('K')
plt.ylabel = ('Error Rate')

From this graph, K value of 3 and 7 seem to show the lowest mean error. So I'll use one of these values and check.

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
predictions2 = knn.predict(X_test)

In [None]:
print(confusion_matrix(y_test, predictions2))
print("\n")
print(classification_report(y_test, predictions2))

So, there were not any significant changes in the accuracy score other than the 0.1 increase in macro avg. So for now I'll use this as my accuracy score for KNN

In [None]:
knn_model_acc = accuracy_score(y_test, predictions2)
print("Accuracy of K Neighbors Classifier Model is: ", knn_model_acc)

<h3 style="font-weight:bold; color:#005a8d;">2.3.3 Decision Tree</h3>

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)
predictions3 = dtree.predict(X_test)

In [None]:
print("Confusion Matrix: \n", confusion_matrix(y_test, predictions3))
print("\n")
print(classification_report(y_test, predictions3))

In [None]:
dtree_acc = accuracy_score(y_test, predictions3)
print("Accuracy of Decision Tree Model is: ", dtree_acc)

<h3 style="font-weight:bold; color:#005a8d;">2.3.4 Random Forests</h3>

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators=300)
rfc.fit(X_train, y_train)
predictions4 = rfc.predict(X_test)

In [None]:
print("Confusion Matrix: \n", confusion_matrix(y_test, predictions4))
print("\n")
print(classification_report(y_test, predictions4))

In [None]:
rfc_acc = accuracy_score(y_test, predictions4)
print("Accuracy of Random Forests Model is: ", rfc_acc)

<h3 style="font-weight:bold; color:#005a8d;">2.3.5 Support Vector Machines (SVM)</h3>

In [None]:
from sklearn.svm import SVC

In [None]:
svc_model = SVC(kernel="rbf")
svc_model.fit(X_train, y_train)
predictions5 = svc_model.predict(X_test)

In [None]:
print("Confusion Matrix: \n", confusion_matrix(y_test, predictions5))
print("\n")
print(classification_report(y_test, predictions5))

In [None]:
svm_acc = accuracy_score(y_test, predictions5)
print("Accuracy of SVM model is: ", svm_acc)

<a id="section3"></a>
<h1 style="font-weight:bold; color:#022e57;">3. Results</h1>

<p style="font-size:18px; color: #fb3640; font-weight: 500;">The accuracy of Logistic Regression Model is 98.245%<br>The accuracy of KNN model is 95.321%<br>The accuracy of Decision Tree Model is 90.643%<br>The accuracy of Random Forest Model is 94.736%<br>The accuracy of SVM Model is 97.660%</p>

In [None]:
print(logmodel_acc)
print(knn_model_acc)
print(dtree_acc)
print(rfc_acc)
print(svm_acc)

In [None]:
plt.figure(figsize=(12,6))
model_acc = [logmodel_acc, knn_model_acc, dtree_acc, rfc_acc, svm_acc]
name_of_model = ['LogisticRegression', 'KNN', 'DecisionTree', 'RandomForests', 'SVM']
sns.barplot(x= model_acc, y=name_of_model, palette='magma')

<h3 style="color: #2978b5; text-align:center;">LOGISTIC REGRESSION MODEL PERFORMED THE BEST WITH AN ACCURACY OF 98.24%</h3>
<h3 style="color: #2978b5; text-align:center;">SVM IS JUST BEHIND, ALSO WITH A GOOD ACCURACY OF 97.66%</h3>

<p style="font-size:18px; color: #185adb; text-align:center;">Please do leave your valuable feedbacks in the comments and any improvements or suggestions are welcomed!</p>
<h1 style="color: #f55c47; text-align:center;">The End</h1>