# Breast Cancer Detection

## Dataset Description
Source : https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Dataset Description : Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.  They describe characteristics of the cell nuclei present in the image. A few of the images can be found at http://www.cs.wisc.edu/~street/images/

The number of instances present in the model are 569, with 32 attributes.



## Attribute Information
ID number
Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

    1) radius (mean of distances from center to points on the perimeter)
	2) texture (standard deviation of gray-scale values)
	3) perimeter
	4) area
	5) smoothness (local variation in radius lengths)
	6) compactness (perimeter^2 / area - 1.0)
	7) concavity (severity of concave portions of the contour)
	8) concave points (number of concave portions of the contour)
	9) symmetry 
	10)fractal dimension ("coastline approximation" - 1)


## Task
Predict the type of breast cancer

### EDA and Pre-Processing

In this section of our notebook, we perform some exploratory data analysis on our dataframe to get a general idea of what our dataframe consists of and to manipulate it if required.

In [None]:
#importing libraries


import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#importing the dataset


df = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')
df.head()

In [None]:
#Let us get some basic insight on our columns
#and understand their properties and datatypes
df.info()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
#We will drop the column "Unnamed: 32", because it has 0 non-null values.
#Also, we can see that no other attributes have any missing values


df.drop(['Unnamed: 32'], axis=1, inplace=True)
df.head()

In [None]:
#We will also need to convert our categorical coumns, into numerical by using
#hot encoding on the dataset


from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler



for column in df.columns:
  if df[column].dtype == np.int64 or df[column].dtype == np.float64:
    continue
  df[column] = LabelEncoder().fit_transform(df[column])


df.head()

In [None]:
#Let's check the ratio of Benign to Malignant cancer


plt.figure(figsize=(13,6))
df.diagnosis.value_counts().plot.pie(autopct="%.1f%%")
plt.title("Diagnosis Ratio", fontsize = 20)
plt.legend(['Benign','Malignant'])

From the above pie chart, we can see that 62.7% of our entries have Benign Type Cancer and 37.3% have Malignant Cancer. 

In [None]:
#A heatmap is used to graphically represent the correlation between the attibutes in our dataset
#We will plot a heatmap to check for the highly correlated columns

plt.figure(figsize=(25,20))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")

In [None]:
#There are many attributes with correalation under less than 0.5.
#Let us do further analysis on these columns


high_corr_data = df.corr()
high_corr_columns = high_corr_data.index[abs(high_corr_data['diagnosis'])>=0.5]
high_corr_columns

In [None]:
#Plotting a heatmap of these high correlated values

plt.figure(figsize=(16,8))
sns.heatmap(df[high_corr_columns].corr(), annot=True, cmap="coolwarm")

In [None]:
#Let us check the difference between the means values of attributes of the two types of cancer by using the
#distplot feature.

mean_col = ['radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']

for col in mean_col:
    sns.displot(df, x=col, hue="diagnosis", kind="kde", multiple="stack")

From our above plots, it that the radius mean,texture mean, perimeter mean, area mean, smoothness mean,
compactness mean, concavity mean, concave points mean, symmetry mean and the fractal dimenion mean is 
significantly varying in the different types of tumors.

### Model Selection

In [None]:
#importing libraries
from sklearn.model_selection import train_test_split 

#Splitting dependent and independent columns
x = df.drop(columns = 'diagnosis')
y = df['diagnosis']

In [None]:
#Splitting data into training and test sets

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

Now that we have sampled our data and performed our basic analysism we will move on to testing our dataset for best the model

In [None]:
#Let us first import the model from the sklearn module


from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report , confusion_matrix , accuracy_score

#### Testing for Logistic Regression

In [None]:
model_logistic = LogisticRegression()
model_logistic.fit(x_train, y_train)
print('Logistic regression accuracy: {:.4f}'.format(accuracy_score(y_test, model_logistic.predict(x_test))))

In [None]:
confusionmatrix = confusion_matrix(y_test, model_logistic.predict(x_test))

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(3, 3))
ax.matshow(confusionmatrix, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confusionmatrix.shape[0]):
    for j in range(confusionmatrix.shape[1]):
        ax.text(x=j, y=i,s=confusionmatrix[i, j], va='center', ha='center', size='xx-large')
 
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)

#### Testing for Random Forest Classifier

In [None]:
model_randomforest = RandomForestClassifier()
model_randomforest.fit(x_train, y_train)
print('Random Forest accuracy: {:.4f}'.format(accuracy_score(y_test, model_randomforest.predict(x_test))))

In [None]:
confusionmatrix = confusion_matrix(y_test, model_randomforest.predict(x_test))

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(3, 3))
ax.matshow(confusionmatrix, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confusionmatrix.shape[0]):
    for j in range(confusionmatrix.shape[1]):
        ax.text(x=j, y=i,s=confusionmatrix[i, j], va='center', ha='center', size='xx-large')
 
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)

#### Testing for K Neighbours Classifier

In [None]:
model_knnclassfier = KNeighborsClassifier()
model_knnclassfier.fit(x_train, y_train)
print('KNeighborsClassifier accuracy: {:.4f}'.format(accuracy_score(y_test, model_knnclassfier.predict(x_test))))

In [None]:
confusionmatrix = confusion_matrix(y_test, model_knnclassfier.predict(x_test))

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(3, 3))
ax.matshow(confusionmatrix, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confusionmatrix.shape[0]):
    for j in range(confusionmatrix.shape[1]):
        ax.text(x=j, y=i,s=confusionmatrix[i, j], va='center', ha='center', size='xx-large')
 
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)

#### Testing for XGBoostClassifier

In [None]:
model_xgb = XGBClassifier()
model_xgb.fit(x_train, y_train)
print('XGBoostClassifier accuracy: {:.4f}'.format(accuracy_score(y_test, model_xgb.predict(x_test))))

In [None]:
confusionmatrix = confusion_matrix(y_test, model_xgb.predict(x_test))

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(3, 3))
ax.matshow(confusionmatrix, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confusionmatrix.shape[0]):
    for j in range(confusionmatrix.shape[1]):
        ax.text(x=j, y=i,s=confusionmatrix[i, j], va='center', ha='center', size='xx-large')
 
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)

#### Testing for SVM

In [None]:
model_svm = SVC()
model_svm.fit(x_train, y_train)
print('SVM accuracy: {:.4f}'.format(accuracy_score(y_test, model_svm.predict(x_test))))

In [None]:
confusionmatrix = confusion_matrix(y_test, model_svm.predict(x_test))

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(3, 3))
ax.matshow(confusionmatrix, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confusionmatrix.shape[0]):
    for j in range(confusionmatrix.shape[1]):
        ax.text(x=j, y=i,s=confusionmatrix[i, j], va='center', ha='center', size='xx-large')
 
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)

In [None]:
model_selection_dict = {"Logistic Regression" : accuracy_score(y_test, model_logistic.predict(x_test)),
                           "Random Forest Classifier" : accuracy_score(y_test, model_randomforest.predict(x_test)),
                               "XGBoost Classifier" :accuracy_score(y_test, model_xgb.predict(x_test)),
                                    "KNN Classifier":accuracy_score(y_test, model_knnclassfier.predict(x_test)),
                                        "SVM": accuracy_score(y_test, model_svm.predict(x_test))
                       }

pd.DataFrame(model_selection_dict.items(), columns=['Model','Accuracy Score'])

In our model seletion analysis we can see that the XGBoostClassifier Model has the maximum score of 98.25% accuracy. Let us furhter analyse this model before confirming our predictions.

### XGBoost Classifier

Let us analyse our XGBoost model, by checking its precision, recall value and f1 score. test We do this by printing a classification report between the y_test values which we had separated from the dataset containing the actual answer to the test set variables, and the model predictions of the x_test set which was untrained and the class variable was unknown to it

In [None]:
print(classification_report(y_test, model_xgb.predict(x_test)))

A 0.99 precision is not bad! But there is still some scope of improvement. We might be able to get a slightly
better result if we had tuned our model. We will do that ahead

In [None]:
predictedvalues= pd.DataFrame({'Actual': y_test, 'Predicted': model_xgb.predict(x_test)})
predictedvalues

Since we have trained our model with 98.25% accuracy, we might not need hyper parameter tuning but let us check it out just in case it helps our case

In [None]:
from sklearn.model_selection import GridSearchCV

param_test1 = {
 'max_depth':range(3,10,2),
 'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch1.fit(x,y)


print("Tuned XGBoost Parameters: {}".format(gsearch1.best_params_))
print("Best score is {}".format(gsearch1.best_score_))

After tuning our hyper parameters, we can see our accuracy has gone up to 99.45%, which is roughly 1%
higher than our previous prediction score. To analyse our tuned model further we will plot the confusion matrix and print a classification report

In [None]:
confusionmatrix = confusion_matrix(y_test, gsearch1.predict(x_test))

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(3, 3))
ax.matshow(confusionmatrix, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confusionmatrix.shape[0]):
    for j in range(confusionmatrix.shape[1]):
        ax.text(x=j, y=i,s=confusionmatrix[i, j], va='center', ha='center', size='xx-large')
 
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)

In [None]:
print(classification_report(y_test, gsearch1.predict(x_test)))

A 100% precision. So now our model is well trained with the dataset in hand. It might produce some errors in
real world datasets, but this was just a beginner project and everything built can always be improved further

In [None]:
predicted_tuned_values= pd.DataFrame({'Actual': y_test, 'Predicted': model_xgb.predict(x_test)})
predicted_tuned_values.to_csv("final_predictions_with_tuning.csv", index=False)
predicted_tuned_values