# Breast Cancer Wisconsin Prognostic
## Context
Data is from UCI Machine Learning Repository
http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.names 

## Content

1. Importing Libraries.
2. Exploration of Data.
3. Normalization of Data.
4. Modelling of Data.
5. Comparing Model Performance
6. Fitting Data to Final Model
7. Conclusion

## Objective
The main goal here is to fit a model to be able to predict whether breast cancer is at the malignant or benign stage based on 30 features and to which variable contributes the most.


### 1. Importing libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder # for creating dummy variables
from sklearn.preprocessing import MinMaxScaler # for normalising data

### 2.1 Importing Data

In [None]:
# Reading csv file into dataframe
df = pd.read_csv("../input/uci-wisconsin-breast-cancer/BreastCancer.csv")
df.head()

### Explanation of variables

- ID number
- Diagnosis (M = malignant, B = benign)

The mean, standard error and “worst” (mean of the three largest values) of ten features were computed for each image, resulting in 30 features. Below is a list of the ten real-valued features computed for each cell nucleus:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension

### 2.2 Data cleaning

In [None]:
# Dropping id column
df1 = df.drop(columns="id")
df1.head()

In [None]:
# Total missing values for each feature
df1.isnull().sum()

A summary of the statistical details of the features show that the means of the features varies widely and therefore we will have to normalise the data before modelling.

In [None]:
# summary of the DataFrame
df1.info()

### 2.3 Data Visualization

In [None]:
#  some basic statistical details for all features
df1.describe()

All variables except **diagnosis** are numeric variables.

In [None]:
# Check the number of malignant(M) and benign(B) cases
sns.countplot(x="diagnosis", data=df1)

Iniatial visualisation to showed that patients who with malignant prognostics had higher radius, area, perimeter and smoothness mean as compared to those with benign prognostics.

In [None]:
sns.set(rc={'figure.figsize':(5,5)})
plt.subplot(2, 2, 1)
sns.boxplot(x='diagnosis', y='radius_mean', data=df1)
plt.ylabel('Radius Mean')
plt.xlabel('Diagnosis')
plt.title('Dianosis vs Radius Mean')
plt.subplot(2, 2, 2)
sns.boxplot(x='diagnosis', y='perimeter_mean', data=df1)
plt.ylabel('Perimeter Mean')
plt.xlabel('Diagnosis')
plt.title('Dianosis vs Perimeter Mean')
plt.subplot(2, 2, 3)
sns.boxplot(x='diagnosis', y='area_mean', data=df1)
plt.ylabel('Area Mean')
plt.xlabel('Diagnosis')
plt.title('Dianosis vs Area Mean')
plt.subplot(2, 2, 4)
sns.boxplot(x='diagnosis', y='smoothness_mean', data=df1)
plt.ylabel('Smoothness Mean')
plt.xlabel('Diagnosis')
plt.title('Dianosis vs Smoothness Mean')
plt.tight_layout()
plt.show()

In [None]:
labels = ['radius_mean', 'perimeter_mean','smoothness_mean', 'compactness_mean', 'concavity_mean',
       'texture_mean', 'symmetry_mean','diagnosis']


In [None]:
# let's examine how features determine prognostics
sns.pairplot(df1[labels], hue='diagnosis')
plt.show()

In [None]:
corr_matrix = round(df1.corr(), 2)

In [None]:
sns.set(rc={'figure.figsize':(15,15)})
sns.heatmap(corr_matrix, cmap='BuPu', annot_kws={'size': 8}, cbar = True, annot=True)
plt.title('Variable Correlation Plot')
plt.show()

### 3. Normalising data

In [None]:
# dividing the data into X and Y
X=df1.iloc[:,1:31]
X.head(2)

In [None]:
Y=df1.iloc[:,0:1]
Y.head(2)

LabelEncoder is used to convert the categorical response into dummy variables.

In [None]:
le = LabelEncoder()

In [None]:
# converting diagnosis to dummy variables
Y['diagnosis_new'] = le.fit_transform(Y.diagnosis)
Y.head()
Y_new=Y.iloc[:,1:2]
Y_new.tail()

Due to the wide difference between the the means of the features, we will have to normalise the features for learning algorithm that computes the distance between the data points lke KNN. This includes all curve based algorithms. 

In [None]:
scaler = MinMaxScaler()

In [None]:
scaler.fit(X)
X1 = scaler.transform(X)
X_new=pd.DataFrame(X1, columns=X.columns)
X_new.head(2)

The features are normalised now as shown in the statistics details below.

In [None]:
X_new.describe()

### 4.1 Importing libraries for fitting data

In [None]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict, cross_val_score # This is for cross-validation
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix, balanced_accuracy_score

### 4.2 Instantiating Models

In [None]:
# search for an optimal value of K for KNN

# range of k we want to try
k_range = range(1, 31)
# empty list to store scores
k_scores = []

# 1. we will loop through reasonable values of k
for k in k_range:
    # 2. run KNeighborsClassifier with k neighbours
    knn = KNeighborsClassifier(n_neighbors=k)
    # 3. obtain cross_val_score for KNeighborsClassifier with k neighbours
    scores = cross_val_score(knn, X_new, Y_new, cv=10,  n_jobs=10)
    # 4. append mean of scores for k neighbors to k_scores list
    k_scores.append(scores.mean())


In [None]:
# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
sns.set(rc={'figure.figsize':(5,5)})
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-validated accuracy')

In [None]:
#finding the best k
k_df = pd.DataFrame(k_scores, index=k_range)
best_kest = int(k_df.idxmax())
best_kest

In [None]:
knn = KNeighborsClassifier(n_neighbors=best_kest)

In [None]:
svm = SVC(random_state=100, C=1.0,
    kernel='linear',
    probability=True,
    ) 

In [None]:
logit = LogisticRegression(penalty='l2',
    tol=0.0001,
    random_state=10)

In [None]:
etc = ExtraTreesClassifier(criterion='entropy',
    min_samples_split=3,
    min_samples_leaf=1,
    n_jobs=10,
    random_state=100,
    verbose=2
    )

In [None]:
bagging = BaggingClassifier(n_estimators=1000,
    n_jobs=10,
    random_state=100,
    verbose=0)

In [None]:
nb = GaussianNB()

In [None]:
rf = RandomForestClassifier(n_estimators=10, random_state=None)

### 4.3 Fitting Models

In [None]:
# Fitting models that does not require scaling
models_1 = [["DecisionTreeClassifier",etc],
         ["BaggingClassifier",bagging],
         ["GaussianNB",nb],
         ["RandomForestClassifier",rf]]

In [None]:
m_accuracy = []
for i in models_1:
    y_predict = cross_val_predict(i[1], X, Y_new, cv=10, n_jobs=10)
    ACC = round(accuracy_score(Y_new, y_predict), 2) 
    recall = round(recall_score(Y_new, y_predict, average='weighted'), 2) 
    B_ACC = round(balanced_accuracy_score(Y_new, y_predict), 2)
    Specificiti = round(2 * B_ACC - recall, 2)
    m_accuracy.append([i[0],ACC,recall,B_ACC,Specificiti]) 

In [None]:
# Fitting models that require scaling
models_2 = [["LogisticRegression",logit],
         ["SupportVector Machine",svm],
         ["KNeighborsClassifier",knn]]

In [None]:
for i in models_2:
    y_predict = cross_val_predict(i[1], X_new, Y_new, cv=10, n_jobs=10)
    ACC = round(accuracy_score(Y_new, y_predict), 2) 
    recall = round(recall_score(Y_new, y_predict, average='weighted'), 2) 
    B_ACC = round(balanced_accuracy_score(Y_new, y_predict), 2)
    Specificiti = round(2 * B_ACC - recall, 2)
    m_accuracy.append([i[0],ACC,recall,B_ACC,Specificiti]) 
    

### 5. Comapring model performance

In [None]:
performace_table = pd.DataFrame(m_accuracy)
performace_table.columns = ['Model','Accuracy', 'Recall','Bal. Accuracy','Specificity']
performace_table.style.bar(subset=["Accuracy",], color='#0d8ca6')\
                 .bar(subset=["Recall"], color='#50cce6')\
                 .bar(subset=["Bal. Accuracy"], color='#17990e')\
.bar(subset=["Specificity"], color='#6ed667')

In [None]:
plt.figure(figsize=(10,5))
plt.barh(performace_table.Model, performace_table.Accuracy, color='#f5ec42', edgecolor='black')
plt.tight_layout()
plt.show()

In [None]:
# list of feature importance in desecending order
rf.fit(X, Y_new)
importance = pd.DataFrame(rf.feature_importances_, index=X_new.columns, columns=['FeatureImportance'])
importance.sort_values(by='FeatureImportance', ascending=False)

### 6.1 Fitting final model

In [None]:
# Now, try to train again with the full data
svm.fit(X_new,Y_new)

### 6.2 Saving final Model

In [None]:
# Python pickle module is used for serializing and de-serializing a Python object structure
import pickle

In [None]:
# Save the model
f1=open('breat_cancer_svm_model','wb') # wb => write binary
pickle.dump(svm, f1)

In [None]:
# better close (or flush) a file when done.
f1.close()