In this kernel, we will conduct EDA to select important variables for breast cancer prediction and apply several models(logistic/ decision tree/ random forest/ SVM) to find out the best models. Enjoy! 

In [None]:
#supress warning 
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn import tree


In [None]:
data = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")

In [None]:
data.head()

Attribute Information:

1) ID number  
2) Diagnosis (M = malignant, B = benign)


Ten real-valued features are computed for each cell nucleus:  

a) radius (mean of distances from center to points on the perimeter)  
b) texture (standard deviation of gray-scale values)  
c) perimeter  
d) area  
e) smoothness (local variation in radius lengths)  
f) compactness (perimeter^2 / area - 1.0)  
g) concavity (severity of concave portions of the contour)  
h) concave points (number of concave portions of the contour)  
i) symmetry  
j) fractal dimension ("coastline approximation" - 1)  

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

In [None]:
data.info()

In [None]:
#remove last columns, also we don't need id
data.drop(data.columns[len(data.columns)-1], axis=1, inplace=True)
data.drop('id', axis=1, inplace=True)

# Exploratory Data Analysis (EDA)

In [None]:
data.diagnosis.value_counts()

In [None]:
sns.set_style('whitegrid')
data.diagnosis.value_counts().plot(kind='bar',color=["lightblue", "salmon"])

In [None]:
categorical_val=[]
continuous_val=[]
for c in data.columns:
    #print('==================')
    #print(f"{c}:{data[c].unique()}")
    if len(data[c].unique()) <= 10:
        categorical_val.append(c)
    else:
        continuous_val.append(c)

In [None]:
print(categorical_val)
print(continuous_val)

In [None]:
plt.figure(figsize=(20,50))
for i, column in enumerate(continuous_val,1):
    plt.subplot(10,3,i)
    sns.distplot(data[data['diagnosis']=='M'][column],rug=False,label="M")
    sns.distplot(data[data['diagnosis']=='B'][column],rug=False,label='B')
    plt.xlabel(column)
    plt.legend()

We can found malignant and benign tumors show different distribution in some columns:  
radius_mean: malignant tumors has lager radius mean.  
perimeter_mean: malignant tumors has lager perimeter mean.  
area_mean: malignant tumors has lager area mean.
compactness_mean: malignant tumors has lager compactness mean.  
concavity_mean: malignant tumors has lager concavity_mean.  
concavity_points_mean: malignant tumors has lager concavity_points_mean.  


In [None]:
df = data.replace({'diagnosis':{"M":1,"B":0}})

In [None]:
df.head()

In [None]:
corr_matrix=df.corr()
fig, ax = plt.subplots(figsize=(15,15))
ax = sns.heatmap(corr_matrix, annot=True, linewidths=0.5, fmt='.2f',cmap='YlGnBu')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom+0.5, top-0.5)

Some columns are correlated with each other. For example, area_mean, perimeter_mean and area_mean basically are same things. Thus, we can just keep 1 column to avoid collinearity. For other simulate columns, we'll do the same thing.

In [None]:
col_drop = ['perimeter_mean','radius_mean','compactness_mean',
            'concave points_mean','radius_se','perimeter_se',
            'radius_worst','perimeter_worst','compactness_worst',
            'concave points_worst','compactness_se','concave points_se',
            'texture_worst','area_worst','concavity_worst']
df2 = df.drop(col_drop,axis=1)

In [None]:
df2.head()

In [None]:
fig, ax = plt.subplots(figsize=(15,15))
ax = sns.heatmap(df2.corr(), annot=True, linewidths=0.5, fmt='.2f',cmap='YlGnBu')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom+0.5, top-0.5)

In [None]:
x = df2.drop('diagnosis',axis=1)

In [None]:
x.shape

In [None]:
#Calculate VIF 
#from statsmodels.stats.outliers_influence import variance_inflation_factor

#vif = pd.DataFrame()
#vif["features"] = x.columns
#vif["VIF Factor"] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]


In [None]:
#vif

In [None]:
#while vif[vif['VIF Factor'] > 10]['VIF Factor'].any():    
#    remove = vif.sort_values('VIF Factor',ascending=0)['features'][1] 
    #print(remove)
    #print(continuous_val)
#    x.drop(remove,axis=1,inplace=True)
#    vif = pd.DataFrame()
#    vif["features"] = x.columns
#    vif["VIF Factor"] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
#    print(vif)
#    print('======================')
    

In [None]:
df2.drop('diagnosis',axis=1).corrwith(df.diagnosis).plot(kind='bar',grid=True,figsize=(12,8),
                                                       title='Correlation with diagnosis')

We see lots of features are highly correlated with diagnosis. 

# Model Preparation

In [None]:
#there is no categorical variable other than our dependent variables, so we don't have to creat dummy variables for our models.
categorical_val

In [None]:
#store variable names 
col_sc = list(df2.columns)
col_sc.remove('diagnosis')

In [None]:
col_sc

In [None]:
#scale our data
#from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
df2[col_sc] = sc.fit_transform(df2[col_sc])

In [None]:
df2.head()

# Applying Machine Learning Algorithms

In [None]:
#from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

def score(m, x_train, y_train, x_test, y_test, train=True):
    if train:
        pred=m.predict(x_train)
        print('Train Result:\n')
        print(f"Accuracy Score: {accuracy_score(y_train, pred)*100:.2f}%")
        print(f"Precision Score: {precision_score(y_train, pred)*100:.2f}%")
        print(f"Recall Score: {recall_score(y_train, pred)*100:.2f}%")
        print(f"F1 score: {f1_score(y_train, pred)*100:.2f}%")
        print(f"Confusion Matrix:\n {confusion_matrix(y_train, pred)}")
    elif train == False:
        pred=m.predict(x_test)
        print('Test Result:\n')
        print(f"Accuracy Score: {accuracy_score(y_test, pred)*100:.2f}%")
        print(f"Precision Score: {precision_score(y_test, pred)*100:.2f}%")
        print(f"Recall Score: {recall_score(y_test, pred)*100:.2f}%")
        print(f"F1 score: {f1_score(y_test, pred)*100:.2f}%")
        print(f"Confusion Matrix:\n {confusion_matrix(y_test, pred)}")
            
    

In [None]:
#from sklearn.model_selection import train_test_split

x = df2.drop('diagnosis',axis=1)
y = df2['diagnosis']

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=42)

## M1: Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg = logreg.fit(x_train, y_train)

In [None]:
score(logreg, x_train, y_train, x_test, y_test, train=True)

In [None]:
score(logreg, x_train, y_train, x_test, y_test, train=False)

The result looks pretty great. How about we tuning our model to prevent over-fitting issue and make the model become more general to unseen samples?

In [None]:
#C represents the strength of the regularization. higher values of C correspond to less regularization
C = [1, .5, .25, .1, .05, .025, .01, .005, .0025] 
l1_metrics = np.zeros((len(C), 5)) 
l1_metrics[:,0] = C

for index in range(0, len(C)):
    logreg = LogisticRegression(penalty='l1', C=C[index], solver='liblinear') 
    logreg = logreg.fit(x_train, y_train)
    pred_test_Y = logreg.predict(x_test)
    l1_metrics[index,1] = np.count_nonzero(logreg.coef_) 
    l1_metrics[index,2] = accuracy_score(y_test, pred_test_Y) 
    l1_metrics[index,3] = precision_score(y_test, pred_test_Y) 
    l1_metrics[index,4] = recall_score(y_test, pred_test_Y)
    
col_names = ['C','Non-Zero Coeffs','Accuracy','Precision','Recall'] 
print(pd.DataFrame(l1_metrics, columns=col_names))

We finally choose C=0.25 because it got best performance with fewer parameters.

In [None]:
logreg_t = LogisticRegression(penalty='l1', C=0.25, solver='liblinear')
logreg_t = logreg_t.fit(x_train,y_train)

In [None]:
score(logreg_t, x_train, y_train, x_test, y_test, train=True)

In [None]:
score(logreg_t, x_train, y_train, x_test, y_test, train=False)

Great! This model got slightly better in test sample than the original one. 

## M2: Decision Tree

In [None]:
from sklearn import tree

tree1 = tree.DecisionTreeClassifier()
tree1 = tree1.fit(x_train, y_train)

In [None]:
score(tree1, x_train, y_train, x_test, y_test, train=True)

In [None]:
score(tree1, x_train, y_train, x_test, y_test, train=False)

Seems like an over-fitting issue. Again, let's try pruning the tree.

In [None]:
#decide the tree depth!
depth_list = list(range(2,15))
depth_tuning = np.zeros((len(depth_list), 4)) 
depth_tuning[:,0] = depth_list

for index in range(len(depth_list)):
    mytree = tree.DecisionTreeClassifier(max_depth=depth_list[index]) 
    mytree = mytree.fit(x_train, y_train)
    pred_test_Y = mytree.predict(x_test)
    depth_tuning[index,1] = accuracy_score(y_test, pred_test_Y) 
    depth_tuning[index,2] = precision_score(y_test, pred_test_Y) 
    depth_tuning[index,3] = recall_score(y_test, pred_test_Y)
    
col_names = ['Max_Depth','Accuracy','Precision','Recall'] 
print(pd.DataFrame(depth_tuning, columns=col_names))

Max depth = 3 seems a good choice!

In [None]:
tree2 = tree.DecisionTreeClassifier(max_depth=3)
tree2 = tree2.fit(x_train,y_train)

In [None]:
score(tree2, x_train, y_train, x_test, y_test, train=True)

In [None]:
score(tree2, x_train, y_train, x_test, y_test, train=False)

Next, we can plot the tree!

In [None]:
import graphviz
exported = tree.export_graphviz( decision_tree=tree2,
                                out_file=None,
                                feature_names=x.columns,
                                precision=1,
                                class_names=['B','M'], 
                                filled = True)
graph = graphviz.Source(exported) 
display(graph)

## M3: Ramdom Forest 

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=1000, random_state= 42)
forest = forest.fit(x_train,y_train)

In [None]:
score(forest, x_train, y_train, x_test, y_test, train=True)

In [None]:
score(forest, x_train, y_train, x_test, y_test, train=False)

Next, go with tuning! The article of random forest here provides details in hyperparameter tuning: https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

In [None]:
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators, 'max_features': max_features,
               'max_depth': max_depth, 'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}

In [None]:
random_grid

In [None]:
forest2 = RandomForestClassifier(random_state=42)

#Random search of parameters, using 3 fold cross validation, search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = forest2, param_distributions=random_grid,
                              n_iter=100, cv=3, verbose=2, random_state=42, n_jobs=-1)

rf_random.fit(x_train,y_train)


In [None]:
rf_random.best_params_

In [None]:
forest3 = RandomForestClassifier(bootstrap=True,
                                 max_depth=20, 
                                 max_features='sqrt', 
                                 min_samples_leaf=2, 
                                 min_samples_split=2,
                                 n_estimators=1200)
forest3 = forest3.fit(x_train, y_train)

In [None]:
score(forest3, x_train, y_train, x_test, y_test, train=True)

In [None]:
score(forest3, x_train, y_train, x_test, y_test, train=False)

## M4: SVM 

The important parameters details in SVM model can be founded in here: https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769 

In [None]:
from sklearn.svm import SVC

svm = SVC()
svm = svm.fit(x_train,y_train)

In [None]:
score(svm, x_train, y_train, x_test, y_test, train=True)

In [None]:
score(svm, x_train, y_train, x_test, y_test, train=False)

In [None]:
from sklearn.model_selection import GridSearchCV

svm_model = SVC()

params = {"C":(0.1, 0.5, 1, 2, 5, 10, 20), 
          "gamma":(0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1), 
          "kernel":('poly', 'rbf')}

svm_grid = GridSearchCV(svm_model, params, n_jobs=-1, cv=5, verbose=1, scoring="accuracy")
svm_grid.fit(x_train, y_train)

In [None]:
svm_grid.best_params_

In [None]:
svm2 = SVC(C=2, gamma=0.01, kernel='rbf')
svm2 = svm2.fit(x_train, y_train)

In [None]:
score(svm2, x_train, y_train, x_test, y_test, train=True)

In [None]:
score(svm2, x_train, y_train, x_test, y_test, train=False)

# Conclusion

In this kernel, we try to select the useful parameters by conducting visualization analysis. We also check the correlation matrix to avoid collinearity. After that, we use logistic, decision tree, random forest and SVM models for prediction, we even tune all this model to prevent over-fitting issue. Comparing the outcome, the logistic model gives the most precise prediction for our test data. 


Thanks for your time!  
If this kernel is helpful, please upvote and write comments below to let me know. It would be such a great motivation for me:)