In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import pandas as pd
import seaborn as sns 
%matplotlib inline

# Import data and select "id" column as index

path = '../input/breast-cancer-csv/breastCancer.csv'
df = pd.read_csv(path, index_col = "id")

# Examine the dataframe

df.info()

First, the datatype information from the dataframe shows that our target variable for the classifier model is integer type. We need to convert it to categorical. A further step needs to be made with our categorical data, but we will save this step for before implementing the model. If you are curious as I would be at this point, the further step is to convert categorical data into one-hot encoding.

In [None]:
# Convert the "class" column (target feature for classifier) to categorical data

df["class"] = df["class"].astype("category")

# Examine dataframe again

df.info()

The column "bare_nucleoli" has to be changed to integer data type. Here we have two problems. The first one is that the datatypes are strings (object), while the second one is that missing values are reported as "?". First the Series is converted to "integer" datatype, and "?" to "None". Second, we make use of scikit-learn's imputer function, and lastly we save the output as its correspondant column. We store the values as integers. 

In [None]:
from sklearn.impute import SimpleImputer
import numpy as np

# apply lambda function to change "?" for None values

df["bare_nucleoli"] = df["bare_nucleoli"].apply(lambda x: None if x is "?" else x)

# Convert column to numeric type data

df["bare_nucleoli"] = pd.to_numeric(df.bare_nucleoli)

# Initialize SimpleImputer

imputer = SimpleImputer(missing_values = np.nan, strategy = "median")

# Reshape the imputer input 

imp_input = df.bare_nucleoli.to_numpy().reshape(-1,1)

# Fit and transform imputer

imputer.fit(imp_input)
imp_input_transformed = imputer.transform(imp_input)

# Save the imputer output into the dataframe column and convert to integer datatype

df["bare_nucleoli"] = imp_input_transformed.astype(int)
df.info()

Having our clean data, we collect further insight through descriptive statistics and Exploratory Data Analysis. Descriptive statistics provides valuable information about the range of values within a feature, and how similar/distant these features are. These insights are useful for model selection, however we still need visual insight.

In [None]:
print(df.describe())

Our next step in the process is EDA. We will see how the data is distributed among categories through a heatmap from the feature correlation matrix, and bee-swarm plots to further explore our data.

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt

# Define target 

y = df["class"]

# compute corelation matrix

corr = df.corr()

# display heatmap from correlation matrix

sns.heatmap(corr,cmap="Blues",  annot=True)


Looking at the heatmat from the correlation matrix, we observe features that are highly correlated. We want to eliminate redundant features, and also to extract features from the data. Having a ratio of shape uniformity and size uniformity extracts data from both features, so instead of eliminating either one, we extract the ratio between both. This renders size uniformity and shape uniformity as redundant, so they are removed.

In [None]:
# Extract ratio between shape uniformity and size uniformity

df["shape_size_uniformity"] = df["shape_uniformity"]/df["size_uniformity"]

# Drop redundant features

df.drop(["shape_uniformity", "size_uniformity"], axis = 1, inplace = True)

# Compute new correlation matrix

corr = df.corr()

# Plot heatmap from the correlation matrix

sns.heatmap(corr, cmap = "Blues", annot = True)

# Check the distribution of the newly created feature in a bee-swarm plot

plt.figure()
sns.swarmplot(x = df["class"], y = df["shape_size_uniformity"], data = df )



An important fact to consider is that according to the Wisconsin Breast Cancer Dataset, the "class" column shows 2 for a benign tumor and 4 for a malignant tumor. These categories are replaced by "Benign", and "Malignant".

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Prepare features and target for the model

y = df["class"].replace({2:0, 4:1})
X = df.drop(["class"], axis = 1)

# Generate train and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

# Instantiate Random Forest Classifier

rfc = RandomForestClassifier(n_estimators = 100, max_depth = 4, random_state = 42)

# Fit model into training data

rfc.fit(X_train,y_train)

# Predict test data

y_pred_train = rfc.predict(X_train)
y_pred_test = rfc.predict(X_test)

# Compute scores predicted train and test data

accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_test = accuracy_score(y_test, y_pred_test)

# Compute confusion matrix

conf_matrix_rf = confusion_matrix(y_test,y_pred_test)



Our Random Forest Classifier is evaluated through accuracy and confusion matrix.
Accuracy score computes the ratio of correct predictions with the total, while the confusion matrix displays the results as a 2 x 2 matrix where:

                predicted no | predicted yes
    actual no   _____TN______|______FP______|              TN = True Negative      FP = False Positive       
    actual yes  _____FN______|______TP______|              FN = False Negative     TP = True Positive



In [None]:
##### Display results

print("The accuracy score of the Random Forest Classifier on train data is {:.2f}".format(accuracy_train))
print("The accuracy score of the Random Forest Classifier on test data is {:.2f}".format(accuracy_test))
print("")
print("Confusion matrix:")
print("")
print(conf_matrix_rf)

# Compute a panda series of feature importances

importances = pd.Series(data=rfc.feature_importances_, index= X_train.columns)

# Sort importances

importances_sorted = importances.sort_values()

# Plot a horizontal bar plot of the feature importances

importances_sorted.plot(kind = "barh")


The confusion matrix shows that there are 7 false negatives and 5 false positives. It is the first one we want to take care of, since this case calls for minimizing false negatives. We do not want to tell patients they do not have cancer when they actually do. This model still has not been tuned. Next step, we use scikit-learn's GridSearchCV for finding the best parameters for our Random Forest Classifier out of a neighborhood of chosen parameters and values.

In [None]:
from sklearn.model_selection import GridSearchCV

# Establish parameters for GridSearchCV, as a dictionary

rfc_params = {"n_estimators":[50, 70, 90, 100, 110, 130, 140], 
              "max_features":["log2", "sqrt", "auto"],
              "min_samples_leaf":[2, 4, 8, 10]}

# Obtain the model with the optimal hyperparameters found in GridSearchCV

rfc_gscv = GridSearchCV(estimator = rfc,
                       param_grid = rfc_params,
                       cv = 5,
                       scoring = "accuracy",
                       verbose = 2,
                       n_jobs = -1)

# Fit model found with GridSearchCV on train data 

rfc_gscv.fit(X_train,y_train)

# show results
print("Best parameters for Grid Search CV Random Forest Classifier model:")
print(rfc_gscv.best_params_)



Results from Grid Search Cross Validation method gave the best parameters for the Random Forest Classifier. 
With the new confusion matrix, two false negatives and four false positives were found. This is positive, considering the confusion matrix from the previous model. We must approach this particular problem with a degree of sensitivity, since we do not want to diagnose a patient with a benign tumor, where they actually have a malignant tumor.



In [None]:
# Compute predictions from new model

y_pred = rfc_gscv.predict(X_test)

# Compute confusion matrix from test data and prediction

conf_matrix_rfc_gscv = confusion_matrix(y_test, y_pred)

# show results

print(conf_matrix_rfc_gscv)