# _Exploratory Data Analysis of the Coimbra Breast Cancer data Data Set._

## Dataset Summary.

The dataset used in this project consists of anthropometric data and parameters gathered in a standard blood analysis. This dataset was created by Miguel Patrício, José Pereira, Joana Crisóstomo, Paulo Matafome, Raquel Seiça, Francisco Caramelo, all from the Faculty of Medicine of the University of Coimbra and also Manuel Gomes from the University Hospital Centre of Coimbra (Patrício et al., 2018). The dataset was sourced from the UCI Machine Learning Repository (Dua and Graff 2017) and it can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra), particularly [this file](https://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv). Each row in this dataset represents a set of observations of individual patients and each column represents a variable. In this dataset, there are 116 observations and 9 features which are all numerical. There are zero observations with missing values for each class in the dataset. The target column is a binary dependent variable, which indicates the presence (Classification = 2) or absence (Classification = 1) of breast cancer.


### Exploratory Data Analysis checklist:

- Formulate the question
- Read in the data
- Check the packaging
- Look at the top and the bottom of your data
- Make a plot
- Follow up

### Formulate the Question: 

Given the clinical and anthropometric data available, predict if a patient have breast cancer or not?

## Load Required Packages

In [None]:
import matplotlib.pyplot as plt
import altair as alt
import numpy as np
import pandas as pd
import seaborn as sns

# Classifiers 
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# train test split and cross validation
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)

# Feature and model selection metrics
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    plot_confusion_matrix,
)

# Preprocessing and pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import FeatureUnion, Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (
    FunctionTransformer,
    Normalizer,
    StandardScaler,
    normalize,
    scale,
)

## Read in the data and Check the packaging

In [None]:
bc_df = pd.read_csv("../data/raw/dataR2.csv")

bc_df

### The Workflow to which we should adhere.

To avoid breaking the golden rule and hence optaining an optimistic estimate of our model's performance when computing scores (which is bad), we have decided to split our dataset before performing an exploratory data analysis.

In [None]:
train_df, test_df = train_test_split(bc_df, test_size = 0.2, random_state = 123)

In [None]:
train_df.info()

In [None]:
train_df.describe(include = "all")

## Make Plots:

In [None]:
features = train_df.drop(columns = ["Classification"]).select_dtypes(include = np.number)

for feat in features:
    eda_fig = plt.gcf()
    ax = train_df.groupby("Classification")[feat].plot.hist(bins = 20, alpha = 0.4, legend = True)
    plt.xlabel(feat)
    plt.title("Histogram of " + feat)
    plt.show()

## Follow up:

Looking at the graphs developed above, there seems to be some interesting features (such as Glucose, Insulin, HOMA, and Resistin) which can be used to predict the presence or absence of breast cancer. Therefore, we plan on exploring classification evaluation metrics, developing a baseline model, exploring more complicated models, choosing a model based on our evaluation metrics, and performing hyperparameter optimization of the model.  

# Analysis 



In [None]:
X_train, y_train = train_df.drop(columns=["Classification"]), train_df["Classification"]

X_test, y_test = test_df.drop(columns=["Classification"]), test_df["Classification"]

In [None]:
numeric_features = X_train.select_dtypes(include=np.number).columns.tolist()

numeric_features

In [None]:
X_train = train_df.drop(columns = ["Classification"])
y_train = train_df["Classification"]


X_test = test_df.drop(columns = ["Classification"])
y_test = test_df["Classification"]

# Data Preprocessing

Given that we only have numerical features and provided that there are no missing values, therefore, we decided to create a machine learning pipeline that only scales the numerical features, so that we can get appropriate results.

In [None]:
# Since there is no missing values we do not need impute the data
numeric_transformer = make_pipeline(StandardScaler())   

preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features)
)

preprocessor.fit(X_train);             # Calling fit to examine all the transformers
preprocessor.named_transformers_

In [None]:
preprocessor.fit(X_train, y_train);

## Fitting a baseline model and Multiple models

We decided to fit a baseline model to give us an idea of what a minimum score would look like. Furthermore, we attempted to fit different classifiers to observe how they perform on the data.

In [None]:
classifiers = {
    "DummyClassifier": DummyClassifier(strategy="most_frequent"),
    "Decision tree": DecisionTreeClassifier(),
    "kNN": KNeighborsClassifier(),
    "RBF SVM": SVC(),
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(),
}

In [None]:
results_dict = {}
results = {}

scoring = ['recall', 'accuracy', 'precision', 'f1']

for name, classifier in classifiers.items():
    pipe_classifier = make_pipeline(preprocessor, classifier)
    scores = cross_validate(pipe_classifier, X_train, y_train, return_train_score = True, scoring = scoring)
    results = {name: pd.DataFrame(scores).mean().tolist()}
    results_dict.update(results)
    
pd.DataFrame(results_dict, index = scores.keys()).round(4).T

## Analysis and Results

The problem statement attempts to predict the presence or absence of breast cancer given the clinical and anthropometric data available. Based on this, we are trying to reduce the amount of false negatives that our model predicts (i.e we care more about accurately predicting the presence of breast cancer), therefore an important evaluation metric is the recall score. Analysing the table above, we can see that the dummy classifier model has the lowest recall scores (0) therefore we can completely ignore this model. Looking at the scores of the decision tree classifier, it has a high train scores for recall, accuracy and f1, however, it has much lower validation ("test") scores, this is indicative of overfitting on the training data, hence, we can eliminate this classifier. The Random Forest classifier can also be discarded using an analysis equivalent to the one used in discarding the Decision Tree classifier (i.e really high train scores, much lower validation ("test") scores).

Furthermore, the kNearest neighbours classifier, Support Vector Machine (SVM) using the Radial Basis Function (RBF) kernel and the logistic regression classifier have similar train and validation scores, however, it is observed that the fundamental tradeoff (i.e the difference between the train and validation errors as train score increases) is minimal when considering the Logistic regression classifier. The logistic regression classifier also has the highest validation f1 ("test_f1") score. This is an extremely promising classifier! Moreover, further down the line we might attempt to optimise the hyperparameters of the classifier chosen thus, the f1 score is also important because it gives us a score to use when optimizing the hyperparameters and it also combines both the recall and precision scores.

Finally, predicated on the aforementioned analysis, the logistic regression is the optimum model chosen to solve this question. 

# _References_

Patrício, M., Pereira, J., Crisóstomo, J., Matafome, P., Gomes, M., Seiça, R. and Caramelo, F., 2018. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer, 18(1). https://doi.org/10.1186/s12885-017-3877-1

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.