In [114]:
import pandas as pd

# Breast Cancer Tumor Image Classification

## Model Exploration

This notebook utilizes the cleaned data from data_handling.ipynb and explores various ML models in their default state. The goal is to determine which models exhibit the best predictive power.

Models of interest include the following:

1. Linear Regression
2. Logistic Regression
3. K Nearest Neighbors
4. Support Vector Classifier
5. Decision Trees
6. Random Forests
7. Extra Trees

In [115]:
training_data = pd.read_csv('Resources/training_data.csv')
y_train = training_data.pop('malignant')
X_train = training_data

testing_data = pd.read_csv('Resources/testing_data.csv')
y_test = testing_data.pop('malignant')
X_test = testing_data

The following function test classifiers of the SciKit-Learn form and reports their score. 

In [116]:
from sklearn.metrics import confusion_matrix

def classifier_test(classifier_name, classifier, X_train, y_train, X_test, y_test):
    classifier.fit(X_train, y_train)
    training_score = classifier.score(X_train, y_train)
    testing_score = classifier.score(X_test, y_test)
    
    y_pred = classifier.predict(X_test)
    
    if classifier_name != 'Linear Regression':
        cm = confusion_matrix(y_test, y_pred)
        cm = pd.DataFrame(cm, index=['Actual 0', 'Actual 1'], columns=['Predicted 0', 'Predicted 1'])
        precision = cm.iloc[1, 1] / (cm.iloc[1, 1] + cm.iloc[0, 1])
        recall = cm.iloc[1, 1] / (cm.iloc[1, 1] + cm.iloc[1, 0])
        specificity = cm.iloc[0, 0] / (cm.iloc[0, 0] + cm.iloc[0, 1])
        sensitivity = cm.iloc[1, 1] / (cm.iloc[1, 1] + cm.iloc[1, 0])
        f1_score = 2 * precision * recall / (precision + recall)
        

    
    print(f"Classifier: {classifier_name}")
    print(f"Training Data Score: {100 * training_score:.2f}%")
    print(f"Testing Data Score: {100 * testing_score:.2f}%")
    
    if classifier_name != 'Linear Regression':
        print('\nConfusion Matrix')
        print(cm)
        
    print('\n---------------------------------------------------\n')

    
    return (training_score, testing_score)

In [117]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier

rs = 42

classifiers = [
    {'name': 'Linear Regression', 'classifier': LinearRegression()},
    {'name': 'Logistic Regression', 'classifier': LogisticRegression(max_iter=10000, random_state=rs)},
    {'name': 'K-Nearest Neighbors', 'classifier': KNeighborsClassifier()},
    {'name': 'Support Vector Machine', 'classifier': SVC(max_iter=10000, random_state=rs)},
    {'name': 'Decision Tree', 'classifier': DecisionTreeClassifier(random_state=rs)},
    {'name': 'Random Forest', 'classifier': RandomForestClassifier(random_state=rs)},
    {'name': 'Extra Tree', 'classifier': ExtraTreeClassifier(random_state=rs)}
    ]

In [118]:
classifier_results = []

for classifier in classifiers:
    training_score, testing_score = classifier_test(classifier['name'], classifier['classifier'], X_train, y_train, X_test, y_test)
    classifier_results.append({'name': classifier['name'], 'training_score': training_score, 'testing_score': testing_score})

Classifier: Linear Regression
Training Data Score: 79.52%
Testing Data Score: 77.31%

---------------------------------------------------

Classifier: Logistic Regression
Training Data Score: 96.67%
Testing Data Score: 97.20%

Confusion Matrix
          Predicted 0  Predicted 1
Actual 0          128            0
Actual 1            7          115

---------------------------------------------------

Classifier: K-Nearest Neighbors
Training Data Score: 96.27%
Testing Data Score: 92.80%

Confusion Matrix
          Predicted 0  Predicted 1
Actual 0          121            7
Actual 1           11          111

---------------------------------------------------

Classifier: Support Vector Machine
Training Data Score: 90.27%
Testing Data Score: 91.60%

Confusion Matrix
          Predicted 0  Predicted 1
Actual 0          122            6
Actual 1           15          107

---------------------------------------------------

Classifier: Decision Tree
Training Data Score: 100.00%
Testing Dat

In [119]:
# plot the results as an interactive plotly plot   
import plotly.express as px
import plotly.io as pio

df = pd.DataFrame(classifier_results)
fig = px.bar(df, x='name', y=['training_score', 'testing_score'], barmode='group')
fig.update_layout(title='Classifier Scores', yaxis_title='Score', xaxis_title='Classifier')
fig.show()

# save the plot as an image
pio.write_image(fig, 'Plots/classifier_scores.png')

The Random Forest classifier exhibits the best predictive power with the supplied testing set with an accuracy score of 99.2%

All other models aside from linear regression also gave accuracy scores greater than 90%. 

We will focus our optimization on the random forest model, though the other models might present additional benefit.