# Advanced Machine Learning and Deep Learning Pathway
## Sasha DiVall, Data Science Co-op, Bentley Systems
### Predict Breast Cancer
The Wisconsin Breast Cancer dataset embeds a classification task. From various information extracted from scanned images, **predict if it's a tumor or not.**

Even before you begin, consider the metrics and figures you should include to provide an accurate pricture for your implementation performance. In particular, is accuracy the best metric here? 

Here are the performance you may expect (all three levels are sufficient for the AI Champion pathway, but the last ones are an interesting challenge):
* Minimum: 85% Accuracy
* Challenge: 92% Accuracy
* Expert: 97% Accuracy

Don't forget to clean and look at the data first! Also, try a few machine learning approaches and compare their accuracy.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
cancer_data = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")

### Cleaning Data
Investigate null values and find the best strategy to handle null values. Think about which attributes are best in predicting whether the tumor is malignant or benign. 


In [None]:
# Testing for imbalanced classes - how many more benign observations are there than malignant
len(cancer_data[cancer_data["diagnosis"] == "B"]) - len(cancer_data[cancer_data["diagnosis"] == "M"])

In [None]:
# Null values
cancer_data.isna().sum()

In [None]:
# Dropping id column and describing numeric data
cancer_data = cancer_data.drop(columns = ["id"])
cancer_num = cancer_data.drop(columns = ["diagnosis"])
cancer_num.describe().T

### Brainstorming
* Since we have two classes of tumor (benign and malignant), perhaps a classification algorithm is suited here.
#### Models
* Let's test a couple models and see which ones perform the best

In [None]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# define X and y variables
X = cancer_data.drop(columns = ["diagnosis"])
y = cancer_data["diagnosis"]

In [None]:
# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state = 0)

In [None]:
X_train.shape, X_val.shape, y_train.shape, y_val.shape

## Logistic Regression

In [None]:
log_reg = LogisticRegression(solver = "liblinear")
log_reg.fit(X_train, y_train)

In [None]:
log_reg_preds = log_reg.predict(X_val)
print("Benign: ", sum(log_reg_preds == 'B'))
print("Malignant: ", sum(log_reg_preds == 'M'))

In [None]:
ConfusionMatrixDisplay.from_predictions(y_val, log_reg_preds, 
                                        labels = log_reg.classes_, cmap = "Blues")

In [None]:
print(classification_report(y_val, log_reg_preds))

The LogisticRegression Model has an accuracy of **96%**.

In [None]:
# Decision Tree Classifier
decision_tree = DecisionTreeClassifier(random_state = 0)
decision_tree.fit(X_train, y_train)

In [None]:
decision_tree_preds = decision_tree.predict(X_val)
print("Benign: ", sum(decision_tree_preds == 'B'))
print("Malignant: ", sum(decision_tree_preds == 'M'))

In [None]:
ConfusionMatrixDisplay.from_predictions(y_val, decision_tree_preds, 
                                       labels = decision_tree.classes_, cmap = 'Blues')
plt.show()

In [None]:
print(classification_report(y_val, decision_tree_preds))

The accuracy of the DecisionTreeClassifier is **88%**.

In [None]:
# Figuring out which features are most important in the classificatin
decision_tree_feature_imps = pd.DataFrame({"Feature": decision_tree.feature_names_in_, 
                                          "Importance": decision_tree.feature_importances_})

decision_tree_feature_imps ["Importance"] = decision_tree_feature_imps["Importance"].round(4)

decision_tree_feature_imps.sort_values(by = ["Importance"], ascending = False).reset_index(drop = True)

In [None]:
# Find which attributes don't contribute to the model at all
decision_tree_feature_imps[decision_tree_feature_imps["Importance"] == 0.0].reset_index(drop = True)

In [None]:
# Randomg Forest Classifier
random_forest = RandomForestClassifier(random_state = 0)
random_forest.fit(X_train, y_train)

In [None]:
random_forest_predictions = random_forest.predict(X_val)
print("Benign: ", sum(random_forest_predictions == 'B'))
print("Malignant: ", sum(random_forest_predictions == 'M'))

In [None]:
ConfusionMatrixDisplay.from_predictions(y_val, random_forest_predictions, 
                                       labels = random_forest.classes_, cmap = "Blues")

In [None]:
print(classification_report(y_val, random_forest_predictions))

The RandomForestClassifierModel is **97%** accurate.

In [None]:
#KNN Classifier
# error rates from models with uniformly distributed weights
error_uniform = [] # error rates from models with uniformly distributed weights
error_distance = [] # error rates from models with distance based weights

k_range = range(1,31)

for k in k_range:
    knn_clf = KNeighborsClassifier(n_neighbors = k, weights = 'uniform')
    knn_clf.fit(X_train, y_train)
    predictions = knn_clf.predict(X_val)
    error_uniform.append(1 - accuracy_score(y_val, predictions))
    
    knn_clf = KNeighborsClassifier(n_neighbors = k, weights = 'distance')
    knn_clf.fit(X_train, y_train)
    predictions = knn_clf.predict(X_val)
    error_distance.append(1 - accuracy_score(y_val, predictions))
    

In [None]:
# Plotting the error rates
plt.figure(figsize=(16,9))
plt.plot(k_range, error_uniform, c = 'blue',
         linestyle = "solid", marker = "o", markerfacecolor = "black", 
         label = "Error Uniform")
plt.plot(k_range, error_distance, c  = 'green', 
         linestyle = "dashed", marker = "o", markerfacecolor = "black",
         label = "Error Distance")

plt.xlabel("K value")
plt.ylabel("Error Rate")
plt.title("Error Rates for Each K Value")
plt.legend()
plt.plot()

The lowest error rate occurs first when k is 8 and the weights are uniform.

In [None]:
knn = KNeighborsClassifier(n_neighbors = 8, weights = 'uniform')
knn.fit(X_train, y_train)

In [None]:
knn_preds = knn.predict(X_val)
print("Benign: ", sum(knn_preds == 'B'))
print("Malignant: ", sum(knn_preds == "M"))

In [None]:
ConfusionMatrixDisplay.from_predictions(y_val, knn_preds, 
                                       labels = knn.classes_, cmap = "Blues")
plt.show()

In [None]:
print(classification_report(y_val, knn_preds))

The KNN Classifier is **96%** accurate.

In [None]:
# Support Vector Machines (SVM) Classifier
## 1-Linear Kernel
linear_svm = SVC(C = 1, kernel = 'linear')
linear_svm.fit(X_train, y_train)

In [None]:
linear_svm_preds = linear_svm.predict(X_val)
print("Benign: ", sum(linear_svm_preds == 'B'))
print("Malignant: ", sum(linear_svm_preds == 'M'))

In [None]:
ConfusionMatrixDisplay.from_predictions(y_val, linear_svm_preds, 
                                       labels = linear_svm.classes_, cmap = "Blues")
plt.show()

In [None]:
print(classification_report(y_val, linear_svm_preds))

The linear SVM model is **96%** accurate.

In [None]:
# 2-Polynomial Kernel
poly_svm = SVC(C = 1, kernel = 'poly')
poly_svm.fit(X_train, y_train)

In [None]:
poly_svm_preds = poly_svm.predict(X_val)

print("Benign: ", sum(poly_svm_preds == 'B'))
print("Malignant: ", sum(poly_svm_preds == 'M'))

In [None]:
ConfusionMatrixDisplay.from_predictions(y_val, poly_svm_preds,
                                       labels = poly_svm.classes_, cmap = "Blues")
plt.show()

In [None]:
print(classification_report(y_val, poly_svm_preds))

The 2-Poly SVM model is **92% accurate**.

# Conclusion
In conclusion, while all models performed generally well, the **RandomForestClassifier** performed the best on this training dataset with 97% accuracy.