In [None]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.feature_selection import SelectKBest, f_classif

Converting String data to integer, and normalizing them.

In [None]:
df = pd.read_csv('Breast_Cancer_dataset.csv')
for col in df.columns:
    if not np.issubdtype(df[col].dtype, np.number):
        unique_values = len(df[col].unique())
        df[col] = pd.Categorical(df[col]).codes + 1
df.fillna(df.mean(), inplace=True)
df = (df - df.min()) / (df.max() - df.min())
df.to_csv('python_preprocessed.csv', index=False)

In [None]:
def split_data():
    df = pd.read_csv('python_preprocessed.csv')
    X = df.drop(columns=['Status'])
    y = df['Status']

    selector = SelectKBest(score_func=f_classif, k=10) 
    X_selected = selector.fit_transform(X, y)

    selected_features = X.columns[selector.get_support()]
    print("Selected Features:", selected_features)

    df_selected = pd.DataFrame(X_selected, columns=selected_features)
    df_selected['Status'] = y.values


    training_set = df_selected.sample(frac=0.8)
    test_set = df_selected.drop(training_set.index)

    X_train = training_set.drop(columns=['Status'])
    y_train = training_set['Status']
    X_test = test_set.drop(columns=['Status'])
    y_test = test_set['Status']

    return (X_train, y_train), (X_test, y_test)

# test
(X_train, y_train), (X_test, y_test) = split_data()
print(X_train.shape, y_train.shape)

1. **Data Cleaning and Missing Value Replacement**:  
   Missing values were addressed by replacing them with the average value of each respective feature.

2. **Normalization**:  
   Normalization was performed using the formula:
   $\text{val} = \frac{\text{val} - \text{min}}{\text{max} - \text{min}}$
   

3. **Balancing the Dataset**:  
   The dataset contains a significantly higher number of patients marked as "alive" compared to those marked as "dead." For certain algorithms (specially KNN which we manually coded), we adjusted the training set by rescaling the proportion of "alive" patients to achieve better balance.

4. **Feature Selection**:  
   Feature selection was conducted using the `f_classif` function from the `scikit-learn` package. \
   This method selects the top \(k\) features based on the ANOVA F-value, calculated as $\frac{out-group variance}{in-group variance} $. The top 10 features were selected for the final model.


KNN is just finding the k-nearest neighbors and taking a majority vote, and it is implemented this from scratch
The main hyperparameter is the number of neighbors to check for voting, which is set to 10 in this case.

KNN is very easy to implement and understand, but it is computationally expensive and not very efficient for large datasets.

In [None]:
# KNN
# Distance between two rows of data
def distance(row1, row2):
    dist = 0
    for i in range(len(row1) - 1):
        dist += (row1.iloc[i] - row2.iloc[i]) ** 2
    return dist ** 0.5


def knn( point, x_train, y_train, k = 10):
    distances = []
    for i in range(len(x_train)):
        dist = distance(point, x_train.iloc[i])
        distances.append((y_train.iloc[i], dist))
    distances.sort(key=lambda x: x[1])
    neighbors = [x[0] for x in distances[:k]]
    return max(set(neighbors), key=neighbors.count)

(X_train, y_train), (X_test, y_test) = split_data()
total_correct = 0
for i in range(len(X_test)):
    prediction = knn(X_test.iloc[i], X_train , y_train)
    if y_test.iloc[i] == prediction:
        total_correct += 1

print(f'Accuracy: {total_correct / len(X_test)}')

Naive Bayes is a probabilistic classifier based on applying Bayes' theorem with the assumption that features are independent of each other. 
The main hyperparameter is the smoothing parameter, which is set to 1e-9 in this case.

Naive Bayes is very efficient and works well small or Large datasets, but it is sensitive to feature independence.

In [None]:
# Naive Bayes
(X_train, y_train), (X_test, y_test) = split_data()

nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Report:\n", classification_report(y_test, y_pred))

C4.5 Decision Tree is a recursive algorithm that splits the data based on the the most useful feature.
The main hyperparameter is the "Criteria" which is the method used to split the data, and also the maximum_depth and min_samples_split describing the shape of the tree. 
This model is easy to visualize and understand and performs feature selection automatically, but it is prone to overfitting.

In [None]:
# C4.5 Decision Tree
(X_train, y_train), (X_test, y_test) = split_data()

dt = DecisionTreeClassifier(criterion='entropy')
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Report:\n", classification_report(y_test, y_pred))

Random forest is essentially an randomly picked subset of decision trees. This way it reduces the overfitting that decision tree can have.
The main hyperparameters are the same as the decision tree.

This model is accurate and works with large dataset very well, but can be computationally expensive.

In [None]:
# Random forest
(X_train, y_train), (X_test, y_test) = split_data()


rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Report:\n", classification_report(y_test, y_pred))

Gradient boosting is a mixture of tree models and build them on top of each other.
The main hyperparameters is learning_rate which is the step size at each iteration, and n_estimators which is the number of boosting stages to perform.

This model is very accurate, it is the algorithm that performed the best in our experiment, but it can sometimes cause overfitting.

In [None]:
#  Gradient Boosting
(X_train, y_train), (X_test, y_test) = split_data()

gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Report:\n", classification_report(y_test, y_pred))

Neural Network is the state of the art algorithm that consits of layers of nodes, it can capture complex patterns in the data.
The main hyperparameters are the number of layers, the number of nodes in each layer.

This model is very accurate and can capture complex patterns, but it is computationally expensive and can be hard to interpret.

In [None]:
# NN
(X_train, y_train), (X_test, y_test) = split_data()
nn = MLPClassifier(hidden_layer_sizes=(100,), max_iter=300)

nn.fit(X_train, y_train)
y_pred = nn.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Report:\n", classification_report(y_test, y_pred))

In [None]:
# Hyperparameter tuning on NN

(X_train, y_train), (X_test, y_test) = split_data()

nn = MLPClassifier()
hidden_layer_sizes = [(100,), (200,), (300,) , (400,), (500,)]
max_iter = [100, 200, 300, 400, 500]

# Display the performance
all_results = []

for hidden_layer_size in hidden_layer_sizes:
    for iter in max_iter:
        nn.set_params(hidden_layer_sizes=hidden_layer_size, max_iter=iter)
        nn.fit(X_train, y_train)
        y_pred = nn.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Hidden Layer Size: {hidden_layer_size}, Max Iter: {iter}, Accuracy: {accuracy}")
        all_results.append((hidden_layer_size, iter, accuracy))

# find best hyperparameters and the accuracy
best_hyperparameters = max(all_results, key=lambda x: x[2])
print(f"Best Hyperparameters: {best_hyperparameters[0]}, {best_hyperparameters[1]}, Accuracy: {best_hyperparameters[2]}")

### Neural Network Performance
| Hidden Layer Size | Max Iterations | Accuracy             |
|-------------------|----------------|----------------------|
| (100,)            | 100            | 0.8907               |
| (100,)            | 200            | 0.8957               |
| (100,)            | 300            | 0.8957               |
| (100,)            | 400            | 0.8932               |
| (100,)            | 500            | **0.8994**           |
| (200,)            | 100            | 0.8994               |
| (200,)            | 200            | 0.8919               |
| (200,)            | 300            | 0.8919               |
| (200,)            | 400            | 0.8981               |
| (200,)            | 500            | 0.8932               |
| (300,)            | 100            | 0.8981               |
| (300,)            | 200            | 0.8944               |
| (300,)            | 300            | 0.8981               |
| (300,)            | 400            | 0.8994               |
| (300,)            | 500            | 0.8957               |

### Best Hyperparameters
 **Hidden Layer Size:** (100,) **Max Iterations:** 500  **Accuracy:** 0.8994

In [None]:
# Hyper parameter tuning on Random Forest

n_estimators = [100, 200, 300, 400, 500]
max_depth = [10, 50, 100, 200, 300]
results = []

for n_estimator in n_estimators:
    for depth in max_depth:
        rf = RandomForestClassifier()
        rf.set_params(n_estimators=n_estimator, max_depth=depth)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print(f"n_estimators: {n_estimator}, max_depth: {depth}, Accuracy: {accuracy}")
        results.append((n_estimator, depth, accuracy))

best_hyperparameters = max(results, key=lambda x: x[2])
print(f"Best Hyperparameters: {best_hyperparameters[0]}, {best_hyperparameters[1]}, Accuracy: {best_hyperparameters[2]}")

### Random Forest Parameter Tuning Performance
| n_estimators | max_depth | Accuracy             |
|--------------|-----------|----------------------|
| 200          | 10        | 0.9081               |
| 200          | 50        | 0.9043               |
| 200          | 100       | 0.8957               |
| 200          | 200       | 0.9019               |
| 200          | 300       | 0.9056               |
| 300          | 10        | **0.9093**           |
| 300          | 50        | 0.9006               |
| 300          | 100       | 0.8981               |
| 300          | 200       | 0.8994               |
| 300          | 300       | 0.9019               |
| 400          | 10        | 0.9056               |
| 400          | 50        | 0.9019               |
| 400          | 100       | 0.9031               |
| 400          | 200       | 0.9019               |
| 400          | 300       | 0.9006               |

### Best Hyperparameters
**n_estimators:** 300 **max_depth:** 10 **Accuracy:** 0.9093


## Conclusion
|Method        | KNN   | Naive Bayes | C4.5 Decision Tree | Random Forest | Gradient Boosting | Neural Network| 
| ---          | ---   | ---         | ---                | ---           | ---               | --- |
|No Selection  | 0.746 | 0.806       | 0.816              | 0.891         | 0.896             |  0.861 |
|ANOVA F-value | 0.846 | 0.831       | 0.847              | 0.875        | 0.909             | 0.902 |


With feature selection, the accuracy of most models are improved, the feature we selected are "T Stage", "N Stage", "6th Stage", "Grade", "A Stage", "Tumor Size", "Estrogen Status", "Progesterone Status", "Reginol Node Positive", "Survival Months". And the features deemed not useful are Age, Race, Marital Status, Differentiate, Regional Node Examined.

The best model is Gradient Boosting followed by Neural network, but most models have over 80% accuracy. 