<a href="https://colab.research.google.com/github/w4bo/teaching-handsondatapipelines/blob/main/materials/15-BreastCancer.solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The `BreastCancer` challenge

### Goal

It is your job to predict the `diagnosis` for each data item.

### Metric

Submissions are evaluated using the accuracy score. When splitting train and test datasets, the test dataset should contain 30% of the data.

### Requirements

You are allowed to use `numpy`, `pandas`, `matplotlib`, `sns`, and `sk-learn` Python libraries. You can import any model from `sk-learn`.

In [None]:
# Import the libraries used for machine learning
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv), data manipulation as in SQL
import matplotlib.pyplot as plt # this is used for the plot the graph 
import seaborn as sns # used for plot interactive graph. I like it most for plot
%matplotlib inline
from sklearn.model_selection import train_test_split # to split the data into two parts
from sklearn.model_selection import KFold # use for cross validation
from sklearn.model_selection import GridSearchCV # for tuning parameter
from sklearn.ensemble import RandomForestClassifier # for random forest classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics # for the check the error and accuracy of the model
import random
import os

# SEED all random generators
seed = 42
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)

# read the data
# df = pd.read_csv("datasets/breastcancer.csv")
df = pd.read_csv("https://raw.githubusercontent.com/w4bo/teaching-handsondatapipelines/main/materials/datasets/breastcancer.csv")

## Data understanding

Hints

- There are 569 observations with 30 features each
- Each observation is labelled with a `diagnosis`

Take a first glance to the dataset
- Do we consider all features?
- Are there null values?
- Which are the attribute types?
- Which are the attribute ranges?
- How many labels?
- Are classes unbalanced?



In [None]:
df

In [None]:
df.info()

### Feature semantics

Hint:
- id of the observation
- diagnosis (M = malignant, B = benign)
- Ten real-valued features are computed for each cell nucleus:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

`*_mean`: the means of all cells

`*_se`: standard error of all cells

`*_worst`: the worst cell


In [None]:
df.describe()

In [None]:
df['diagnosis'].value_counts()

In [None]:
df['diagnosis'].value_counts().plot(kind="bar", label="Count", legend=True)

In [None]:
sns.countplot(data=df, x='diagnosis', hue='diagnosis')

### Summing up

| Question | Answer | Do we need action? |
| -        | -      | - |
| Are there null values? | Yes | No need for imputation, drop the column |
| Which are the attribute types? | All attributes are numeric but `diagnosis` | Encode diagnosis | 
| Which are the attribute ranges? | Attribute ranges are similar | We could apply normalization |
| How many labels? | 2 | - |
| Are classes unbalanced? | No, classess are almost equally distributed | No rebalancing |

## Data preprocessing

Drop the unnecessary columns

In [None]:
# `Unnamed:32` has 0 non null objects, all values are null. Drop the column
df.drop(["id", "Unnamed: 32"], axis=1, inplace=True)

In [None]:
# map the diagnosis column to numeric
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})

### Data visualization

- Check the attribute's distribution
- Check the relationships between attributes (e.g., the correlation). Should we keep all attributes?

For now, let's just focus on `*_mean` attributes

In [None]:
# Data can be divided into three parts (i.e., families of columns)
features_mean = list(df.columns[1:11]) + ["diagnosis"]
features_se = list(df.columns[11:20]) + ["diagnosis"]
features_worst = list(df.columns[21:31]) + ["diagnosis"]
print("features_mean: " + str(features_mean))
print("features_se: " + str(features_se))
print("features_worst: " + str(features_worst))

In [None]:
df[features_mean].hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
g = sns.pairplot(df[features_mean], hue='diagnosis', markers='+')
plt.show()

In [None]:
# df[features_se].hist(bins=50, figsize=(20,15))
# plt.show()

In [None]:
# g = sns.pairplot(df[features_se], hue='diagnosis', markers='+')
# plt.show()

In [None]:
#  df[features_worst].hist(bins=50, figsize=(20,15))
# plt.show()

In [None]:
# g = sns.pairplot(df[features_worst], hue='diagnosis', markers='+')
# plt.show()

In [None]:
df = df[features_mean]
from scipy.stats import pearsonr
rho = df.corr(method ='pearson')
pval = df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
p = pval.applymap(lambda x: ''.join(['*' for t in [0.01, 0.05, 0.1] if x <= t]))
rho.round(2).astype(str) + p

In [None]:
min_corr = 0.3
kot = rho[(abs(rho) >= min_corr) & (rho < 1)]
plt.figure(figsize=(14, 14))
sns.heatmap(kot, cmap=sns.color_palette("coolwarm", as_cmap=True), annot=True, fmt= '.2f',annot_kws={'size': 15})

### Should we drop some attributes?

- `radius_mean`, `perimeter_mean`, and `area_mean` are highly correlated, keep `permiter`
- `compactness_mean`, `concavity_mean` and `concavepoint_mean` are highly correlated, keep `compactness_mean`

In [None]:
# now these are the variables that we will use for prediction
prediction_var = ['texture_mean', 'perimeter_mean', 'smoothness_mean', 'compactness_mean', 'symmetry_mean']

## Modeling with scikit-learn

Preparing the dataset for the ML pipeline.
- X: the dataset
- y: the labels

In [None]:
def set_dataset(feature_list, normalize=False):
    X = df[[x for x in feature_list if x != "diagnosis"]]
    y = df['diagnosis']

    if normalize:
      X = (X - X.mean()) / X.std()
    
    # print(X.head())
    print(X.shape)
    # print(y.head())
    print(y.shape)
    return X, y

X, y = set_dataset(prediction_var, True)

In [None]:
X

In [None]:
y

## Split the dataset into a training set and a testing set

### Advantages
- By splitting the dataset pseudo-randomly into a two separate sets, we can train using one set and test using another.
- This ensures that we won't use the same observations in both sets.
- More flexible and faster than creating a model using all of the dataset for training.

### Disadvantages
- The accuracy scores for the testing set can vary depending on what observations are in the set. 
- This disadvantage can be countered using k-fold cross-validation.

### Notes
- The accuracy score of the models depends on the observations in the testing set, which is determined by the seed of the pseudo-random number generator (random_state parameter).
- As a model's complexity increases, the training accuracy (accuracy you get when you train and test the model on the same data) increases.
- If a model is too complex or not complex enough, the testing accuracy is lower.
- For KNN models, the value of k determines the level of complexity. A lower value of k means that the model is more complex.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
result = pca.fit_transform(X_train)

plt.scatter(
    x=result[:,0], 
    y=result[:,1] , 
    c=y_train,
    cmap='viridis'
)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(projection='3d')

pca = PCA(n_components=3)
result = pca.fit_transform(X_train)

ax.scatter(
    xs=result[:,0], 
    ys=result[:,1], 
    zs=result[:,2], 
    c=y_train,
    cmap='viridis'
)

In [None]:
from sklearn.ensemble import IsolationForest
clf = IsolationForest()
clf.fit(X_train)
is_outlier = clf.predict(X_train)

plt.scatter(
    x=result[:,0], 
    y=result[:,1], 
    s=[10 if x > 0 else 20 for x in is_outlier],
    c=is_outlier,
    cmap='viridis'
)

In [None]:
# all parameters not specified are set to their defaults
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
logisticRegr.fit(X_train, y_train)
y_pred = logisticRegr.predict(X_test)
metrics.accuracy_score(y_test, y_pred)

Fit your model and try it with several parameters

In [None]:
def fit_knn(X_train, y_train, X_test, y_test):
    # experimenting with different k values
    k_range = list(range(1, 30))
    scores = []
    for k in k_range:
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        y_pred = knn.predict(X_test)
        scores.append(metrics.accuracy_score(y_test, y_pred))

    plt.plot(k_range, scores)
    plt.xticks(k_range)
    plt.xlabel('Value of k for KNN')
    plt.ylabel('Accuracy Score')
    plt.title('k-Nearest-Neighbors')
    plt.show()
    return y_pred

fit_knn(X_train, y_train, X_test, y_test)

What if I compare the model vs the model trained on the training set only?

In [None]:
fit_knn(X_train, y_train, X_train, y_train)

What if I choose a more complex model?

In [None]:
def fit_forest(X_train, y_train, X_test, y_test):
    model=RandomForestClassifier(n_estimators=100) # a simple random forest model
    model.fit(X_train, y_train) # now fit our model for training data
    y_pred = model.predict(X_test) # predict for the test data
    # prediction will contain the predicted value by our model predicted values of dignosis column for test inputs
    print("Accuracy: " + str(metrics.accuracy_score(y_pred, y_test))) # to check the accuracy
    # here we will use accuracy measurement between our predicted value and our test output values
    featimp = pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)
    print("\nFeatures sorted by descending importance:")
    print(featimp) # this is the property of Random Forest classifier that it provide us the importance of the features used

fit_forest(X_train, y_train, X_test, y_test)

Now lets do this for all `feature_mean` so that from Random forest we can get the feature which are important

In [None]:
X, y = set_dataset(features_mean, True) # taking all features
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
fit_forest(X_train, y_train, X_test, y_test)

What if we use cross-validation?

https://scikit-learn.org/stable/modules/cross_validation.html

In [None]:
from sklearn.model_selection import cross_val_score

def cv(model, X, y):
    scores = cross_val_score(model, X, y, cv=5)
    print("Scores: " + str(scores))
    print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

model = RandomForestClassifier(n_estimators=100)
cv(model, X, y)

In [None]:
model = KNeighborsClassifier(n_neighbors=10)
cv(model, X, y)

Grid search

In [None]:
# lets Make a function for Grid Search CV
def gridsearch_cv(model,param_grid, X_train, y_train):
    clf = GridSearchCV(model, param_grid, cv=5, scoring="accuracy", verbose=1, n_jobs=2)
    clf.fit(X_train, y_train)
    print("The best parameters are:")
    print(clf.best_params_)
    print("The best estimator is " + str(clf.best_estimator_))
    print("The best score is " + str(clf.best_score_))

In [None]:
model = KNeighborsClassifier()

k_range = list(range(1, 30, 3))
leaf_size = list(range(1, 30, 3))
param_grid = {'n_neighbors': k_range, 'leaf_size': leaf_size} #, 'weights': ['uniform', 'distance']}

gridsearch_cv(model, param_grid, X, y)

In [None]:
model = RandomForestClassifier()

estimator_range = [10, 50, 100]
param_grid = {'n_estimators': estimator_range}

gridsearch_cv(model, param_grid, X, y)