# **Predict Diagnosis**

## Objectives

* Train an ML pipeline using hyperparameter optimization.
* Use the best features to predict cancer diagnosis.

## Tasks
* Loading the data.
* Creating the ML Classification Pipelines.
* Split Train and Test sets.
* Grid Search CV - Sklearn.

## Inputs

* outputs/datasets/collection/breast-cancer.csv
* Instructions on which variables to use for data cleaning and feature engineering. They are found in each respective notebook.

## Outputs

* Train set (features and target)
* Test set (features and target)
* Data cleaning and Feature Engineering pipeline
* Modeling pipeline
* Feature importance plot

## Additional Comments

* This notebook was written based on the guidelines provided in the walkthrough project 2: 'Churnometer'.
* This notebook relates to the Data Understanding step of Crisp-DM methodology.
* This notebook and the following will represent the learning outcome after following the Code Institute - Predictive Analytics and Machine Learning module.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Step 1: Load Data

In [None]:
import numpy as np
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/breast-cancer.csv")
    .drop(labels=['id'], axis=1)
)

print(df.shape)
df.head(3)

# Step 2: ML Pipeline with all data

## ML pipeline for Data Cleaning and Feature Engineering

In [None]:

%matplotlib inline
# This line is used to display plots inline in Jupyter notebooks

In [None]:
power_vars = ['concavity_mean', 'concave points_mean', 'concavity_worst']
power_vars

In [None]:
yj_vars = df.drop(columns=['diagnosis'] + power_vars).columns.tolist()
yj_vars

In [None]:
from sklearn.pipeline import Pipeline

# Feature Engineering
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine import transformation as vt

def PipelineDataCleaningAndFeatureEngineering():
    pipeline_base = Pipeline([
        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(variables=yj_vars)),
        ("PowerTransformation", vt.PowerTransformer(variables=power_vars)),
        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
        method="spearman", threshold=0.8, selection_method="variance")),

    ])

    return pipeline_base

PipelineDataCleaningAndFeatureEngineering()

## ML Pipeline for Modelling

In [None]:
# Feat Scaling
from sklearn.preprocessing import StandardScaler

# Feat Selection
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier


def PipelineClf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("feat_selection", SelectFromModel(model)),
        ("model", model),
    ])

    return pipeline_base

## Hyperparameter Optimisation

* **Custom Class for Hyperparameter Optimisation**

In [None]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")

            model = PipelineClf(self.models[key])
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                            verbose=verbose, scoring=scoring, )
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score',
                'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

## Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['diagnosis'], axis=1),
    df['diagnosis'],
    test_size=0.2,
    random_state=0,
)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

## Handle Target Imbalance

In [None]:
pipeline_data_cleaning_feat_eng = PipelineDataCleaningAndFeatureEngineering()
X_train = pipeline_data_cleaning_feat_eng.fit_transform(X_train)
X_test = pipeline_data_cleaning_feat_eng.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

* **Check Train Set Target distribution**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

* **Use SMOTE (Synthetic Minority Oversampling TEchnique) to balance Train Set target**

In [None]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

* **Check Train Set Target distribution after resampling**

In [None]:
import matplotlib.pyplot as plt
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

## Grid Search CV - Sklearn

### Use standard hyperparameters to find most suitable algorithm

In [None]:
models_quick_search = {
    "LogisticRegression": LogisticRegression(random_state=0),
    "XGBClassifier": XGBClassifier(random_state=0),
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
}

params_quick_search = {
    "LogisticRegression": {},
    "XGBClassifier": {},
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
}

* **Quick GridSearch CV - Binary Classifier**

In [None]:
from sklearn.metrics import make_scorer, recall_score
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train,
        scoring =  make_scorer(recall_score, pos_label=1),
        n_jobs=-1, cv=5)

* **Check results**

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary 

* **We do an extensive search on the most suitable algorithm to find the best hyperparameter configuration**

In [None]:
# Define model and parameters, for Extensive Search
models_search = {
    "AdaBoostClassifier":AdaBoostClassifier(random_state=0),
}
params_search = {
    "AdaBoostClassifier":{
        'model__learning_rate': [0.1, 0.5, 1.0, 1.1],
        'model__n_estimators': [50, 100, 200],
    }
}

* Extensive GridSearch CV - Binary Classifier

In [None]:
from sklearn.metrics import recall_score, make_scorer
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
        scoring =  make_scorer(recall_score, pos_label=1),
        n_jobs=-1, cv=5)

* Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

* Get best model name programmatically

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

* Parameters for best model

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

* Define the best clf pipeline

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

## Assess feature importance

In [None]:
X_train.head(3)

* With the current model, we can assess with .features_importances_

In [None]:
# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': X_train.columns[pipeline_clf['feat_selection'].get_support()],
    'Importance': pipeline_clf['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

# re-assign best_features order
best_features = df_feature_importance['Feature'].to_list()

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
    f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

* Evaluate Pipeline on Train and Test Sets

In [None]:
from sklearn.metrics import classification_report, confusion_matrix


def confusion_matrix_and_report(X, y, pipeline, label_map):

    prediction = pipeline.predict(X)

    print(pd.DataFrame(
        confusion_matrix(y_true=y, y_pred=prediction),
        columns=[["Predicted " + sub for sub in label_map]],
        index=[["Actual " + sub for sub in label_map]]
    ))
    print("\n")
    
    print('---  Classification Report  ---')
    print(classification_report(y, prediction, target_names=label_map), "\n")


def clf_performance(X_train, y_train, X_test, y_test, pipeline, label_map):
    print("#### Train Set #### \n")
    confusion_matrix_and_report(X_train, y_train, pipeline, label_map)

    print("#### Test Set ####\n")
    confusion_matrix_and_report(X_test, y_test, pipeline, label_map)

Evaluation: We cross check with metrics defined at ML business case

* 90% Recall for Malignant, on train and test set
* 90% Precision for Benign on train and test set

In [None]:
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline_clf,
                label_map= ['Benign', 'Malignant'] 
                )

# Step 3: Refit pipeline with best features

## Refit ML Pipeline and Resampling

* In theory, a pipeline fitted using only the most important features should give the same result as the one fitted with all variables and feature selection

* However, in this project we have a step for feature augmentation, which is to balance the target Train Set using SMOTE().

### Rewrite ML pipeline for Data Cleaning and Feature Engineering

In [None]:
best_features

* New Pipeline for DataCleaning And FeatureEngineering

In [None]:
yj_vars_new = ['area_mean','smoothness_worst','perimeter_se','texture_worst','symmetry_mean']
power_vars_new = ['concavity_worst']

yj_vars_new, power_vars_new

In [None]:
def PipelineDataCleaningAndFeatureEngineering():
    pipeline_base = Pipeline([
        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(variables=yj_vars_new)),
        ("PowerTransformation", vt.PowerTransformer(variables=power_vars_new)),

    ])

    return pipeline_base

PipelineDataCleaningAndFeatureEngineering()

---

# Load and Inspect Kaggle Data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/breast-cancer.csv")
df.head(10)

In [None]:
df.info()

**Abbreviations explained:**

`id`
Unique ID


`diagnosis`
Target: M - Malignant B - Benign


`radius_mean`
Radius of Lobes


`texture_mean`
Mean of Surface Texture


`perimeter_mean`
Outer Perimeter of Lobes


`area_mean`
Mean Area of Lobes


`smoothness_mean`
Mean of Smoothness Levels


`compactness_mean`
Mean of Compactness


`concavity_mean`
Mean of Concavity


`concave points_mean`
Mean of Cocave Points


`symmetry_mean`
Mean of Symmetry


`fractal_dimension_mean`
Mean of Fractal Dimension


`radius_se`
SE of Radius


`texture_se`
SE of Texture


`perimeter_se`
Perimeter of SE


`area_se`
Are of SE


`smoothness_se`
SE of Smoothness


`compactness_se`
SE of compactness


`concavity_se`
SEE of concavity


`concave points_se`
SE of concave points


`symmetry_se`
SE of symmetry


`fractal_dimension_se`
SE of Fractal Dimension


`radius_worst`
Worst Radius


`texture_worst`
Worst Texture


`perimeter_worst`
Worst Permimeter


`area_worst`
Worst Area


`smoothness_worst`
Worst Smoothness


`compactness_worst`
Worse Compactness


`concavity_worst`
Worst Concavity


`concave points_worst`
Worst Concave Points


`symmetry_worst`
Worst Symmetry


`fractal_dimension_worst`
Worst Fractal Dimension

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/breast-cancer.csv",index=False)
