# Use of Machine Learning in Diagnosing Breast Cancer
# [Part 2] Hyperparameter optimization using Grid Search

When your choosing a model for machine learning, it's never easy to know which model to choose as the best that can be used on your data without trying them. There are different methods that can be used to find what are the best hyperparameters that can be used in our work and of interest here we use the **GridSearch**

Hopefully it will improve on the performance as compared to the results in part1

## Import libraries

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

## Import the data from the previous part

In [2]:
df = pd.read_csv('Processed_breast_cancer_features.csv')

### From part1, let's list down the features we obtained from the correlation analysis

In [3]:
prediction_var = ['radius_mean',
 'perimeter_mean',
 'area_mean',
 'compactness_mean',
 'concavity_mean',
 'concave points_mean',
 'radius_se',
 'perimeter_se',
 'area_se',
 'concave points_se',
 'radius_worst',
 'perimeter_worst',
 'area_worst',
 'concavity_worst',
 'concave points_worst',
 'fractal_dimension_worst']

In [4]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.15, random_state=1)
#the random_state value of 1 sets seed to the random generator and this allows us to get the same value each time the algorithm runs

train_X = train[prediction_var]
train_y = train['diagnosis']
test_X = test[prediction_var]
test_y = test['diagnosis']

In [5]:
from sklearn.model_selection import GridSearchCV

### Let's use the RandomForestClassifier as our model sample

In [6]:
model = RandomForestClassifier()

In [7]:
parameters = {'max_depth': (1,2,3,4), 'n_estimators': (10,50,100,500)}

best_model = GridSearchCV(model, parameters)
#best_model.estimator.get_params().keys()

In [8]:
best_model.fit(train_X, train_y)

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'max_depth': (1, 2, 3, 4),
                         'n_estimators': (10, 50, 100, 500)})

In [9]:
best_model.best_params_

{'max_depth': 4, 'n_estimators': 100}

In [10]:
prediction = best_model.predict(test_X)

In [11]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score

In [12]:
confusion_matrix(test_y, prediction)

array([[51,  1],
       [ 5, 29]])

In [13]:
precision = precision_score(test_y, prediction)
print('The precision score is %.2f' % precision)
recall = recall_score(test_y, prediction)
print('The recall score is %.2f' % recall)
accuracy = accuracy_score(test_y, prediction)
print('The accuracy score is %.2f' % accuracy)

The precision score is 0.97
The recall score is 0.85
The accuracy score is 0.93


Comparing the results of the model with default parameters and our optimized model with manually set parameters, the scores are the same.
It can be assumed that even with the default parameters, the algorithm has been set to produce the best results.

`Note:`It is important to perform hyperparameter optimization to generate results that are dependent on the goal of the ML. It is important to know that sometimes there exists a trade off between the Precision_score and the Recall_score

For example, for our dataset, we can consider that achieving a high recall is more important than getting a high precision – we would like to detect as many cancers as possible. In a practical set-up within a hospital, we don't want to subject patients to surgeries and medications when there is a chance that its just a false positive. On the other hand, if the doctor deems that even the cases where patients who were diagnosed as false positives are possible indications of some other underlying condition, then the precision score becomes equally important and thus one would aim for a good F1-score