A model `hyperparameter` is a characteristic of a model that is external to the model and whose value cannot be estimated from data. The value of the `hyperparameter` has to be set before the learning process begins. For example, `c` in `SVM`, `k` in `KNN`, the number of hidden layers in Neural Networks.

In contrast, a parameter is an internal characteristic of the model and its value can be estimated from data. Example, `beta coefficients of linear/logistic regression` or `support vector`s in `SVM`.

>Grid-search is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.

### Import Packages

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sn
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels
from sklearn.model_selection import GridSearchCV
np.set_printoptions(precision=2)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Import the dataset

In [None]:
#import data
data = pd.read_csv('/kaggle/input/breast-cancer-csv/breastCancer.csv')

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.columns

In [None]:
data.info()

Each row in the dataset have one of two possible classes: benign (represented by 2) and malignant (represented by 4). Also, there are 10 attributes in this dataset (shown above) which will be used for prediction.

### Data Cleaning
Clean the data and rename the class values as 0/1 for model building (where 1 represents a malignant case). Also, let’s observe the distribution of the class.

In [None]:
data = data.drop(['id'],axis=1) #Drop 1st column
data = data[data['bare_nucleoli'] != '?'] #Remove rows with missing data
data['class'] = np.where(data['class'] ==2,0,1) #Change the Class representation
data['class'].value_counts() #Class distribution

There are 444 benign and 239 malignant cases.

In [None]:
#Split data into attributes and class
X = data.drop(['class'],axis=1)
y = data['class']

In [None]:
#perform training and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Dummy Classifier
`DummyClassifier` is a classifier that makes predictions using simple rules.This classifier is useful as a simple baseline to compare with other (real) classifiers. Do not use it for real problems.

In [None]:
#Dummy Classifier
# clf=DummyClassifier(strategy="most_frequent")
# clf.fit(X_train,y_train)
clf = DummyClassifier(strategy= 'most_frequent',random_state=42).fit(X_train,y_train)
y_pred = clf.predict(X_test)

In [None]:

#Distribution of y test
print('y actual : \n' +  str(y_test.value_counts()))
#Distribution of y predicted
print('y predicted : \n' + str(pd.Series(y_pred).value_counts()))

### Calculate the evaluation metrics of model

In [None]:
# Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred,labels=np.unique(y_pred))))

In [None]:

#Dummy Classifier Confusion matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

### Function for confusion matrix plot

In [None]:
#Function to plot intuitive confusion matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

In [None]:
# Plot non-normalized confusion matrix
plt.figure()
class_names = [0,1]
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix - DummyClassifier')
a = plt.gcf()
a.set_size_inches(8,4)
plt.show()

## Logistic Regression model with default parameters

In [None]:
#Logistic regression
clf = LogisticRegression(solver="lbfgs",random_state=42).fit(X_train,y_train)
y_pred = clf.predict(X_test)

In [None]:
# Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

In [None]:
#Logistic Regression Classifier Confusion matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

In [None]:
cnf_matrix = confusion_matrix(y_test, y_pred)

In [None]:
# Plot non-normalized confusion matrix
plt.figure()
class_names = [0,1]
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix - LogisticRegression')
a = plt.gcf()
a.set_size_inches(8,4)
plt.show()

## Logistic Regression + Grid Search



In [None]:
#Grid Search
clf = LogisticRegression(solver='liblinear',random_state=42)
grid_values = {'penalty': ['l1', 'l2'],'C':[0.001,.009,0.01,.09,1,5,10,25]}
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values,scoring = 'recall',cv=5,iid=True)
grid_clf_acc.fit(X_train, y_train)

#Predict values based on new parameters
y_pred_acc = grid_clf_acc.predict(X_test)

In [None]:
# New Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc)))

In [None]:
#Logistic Regression (Grid Search) Confusion matrix
confusion_matrix(y_test,y_pred_acc)

In [None]:
cnf_matrix = confusion_matrix(y_test, y_pred)

In [None]:
# Plot non-normalized confusion matrix
plt.figure()
class_names = [0,1]
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix - Logistic Regression (Grid Search)')
a = plt.gcf()
a.set_size_inches(8,4)
plt.show()

The hyperparameters we tuned are:
- Penalty: l1 or l2 which species the norm used in the penalization.
- C: Inverse of regularization strength- smaller values of C specify stronger regularization.

## HistGradientBoostingClassifier
`HistGradientBoostingClassifier` is Histogram-based Gradient Boosting ClassificationTree.

In [None]:
# explicitly require this experimental feature
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
# now you can import normally from ensemble
from sklearn.ensemble import HistGradientBoostingClassifier

In [None]:
clf = HistGradientBoostingClassifier(learning_rate=0.005,random_state=42).fit(X_train, y_train)
y_pred=clf.predict(X_test)

In [None]:
clf.score(X_test,y_test)

In [None]:
#Logistic Regression Classifier Confusion matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

In [None]:
cnf_matrix_hgbc = confusion_matrix(y_test, y_pred)

In [None]:
# Plot non-normalized confusion matrix
plt.figure()
class_names = [0,1]
plot_confusion_matrix(cnf_matrix_hgbc, classes=class_names,
                      title='Confusion matrix - HistGradientBoostingClassifier')
a = plt.gcf()
a.set_size_inches(8,4)
plt.show()