## Table of Contents
<a href="#Class-Distributions"><font size="+0.5">Class Distributions</font></a>
* What they are
* Basic Model
* Stratification

<a href="#Logistic-Regression"><font size="+0.5">Logistic Regression</font></a>
* What the model is

<a href="#Model-Evaluation"><font size="+0.5">Model Evaluation</font></a>
* Truth Tables and Confusion Matrices
* Precision, Recall and F1 Scores
* ROC and AUC

<a href="#Hyperparameter-Optimisation"><font size="+0.5">Hyperparameter Optimisation</font></a>
* Finding the Best Parameter Values
* Grid Searching
* Alternatives to Brute Force

---
<center><h1><font size=6> Supervised Learning</font></h1></center>
<center><h1><font size=7> Classification </font></h1></center>

## Learning Objectives
- Understand what classification is
- Understand how class distributions can affect our models
- Be able to train a logistic regression model
- Know different methods to evaluate the performance of a classification model
- Be able to find optimal hyperparameter tunings

Classification problems are a powerful tool for automating work and using data to make informed decision. With regression, we utilise our given data to build a model that predicts a continuous numerical value. In contrast, for classification problems we want to predict a categorical data type called the label.

The workflow required for a classification problem is much the same as that for a regression model. We will prepare our data, train a model, evaluate it's performance and improve the model with hyperparameter tuning. 

The difference with classification compared to regression are the ways we are going to measure the performance of our model. We can no longer use the difference between **true** and **predicted** values in the same manner as our targets are now categorical not numeric. 

We are going to focus on binary classification in this chapter, where our model needs to pick between one of two options. However, this will not always be the case. More complex problems may have multiple classes to predict from: such as yes, no, maybe. In addition, we may want to use a model to predict more than one attribute, a multi-label problem. But, for now, we will keep to the simple binary case to understand the principles. 

Before we get started, we should first load and understand our data.

## Data Preparation

### Load Relevant Libraries

In [None]:
# Load initial relevant libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

Our data set is a subset of the 1990 US census with some attributes removed. We will be using the different features to predict the **`salary_over_50K`** variable which denotes whether the individual described earns more than $50,000. 

In [None]:
# Load data
income_filepath = '../../data/income.csv'
income_data = pd.read_csv(filepath_or_buffer=income_filepath, delimiter=",")
income_data.sample(n=7)

Our target variable is going to be the binary attribute **`salary_over_50K`**. Our data has a mix of numerical data, such as the **`age`** and **`education_years`**. It also has many categorical variables. This means we need to one-hot encode our categorical data, scale our data, and convert our target data to binary labels.  

Our data also contains the following information about the features:

* age - integer (years)
* Sector - categorical
* education_years - integer (years)
* profession - categorical
* marital - categorical (marital status)
* ethnicity - categorical
* gender - categorical (binary)
* work_hours - integer (hours)
* origin_country - categorical (country of origin)
* salary_over_50K - categorical (binary)

### Data Cleaning
We now need to handle missing data. As this is a large data set we are going to assume we can just drop missing data. We then reset the indexes of the data frame to prevent mis-matching indexes and rows.

In [None]:
# Dropping the rows that contain missing data for any column.
print("\tBefore cleaning:\n", income_data.isna().sum())
income_data = income_data.dropna()
income_data = income_data.reset_index(drop=True)
print("\tAfter cleaning:\n", income_data.isna().sum())

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 1:</font></b> 

<p> 
Using a <b>pandas</b> data frame plot one of the following attributes <b>"age"</b>, <b>"education_years"</b> and <b>"work_hours"</b>. 

This can be done using <b>dataframe_name["column_name"].plot.density()</b>

Are the variables normally distributed?

What does the distribution of the attribute tell us? What scaler should we use for this data?
</p> 
</div>

In [None]:
# Write your code here


### Data Scaling

In this example we are going to use the **```RobustScaler()```** throughout. The following code will separate our numerical and categorical and target data in order to process them separately. The data will then be combined into the correct format for our machine learning models. For the model we are using later we do not need to scale the encoded categorical data, but it's always worth checking this.

In [None]:
# Load preprocessing libraries
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, RobustScaler

# Create the one hot encoder object
one_hot_encoder = OneHotEncoder(handle_unknown='ignore')

# Create the scaler object
robust_scaler = RobustScaler()

# Set our target variable to be a separate array.
target_data = pd.DataFrame(income_data["salary_over_50K"])
y = target_data["salary_over_50K"].to_numpy()

# Remove target variable in order to have only feature data
income_data = income_data.drop(columns=["salary_over_50K"])

# Separate the numerical data and categorical data.
X_numerical = income_data.select_dtypes(exclude="object").to_numpy()
X_categorical = income_data.select_dtypes(include="object").to_numpy()

# Fit the one hot encoder to the categorical data.
one_hot_encoder.fit(X_categorical)

# Transform the categorical data frame and convert it to an array.
X_categorical = one_hot_encoder.transform(X_categorical).toarray()

# Fit the scaler to the numeric data.
robust_scaler.fit(X_numerical)

# Transform the raw numerical data to a scaled version.
X_numerical = robust_scaler.transform(X_numerical)

# Combine the categorical and numerical data back to a single array.
X = np.concatenate((X_numerical, X_categorical), axis=1)

We want to convert our "Yes"/"No" target data into binary 1's and 0's. We can check to see what each value corresponds to from the original data using the **```inverse_transform()```** function once our encoder is fitted.

In [None]:
# Create the encoder object
label_encoder = LabelEncoder()

# Fit and transform the data.
y = label_encoder.fit_transform(y)

# What does our transformed data look like?
print(y[:10])

# Checking what the 1's and 0's correspond to in our "salary_over_50K" variable.
print("The classes of this variable are: \n", label_encoder.classes_)
print("1 corresponds to: \n", label_encoder.inverse_transform([1]))
print("0 corresponds to: \n", label_encoder.inverse_transform([0]))

Our data is now model ready! Before we start trying to make predictions, we should try understand the data and the model we are using.

---
# Class Distributions

We have already discussed the distribution of numerical data earlier in the course. We used this information to decide what method of scaling we should use. The distribution of different variables is crucial information that informs our whole workflow. 

The most important distribution with regards to our model is often the class distribution of our target data. We need to know how much of the data is a certain label, so that when we make a model we can evaluate it properly. 

> What is a class distribution? Simply, it is the amount of data that has each label. 

Below is a plot of the class distribution of our **```income_data```** target variable, the **`salary_over_50K`**. The y-axis represents the proportion of all the data which has a certain class.

In [None]:
# Plot the amount of each target class is within the data.
target_data["salary_over_50K"].value_counts(normalize=True).plot(kind="bar",
                                                        color=["salmon", "limegreen"],
                                                        title="salary_over_50K Class Distribution")
plt.xlabel("Class")
plt.ylabel("Proportion");

We can clearly see that the majority of our data has the label "No" in the target variable. This will have a significant effect on our model. 

Let's first consider how we might measure the performance of a classification model. We want to make predictions, good predictions, and so we may be able to use the accuracy as a measure. The accuracy is fundementally the proportion of correct predictions we make out of all predictions. 

> As an equation, the accuracy is: 
$$ acc = \frac{\#~correct~predictions}{\#~total~predictions} $$

If we used a train/test split to evaluate our model we would be making predictions on the *test* set. The $\#~correct~predictions$ would be the number of times our model's ```y_pred``` equalled it's corresponding ```y_test```. The $\#~total~predictions$ is simply the number of predictions, or the size of the test set. For example, if ```y_pred = [0, 0, 1]``` and ```y_true = [0, 1, 1]```, then the model got two out of three predictions correct. Giving an $acc = \frac{2}{3}$.

## Basic Model

Using the information we have, that the data is largely "No" target data, we can build a simple classifier. This model is called a Most Frequent Classifier. 

The model takes into account the count of each target value and, regardless of the features of the data that is input into the model, it will predict the most frequent target class. In our case, it will predict the "No" class. 

To build this in **`sklearn`** we are going to use a model called the **```DummyClassifier()```**. This is a really useful model for starting out as we can give it different strategies easily and it will predict using naive assumptions.

<div class="alert alert-block alert-warning">
<b><font size=3> Key Point<font> </b> 
<p> 
This following model is not a good model. It is a useful model for demonstrating class distribution and evaluation related issues. We will use and discuss real models later in the the chapter.
</p>
</div>

Below we are going to 'train' a most frequent model on our data, and then evaluate it using the accuracy measure discussed. 

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split data to training test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Create the model object
most_frequent_model = DummyClassifier(strategy="most_frequent")

# Train model
most_frequent_model.fit(X_train, y_train)

# Predict for test data
y_pred = most_frequent_model.predict(X_test)

# Calculate the accuracy of the predictions.
most_frequent_acc = accuracy_score(y_test, y_pred)

print("Model accuracy: \n", most_frequent_acc)

# Calculate the class distribution of the data
values = target_data["salary_over_50K"].value_counts()
proportion_no = values["No"] / (values["No"] + values["Yes"])

print("Proportion \"No\": \n", proportion_no)

This shows us that using the most frequent class can get us an accuracy score greater than 50% without even building a real machine learning model. Unfortunately, this could cause us to really misread how good our model is. For example, if we were trying to predict if a patient has a disease, they are 99% likely to not have it, so using a "most_frequent" model we always predict not having the disease. By doing this we get an accuracy score of 0.99, which is really high!  This misses the point of trying to build a model to predict something useful.

<div class="alert alert-block alert-warning">
<b><font size=3>Key Point<font> </b> 
<p> 
Although an intuative measure, the accuracy score of a model is a poor method for evaluating classification outputs. We need to think about our class distribution when approaching a problem, how is it going to affect our measurements. Accuracy performs poorly when there is an imbalanced data set. 
</p>
</div>


## Stratification

We clearly have a problem with class distributions, and need to take them into account. Stratification is a concept which can allow us to do so. As we know that we have a class distribution, in our example case 40% Yes, 60% No, we need to make sure that the model we give our data to retains that class distribution such that it is representative of the problem as a whole. There are a range of methods in **```sklearn```** which allow us to take this distribution into account. For example, when doing K-fold cross validation, we can ensure that the distribution is maintained in each of the K-folds by using **```StratifiedKFold()```**.

For large inequality in class distributions we may need to rebalance the data set so that one class is not too much more significant than the others. This can be achieved by "re-weighting" the different classes. There are two options when doing this:

* Change the **`class_weight`** argument within the model chosen to **`"balanced"`**, the estimator will give a weight to each sample such that the the model training will treat all the samples of each class equally.

* The data itself can be re-sampled in order to either produce more of the smaller class, or decrease the amount of the larger class. These methods are called over and undersampling respectively. This can either be implemented manually or done using a library such as **`sklearn.resample`**.

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 2:</font></b> 

<p> 
Using <b>X</b> and <b>y</b> train a new model using <b>DummyClassifier()</b> that uses the <b>strategy="stratified"</b> argument. What does this do? How do our results change? 
</p> </div>

In [None]:
# Write your answer here


---
# Logistic Regression

One of the most powerful classification models is built using the principles of regression discussed in the previous chapter. Fundementally, a logistic regression model calculates the probability that a given data point has a certain label. 

If the probability of having a certain label is above a threshold (given by the distribution of the data) then the model gives the data that label. 

Underlying the logistic regression model is a logistic function, shown below. Instead of fitting a linear model using regression, a logistic function is fitted to the data. The value of the function corresponds to the probability that a data point is in a certain class. This is done using a method called Maximum Likelihood Estimation, which calculates the probability of the data being what it is given a certain model. It then tries different models and selects the one with the highest probability of producing the training data. 

> This is what the logistic function looks like mathematically $$ f(x) = \frac{1}{1+e^{-x}}$$

In [None]:
# Example plot of logistic function

# Generate some data points over a domain.
x_generated_data = np.linspace(-10,10,100)
y_generated_data = 1/(1+np.exp(-x_generated_data))

# Plot the model
plt.plot(x_generated_data, y_generated_data, color="orchid")
plt.title("Logistic Function")
plt.xlabel("Input value")
plt.ylabel("Output value");

That's the background of the model, but what we really need is to be able to use it to make predictions. We shall now train a model to predict our **`salary_over_50K`** target using logistic regression. We can then compare the accuracy of our new model with the old one.

In [None]:
# Import model class
from sklearn.linear_model import LogisticRegression

# Split training and test data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# Create the model object, "liblinear" is a method of fitting the logistic curve.
# Ignore the arguments for now.
logistic_regression_model = LogisticRegression(solver="liblinear",  
                                               C=0.001, 
                                               random_state=1234, 
                                               class_weight="balanced")

# Fit the model
logistic_regression_model.fit(X_train, y_train)

# Predict values using the test data.
y_pred = logistic_regression_model.predict(X_test)

# Compare predicted and true values to yield the accuracy.
logistic_regression_acc = accuracy_score(y_test, y_pred)

print("Logistic Regression Model accuracy: \n", logistic_regression_acc)



In [None]:
# Predicting Probabilities
y_pred_prob = logistic_regression_model.predict_proba(X_test)
print(y_pred_prob[:5])
print(y_pred[:5])


We can clearly see that our logistic regression model performs better than the "most_frequent" model. Next, we are going to look at other methods to measure the performance of a classification model, this will allow us to better evaluate our models and look towards improving our predictions even more. 

---
# Model Evaluation

The accuracy score we have used is an intuative way to view our model performance, but as we have seen it is an easy metric to "game" based on the class distribution. In this section we are going to look at a number of other methods which allow us to quantify key areas of weakness and highlight qualities to improve in our model. 

## Truth Tables

We are going to consider a **"No"** prediction as a **positive** one, and a *"Yes"* as *negative*. Not that in this example either is good or bad, but it is helpful to label them and be consistent.

- A **True Positive** is therefore a *correct* prediction of the positive class, ie our model predicts "No" when the true value is "No".

- A **False Positive** is an *incorrect* prediction of the positive class, ie our model predicts "Yes" for a true "No" class.

- As a result, a **True Negative** is a *correct* prediction of the negative class. Our model predicts "Yes" for a true "Yes".

- And finally a **False Negative** is an *incorrect* prediction of the negative class. Our model predicts "No" when the true value is "Yes". 

The *False* values may be more familiar to you by another name, as errors. 

A *Type I Error* is a **False Positive**.

A *Type II Error* is a **False Negative**.

This is summarised in the image below, a truth table. 

<img src="../../images/truthtable.png"  width="500" height="550" alt="A truth table showing the names of true positive, true negative, false positive and false negative with respect to the true and predicted data values">

## Confusion Matrix

The accuracy score effectively weights an incorrect prediction of "Yes" equal to that of class "No". By having one overall output score we lose information about the specific predictions our model makes. 

A confusion matrix allows us to visualise how our model performs with each target class. This allows us to understand where the model is making mistakes, and whether it is favouring certain classes disproportionatly. We are going to first look at the confusion matrix of our Logistic Regression model, then compare it to the Most Frequent model. 

> In our matrix, the y-axis represents the **true** value of the target, and the x-axis is the **predicted** value. 

The confusion matrix is the count or proportion of the True/False Positive/Negatives given by the truth table.

In [None]:
from sklearn.metrics import confusion_matrix

# Set our labels for the matrix and plot, ensuring they correspond ie 0->"No".
labels = [0, 1]
tick_labels = ["No", "Yes"]

# Generate the confusion matrix from true and predicted values.
logistic_regression_conf_matrix = confusion_matrix(y_test, y_pred, labels=labels)
print("Confusion matrix (count): \n", logistic_regression_conf_matrix)

# This step normalises our output so we see a proportion rather than the raw count of values.
logistic_regression_conf_matrix = confusion_matrix(y_test, y_pred, labels=labels, normalize='true')
print("Confusion matrix (proportion true): \n", np.around(logistic_regression_conf_matrix, 2))

# Plotting the confusion matrix with color scale.
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(logistic_regression_conf_matrix, cmap="coolwarm")
fig.colorbar(cax)
ax.set_xticklabels([''] + tick_labels)
ax.set_yticklabels([''] + tick_labels)
plt.xlabel('Predicted')
ax.xaxis.set_label_position('top') 
plt.ylabel('True');

The confusion matrix tells us some interesting things about our model. The different classes do not have equal success rates of predictions.

What does this mean? Our model is better at predicting "Yes" than "No". If we want to improve our model we need to give it more information, or the ability to predict "No". 

Using a proportion of the true value allows us to compare the different classes without taking into account the potentially large difference in the size of the classes. 

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 3:</font></b> 

<p> 
Calculate and show the confusion matrix for the "most_frequent" model (<b>most_frequent_model</b>) we used earlier. What is the difference in the matrix with the Logistic Regression model?
</p> </div>

In [None]:
# Creating predictions from the model.
y_pred_most_frequent = most_frequent_model.predict(X_test)

# Write your code here


## Precision, Recall and F1 Scores

Now we have seen both the accuracy and how to generate the confusion matrix. The accuracy was a naive way of viewing the data, but the fact it was able to give us a single output was useful. The confusion matrix was important as it told us about where our model was performing well, and where it wasn't. 

In this section we are going to introduce more in-depth measures of how a prediction has worked, allowing us to describe our model's performance.

Using the terminology introduced with the truth table and confusion matrix we can now describe our different predictions based on the true value of the data. They also allow us to measure the performance, whether we are getting TPs and TNs (what we want!) or FPs and FNs.

From these values we can create new metrics to measure performance.

### Recall

The recall of a model, otherwise known as the true positive rate (TPR) is proportion of true positive predictions out of all the positive value targets. This tells us how good our model is at predicting positive values.

>  $$ recall = \frac{\#~true~positives~predictions}{\#~true~positives} = \frac{\#~TP}{\#~TP+\#~FN}$$

### Precision

The precision of a model is how good are the models predictions taking into account the false positives it produces. This is the proportion of true positive results compared the the total predicted positive results. 

> $$precision = \frac{\#~true~positive~predictions}{\#~predicted~positive} = \frac{\#~TP}{\#~TP+\#~FP}$$

### Healthcare Example

Whilst these two measurements may seem similar, their difference is powerful. If we are predicting whether a patient has a disease we care most about predicting the positive case (has disease) much more than missing the disease. For this reason we want a model that can maximise the recall, so we detect all instances of the disease. We therefore are not as concerned about the precision, and are willing to have a lower value, as predicting someone having the disease when they don't, a false positive, will not be as damaging as missing a true positive.

*This example assumes that the medical professionals wouldn't operate or medicate without using other tests to confirm findings.*

### F1 Score

The precision and recall of a model give us insight into how it performs with respect to all the true positive values and all the predicted true values. We can combine the two metrics in order to get one overall score that takes both properties into account, this is the F1-score. 

> $$ F1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} $$

Using our True Positive notation this is equivalent to:

> $$ F1 = \frac{TP}{TP + \frac{1}{2} (FP + FN)} $$


We can generate the results of these different tests all together for each class by implementing a useful **```sklearn```** function called the **```classification_report()```**. We can then apply this to our logistic regression model to find the different scores.

In [None]:
from sklearn.metrics import classification_report

# Set the names for our report to produce.
target_names = ["No", "Yes"]

# Generate the report using the target test and prediction values.
classif_report = classification_report(y_test, y_pred, target_names=target_names)

print(classif_report)

In our example the weighted average F1-score produced is very similar to the accuracy. This is because we have a very simple problem, with binary labels and not a huge class imbalance. If we were to have data that was 95% one class and the remaining 5% split between two others we would see the scores diverge. Nonetheless, we should always use an informative metric rather than the accuracy for evaluating a model.

As a single output value of the F1 score we often use the macro or weighted average. The macro average takes the score of each class then averages them. The weighted average takes the F1 score of each class and averages them with the weight of the class's true value proportion.
 
The macro average tells us how our model performs equally across the classes, and the weighted average shows performance based on the abundence of each class.

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 4:</font></b> 

<p> 
Calculate and print the classification report for the <b>most_frequent_model</b> using <b>y_test</b> and <b>y_pred_most_frequent</b>. How is this report different from the logistic regression report? 
</p> </div>

In [None]:
# Write your code here


## ROC and AUC

We can produce a plot that shows how our true positive rate varies with our false positive rate over a range of thresholds. This is called the Receiver Operating Characteristic (ROC) curve. 

The ROC curve can help us understand the overlap between our predictions of different classes.

This shows us our performance compared to a random prediction, which is represented by a straight diagonal line. The further above this diagonal line our model is, the better it is performing.

The Area Under Curve (AUC) measures the... area under the ROC curve, which will range from $0\rightarrow1$. An AUC value of 0.5 corresponds to the random model straight diagonal line, we want a value as close to 1 as we can get!

A more in depth explaination of ROC curves can be found [here](https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5).

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

# For ROC curves we cannot use our y_pred, we need "scores"
# which are produced from the models decision function.
y_pred_scores = logistic_regression_model.decision_function(X_test)

# The roc_curve produces values for the tpr, fpr and the tresholds to calculate,
# we will use the tpr and fpr values to plot.
fpr, tpr, _thresholds = roc_curve(y_test, y_pred_scores, pos_label=1)

# Calculate the AUC value.
AUC_score = roc_auc_score(y_test, y_pred_scores)

# Plot our ROC curve against what would be an unbiased random model.
plt.plot(fpr, tpr, color='firebrick', label="ROC curve area: {:.3f}".format(AUC_score))
plt.plot([0, 1], [0, 1], color='steelblue', linestyle='--', label="Expected Random Model")
plt.title("Logistic Regression ROC Curve")
plt.legend(loc="lower right")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate');

---
# Hyperparameter Optimisation

In the previous chapter we introduced the concept of hyperparameters, parameters in our model that affect how our model will learn from the data. We used the example of $\alpha$, the regularization constant to demonstrate that changing a parameter of a model can help us improve the performance. 

So far we have largely used the default model for our problems, but now we know how to properly measure the performance of our model we can try and improve on the default hyperparameters.

What we did last time was manually tune the hyperparameter by plugging in numbers and seeing the impact it had on the model. There are, unsurprisingly, more sophisticated methods of finding an optimal set of hyperparameter values. We are going to explore one of the most widely used approaches called a *parameter sweep* or *grid search*.

## Finding the Best Parameter Values

In order to find the best parameters for our model and data we need a number of things.

- A model/estimator
- A set of possible parameter values
- A method to measure the performance of the model

The first of these is easy, use the model we have chosen, in this case the **`LogisticRegression()`** model. 

What we are going to do is give a set of values of parameters to a function in **`sklearn`** which will build a model for every combination of parameters we have given it. It will then measure the performance of each model using a metric we have designated and tell us what the highest performing set of parameter values were. We list the possible parameters and input them into the function. 

In addition, we must decide which method we are going to use to evaluate the models. This process is called a grid search and is implemented in the **`GridSearchCV()`** class in **`sklearn`**. The method uses K-fold cross validation as discussed in the previous chapter in order to ensure the performance is properly measured. This method for optimising performance can be done for both regression and classification problems, or any problem with hyperparameters. 

Below is an example of optimising our model, we are going to use the **`solver=`** and **`C=`** hyperparameters. **`solver=`** specifies the algorithm by which our model is optimised in it's fit. **`C`** is the inverse of the regularisation constant $\alpha$, the smaller the value of **`C`** the stronger the regularisation. 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

# We need two different sets of data, one to train our model and one to evaluate.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# Define the parameters and the values we want to search.
parameters_to_search = {"solver":["liblinear", "newton-cg"], "C":[0.1, 1, 10, 100]}

# With 2 different parameters for "solver" and 4 for "C", the search will train 2*4=8 models
# For many parameters and values this could increase greatly.

# Select the model type we have chosen.
logistic_regression_model = LogisticRegression(class_weight="balanced")

# Set the number of folds we want to have,
K = 5

# Define our grid to find optimal model.
optimised_model = GridSearchCV(estimator=logistic_regression_model, \
                               param_grid=parameters_to_search, scoring="f1", cv=K)

# Fit the parameter search model.
optimised_model.fit(X_train, y_train);

Fitting our model can often generate Convergence Warnings, this means that for one attempt in solving for the logistic function the solver was unable to find an optimal value. Whilst this is an interesting result, and pulls the curtain a bit about what is going on behind the scenes, we can ignore it for now as long as it is just a warning.

Below we are going to evaluate this "best" parameter model and compare it to our previous model.

In [None]:
print("The original parameters were: \n\n", logistic_regression_model.get_params())
print("\nThe best parameters found are: \n\n", optimised_model.best_params_)

# Predict target based on best model found.
y_pred = optimised_model.best_estimator_.predict(X_test)

# Generate the report using the target test and prediction values.
classif_report_new = classification_report(y_test, y_pred, target_names=target_names)

print("\nOld parameter report: \n", classif_report)
print("Optimised parameter report: \n", classif_report_new)

We can see that for all classes the improved model has better evaluation values.

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 5:</font></b> 

<p> 
Using our <b>"best parameter"</b> values found above, implement a grid search method on the logistic regression model, this time searching for the optimal value of the argument <b>tol</b>. This is the tolerance for the stopping criteria, which affects how quickly our model will converge and the granularity of the solution. Discuss what the range of values we should consider exploring and why you think the result is what it is.
</p> </div>

In [None]:
# Write your code here


### Alternative To Brute Force Searching

For many hyperparameters and wide ranges of potential values the possible number of combinations of parameters for our model could be huge. It is not always feasible to search through every single combination to determine the optimal values. One way to combat this is to sample from the ranges of hyperparameter values which allows us to search as much as we want, then we can try to improve on the best parameters given. 

An implementation of this method is the **`RandomizedSearchCV`**, which takes as an input: the chosen model, the distribution of values to sample from and the number of samples we want to take. 

## Summary

In this chapter we have introduced the classification genre of problems. This type of problem requires a slightly different approach to that of regression, in that we need to chose a different type of model, but also that we need to measure that model's performance in a different manner. 

Logistic Regression is a powerful tool for classification problems, which lets us make categorical predictions based on input data. This is not the only model that can be trained for classification problems, others include: K-nearest-neighbours, Support Vector Machines and Na&#239;ve Bayes.

We have shown that the class distribution of a data set is an important factor when designing our model, and as a result we should not rely on the accuracy score to measure the model's performance. In this section we have looked at appropriate ways to evaluate a classification model, these included: confusion matrices, precision/recall/F1 scores and ROC curves. 

Building on what we learnt in chapter 2: Regression about hyperparameters, we looked at a method that helps us choose optimal values for these parameters. Grid searching is a method that creates a model for a range of different hyperparameters and measures their performances against a scorer, then returns to us the best performing set of parameters. 

This course has covered the basics required to complete supervised machine learning problems. The key steps we have covered are:

**Data Preparation**
- Handling missing data
- Standardizing data
- Engineering features for our model
- Selecting features
- **`sklearn`** data structures
- Training/Test Splits

**Model Training and Prediction**
- Regression
- Classification
- Predicting target values

**Model Evaluation**
- Cross validation
- Regression measures (MSE, MAE)
- Class distributions
- Classification measures (confusion matrices, F1 scores, ROC curves)

**Model Optimisation**
- Regularization
- Hyperparameters
- Parameter Grid Search

It is important to remember that the methods and skills taught are not the whole of the machine learning topic, they are an introduction to important concepts. Nor is the approach taken a representation of a true problem workflow as we have used basic examples with key learning points. When approaching a new problem there will often be much more back and forth between the different steps. The method for cleaning the data you start with won't necessarily be the one you finish with, each step can be improved to increase the performance of your final model. For example, you may want to add a new feature to improve the recall of you model in a certain class, or remove a feature that is causing overfitting. 



<div class="alert alert-block alert-success">
<b><font size="4"> Next: Case Studies A & B</font> </b> 
<p> 
You now have the skills required to complete the case study provided, which will give you experience putting together a machine learning workflow. Case Study A will introduce a new type of model and allow you to practice a classification project start to finish. Case Study B will allow you to practice a regression task with a goal and little prompting.
</p>
</div>