# Lab 1: Machine Learning Supervised Binary Classification 

References: 
- [The Machine Learning Simplified: A Gentle Introduction to Supervised Learning](https://www.amazon.com/dp/B0B216KMM4/qid=1653304321)
- [Machine Learning for Neuroscience Notebook](https://github.com/PBarnaghi/ML4NS/blob/main/00-%20Tutorials/Machine%20Learning%20for%20Beginners%20Tutorial%20and%20Assessment/Machine%20Learning%20for%20Beginners%20(run).ipynb)

# Artificial Intelligence

![AI-ML-DL-Data-Science](https://miro.medium.com/v2/resize:fit:1358/1*UQiAwDtQHP_MunDx5QfnHw.png)

**Artificial Intelligence** (AI): 
- Anything a machine does that looks like a human task ‚Äî could be simple code, rules, or a smart system.
- Example: A program that sends a reminder email when a due date passes. It's "AI" if we call that behavior intelligent.

**Machine Learning** (ML): 
- The system learns from data and finds patterns instead of being fully hard-coded.
- Example: A model that looks at past invoices and learns which ones get paid late, then predicts future late payments.

**Deep Learning** (DL): 
- A kind of ML that uses deep neural networks (many layers) and often works directly with raw data (images, audio, text).
- Example: A CNN that learns from raw photos to decide if there's a cat ‚Äî it figures out edges, shapes, and features automatically.

**Data Science**: 
- The broader process around data ‚Äî collecting, cleaning, exploring, visualizing, and using ML/DL to answer questions and make decisions.
- Example: Inspecting sales data, cleaning it, plotting trends, building a model to forecast demand, and reporting the result.

In summary, 
- AI = any machine behavior we call ‚Äúintelligent.‚Äù
- ML = AI that learns from data.
- DL = ML using deep neural networks.
- Data Science = turning data into insight (includes ML/DL).

# Machine Learning

- The basic idea of machine learning, or ML, is to learn to do a certain task from data.
- At a high-level it can be understood as appropriately recognizing and extracting patterns from the data.
- According to [Wikipedia](https://en.wikipedia.org/wiki/Machine_learning): 
    - Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms 
        - that can **learn from data** and 
        - **generalize** to unseen data, and 
        - thus perform tasks **without** explicit **instructions**.


**Machine Learning**
- **Supervised Learning (Labelled Data) (this notebook)**
    - Regression
    - Classification
- Unsupervised Learning (Unlabelled Data)
- Reinforced Learning (not covered in the syllabus)

**Deep Learning**
- Multilayer Neural Network (MLP) (Multilayer Perceptron)
- Convolutional NN
- Sequence to Sequence Models

After the course you will be able to understand: 
- What kind of problems supervised (and unsupervised) ML can be used to solve;
- How a typical supervised learning algorithm works;
- The concepts of overfitting and underfitting;
- Detailed pipeline for a full ML system;
- How to quantitatively evaluate any model;
- The inner workings of the gradient descent algorithm;
- How we can use basis expansion to increase model‚Äôs complexity;
- Why we need regularization and when it is helpful;
- What types of errors any model consists of, and how to decrease/minimize them;
- How to mathematically decompose bias and variance errors from a cost function;
- Three main feature selection families;
- Different feature selection procedures and philosophies;
- What procedures in data preparation exist; (Data Cleaning, Feature Transformation, Feature Engineering, Class Label Imbalance)
- Why we need them and how they are performed

So, a question:
<!-- <p style="margin-bottom: 200px;"> <b>Q. What are the problems where we can utilize ML?</b></p> -->
**Q. What are the problems where we can utilize ML?**

![Source: 2020 Machine Learning Roadmap Video](./resources/images/comment_2020_AI_roadmap.png)

Rule #1 of [Google's Machine Learning Handbook](https://developers.google.com/machine-learning/guides/rules-of-ml/)
    - **If you can build a simple rule-based system that doesn't require ML, do that.**

<!-- <p style="margin-bottom: 200px;"> <b>If you can build a simple rule-based system that doesn't require ML, do that. </b></p> -->

But, if you cannot, then you might possibly require to **use ML**. 

Say, your ML models (emphasis on models) cannot properly learn the pattern in the data then use DL. 

`Note`: As you model becomes **deeper**, the **interpretability** will be decreased. So as,  

depth $\uparrow$,  interpretablity $\downarrow$.


## 1. Supervised ML Pipeline

![Source: 2020 Machine Learning Roadmap Video](./resources/images/ml_pipeline.png)

Question: Here, if this is the ML Pipeline, then what will the pipeline for DL will be? 

### **1.1. Data Extraction**

- Labels are usually obtained in one of two ways: 
    - Either by past observations (such as what price a house actually sold for) or 
    - By subjective evaluations of a human being (such as determining if an email is spam or not).

`Note` that the supervised ML algorithm is not independently smart (as some sci-fi depictions of AI suggest) but instead is only learning to mimic
the labels you gave it.

- Garbage in, Garbage out. 


#### **1.1.0. Data Retreival & Collection**

- Data retrieval and collection is the first step in any ML pipeline, involving gathering relevant raw data from diverse sources such as databases, APIs, sensors, and third-party providers. 
- Good collection practices ensure the data is representative, timely, and accompanied by metadata that supports reproducibility and traceability. 
- Attention to data quality, privacy, and legal constraints during collection helps prevent bias and compliance issues later in the pipeline. 
- Well-documented and structured retrieval workflows reduce downstream cleaning effort and improve model reliability.

If available collect the data from reputable sources.

If reputable sources are not available, then create one (i.e. yourself).

For our example we will use a dataset from the `sklearn` library.

But before that let's import some dependencies. 

In [None]:
# dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets

In [3]:
# from sklearn import datasets
# data = datasets.load_diabetes(as_frame=True) # this loads the dataset as a dictionary
# features = data.data # this derives features as a dataframe (10 features by 442 instances)
# labels = data.target # this derives a continuous-valued attribute as a dataframe (442 instances)

In [4]:
data = datasets.load_breast_cancer(as_frame=True) # this loads the dataset as a dictionary
features = data.data # this derives features as a dataframe (30 features by 569 instances)
labels = data.target # this derives labels as a dataframe (569 instances)

`features` are also called `inputs`. And, `labels` are also called `targets`. 

The breast cancer dataset is a classic and very easy binary classification dataset.


In [5]:
# check the shape of the dataset
print("Features shape:", features.shape)
print("Labels shape:", labels.shape)

Features shape: (569, 30)
Labels shape: (569,)


In [29]:
# check the type of data

print("Features type:", type(features))
print("Labels type:", type(labels))

Features type: <class 'pandas.core.frame.DataFrame'>
Labels type: <class 'pandas.core.series.Series'>


In [30]:
# let's  understand the data
features.head(5) # first 5 rows of the features dataframe

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [31]:
labels.tail(5) # last 5 rows of the labels dataframe/series

564    0
565    0
566    0
567    0
568    1
Name: target, dtype: int64

Note: The labels are binary, with 0 indicating a negative diagnosis and 1 indicating a positive diagnosis of breast cancer, respectively.

Here, this means this is a classification task. 

In [32]:
features.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


Common Statistical Outputs:

- `count`: Number of non-NaN values.
- `mean`: The average of the values.
- `std`: Standard deviation of the values.
- `min`: The smallest value.
- `25%`: The 25th percentile (first quartile); means that 25% of the values in your data are below this number.
- `50%`: The median (second quartile).
- `75%`: The 75th percentile (third quartile).
- `max`: The largest value.

Here, we are going to focus on two features (mean area and mean smoothness), so the first thing we'll do is extract this data.

`Note`: This is done for ease of understanding. To reduce overwhelming the students. 

In [33]:
area_and_smoothness = features[['mean area', 'mean smoothness']]
area_and_smoothness

Unnamed: 0,mean area,mean smoothness
0,1001.0,0.11840
1,1326.0,0.08474
2,1203.0,0.10960
3,386.1,0.14250
4,1297.0,0.10030
...,...,...
564,1479.0,0.11100
565,1261.0,0.09780
566,858.1,0.08455
567,1265.0,0.11780


In [None]:
labels

For this example, we are going to use logistic regression which, despite its name, is a ;`linear classification model`.

To use the Logistic Regression classifier, we must first import it using the following line of code:

In [34]:
from sklearn.linear_model import LogisticRegression

### **2. Data Preparation**
1. Data Cleaning
2. Feature Design


`Assumption`: Data is an ideal state. 

#### **2.1. Data Cleaning**

Practical datasets often have 
- missing values, 
- improperly scaled measurements, 
- outlier data points, or 
- non-numeric structured data (like strings) 
<!-- - erroneous or  -->
that cannot be directly fed into the ML algorithm.

In [16]:
# checking missing values
# well we do not have to code any additional lines because in the .describe() method above, we can see that count for all features is 569 which means there are no missing values

Feature scaling is one of the most critical pre-processing steps in machine learning, with the most common techniques being standardization and normalization.

Machine learning algorithms that calculate distance or assume normality are sensitive to relative scales of features, meaning that if the data is not scaled, features with a higher value range start dominating the model's decision-making process. Feature scaling is therefore needed to bring features with different ranges into comparable ranges.

Feature scaling also allows for much faster model convergence.

In this scenario, we are going to use the sci-kit learn StandardScaler to standardize our data. Another commonly used scaler is the MinMax scaler. The MinMax scaler normalizes data. 

**Standardization (z-score)**
- Shift the data so its average is 0 and scale it so a ‚Äútypical‚Äù spread is 1. Useful when features have very different units (e.g., cm vs kg) so they contribute comparably to algorithms that use distances or assume roughly normal inputs.
- Formula: z = (x ‚àí Œº) / œÉ
    - Œº = mean of the feature, œÉ = standard deviation of the feature

**Normalization (min‚Äìmax)**
- Squeeze the data into a fixed range (commonly 0 to 1). Keeps the shape of the distribution but rescales extremes to the chosen bounds ‚Äî useful for neural nets or when you need values in a known interval.

1. **Formula (to [0, 1]):**
This formula scales the values of \( x \) to a range between 0 and 1.

$$
x' = \frac{x - \min(x)}{\max(x) - \min(x)}
$$

Where:
- $x$ is the original value,
- $min(x)$ is the minimum value in the dataset,
- $\max(x)$ is the maximum value in the dataset,
- $x'$ is the normalized value between 0 and 1.

2. **Variant (to [‚àí1, 1]):**
This variant scales the values of \( x \) to a range between -1 and 1.

$$
x' = 2 \times \frac{x - \min(x)}{\max(x) - \min(x)} - 1
$$

Where:
- **x'** is the normalized value between -1 and 1.

Notes
- Standardization centers and scales; normalization rescales to a bounded interval.
- Min‚Äìmax is sensitive to outliers; standardization is less affected but still influenced by extreme values. Use robust scalers (median/IQR) if outliers are a concern.
- In scikit‚Äëlearn: StandardScaler and MinMaxScaler implement these transforms.

In [21]:
# scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [22]:
# Next, let's implement our standardization.

standardized_data = pd.DataFrame(scaler.fit_transform(area_and_smoothness),
                   columns=['mean area','mean smoothness'])
standardized_data.head(5) # first 5 rows of the standardized data

Unnamed: 0,mean area,mean smoothness
0,0.984375,1.568466
1,1.908708,-0.826962
2,1.558884,0.94221
3,-0.764464,3.283553
4,1.826229,0.280372


For the rest, you need to write code to check them. Do ChatGPT, Grok, Gemini, or ask your Grandmother. You can do anything. But think of ways where, for the same , how would you find out the `outliers` and `non-numeric structured data` upper requirement. 

Other techniques similar to data cleaning (which get the data into an acceptable format to perform
ML), are: 
- to convert numerical values into categories (called `feature binning`);
- to convert from categorical values into numerical (called `feature encoding`); to scale feature measurements to a similar range

This will be taught later, or material will be provided. 


#### **2.2. Feature Design**

**Feature Transformation**
- What: Convert raw values into machine‚Äëfriendly numeric form (scaling, encoding, log transform).
- When: You have categorical text, widely different numeric scales, or heavy skew.
- Example: "yes"/"no" ‚Üí 1/0; apply StandardScaler to features before training an SVM.
- Tip: Encode before scaling; fit transformers on train set only.

**Feature Engineering**
- What: Create new features that capture useful relationships from raw features.
- When: Domain knowledge or intuition suggests combinations/ratios/patterns might matter.
- Example: height and width ‚Üí height/width ratio for fruit shape; day/time ‚Üí is_weekend flag.
- Tip: Simple engineered features often beat complex models; iterate and validate.

**Feature Selection**
- What: Remove irrelevant or harmful features to improve generalization and speed.
- When: Many features, noisy or redundant inputs, or overfitting risk.
- Example: drop near‚Äëconstant columns, use mutual information, L1 regularization, or tree‚Äëbased feature importances.
- Tip: Combine automated selection with domain checks; always evaluate on hold‚Äëout set.


Transform inputs so models can use them, engineer features to expose signal, and select features to reduce noise and overfitting.

Whenever we are building a machine learning model that uses a supervised learning algorithm, it is important we have some way of evaluating the performance of that model. 

However, learning the parameters of the prediction function and testing it on the same data would mean a model would just be repeating the labels of the samples it has just seen, deriving a perfect score, but failing to predict anything useful on as-of-yet unseen data. 

This situation is known as overfitting, and to avoid it the most common practice is to hold out part of the available data as a test set.

The most simple way to do this is by using a technique called the train-test split on our data.

In [35]:
from sklearn.model_selection import train_test_split

In this case, we want to have a train_size of 80% and a test size of 20%.

We also want to shuffle our data before splitting as, without this, we risk creating batch data not representative of the overall dataset.

Finally, we use the random_state parameter to control the shuffling applied to the data before the split to ensure a reproducible output across multiple calls of the function.

In [36]:
x_train, x_test, y_train, y_test = train_test_split(standardized_data, labels, test_size = 0.20, shuffle=True, random_state = 42)
print("x_train shape:", x_train.shape)
print("x_test shape:", x_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

x_train shape: (455, 2)
x_test shape: (114, 2)
y_train shape: (455,)
y_test shape: (114,)


### **3. Model Building**

Once we have pre-processed the data into an acceptable format, we then build an ML model. This iswhere most of the ‚Äúreal ML‚Äù takes place.

#### **3.1. Algorithm Selection**

These algorithms include:
- Linear and Polynomial models ‚Äî simple, interpretable models; polynomial features capture non‚Äëlinear relationships.
- Logit models ‚Äî logistic regression for probabilistic binary/multiclass classification.
- Maximum margin models ‚Äî SVMs that maximize class separation; effective in high dimensions.
- Tree-based models ‚Äî decision trees that handle non‚Äëlinearity and interactions with easy interpretation.
- Ensemble Models ‚Äî bagging/boosting (Random Forest, Gradient Boosting) for improved accuracy and robustness.
- Bayesian models ‚Äî probabilistic approaches (e.g., Naive Bayes) that incorporate priors and quantify uncertainty.

Here, for most tabular problem, tree-based models have shown great performance then most ML models and even many DL models. However, for the sake of the course, we will *start with the simplest option.* 

**The logistic regression classifier.**

<!-- A logistic regression classifier is a supervised machine learning model that predicts the probability of a categorical outcome (like yes/no, spam/not spam) by fitting an S-shaped curve (sigmoid function) to the data, outputting values between 0 and 1, which are then thresholded to assign a class label. 

![Logistic Regression Classifier](https://zd-brightspot.s3.us-east-1.amazonaws.com/wp-content/uploads/2022/04/11040521/46-4-e1715636469361.png) -->

<!-- Sigmoid function formula: 

![sigmoid](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/sigmoid-activation-function-1_0.png) -->

#### **3.2. Loss function Selection** 

After selecting a specific algorithm, we need to decide on its loss function: the method which
the algorithm would use to learn from the data. (Yes, there are different ways of learning for an
algorithm) 

For example, a linear regression typically uses the famous least squares error loss function. 

The selection of the learning algorithm and the selection of the loss function are, of course, coupled: for example, you cannot use a software
package that implements a linear regression algorithm with a misclassification or regression loss function.

In [38]:
## loss function selection

# here we do not to select a loss function manually because sklearn's LogisticRegression class uses the log-loss function by default for binary classification tasks.

Therefore, by default loss function for Logistic Regression is `log-loss` function. 

#### **3.3. Model Learning**

Once we have selected the learning algorithm and its loss function, we need to train the model. 

It is simply a mathematical optimization problem. 

At a high level, learning amounts to finding the set of parameters that minimizes
the loss function on the training data.

In this case, we then train the model with the following line of code.



In [40]:
model = LogisticRegression(random_state=42).fit(x_train, y_train)

#### **3.4. Model Evaluation**

Evaluating performance: To assess how well a model performs, we need to test it on unseen data (test data), not just the training data, to avoid overfitting.

Training vs. test data: Ideally, we would have separate training and test datasets, but in practice, we often split a single large dataset into two parts for training and testing.

Advanced methods: In some cases, more advanced techniques like cross-validation are used, where the data is split into multiple parts and the model is tested on different combinations to get a better understanding of its performance.

In [41]:
scores = model.score(x_test, y_test) 
scores

0.9298245614035088

Now, accuracy is not the only metric that can be used to evaluate model performance.

For this, we must first use our model to predict the labels of the test feature data.



In [None]:
pred_labels = model.predict(x_test)

We now want to use these predicted labels to derive a confusion matrix. A confusion matrix displays the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

`TPs` are test results that correctly indicate the presence of a condition or characteristic.

`TNs` are test results that correctly indicate the absence of a condition or characteristic.

`FPs` are test results that incorrectly indicate the presence of a condition or characteristic.

`FNs` are test results that incorrectly indicate the absence of a condition or characteristic.

Now let's import the confusion matrix helper function.

In [43]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, pred_labels)
print(cm)

[[38  5]
 [ 3 68]]


From this confusion matrix, we can then derive further metrics, including precision, recall, and the F1-score.

`Precision` is the ratio of correctly classified positive instances to the total predicted positive classifications.

`Recall` or sensitivity is the ratio of correctly classified positive instances to the total positive instances.

Precision helps us understand how useful results are; however, recall helps us understand how complete the results are.

The F1-score balances the two previous scores, being the harmonic mean of precision and recall.

Note: accuracy is the ratio of correctly classified instances to the total instances.

In [44]:
from numpy import mean
from numpy import std

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

For precision, recall, and the F1-score, you should report the average and standard deviation of these scores. This is because the precision, recall, and F1-score is provided for each class in the dataset.

However, particularly when working with multiclass data, it is important to understand the performance of the model for each class. For this, we print a classification report, which tells you the precision, recall, and F1-score for each class and support (weighted by number of instances in each class in comparison to total number of instances for each class, respectively).

In [45]:
from sklearn.metrics import classification_report

f1 = f1_score(y_test, pred_labels)
recall = recall_score(y_test, pred_labels)
precision = precision_score(y_test, pred_labels)

f1_avg = mean(f1_score(y_test, pred_labels, average=None))
recall_avg = mean(recall_score(y_test, pred_labels, average=None))
precision_avg = mean(precision_score(y_test, pred_labels, average=None))

f1_sd = std(f1_score(y_test, pred_labels, average=None))
recall_sd = std(recall_score(y_test, pred_labels, average=None))
precision_sd = std(precision_score(y_test, pred_labels, average=None))

print('\nf1:\t\t',f1)
print('recall\t\t',recall)
print('precision\t',precision)

print('\nf1_avg:\t\t',f1_avg)
print('recall_avg\t',recall_avg)
print('precision_avg\t',precision_avg)

print('\nf1_sd:\t\t',f1_sd)
print('recall_sd\t',recall_sd)
print('precision_sd\t',precision_sd)

print('\n',classification_report(y_test, pred_labels))


f1:		 0.9444444444444444
recall		 0.9577464788732394
precision	 0.9315068493150684

f1_avg:		 0.9246031746031746
recall_avg	 0.9207337045528987
precision_avg	 0.9291680588038758

f1_sd:		 0.019841269841269826
recall_sd	 0.03701277432034061
precision_sd	 0.0023387905111927343

               precision    recall  f1-score   support

           0       0.93      0.88      0.90        43
           1       0.93      0.96      0.94        71

    accuracy                           0.93       114
   macro avg       0.93      0.92      0.92       114
weighted avg       0.93      0.93      0.93       114



#### **3.5. Hyper-paramter Tuning**

Hyper-paramter: A hyperparameter is a setting or configuration used to control the learning process of a machine learning model. Unlike model parameters (like weights in a neural network, which are learned from the data), hyperparameters are set before training the model and determine how the model behaves during training. Example: 
- learning rate
- regularization strength
- batch size etc. 

Overfitting and underfitting: A common challenge in machine learning is finding the right balance between `overfitting` (where the model is too complex and fits the training data too well) and `underfitting` (where the model is too simple and doesn't capture enough of the data's patterns). This balance is achieved through hyperparameter tuning.

Hyperparameter tuning: The best hyperparameters (settings that control the model's behavior) aren't usually known upfront. We often have to try many different combinations to find the ones that work best. For example, in the fruit classification task, we tested different values of ùëò.
k in a KNN model. Too many neighbors can lead to underfitting, while too few can lead to overfitting.

Algorithm-specific tuning: Each machine learning algorithm has its own set of hyperparameters, and the process of adjusting these parameters to improve the model's performance is called hyperparameter tuning.


In [54]:
# hyperparameter will be tuned in the upcoming notebook

#### **3.6. Model Validation**


Now, even using a train-test split, there is still a risk of `overfitting` on the test set when tuning the parameters of a model until the estimator performs optimally. In this scenario, knowledge can leak into the model until th evaluation metrics no longer report on generalization performance.

One way of solving this is hold yet another part of the available dataset out as a so-called validation set, on which an initial evaluation is done. However, the problem with partitioning the available data into three sets is that we drastically reduce the number of samples which can be used for learning the model (a particular issue for small datasets and imbalanced datasets). It also means that the results can depend on any random choice for the pair of train and validation sets.

A solution to this and another way of evaluating model performance is to use `cross-validation`, which removes the need for a validation set.

In the basic approach, otherwise known as k-fold cross-validation, the data is split into k folds. The model is trained using (k - 1) of the folds as training data and validation on the remaining part of the data. This is repeated for each fold. The performance measures reported by k-fold cross-validation is then the average of the values computed in the loop.

![Cross Validation](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

In [47]:
from sklearn.model_selection import cross_val_predict

model = LogisticRegression(random_state=42) # reset the model

cv_pred_labels = cross_val_predict(model, standardized_data, labels, cv=10) # Here we use a 10-fold cross-validation.


In [None]:
accuracy = accuracy_score(labels, cv_pred_labels)

cm = confusion_matrix(labels, cv_pred_labels)

f1 = f1_score(labels, cv_pred_labels)
recall = recall_score(labels, cv_pred_labels)
precision = precision_score(labels, cv_pred_labels)

f1_avg = mean(f1_score(labels, cv_pred_labels, average=None))
recall_avg = mean(recall_score(labels, cv_pred_labels, average=None))
precision_avg = mean(precision_score(labels, cv_pred_labels, average=None))

f1_sd = std(f1_score(labels, cv_pred_labels, average=None))
recall_sd = std(recall_score(labels, cv_pred_labels, average=None))
precision_sd = std(precision_score(labels, cv_pred_labels, average=None))

print('accuracy:\t', accuracy)

print(cm)

print('\nf1:\t\t',f1)
print('recall\t\t',recall)
print('precision\t',precision)

print('\nf1_avg:\t\t',f1_avg)
print('recall_avg\t',recall_avg)
print('precision_avg\t',precision_avg)

print('\nf1_sd:\t\t',f1_sd)
print('recall_sd\t',recall_sd)
print('precision_sd\t',precision_sd)

print('\n',classification_report(labels, cv_pred_labels))

# print(roc_auc_score(labels, cv_pred_labels))

accuracy:	 0.8980667838312829
[[173  39]
 [ 19 338]]

f1:		 0.9209809264305178
recall		 0.9467787114845938
precision	 0.896551724137931

f1_avg:		 0.888708284997437
recall_avg	 0.8814082236668253
precision_avg	 0.8987966954022988

f1_sd:		 0.03227264143308067
recall_sd	 0.06537048781776861
precision_sd	 0.00224497126436779

               precision    recall  f1-score   support

           0       0.90      0.82      0.86       212
           1       0.90      0.95      0.92       357

    accuracy                           0.90       569
   macro avg       0.90      0.88      0.89       569
weighted avg       0.90      0.90      0.90       569

0.8814082236668251


**Iterative Development of ML pipeline**


Building a machine learning model is an iterative process: You often test different algorithms, features, and settings multiple times to find the best combination.

It‚Äôs common to switch algorithms (e.g., from a decision tree to a neural network) and experiment with different configurations to see what works best for the data.

While the final prediction follows a sequential process, the model-building and tuning phase involves going back and forth, adjusting, and re-testing until the model performs optimally.