# Intro to Machine Learning & Transparency

<!-- **Question**: What is machine learning and what is AI transparency? -->

```{admonition} Objectives
- Understand real-world datasets used in this course
- Understand components of typical machine learning systems
- Understand steps in typical machine learning processes
- Understand AI transparency definition & taxonomy
```

**Expected time to complete**: 4 hours

In this chapter, we will learn about the components of typical machine learning systems and steps in typical machine learning processes. We will also learn about the definition and taxonomy of AI transparency. We will start with the real-world datasets used in this course to see how machine learning can be used in real-world.

```{note}
If you do not have prior knowledge/experience with linear algebra, Python programming, and probability and statistics, please go through {doc}`00-prerequisites` before starting this course.
```

## What is machine learning?

### Human learning

Learning is familiar to humans (and animals), as a continuous process that starts from birth and continues throughout life. A human child (and adult) learns new things, acquire new skills, and improve existing skills to survive and thrive in the surrounding world. A child's brain and senses perceive the facts of their surroundings to gradually learn the hidden patterns of life that help the child to craft logical rules to identify learned patterns and predict future events.  

Learning is a process of acquiring new knowledge, skills, and behaviors. Learning can be divided into two categories: **acquisition** and **performance**. Acquisition is the process of acquiring new knowledge, skills, and behaviors. Performance is the process of using the acquired knowledge, skills, and behaviors to achieve a goal. Learning can be defined as a change in behavior or knowledge that results from experience. We learn from our experiences and we learn from others. For example, we learn to walk by watching others walk, we learn to speak by listening to others speak, we learn to drive by watching others drive, and we learn to cook by watching others cook. 

You may have already known, since you are here, that machine can also learn, from their experiences, and from others. That is the subject of this course.

### A definition of machine learning

> Machine learning learns a model from data.

Machine learning takes **data** as **input** to learn a **model**, a mathematical representation of the data. This learning process is called the **training** phase and the data used in training is called the **training data**. After training, the learned model can take new data, the **test data**, as input to generate **output** for making predictions on the test data or for exploring/explaining the test data, which is the **test** phase.

The above definition is the one we will use in this course for clarity, loosely inspired by how the human brain learns certain things based on the data it perceives from the outside world. From this definition, we can see machine learning as a set of **software** tools for **modelling** and **understanding** complex **datasets**. You may find various definitions of machine learning elsewhere. For example, machine learning, from a systems perspective, is defined as the creation of automated systems that can learn hidden patterns from data to aid in making intelligent decisions. 


### AI, machine learning, and deep learning

Besides machine learning (ML), you may have also heard about artificial intelligence (AI), deep learning, and data science. Some use these terms interchangeably, but they are not the same, as shown in the figure below.

```{figure} https://github.com/microsoft/ML-For-Beginners/raw/main/1-Introduction/1-intro-to-ML/images/ai-ml-ds.png
---
height: 250px
name: fig-ai-ml-ds
---
The relationships between AI, ML, deep learning, and data science, by [Jen Looper](https://twitter.com/jenlooper) adapted from [stackexchange](https://softwareengineering.stackexchange.com/questions/366996/distinction-between-ai-ml-neural-networks-deep-learning-and-data-mining) (can be redrawn by us later).
```

**AI** is a broad field of study that aims to create intelligent machines that can perform tasks that normally require human intelligence. There are multiple ways to achieve AI, and machine learning is one of them. As defined above, machine learning is a subfield of AI that machines acquire their experiences from data, in the form of a mathematical model. There are multiple ways of machine learning, and deep learning is one of them. **Deep learning** is a subfield of machine learning that uses deep (i.e. many layers) neural networks to learn from data. **Data science** is a broad field of study that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data, not necessarily in the form of a mathematical model as machine learning does.


## Data and problems

This section introduces the real-world datasets used in this course. We will also learn about the machine learning problems that can be solved using these datasets. Before we talk about the datasets, let's first define what machine learning is.

### Real-world datasets in this course

In this course, we will use real-world datasets to introduce machine learning from the perspective of AI transparency. We will use the following datasets from the textbook. You can click on the name of the dataset to see the actual data.

```{list-table} Datasets used in this course, from the textbook (to refine)
:header-rows: 1
:widths: "auto"
:name: datasets-table
* - Name
  - Data provided
  - Machine learning problem
* - [Auto](https://github.com/pykale/transparentML/blob/main/data/Auto.csv)
  - Gas mileage, horsepower, and other information for cars.
  - Predict gas mileage for a car.
* - [Bikeshare](https://github.com/pykale/transparentML/blob/main/data/Bikeshare.csv)
  - Hourly usage of a bike sharing program in Washington, DC.
  - Predict the number of bikes rented per hour.
* - [Boston](https://github.com/pykale/transparentML/blob/main/data/Boston.csv)
  - Housing values and other information about Boston census tracts.
  - Predict the median value of a house. 
* - [BrainCancer](https://github.com/pykale/transparentML/blob/main/data/BrainCancer.csv)
  - Survival times for patients diagnosed with brain cancer.
  - Predict the survival time for a patient.
* - [Caravan](https://github.com/pykale/transparentML/blob/main/data/Caravan.csv)
  - Information about individuals offered caravan insurance.
  - Predict whether an individual will buy caravan insurance.
* - [Carseats](https://github.com/pykale/transparentML/blob/main/data/Carseats.csv)
  - Information about car seat sales in 400 stores.
  - Predict the sales of a car seat.
* - [College](https://github.com/pykale/transparentML/blob/main/data/College.csv)
  - Demographic characteristics, tuition, and more for USA colleges.
  - Predict the number of applications received by a college.
* - [Credit](https://github.com/pykale/transparentML/blob/main/data/Credit.csv)
  - Information about credit card debt for 10,000 customers.
  - Predict the amount of credit card debt for a customer.
* - [Default](https://github.com/pykale/transparentML/blob/main/data/Default.csv) 
  - Customer default records for a credit card company.
  - Predict whether a customer will default on a credit card payment.
* - [Fund](https://github.com/pykale/transparentML/blob/main/data/Fund.csv)  
  - Returns of 2,000 hedge fund managers over 50 months.
  - Predict the returns of a hedge fund manager.
* - [Hitters](https://github.com/pykale/transparentML/blob/main/data/Hitters.csv)  
  - Records and salaries for baseball players.
  - Predict the salary of a baseball player.
* - [Khan](https://github.com/pykale/transparentML/blob/main/data/Khan.json)  
  - Gene expression measurements for four cancer types.
  - Predict the cancer type for a patient.
* - [NCI60](https://en.wikipedia.org/wiki/NCI-60)  
  - Gene expression measurements for 64 cancer cell lines.
  - Find clusters or groups among the cell lines for personalised treatment.
* - [OJ](https://github.com/pykale/transparentML/blob/main/data/OJ.csv)  
  - Sales information for Citrus Hill and Minute Maid orange juice.
  - Predict the sales of orange juice.
* - [Portfolio](https://github.com/pykale/transparentML/blob/main/data/Portfolio.csv)  
  - Past values of financial assets, for use in portfolio allocation.
  - Predict the value of a financial asset.
* - [Publication](https://github.com/pykale/transparentML/blob/main/data/Publication.csv)  
  - Time to publication for 244 clinical trials.
  - Predict the time to publication for a clinical trial.
* - [Smarket](https://github.com/pykale/transparentML/blob/main/data/Smarket.csv)  
  - Daily percentage returns for S&P 500 over a 5-year period.
  - Predict whether the stock index with increase or decrease.
* - [USArrests](https://github.com/pykale/transparentML/blob/main/data/USArrests.csv)  
  - Crime statistics per 100,000 residents in 50 states of USA.
  - Predict the crime rate in a state.
* - [Wage](https://github.com/pykale/transparentML/blob/main/data/Wage.csv)  
  - Income survey data for men in central Atlantic region of USA.
  - Predict the income of men
* - [Weekly](https://github.com/pykale/transparentML/blob/main/data/Weekly.csv)  
  - 1,089 weekly stock market returns for 21 years.  
  - Predict the stock market return in a week
``` 
<!-- * - [NYSE](https://github.com/pykale/transparentML/blob/main/data/.csv)  
  - Returns, volatility, and volume for the New York Stock Exchange.
  - Predict the returns of a stock. -->

The above datasets show the diverse range of problems that machine learning can solve, which shows only the tip of the iceberg actually. Applications of machine learning are everywhere, from healthcare to finance, from manufacturing to agriculture, from transportation to education, and so on. The datasets used in this course are from the textbook, which is a good starting point for learning about machine learning. However, you can also find many other datasets online, such as [Kaggle](https://www.kaggle.com/datasets), [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), [OpenML](https://www.openml.org/), [Google Dataset Search](https://datasetsearch.research.google.com/), and so on.

Next, let us learn about the machine learning problems in their typical definitions.

### Machine learning problems

Machine learning problems can be broadly classified into two categories: **supervised learning** and **unsupervised learning**. In supervised learning, the data is labelled, and the goal is to predict or estimate the label for new data. In unsupervised learning, the data is not labelled, and the goal is to find patterns, such as relationships or structures, in the data. Machine learning models can generate two types of outputs: discrete and continuous. Discrete outputs are categorical, such as the class of an image. Continuous outputs are numerical, such as the price of a house. 

Thus, supervised learning can be further divided into classification and regression. In **classification**, the output is a discrete (e.g. categorical or _qualitative_) value, such as a category or a class. In **regression**, the output is a continuous (e.g. _quantitative_) value, such as a number or a probability. Unsupervised learning can be further divided into clustering and dimensionality reduction. In **clustering**, the goal is to find groups of similar data points so the output is a discrete value, such as a cluster index. In **dimensionality reduction**, the goal is to find a lower-dimensional representation of the data so the output is of continuous values, such as a vector.

The following table shows the different types of machine learning problems in the context of the above definitions. 

```{list-table} Supervised and unsupervised machine learning
:header-rows: 1
:widths: "auto"
:name: mlproblems-table
* - Machine Learning
  - Supervised
  - Unsupervised
* - **Discrete output**
  - Classification
  - Clustering
* - **Continuous output**
  - Regression
  - Dimensionality reduction
```

### Exercises

1. Choose three or more datasets of your interest from {numref}`datasets-table`. Click on the name of each chosen dataset to explore and get a sense of the data. You may not be able to get a beautiful view or a view at all for those larger ones. Write down the possible machine learning problems using terminology in {numref}`mlproblems-table` that can be solved using each of your chosen dataset. Click below for a sample answer.

   ```{toggle}
   Sample answer: to be completed.
   ```
2. To be completed
  
    ```{toggle}
    Sample answer: to be completed.
    ```




## Machine learning systems

### Basic notations

We use notations slightly different from those in the textbook to reduce the cognitive load (hopefully). We use $N$ to represent the number of **samples**, i.e. distinct data points or observations. We use $D$ to represent the number of **features**, i.e. distinct variables or attributes that are available for learning a model,
 also known as the **dimensionality**. We use $C$ to represent the number of **classes**, i.e. distinct categories or labels. We use $\mathbf{x}_n$ to represent the $n$-th sample, where $n = 1, 2, \ldots, N$. We use $x_{nd}$ to represent the $d$-th feature of the $n$-th sample, where $d = 1, 2, \ldots, D$. We use $y_n$ to represent the label of the $n$-th sample, where $y_n \in \{1, 2, \ldots, C\}$. We use $\mathbf{X}$ to represent the **feature matrix**, also know as the **data matrix**, where $\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N]$. We use $\mathbf{y}$ to represent the label vector, where $\mathbf{y} = [y_1, y_2, \ldots, y_N]^\top$. We use $\mathbf{W}$ to represent the **weight matrix**, where $\mathbf{W} = [\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_C]$. We use $\mathbf{b}$ to represent the **bias vector**, where $\mathbf{b} = [b_1, b_2, \ldots, b_C]^\top$. We use $\mathbf{z}_n$ to represent the linear combination of the $n$-th sample, where $\mathbf{z}_n = \mathbf{W} \mathbf{x}_n + \mathbf{b}$. 
 
 <!-- We use $\mathbf{a}_n$ to represent the activation of the $n$-th sample, where $\mathbf{a}_n = \sigma(\mathbf{z}_n)$. We use $\mathbf{Z}$ to represent the linear combination matrix, where $\mathbf{Z} = \{\mathbf{z}_1, \mathbf{z}_2, \ldots, \mathbf{z}_N\}$. We use $\mathbf{A}$ to represent the activation matrix, where $\mathbf{A} = \{\mathbf{a}_1, \mathbf{a}_2, \ldots, \mathbf{a}_N\}$. We use $\mathbf{Z}^T$ to represent the transpose of the linear combination matrix, where $\mathbf{Z}^T = \{\mathbf{z}_1^T, \mathbf{z}_2^T, \ldots, \mathbf{z}_N^T\}$. We use $\mathbf{A}^T$ to represent the transpose of the activation matrix, where $\mathbf{A}^T = \{\mathbf{a}_1^T, \mathbf{a}_2^T, \ldots, \mathbf{a}_N^T\}$. -->

### Machine learning ingredients

Machine learning systems are composed of three main ingredients: **data**, **model**, and **loss function**. The data is the input to the machine learning system. The model is the core of the machine learning system. The loss function is the output of the machine learning system. The following figure shows the three ingredients of a machine learning system. 

A typical machine learning system is composed of the following ingredients:
- **Data/sample**: a data point or observation $\mathbf{x}_n$, with $n = 1, 2, \ldots, N$, where $N$ is the total number of samples.
- **Feature**: each sample vector $\mathbf{x}_n$ has $D$ features as its representation, i.e. $D$ variables or attributes that are available for learning a model.
- **Prediction**: each sample will have a prediction $\hat{y}_n$ as its output.
- **Label**: only in supervised learning, each sample will have a label $y_n$ as its ground truth. In classification, $y_n \in \{1, 2, \ldots, C\}$, where $C$ is the total number of classes.
- **Labeled dataset**: a set of $N$ tuples of the form $(\mathbf{x}_n, y_n)$, where $n = 1, 2, \ldots, N$.
- **Unlabeled dataset**: a set of $N$ samples $\mathbf{x}_n$, where $n = 1, 2, \ldots, N$.
- **Model**: a function $f(\mathbf{x})$ (the **objective function**) that maps a sample $\mathbf{x}$ to an output $\hat{y}$. This is the _focus_ of machine learning. The objective of machine learning is to estimate a good model $f(\mathbf{x})$. In classification, $f(\mathbf{x})$ maps a sample $\mathbf{x}$ to a class label $\hat{y} \in \{1, 2, \ldots, C\}$. In regression, $f(\mathbf{x})$ maps a sample $\mathbf{x}$ to a real number $\hat{y}$. In clustering, $f(\mathbf{x})$ maps a sample $\mathbf{x}$ to a cluster label $\hat{y} \in \{1, 2, \ldots, C\}$. In dimensionality reduction, $f(\mathbf{x})$ maps a sample $\mathbf{x}$ to a low-dimensional vector $\hat{\mathbf{y}}$. 
  - **Hyperparameters**: the high-level parameters of a model that typically need to be _specified_ before learning a model. These hyperparameters will determine the model structure/architecture. For example, the number of layers, the number of neurons in each layer, the activation function, the loss function, the optimizer, etc.
  - **Parameters**: the model parameters are the specific realisation of a model to be _learned_ during training, such as the weights and biases $\mathbf{W}$ and $\mathbf{b}$. In machine learning, it is common to denote all parameters as $\boldsymbol{\theta}$.
- **Loss function**: a function $L(y, \hat{y})$ (also known as **error function**) that measures the difference between the predicted output $\hat{y}$ and the true or desired output $y$ in supervised learning, or a function $L(y)$ that measures some desired property (or properties) of the output in unsupervised learning. In classification, $L(y, \hat{y})$ measures the difference between the predicted class label $\hat{y}$ and the true class label $y$. In regression, $L(y, \hat{y})$ measures the difference between the predicted real number $\hat{y}$ and the true real number $y$. In clustering, $L(y)$ typically measures the coherence and separation of clusters. In dimensionality reduction, $L(y)$ typically measures the preservation of information in the input $\{\mathbf{x}_n\}$.
    - **Evaluation metric/measure**: an evaluation (or error) metric (or measure) is needed for a loss function $L(y, \hat{y})$ or $\hat{y}$ to be useful. For example, in classification, the evaluation metric is typically the accuracy, which is a function of the predicted label $\hat{y}$ and the true label $y$.
- **learning/optimization algorithm**: an algorithm that finds the best model $f(\mathbf{x})$ by minimizing the loss function $L(y, \hat{y})$ or $\hat{y}$. Nowadays, the optimization algorithms are typically available in libraries (software packages) and do not need to be implemented by the user. The optimization algorithms are typically iterative algorithms that iteratively update the model parameters to minimize the loss function. The optimization algorithms are typically _black boxes_ to the user. The user only needs to specify the loss function $L(y, \hat{y})$ or $\hat{y}$ and the optimization algorithm will find the best model $f(\mathbf{x})$.

### Machine learning models

This course focuses on machine learning models (or methods) that are most widely used in practice, while NOT aiming to be exhaustive in covering all the models. The following table shows the machine learning models that we will cover in this course. 

```{list-table} Machine learning models/methods 
:header-rows: 1
:widths: "auto"
:name: mlmethods-table
* - Method
  - Description
  - Example
* - **Linear regression**
  - A linear model for regression.
  - Predicting the price of a house.
* - **Logistic regression**
  - A linear model for classification.
  - Predicting whether a customer will default on a credit card payment.
* - **Support vector machine**
  - A non-linear model for classification.
  - Predicting whether a customer will default on a credit card payment.
* - **Decision tree**
  - A non-linear model for classification and regression.
  - Predicting whether a customer will default on a credit card payment.
* - **Random forest**
  - An ensemble of decision trees for classification and regression.
  - Predicting whether a customer will default on a credit card payment.
* - **Neural network**
  - A non-linear model for classification and regression.
  - Predicting whether a customer will default on a credit card payment.
* - **K-means**
  - A clustering algorithm.
  - Finding groups of similar customers.
* - **Principal component analysis**
  - A dimensionality reduction algorithm.
  - Finding the most important features of a dataset.
```    

No single model will perform well in all possible scenarios. Therefore, it is important to understand the assumptions and trade-offs of each model so that you can choose the right model for a given problem.

## Machine learning process

Machine learning process can be described in terms of lifecycle phases. The phases of An ML system’s lifecycle are a number of analytically distinct activities throughout the stages of system design, development, and deployment. There is no universally agreed breakdown of lifecycle phases for ML systems. However, the following illustrative typology is suitable for a range of contexts and intersects with prominent lifecycle frameworks. Some of the activities below only apply to certain phases of ML systems.

### Lifecycle phases for design & development:

- **Business case and problem definition**: Establishing the need for the ML system and the tasks it is meant to perform.
- **System requirements specification**: Translating the problem definition into technical design and performance requirements.
- **Data acquisition and preparation**: Where relevant, acquiring any data that may be needed to build the system, checking its suitability, and preparing it for use, e.g. via data pre-processing and/or data augmentation.
- **Building**: Creating a system that meets the design requirements previously specified. In the case of ML projects, this involves choosing between ML methods, developing and evaluating candidate models, and selecting the best performing model.
- **Validation and verification**: Verifying, on an on-going basis, that the system meets the relevant design and performance requirements. Depending on the nature of the system, assessment can rely on empirical testing or formal verification.66

### Lifecycle phases for system deployment

- **Integration**: Preparing the ML system for operation by integrating it into the relevant real-world (e.g. business) environment. This can involve technical aspects of integration with other systems or technology infrastructure. It also includes the introduction of users to the operation of the system, the delivery of user training, and other relevant aspects of organisational change management.
- **Operation**: Using the ML system to perform the real-world (e.g. business) tasks for which it was intended.
- **Monitoring and evaluation**: Observing and recording system behaviour in order to assess system performance and compliance during operation, including any procedures of periodic re-validation.
- **Updating/system retirement**: Making changes to the ML system as needed, for example to improve performance or prevent performance deterioration. In the context of supervised ML models, such changes take the form of retraining the model based on new training data. Successful updating is followed by another iteration of lifecycle steps outlined above.

### Execution of lifecycle phases

The lifecycle phases capture activities that are conceptually distinct, but do not necessarily occur in succession. During an ML system’s design and development, for example, agile processes can involve iterative cycles and adjustments across the different phases outlined above. When it comes to deployment, operation and monitoring/evaluation typically occur in parallel. Moreover, in the case of adaptive systems, updating can occur continually during operation. 

The volumes of data needed for ML systems and the complexity of technology supply chains mean that different activities across lifecycle phases are not always performed by actors within the same organisation. In contexts that involve third-party data providers, outsourcing different aspects of system design and development, or reliance on off-the-shelf tools, certain activities will be carried out by actors outside of the firm using the system.

Indeed, some of these activities might not be carried out by human actors, but by ML systems: recent innovations make it possible to automate large sections of an ML system’s development. However, in any of these cases, the structure of lifecycle phases remains unaffected by this, as the fundamental steps in designing, developing, and deploying an ML system stay the same. 



## Machine learning transparency 

Transparency is a fundamental AI ethics principle key to _responsible_ AI innovation {cite}`ostmann2021ai`. It plays a crucial role in the development of ML systems, as well as in the evaluation of their performance and the _trust_ that people place in them. We follow the definition and framework of transparency in {cite}`ostmann2021ai` in this course.

### Why is transparency important?

For ML systems to be trustworthy and to be used responsibly, it is vital to ensure that they are transparent, i.e. stakeholders have _access to information_ relevant to them. Addressing concerns of AI and preventing potential harms of AI requires information being available to individuals involved in designing an ML system, developing it, deploying it, and using it, as well as to the general public, regulators, and other stakeholders for them to understand decisions made by the system, trust it, and hold it accountable. Different stakeholders are likely to have different information needs. 

Transparency and accountability are closely related and reinforce each other. Accountability mechanisms depend on the
availability of information about an ML system and accountability is key motivation for transparency.
Transparency also acts as an enabler to other ML/AI ethics principles including fairness, sustainability, and safety.

### What is transparency?

ML transparency relates to disclosing information about ML systems, and it can be understood as relevant stakeholders having access to relevant information about a given ML system. Transparency involves gathering and sharing information about an ML system’s logic (i.e. _explainability_) and how it was designed, developed, and deployed. 

The three key questions to ask when considering transparency are: 

- **What** types of information are relevant?
- **Who** are the relevant stakeholders?
- **Why** are stakeholders interested in information about an ML system?

Transparency is a _property_ of the system, and it is _not_ a property of the information itself. Transparency is a _relative_ concept, and it is _not_ an absolute concept. Transparency is _context-dependent_ and it is _not_ a fixed property of the system. For example, a system transparent to a doctor (data scientist) does not mean the system is transparent to a patient (customer). External stakeholders (e.g. regulators, the general public) may have different information needs than internal stakeholders (e.g. data scientists, developers, and engineers).

### Relevant information

There are two broad categories of information considered relevant for transparency:

- **System logic information**: Information that relates to the operational logic of a given ML system, i.e. information about the system’s ‘inner workings’. Examples include information about the input features that a system relies on or information about the relationship between the system’s inputs and outputs.
- **Process information**: Information that relates to the processes surrounding the ML system’s design, development, and deployment. Examples include information about data management practices, assessments of system performance, quality assurance (including of data) and governance arrangements, or the training of system users.

Respectively, the above categories of information define two forms of transparency:

- **System transparency**: Stakeholders having access to system logic information
- **Process transparency**: Stakeholders having access to process information

In this course, we will study machine learning systems from the perspectives of these two forms of transparency.

### Relevant stakeholders

The information that is relevant to a given stakeholder depends on the stakeholder's role and the context. For example, a data scientist may need to know the details of the data collection process, while a data subject may need to know how their data is used. We can split those who may have an interest in system or process transparency into two categories:

- **Internal stakeholders**: Those individuals who are involved in the design, development, and procurement of the ML system. Examples include data scientists, developers, and engineers. They also include individuals who make decisions about its deployment, operate the system, manage external communications, or perform corporate governance and oversight functions. Examples include members of development or procurement teams, risk and compliance teams, audit teams, senior management, company boards, operational teams using the ML system, and customer service teams.
- **External stakeholders**: Those individuals who are external to the organisation employing the ML system that have a significant relationship with the organisation deploying the system or may be affected by the ML system’s use. They are not involved in the design, development, procurement, and deployment of the ML system. Examples include regulators, customers, shareholders, academics, and the general public.  

Based on these two categories of stakeholders, we can make a second distinction in mapping out different types of transparency:

- **Internal transparency**: Information being accessible to internal stakeholders
- **External transparency**: Information being accessible to external stakeholders

This second distinction intersects with the first one, between system and process transparency.
System logic information or process information can be accessible to internal stakeholders,
external stakeholders, or both. The resulting four-fold transparency typology is summarised
in {numref}`fig6-ai-transparency` below.

```{figure} images/fig6-ai-transparency.png
---
name: fig6-ai-transparency
---
AI transparency typology {cite}`ostmann2021ai` (maybe redraw later).
```

### Reasons for accessing information

As mentioned above, transparency is a _relative_ concept. Not all types of information about an ML system will be equally important to all types of stakeholders. The reasons that underpin stakeholders’ interests in information about a given system (i.e. their ‘transparency interests’) are important in determining the types of information they may seek access to. When these reasons differ between stakeholders, the definition of what constitutes relevant information can change. For example, customers faced with an ML system used to make credit eligibility decisions may wish to understand the impact of, say, a 3% pay raise on their credit eligibility. The answer to this question can involve types of information that may not be relevant to the transparency interests, say, of regulators, which may be motivated by the goal of understanding
different aspects of system performance and compliance.

Stakeholders’ transparency interests can differ even when their reasons for seeking information are the same. For example, a risk and compliance officer may seek information about an ML system for the same reasons and look for answers to the same questions as a different internal stakeholder (eg a customer service representative) or an external stakeholder (eg a member of the public). Each of these stakeholders, however, might expect different levels of detail.

Figure {numref}`fig7-ai-transparency`7 summarises the three key questions of AI transparency so far. 

```{figure} images/fig7-ai-transparency.png
---
name: fig7-ai-transparency
---
Summary of the three key questions of transparency {cite}`ostmann2021ai` (maybe redraw later).
```

### Trade-offs

There can be reasons for not making some types of information about ML systems accessible to certain stakeholders. Such
reasons often play a prominent role in discussions about the disclosure of information to external stakeholders in particular. The applicability of such countervailing reasons is context dependent. In particular, these reasons, where relevant, do not speak against the disclosure of system logic and process information in a wholesale manner. Instead, they typically apply to the disclosure of specific types of information (e.g. specific aspects of system logic information rather than all types of system logic information) to specific types of stakeholders (e.g. customers rather than all external stakeholders), for specific types of use cases.

In addition, disclosing information that is irrelevant or excessively detailed in response to stakeholders’ questions may generate undue distrust. Avoiding ‘information overload’ is one possible reason against the disclosure of some types of information to certain stakeholders.

Three other potential reasons are worth noting:

- **Preventing system manipulation or ‘gaming’**: In some cases, firms employing ML systems may seek to protect certain aspects of information to prevent the subversion of these systems. In the case of fraud detection systems, for instance, preventing adversarial actors from finding ways to evade detection can speak against disclosing information about system logic or the data used to customers. Yet, this countervailing reason does not necessarily apply to the disclosure of the same information to regulators, or the disclosure of other types of information to customers.
- **Protecting commercially sensitive information**: Certain types of information may be considered commercially sensitive by the firm employing an ML system or by third-party providers involved in the system’s development. For example, an investment management firm that relies on proprietary ML systems to identify profitable investment opportunities has an interest to protect the competitive advantage enabled by these systems. Similarly, third-party providers may want to protect the IP contained in their products. As such, firms may be reluctant to disclose information that is central to their commercial success. Once again, however, this reason typically only applies to specific types of information (eg details of a system’s logic or proprietary source code) and their disclosure to certain stakeholders. 
- **Protecting personal data**: Certain forms of information disclosure can conflict with firms’ obligation to protect personal data. This includes, most obviously, the direct sharing of personal data – be it data used in the development or the operation of ML systems – in ways that violate data protection legislation. In addition, where ML systems are trained with personal data, it may be possible to infer protected personal information through, for example, model inversion or membership inference attacks. While concerns about such attacks only apply in limited circumstances, they can speak against the disclosure of certain aspects of system logic information to stakeholders.

The applicability and implications of transparency trade-offs depend on context and vary between use cases. Regardless of the applicability of different countervailing reasons, large segments of the information that is of interest to stakeholders will remain unaffected.

## System transparency 

We will now focus on system transparency, i.e. the transparency of the system logic information. We aim to answer three questions:

- What types of information fall under the category of system transparency. 
- Why different stakeholders can be interested in them.
- How such information can be obtained and communicated.

Figure {numref}`fig8-ai-transparency` summarises system transparency in the three key questions above.

```{figure} images/fig8-ai-transparency.png
---
name: fig8-ai-transparency
---
The what, why, and how of system transparency {cite}`ostmann2021ai` (maybe redraw later).
```

### What relevant information?

System transparency refers to access to information about the operational logic of a system. The most transparent systems are simple systems where system logic information can be inferred purely from a system’s formal
representation. Three types of information are considered relevant for system transparency:

> (1) The input variables that a given system relies on: what are the types of information that the system uses in operation?
> 
> (2) The way in which the system transforms inputs into outputs: what is the relationship between input variables and system results?
> 
> (3) The conditions under which the system would produce a certain output: for what values of the input variables would the system return a specific value of interest?

```{admonition} Example
Let us know see illustrate how these three types of information can be inferred from the formal expression of a
simple system below, a linear model calculates a person’s credit score $y$ (the output variable) as a function of their weekly income $x$ (the input variable):

$$y = 200 + 0.5x$$

This simple equation provides answers to all three questions outlined above:
- the model relies on a single input variable, namely weekly income $x$;
- the model transforms the input variable into a credit score $y$ (output) by multiplying it by a coefficient of 0.5 and adding a constant of 200;
- in order for the model to yield an output value (credit score) of 600 (for example), the value
of weekly income $x$ would need to be £800.

Given that it is possible to infer these three types of information from its formal expression,
this simple model is **fully transparent**. Since the input variable $x$ is an easily understandable real-world property,  this model is also **full interpretable**.
```

Many of the models that financial services firms use meet the definition of _interpretable models_. Their interpretation may require a higher level of mathematical knowledge, but their structure makes it possible to infer answers to the three questions above based on a _formal model expression_. The increases in model complexity enabled by ML methods can entail
a decrease in or loss of model interpretability. It will generally be possible to identify the input variables that ML models rely on (information in category (1) above). Yet, model complexity can make it difficult to understand – from a formal expression of the model – how inputs are transformed into outputs (information in category (2)) or the conditions under which the model yields a specific output (information in category (3)).

Decreases in interpretability can take two forms. First, as model complexity increases, interpreting models requires greater technical skills. This possibility of opacity due to non-expertise shows that interpretability is a relative concept. Whether an ML system is considered interpretable can depend on the level of technical expertise of those trying to understand it. Second, model complexity can take forms that make ML systems inscrutable, affecting their interpretability regardless of expertise. In such cases, experts may still be able to give partial answers to the question of how the model transforms inputs into outputs from a formal representation of it – for example, by providing a high-level description of the model’s structure. Yet, these partial answers fall far short of the complete understanding that can be gained from the formal expression of the simple linear model above.

The lack of interpretability of certain types of models does not necessarily mean that adequate forms of system logic information are unobtainable for these ML systems. Instead of obtaining information from the formal expression of models, system logic information can also be obtained indirectly, by using auxiliary strategies and tools, such as _explainability methods_.  However, these explainability methods cannot fully compensate for the information that can be obtained from interpretable systems.

### Why such info?

Access to system logic information can serve to address relevant concerns (e.g. ensuring trustworthiness and responsible use) as well as to provide assurance about possible concerns (e.g. demonstrating trustworthiness and responsible use), as shown in. 

```{figure} images/fig9-ai-transparency.png
---
name: fig9-ai-transparency
---
Areas of concern related to system transparency {cite}`ostmann2021ai` (maybe redraw later).
```

#### System performance

System logic information can be vital to understanding and improving the effectiveness, reliability, and robustness of ML systems. Where testing during system development reveals shortcomings, the analysis of input-output relationships can help identify possible improvements. Knowledge of input-output relationships can also be crucial when assessing the extent of possible performance issues that may arise during deployment. Stakeholders that may be interested in system logic information for these reasons include those involved in or making decisions about the development and use of ML systems as well as those seeking assurance about an ML system’s performance (including evaluation).

#### System compliance

Knowledge of the input variables that a system relies on and other aspects of system logic can be crucial to ensuring compliance with legal and regulatory standards and rules. For example, an understanding of system logic can be critical to avoiding unlawful discrimination; ensuring the adequacy in risk management; assessing the risks; or avoiding the unlawful
processing of personal data. As in the case of system performance, stakeholders that may be interested in system logic information for these reasons include those involved in or making decisions about the development and use of ML systems as well as those seeking assurance about system compliance.

#### Competent use & human oversight

System users may need access to system logic information to ensure competent use. For example, knowledge of the input variables that a system relies on can be necessary to ensure that factors already accounted for in system outputs are not accounted for more than once (and therefore distort results) within a given decision process as a whole. Similarly, internal stakeholders in charge of oversight arrangements may need an understanding of system logic to determine what kind of oversight is required and to anticipate situations that call for intervention.

#### Providing explanations

System logic information can be at the core of explanations sought by
decision recipients. For instance, it can provide assurance that decisions are taken in non-arbitrary
and methodologically sound ways. In contexts such as credit or insurance underwriting, for
example, access to system logic information can also be important in order for decision recipients
to understand the effect that their behaviour may have on the decisions they receive.

#### Responsiveness

Customer service representatives, for example, may need to understand which
input variables a system relies on, how the system transforms inputs into outputs, or under what
conditions a system would yield certain results to be able to respond to customer queries.

#### Social and economic impact

System logic information can be essential to assessing potential
social and economic impacts or providing assurance in relation to concerns about such impacts.
For example, knowledge of the input variables used and the relationship between inputs and
outputs can be relevant to understanding whether the system relies on inferences whose use
may be considered ethically objectionable. Regulators, academics, or indeed wider civil society
stakeholders may have an interest in system logic information in order to assess social and
economic implications.

### How to obtain & communicate such info?

We now discuss how to obtain system logic information and how to communicate it to
relevant stakeholders.

#### Obtaining system logic information

There are two methodological paths to obtaining information about an ML system’s input-output
relationships and conditions under which it produces certain outputs:
- Direct interpretation: Where complexity allows, relevant information can be obtained by analysing
a formal representation of the system (as illustrated by the [example of the simple linear model](#what-relevant-information)). This will be possible for many ML systems covered in this course, including those that are linear or non-linear but have a relatively simple structure. However, it will not be possible for all ML systems, including those that are non-linear and have a complex structure.
- Indirect analysis using explainability methods: Various auxiliary methods can help shed light
on system logic. Many of these methods are perturbation-based – relying on the analysis of
changes in system outputs in response to changes in input values – and can be used without
access to a formal representation of the system. This is out of the scope of this course and hence will _NOT_ be covered.


#### Machine learning interpretability

In general, linear models allow for a more interpretable (but less flexible) model, while non-linear models allow for a more flexible (but less interpretable) model.

A trade-off between model flexibility and interpretability is shown in the figure below. 

```{figure} images/fig2_7.png
---
height: 300px
name: fig-trade-off-interpretable-flexible
---
A trade-off between model flexibility and interpretability {cite}`james2021statistical`.
```

The decision to limit model complexity for the sake of interpretability is often portrayed as a tradeoff with model accuracy. You may find figures look like the above but replacing flexibility with accuracy. The basis for this argument is the assumption that more complex models have higher accuracy than simpler ones. Yet, this assumption is not always true. In many modelling contexts, interpretable models can be designed to achieve the same or comparable levels of accuracy as models that would be considered uninterpretable. Significant research efforts are underway to advance the field of _interpretable machine learning_. Over time, these research efforts can be expected to further reduce the range of contexts in which interpretability-accuracy tradeoffs are perceived to exist.

Decisions in favour of interpretability do not necessarily come at the expense of
accuracy. Where trade-offs between interpretability and accuracy do exist, it may be preferable
to accept a lower level of accuracy in the interest of enabling direct interpretation by system
developers and other relevant actors. Conversely, where uninterpretable models are being used,
it is important to be mindful of the limitations of explainability methods. Ignoring these limitations
risks having a false sense of understanding, potentially resulting in misplaced trust in ML systems
and unexpected harmful outcomes. Governance arrangements play a key role when it comes to
choosing appropriate types of models.

#### Communicating system logic information
System logic information is only useful if it is communicated to stakeholders in ways that are
intelligible and meaningful.

Stakeholders differ in their familiarity with technical concepts. Depending on the audience, system logic information may need to be translated from technical into plain language to make it intelligible. The form and degree of translation required can vary between audiences. For example, while customers may seek information that is presented in non-technical language, senior managers may be more comfortable with technical terms. Non-textual forms of presenting system logic information, including visuals or interactive dashboards, can also enhance intelligibility. Whether information is meaningful depends on the questions that stakeholders seek to answer. Questions can differ significantly between stakeholders, as can the level of detail expected in the answer to each question.

This can be particularly relevant when comparing the transparency interests of customers with
those involved in managing or monitoring the performance of ML systems. Three considerations
are worth highlighting:

- **The role of counterfactuals**: The interest of customers in accessing system logic information can often be driven by questions about the conditions under which a system would yield a certain output (eg a favourable decision outcome). Such counterfactual explanations differ from the types of information that are of interest to other stakeholders, eg those who want to understand system performance.

- **Relevance**: Excessively detailed information or information that is irrelevant to customers’ queries can cause confusion and generate distrust.

- **Intuitiveness and simplicity**: Customers may expect the logic of systems to be sufficiently intuitive and simple, so that they are able to remember it in day-to-day life and make informed choices about aspects of their behaviour that may affect decision outcomes. Intelligibility alone does not guarantee that these expectations are met.


## Process transparency 

Now let us consider process transparency. We aim to answer three questions:

- What information falls under the category of process transparency.
- Why stakeholders can be interested in such information.
- How such information can be managed and communicated. 
 
{numref}`fig10-ai-transparency` summarises process transparency in the three key questions above.

```{figure} images/fig10-ai-transparency.png
---
name: fig10-ai-transparency
---
The what, why, and how of process transparency {cite}`ostmann2021ai` (maybe redraw later).
```

### What relevant information?

Process transparency concerns access to any information about an ML system’s design, development,
and deployment apart from the system’s logic. As with system logic information, such process
information is important for addressing and providing assurance about concerns raised by AI
systems. Correspondingly, there is a growing amount of work on how process information can
be recorded, managed, and made accessible in practice.

We can categorise process information regarding ML systems along two dimensions:

- **Different lifecycle phases**: Process information can relate to (i) the design and development or (ii) the deployment of an ML system. In both areas, more specific lifecycle phases can be distinguished, each of them associated with unique aspects of information.
- **Different levels of information**: In considering a given lifecycle phase, different levels of process information can be distinguished, corresponding to the kinds of questions that the information serves to answer.

These two dimensions lead to a typology for process information whose general structure can be
represented in the form of a matrix, as illustrated in {numref}`fig11-ai-transparency`. 

```{figure} images/fig11-ai-transparency.png
---
name: fig11-ai-transparency
---
Process transparency matrix {cite}`ostmann2021ai` (maybe redraw later).
```

Let us study the different levels of information below. For each lifecycle phase of an ML system, there are various aspects of information that can be of interest to internal or external stakeholders. These aspects of information can answer questions at different levels of abstraction. The following four levels of information can be distinguished, moving from more concrete to more abstract questions:

- **Substantive information** relates to questions about substantive aspects of activities within a given lifecycle phase for an ML system. Examples of such information include: the content of problem definition or system requirement statements; the content of or summary statistics for datasets used during the ML system’s development and operation; source code or other formal representations of the ML system; and the results of tests conducted to assess system performance or compliance.
- **Procedural information** answers questions about the procedures that were followed in performing the activities within a given lifecycle phase. Examples include descriptions of: the process that led to the agreed problem definition or system requirements (eg the actors and the steps involved); the procedures employed to collect and assemble the data used during the ML system’s development or operation; the nature of data quality checks or processing steps carried out; the process followed to select the type of ML method used for developing models; or the procedures used to conduct system tests.
- **Governance information** answers questions about governance arrangements for activities conducted within a lifecycle phase. This information may take the form of statements of accountability and liability, or descriptions of the structure of relevant oversight mechanisms (including, where relevant, the role of risk and compliance teams, ethics review boards, audit teams, senior managers, or board members).
- **Information on adherence to norms and standards** refers to compliance with norms or standards in the design, development, and deployment of an ML system. Such norms or standards may touch on substantive, procedural, or governance questions.

The distinction between these four levels of information remains unaffected by the complexities of sourcing data for ML systems or technology supply chains. Where firms rely on third-party providers, all four levels of information can be applied to activities carried out within and outside the firm. Additionally, relevant governance information in such cases can include information about accountability structures and mechanisms that govern the relationship between third-party providers and the firm employing the ML system in question. These four levels of information, combined with the typology of lifecycle phases, lead to a more concrete version of the matrix we introduced at the beginning of this section to map out different types of process information. 

{numref}`fig12-ai-transparency` incorporates specific categories into {numref}`fig11-ai-transparency`.

```{figure} images/fig12-ai-transparency.png
---
name: fig12-ai-transparency
---
Detailed process transparency matrix {cite}`ostmann2021ai` (maybe redraw later).
```

### Why such info?

Process information, like system logic information, can help to address concerns related to AI
systems (ensuring the trustworthiness and responsible use of these systems) as well as to provide
assurance that concerns have been addressed adequately (demonstrating trustworthiness and
responsible use). In the following paragraphs, we illustrate the importance of process transparency
in addressing each of the six areas of concern as in {numref}`fig9-ai-transparency`.

- **System performance**: Information about the content of system requirement specifications, about the quality and origin of data used during an ML system’s development or operation, or about validation procedures is crucial for understanding the effectiveness, reliability, and robustness of ML systems. This information can be of interest to those involved in or making decisions about the development and use of an ML system, as well as those seeking assurance about the system’sm performance (eg members of audit teams, board members, regulators or customers).
- **System compliance**: Process information is crucial to assessing ML systems’ adherence to compliance requirements. For example, information about the quality of data used and system tests conducted is essential for a holistic understanding of potential risks of unlawful discrimination. Similarly, where ML systems use personal data, information about the provenance, content, and quality of this data is important for data protection assessments. Process information can be of interest to those ensuring system compliance or can demonstrate system compliance to stakeholders.
- **Competent use and human oversight**: Information about an ML system’s intended purpose, system requirements specifications, or system performance measurements can be essential to ensuring competent use and preventing the inappropriate repurposing of ML systems. This information can also be crucial to determine what forms of human oversight are needed and to enable overseers to exercise their role effectively.
- **Providing explanations**: Explanations of an ML system’s outputs can involve system logic information as well as process information. Indeed, a complete understanding of a particular decision requires both. {numref}`fig13-ai-transparency` illustrates this using the example of a loan eligibility decision. In terms of process information, decision recipients seeking to understand an outcome may want to know the content of the input data about them that an ML system used. This knowledge is a precondition, for example, for being able to identify erroneous decisions.

```{figure} images/fig13-ai-transparency.png
---
name: fig13-ai-transparency
---
The combined relevance of process and system logic information in explaining system outputs {cite}`ostmann2021ai` (maybe redraw later).
```

- **Responsiveness**: Telling users about ways in which they can ask for information, help, or redress is important to reassure them of the existence of pathways for expressing such requests. In addition, internal stakeholders may need access to different forms of process information, such as the data used during an ML system’s operation, to be able to respond to customer requests. Finally, stakeholders seeking assurance about the responsible use of ML systems may be interested in information about how issues of responsiveness are managed.
- **Social and economic impact**: Various types of process information may be needed to manage and provide assurance regarding the social and economic impacts of an ML system. For example, information about system test results can be important for understanding an ML system’s potential financial exclusion implications. Similarly, information about how firms communicate personal data use to customers can be of interest to stakeholders seeking assurance in relation to concerns regarding consumer empowerment.

### How to manage such info?

The appropriate level of detail and technical sophistication in providing process information will depend on the purpose that the information is meant to serve. For example, actors involved directly in the validation of an ML system are likely to need a system requirements statement. Customers or other stakeholders interested in ensuring that the right procedures have been followed in validating an ML system will likely need less detailed information, expressed in easy-to-understand language.

Existing best practices within firms – even if they are not specifically designed for ML systems – can guide the process of identifying suitable ways of recording and presenting process information. We consider three areas of research and development related to managing and communicating process information below.

**Recording and presenting process information for ML systems**: Recent years have seen a rapidly growing literature on topics such as documentation, assurance, traceability, and audit trails for ML systems. Contributions to this literature often give examples of how different aspects of process information can be recorded and made accessible to different stakeholders. In many cases, these examples involve proposals for different ‘documentation artefacts’ and templates that can be used to structure process information in practice.

Some contributions to this debate are focused on subsets of process information or the information needs of particular stakeholders. Increasingly, however, contributions adopt a holistic perspective on documentation needs, covering all phases of an ML system’s lifecycle as well as the information needs of all relevant stakeholders. An approach that is growing in popularity – especially in the context of high-stakes applications of ML/AI – is the use of ‘argument-based assurance cases’, often following a specified template, in support of claims about an ML system’s properties.

Recent years have also seen an increase in the number of open-source tools for testing ML systems and examining their properties. These tools can be useful for generating some of the process information that is of interest to stakeholders.

**Emerging norms and standards**: A second evolving area with relevance to managing and communicating process information consists of work on standards for ML systems and on professional standards. Several national and international bodies are currently working to develop standards for ML systems. These standards serve as a useful point of reference though their applicability to real-world use cases will depend on context. 

In addition, recent years have seen growing support of initiatives to professionalise the field of data science. Efforts in this space are aimed at establishing commonly agreed curricula for data science courses and possible forms of professional accreditation for data scientists. For example, a group of professional bodies led by the Royal Statistical Society (RSS) is currently working to develop commonly agreed professional standards for data science.Concurrently, some professional bodies in the financial services space are turning their attention to codes of conduct for the use of data and emerging technologies.

**Mechanisms for verifying process information**: A third area of emerging work concerns the verifiability of process information for ML systems. This includes forms of independent certification for relevant norms and standards. Currently, declarations of adherence to norms and standards take the form of self-declared adherence (‘self-certification’). However, in some contexts, stakeholders may place greater trust in such declarations if they are supported by independently administered certification or labelling schemes.

There is also an emerging literature on the role of auditors in examining system design, development, and deployment processes (including evaluation). In contrast to certification, auditors may verify process information at a more detailed level. ML system auditors can be internal or external to the firm that is employing a given ML system.

Finally, growing research and development efforts are being dedicated to technical solutions that automate the generation and recording of process information. Software-generated ‘audit trails’ and related concepts can contribute to the reliability and verifiability of some types of process information, while at the same time reducing the cost of recording and making the information available to stakeholders.

### Reproducibility of ML systems

Reproducibility is more related to process transparency. Maybe to add later.


## K-Nearest Neighbors (KNN) classification

The K-Nearest Neighbors (KNN) algorithm is a supervised learning algorithm. It is a _non-parametric_ method used for classification and regression. In both cases, the input consists of the $k$ closest training examples in the feature space. The output depends on whether KNN is used for classification or regression. Here, we focus on the classification case.

### Ingredients & transparency
For all machine learning models covered in this course, we aim to talk about their ingredients and transparency in a standard way to facilitate understanding their similarities and differences. For transparency, we will focus on the system transparency, i.e. system logic. Process transparency is not specific to ML models and it will be discussed when we cover the ML process/lifecycle.

The ingredients of a KNN model are the training data. The transparency of a KNN model is the distance between the test point and the training points.
```{admonition} Ingredients
Input: training data
Label: class
```

```{admonition} Transparency
System logic: distance between test point and training points
```

### Example: Iris classification

We adapt the [KNN example from scikit-learn](https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py) to illustrate the use of KNN for classification. 

To do so, we use the Iris dataset, which is a classic dataset in machine learning and statistics. It is included in scikit-learn and we load it as follows.

```{code-block} python

```{admonition} Launch (<i class="fas fa-rocket"></i>)
Click the rocket symbol (<i class="fas fa-rocket"></i>) to launch this page as an interactive notebook in Google Colab (faster but requiring a Google account) or Binder.
```

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
from sklearn.inspection import DecisionBoundaryDisplay

n_neighbors = 15

# import some data to play with
iris = datasets.load_iris()

# we only take the first two features. We could avoid this ugly
# slicing by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target

# Create color maps
cmap_light = ListedColormap(["orange", "cyan", "cornflowerblue"])
cmap_bold = ["darkorange", "c", "darkblue"]

for weights in ["uniform", "distance"]:
    # we create an instance of Neighbours Classifier and fit the data.
    clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf.fit(X, y)

    _, ax = plt.subplots()
    DecisionBoundaryDisplay.from_estimator(
        clf,
        X,
        cmap=cmap_light,
        ax=ax,
        response_method="predict",
        plot_method="pcolormesh",
        xlabel=iris.feature_names[0],
        ylabel=iris.feature_names[1],
        shading="auto",
    )

    # Plot also the training points
    sns.scatterplot(
        x=X[:, 0],
        y=X[:, 1],
        hue=iris.target_names[y],
        palette=cmap_bold,
        alpha=1.0,
        edgecolor="black",
    )
    plt.title(
        "3-Class classification (k = %i, weights = '%s')" % (n_neighbors, weights)
    )

plt.show()

### Exercises

min 3 max 5

## Quiz

_Not for now. To finish in the next cycle._ Complete [Quiz](https://forms.gle/8Q5Z7Z7Z7Z7Z7Z7Z7) to check your understanding of this topic. You are advised to score at least 50% to proceed to the next topic.

## Summary

Machine learning automates the process of learning a model from data that captures hidden patterns or relationship to help us make intelligent, data-driven predictions or decisions. It has been proven to be very useful in many applications, such as image recognition, speech recognition, natural language processing, and many more. Understanding and applying machine learning through this course will help you to acquire essential skills for solving many real-world problems.

In this topic, you learned:
- Machine learning learns a model from data to make predictions or decisions.
- 

## References and further reading

This material is based on the following resources:
- [Machine Learning for Beginners - A Curriculum](https://github.com/microsoft/ML-For-Beginners), Microsoft
- [Deep Learning for Molecules & Materials](https://dmol.pub/)

```{bibliography}
```