# Intro: Machine Learning & Transparency

<!-- **Question**: What is machine learning and what is AI transparency? -->

```{admonition} Objectives
- Understand real-world datasets used in this course
- Understand components of typical machine learning systems
- Understand steps in typical machine learning processes
- Understand AI transparency definition & taxonomy
```

**Expected time to complete**: 4 hours

In this chapter, we will learn about the components of typical machine learning systems and steps in typical machine learning processes. We will also learn about the definition and taxonomy of AI transparency. We will start with the real-world datasets used in this course to see how machine learning can be used in real-world.

```{note}
If you do not have prior knowledge/experience with linear algebra, Python programming, and probability and statistics, please go through {doc}`00-prerequisites` before starting this course.
```

## What is machine learning?

### Human learning

Learning is familiar to humans (and animals), as a continuous process that starts from birth and continues throughout life. A human child (and adult) learns new things, acquire new skills, and improve existing skills to survive and thrive in the surrounding world. A child's brain and senses perceive the facts of their surroundings to gradually learn the hidden patterns of life that help the child to craft logical rules to identify learned patterns and predict future events.  

Learning is a process of acquiring new knowledge, skills, and behaviors. Learning can be divided into two categories: **acquisition** and **performance**. Acquisition is the process of acquiring new knowledge, skills, and behaviors. Performance is the process of using the acquired knowledge, skills, and behaviors to achieve a goal. Learning can be defined as a change in behavior or knowledge that results from experience. We learn from our experiences and we learn from others. For example, we learn to walk by watching others walk, we learn to speak by listening to others speak, we learn to drive by watching others drive, and we learn to cook by watching others cook. 

You may have already known, since you are here, that machine can also learn, from their experiences, and from others. That is the subject of this course.

### A definition of machine learning

> Machine learning learns a model from data.

Machine learning takes **data** as **input** to learn a **model**, a mathematical representation of the data. This learning process is called the **training** phase and the data used in training is called the **training data**. After training, the learned model can take new data, the **test data**, as input to generate **output** for making predictions on the test data or for exploring/explaining the test data, which is the **test** phase.

The above definition is the one we will use in this course for clarity, loosely inspired by how the human brain learns certain things based on the data it perceives from the outside world. From this definition, we can see machine learning as a set of **software** tools for **modelling** and **understanding** complex **datasets**. You may find various definitions of machine learning elsewhere. For example, machine learning, from a systems perspective, is defined as the creation of automated systems that can learn hidden patterns from data to aid in making intelligent decisions. 


### AI, machine learning, and deep learning

Besides machine learning (ML), you may have also heard about artificial intelligence (AI), deep learning, and data science. Some use these terms interchangeably, but they are not the same, as shown in the figure below.

```{figure} https://github.com/microsoft/ML-For-Beginners/raw/main/1-Introduction/1-intro-to-ML/images/ai-ml-ds.png
---
height: 250px
name: fig-ai-ml-ds
---
The relationships between AI, ML, deep learning, and data science, by [Jen Looper](https://twitter.com/jenlooper) adapted from [stackexchange](https://softwareengineering.stackexchange.com/questions/366996/distinction-between-ai-ml-neural-networks-deep-learning-and-data-mining) (can be redrawn by us later).
```

**AI** is a broad field of study that aims to create intelligent machines that can perform tasks that normally require human intelligence. There are multiple ways to achieve AI, and machine learning is one of them. As defined above, machine learning is a subfield of AI that machines acquire their experiences from data, in the form of a mathematical model. There are multiple ways of machine learning, and deep learning is one of them. **Deep learning** is a subfield of machine learning that uses deep (i.e. many layers) neural networks to learn from data. **Data science** is a broad field of study that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data, not necessarily in the form of a mathematical model as machine learning does.


## Data and machine learning problems

This section introduces the real-world datasets used in this course. We will also learn about the machine learning problems that can be solved using these datasets. Before we talk about the datasets, let's first define what machine learning is.

### Real-world datasets in this course

In this course, we will use real-world datasets to introduce machine learning from the perspective of AI transparency. We will use the following datasets from the textbook. You can click on the name of the dataset to see the actual data.

```{list-table} Datasets used in this course, from the textbook (to refine)
:header-rows: 1
:widths: "auto"
:name: datasets-table
* - Name
  - Data provided
  - Machine learning problem
* - [Auto](https://github.com/pykale/transparentML/blob/main/data/Auto.csv)
  - Gas mileage, horsepower, and other information for cars.
  - Predict gas mileage for a car.
* - [Bikeshare](https://github.com/pykale/transparentML/blob/main/data/Bikeshare.csv)
  - Hourly usage of a bike sharing program in Washington, DC.
  - Predict the number of bikes rented per hour.
* - [Boston](https://github.com/pykale/transparentML/blob/main/data/Boston.csv)
  - Housing values and other information about Boston census tracts.
  - Predict the median value of a house. 
* - [BrainCancer](https://github.com/pykale/transparentML/blob/main/data/BrainCancer.csv)
  - Survival times for patients diagnosed with brain cancer.
  - Predict the survival time for a patient.
* - [Caravan](https://github.com/pykale/transparentML/blob/main/data/Caravan.csv)
  - Information about individuals offered caravan insurance.
  - Predict whether an individual will buy caravan insurance.
* - [Carseats](https://github.com/pykale/transparentML/blob/main/data/Carseats.csv)
  - Information about car seat sales in 400 stores.
  - Predict the sales of a car seat.
* - [College](https://github.com/pykale/transparentML/blob/main/data/College.csv)
  - Demographic characteristics, tuition, and more for USA colleges.
  - Predict the number of applications received by a college.
* - [Credit](https://github.com/pykale/transparentML/blob/main/data/Credit.csv)
  - Information about credit card debt for 10,000 customers.
  - Predict the amount of credit card debt for a customer.
* - [Default](https://github.com/pykale/transparentML/blob/main/data/Default.csv) 
  - Customer default records for a credit card company.
  - Predict whether a customer will default on a credit card payment.
* - [Fund](https://github.com/pykale/transparentML/blob/main/data/Fund.csv)  
  - Returns of 2,000 hedge fund managers over 50 months.
  - Predict the returns of a hedge fund manager.
* - [Hitters](https://github.com/pykale/transparentML/blob/main/data/Hitters.csv)  
  - Records and salaries for baseball players.
  - Predict the salary of a baseball player.
* - [Khan](https://github.com/pykale/transparentML/blob/main/data/Khan.json)  
  - Gene expression measurements for four cancer types.
  - Predict the cancer type for a patient.
* - [NCI60](https://en.wikipedia.org/wiki/NCI-60)  
  - Gene expression measurements for 64 cancer cell lines.
  - Find clusters or groups among the cell lines for personalised treatment.
* - [OJ](https://github.com/pykale/transparentML/blob/main/data/OJ.csv)  
  - Sales information for Citrus Hill and Minute Maid orange juice.
  - Predict the sales of orange juice.
* - [Portfolio](https://github.com/pykale/transparentML/blob/main/data/Portfolio.csv)  
  - Past values of financial assets, for use in portfolio allocation.
  - Predict the value of a financial asset.
* - [Publication](https://github.com/pykale/transparentML/blob/main/data/Publication.csv)  
  - Time to publication for 244 clinical trials.
  - Predict the time to publication for a clinical trial.
* - [Smarket](https://github.com/pykale/transparentML/blob/main/data/Smarket.csv)  
  - Daily percentage returns for S&P 500 over a 5-year period.
  - Predict whether the stock index with increase or decrease.
* - [USArrests](https://github.com/pykale/transparentML/blob/main/data/USArrests.csv)  
  - Crime statistics per 100,000 residents in 50 states of USA.
  - Predict the crime rate in a state.
* - [Wage](https://github.com/pykale/transparentML/blob/main/data/Wage.csv)  
  - Income survey data for men in central Atlantic region of USA.
  - Predict the income of men
* - [Weekly](https://github.com/pykale/transparentML/blob/main/data/Weekly.csv)  
  - 1,089 weekly stock market returns for 21 years.  
  - Predict the stock market return in a week
``` 
<!-- * - [NYSE](https://github.com/pykale/transparentML/blob/main/data/.csv)  
  - Returns, volatility, and volume for the New York Stock Exchange.
  - Predict the returns of a stock. -->

The above datasets show the diverse range of problems that machine learning can solve, which shows only the tip of the iceberg actually. Applications of machine learning are everywhere, from healthcare to finance, from manufacturing to agriculture, from transportation to education, and so on. The datasets used in this course are from the textbook, which is a good starting point for learning about machine learning. However, you can also find many other datasets online, such as [Kaggle](https://www.kaggle.com/datasets), [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), [OpenML](https://www.openml.org/), [Google Dataset Search](https://datasetsearch.research.google.com/), and so on.

Next, let us learn about the machine learning problems in their typical definitions.

### Machine learning problems

Machine learning problems can be broadly classified into two categories: **supervised learning** and **unsupervised learning**. In supervised learning, the data is labelled, and the goal is to predict or estimate the label for new data. In unsupervised learning, the data is not labelled, and the goal is to find patterns, such as relationships or structures, in the data. Machine learning models can generate two types of outputs: discrete and continuous. Discrete outputs are categorical, such as the class of an image. Continuous outputs are numerical, such as the price of a house. 

Thus, supervised learning can be further divided into classification and regression. In **classification**, the output is a discrete (e.g. categorical or qualitative) value, such as a category or a class. In **regression**, the output is a continuous (e.g. quantitative) value, such as a number or a probability. Unsupervised learning can be further divided into clustering and dimensionality reduction. In **clustering**, the goal is to find groups of similar data points so the output is a discrete value, such as a cluster index. In **dimensionality reduction**, the goal is to find a lower-dimensional representation of the data so the output is of continuous values, such as a vector.

The following table shows the different types of machine learning problems in the context of the above definitions. 

```{list-table} Supervised and unsupervised machine learning
:header-rows: 1
:widths: "auto"
:name: mlproblems-table
* - Machine Learning
  - Supervised
  - Unsupervised
* - **Discrete output**
  - Classification
  - Clustering
* - **Continuous output**
  - Regression
  - Dimensionality reduction
```

### Exercises

1. Choose three or more datasets of your interest from {numref}`datasets-table`. Click on the name of each chosen dataset to explore and get a sense of the data. You may not be able to get a beautiful view or a view at all for those larger ones. Write down the possible machine learning problems using terminology in {numref}`mlproblems-table` that can be solved using each of your chosen dataset. Click below for a sample answer.

   ```{toggle}
   Sample answer: to be completed.
   ```
2. To be completed
  
    ```{toggle}
    Sample answer: to be completed.
    ```




## Machine learning systems

### Basic notations

We use notations slightly different from those in the textbook to reduce the cognitive load (hopefully). We use $N$ to represent the number of **samples**, i.e. distinct data points or observations. We use $D$ to represent the number of **features**, i.e. distinct variables or attributes that are available for learning a model,
 also known as the **dimensionality**. We use $C$ to represent the number of **classes**, i.e. distinct categories or labels. We use $\mathbf{x}_n$ to represent the $n$-th sample, where $n = 1, 2, \ldots, N$. We use $x_{nd}$ to represent the $d$-th feature of the $n$-th sample, where $d = 1, 2, \ldots, D$. We use $y_n$ to represent the label of the $n$-th sample, where $y_n \in \{1, 2, \ldots, C\}$. We use $\mathbf{X}$ to represent the **feature matrix**, also know as the **data matrix**, where $\mathbf{X} = \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N\}$. We use $\mathbf{y}$ to represent the label vector, where $\mathbf{y} = \{y_1, y_2, \ldots, y_N\}$. We use $\mathbf{W}$ to represent the **weight matrix**, where $\mathbf{W} = \{\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_C\}$. We use $\mathbf{b}$ to represent the **bias vector**, where $\mathbf{b} = \{b_1, b_2, \ldots, b_C\}$. We use $\mathbf{z}_n$ to represent the linear combination of the $n$-th sample, where $\mathbf{z}_n = \mathbf{W} \mathbf{x}_n + \mathbf{b}$. 
 
 <!-- We use $\mathbf{a}_n$ to represent the activation of the $n$-th sample, where $\mathbf{a}_n = \sigma(\mathbf{z}_n)$. We use $\mathbf{Z}$ to represent the linear combination matrix, where $\mathbf{Z} = \{\mathbf{z}_1, \mathbf{z}_2, \ldots, \mathbf{z}_N\}$. We use $\mathbf{A}$ to represent the activation matrix, where $\mathbf{A} = \{\mathbf{a}_1, \mathbf{a}_2, \ldots, \mathbf{a}_N\}$. We use $\mathbf{Z}^T$ to represent the transpose of the linear combination matrix, where $\mathbf{Z}^T = \{\mathbf{z}_1^T, \mathbf{z}_2^T, \ldots, \mathbf{z}_N^T\}$. We use $\mathbf{A}^T$ to represent the transpose of the activation matrix, where $\mathbf{A}^T = \{\mathbf{a}_1^T, \mathbf{a}_2^T, \ldots, \mathbf{a}_N^T\}$. -->

### Machine learning ingredients

Machine learning systems are composed of three main ingredients: **data**, **model**, and **loss function**. The data is the input to the machine learning system. The model is the core of the machine learning system. The loss function is the output of the machine learning system. The following figure shows the three ingredients of a machine learning system. 

A typical machine learning system is composed of the following ingredients:
- **Sample**: a data point or observation $\mathbf{x}_n$, with $n = 1, 2, \ldots, N$, where $N$ is the total number of samples.
- **Feature**: each sample vector $\mathbf{x}_n$ has $D$ features as its representation, i.e. $D$ variables or attributes that are available for learning a model.
- **Prediction**: each sample will have a prediction $\hat{y}_n$ as its output.
- **Label**: only in supervised learning, each sample will have a label $y_n$ as its ground truth. In classification, $y_n \in \{1, 2, \ldots, C\}$, where $C$ is the total number of classes.
- **Labeled dataset**: a set of $N$ tuples of the form $(\mathbf{x}_n, y_n)$, where $n = 1, 2, \ldots, N$.
- **Unlabeled dataset**: a set of $N$ samples $\mathbf{x}_n$, where $n = 1, 2, \ldots, N$.
- **Model**: a function $f(\mathbf{x})$ (the **objective function**) that maps a sample $\mathbf{x}$ to an output $\hat{y}$. In classification, $f(\mathbf{x})$ maps a sample $\mathbf{x}$ to a class label $\hat{y} \in \{1, 2, \ldots, C\}$. In regression, $f(\mathbf{x})$ maps a sample $\mathbf{x}$ to a real number $\hat{y}$. In clustering, $f(\mathbf{x})$ maps a sample $\mathbf{x}$ to a cluster label $\hat{y} \in \{1, 2, \ldots, C\}$. In dimensionality reduction, $f(\mathbf{x})$ maps a sample $\mathbf{x}$ to a low-dimensional vector $\hat{\mathbf{y}}$. 
- **Loss function**: a function $L(y, \hat{y})$ (also known as **error function**) that measures the difference between the predicted output $\hat{y}$ and the true or desired output $y$ in supervised learning, or a function $L(y)$ that measures some desired property (or properties) of the output in unsupervised learning. In classification, $L(y, \hat{y})$ measures the difference between the predicted class label $\hat{y}$ and the true class label $y$. In regression, $L(y, \hat{y})$ measures the difference between the predicted real number $\hat{y}$ and the true real number $y$. In clustering, $L(y)$ typically measures the coherence and separation of clusters. In dimensionality reduction, $L(y)$ typically measures the preservation of information in the input $\{\mathbf{x}_n\}$.

### Machine learning methods 

This course focuses on machine learning methods that are most widely used in practice, while NOT aiming to be exhaustive in covering all the methods. The following table shows the machine learning methods that we will cover in this course. 

```{list-table} Machine learning methods 
:header-rows: 1
:widths: "auto"
:name: mlmethods-table
* - Method
  - Description
  - Example
* - **Linear regression**
  - A linear model for regression.
  - Predicting the price of a house.
* - **Logistic regression**
  - A linear model for classification.
  - Predicting whether a customer will default on a credit card payment.
* - **Support vector machine**
  - A non-linear model for classification.
  - Predicting whether a customer will default on a credit card payment.
* - **Decision tree**
  - A non-linear model for classification and regression.
  - Predicting whether a customer will default on a credit card payment.
* - **Random forest**
  - An ensemble of decision trees for classification and regression.
  - Predicting whether a customer will default on a credit card payment.
* - **Neural network**
  - A non-linear model for classification and regression.
  - Predicting whether a customer will default on a credit card payment.
* - **K-means**
  - A clustering algorithm.
  - Finding groups of similar customers.
* - **Principal component analysis**
  - A dimensionality reduction algorithm.
  - Finding the most important features of a dataset.
```    

No single method will perform well in all possible scenarios. Therefore, it is important to understand the assumptions and trade-offs of each method so that you can choose the right method for a given problem.


## Machine learning models

## Machine learning tasks

## Machine learning evaluation



## Machine learning process

The process of building, using and maintaining a machine learning system is 

### Typical steps

1. **Data collection**: Collect data from the real world.
2. 



### Machine learning deployment

### Machine learning reproducibility


## AI transparency

### Machine learning ethics

### Machine learning interpretability

### Machine learning transparency


### Exercises

min 3 max 5

## Quiz

_Not for now. To finish in the next cycle._ Complete [Quiz](https://forms.gle/8Q5Z7Z7Z7Z7Z7Z7Z7) to check your understanding of this topic. You are advised to score at least 50% to proceed to the next topic.

## Summary

Machine learning automates the process of learning a model from data that captures hidden patterns or relationship to help us make intelligent, data-driven predictions or decisions. It has been proven to be very useful in many applications, such as image recognition, speech recognition, natural language processing, and many more. Understanding and applying machine learning through this course will help you to acquire essential skills for solving many real-world problems.

In this topic, you learned:
- Machine learning learns a model from data to make predictions or decisions.
- 

## References and further reading

This material is based on the following resources:
- [Machine Learning for Beginners - A Curriculum](https://github.com/microsoft/ML-For-Beginners), Microsoft
- [Deep Learning for Molecules & Materials](https://dmol.pub/)