# Intro to Machine Learning & Transparency

**Question**: What is machine learning and what is AI transparency?

```{admonition} Objectives
- Understand real-world datasets used in this course
- Understand components of typical machine learning systems
- Understand steps in typical machine learning processes
- Understand AI transparency definition & taxonomy
```

**Expected time to complete**: 4 hours

Machine learning takes data as **input** to learn a **model**, a mathematical representation of the data. This learned model can take new data as input to generate **output** for making predictions on the new data or for exploring/explaining the new data. In this chapter, we will learn about the components of typical machine learning systems and steps in typical machine learning processes. We will also learn about the definition and taxonomy of AI transparency. We will start with the real-world datasets used in this course to see how machine learning can be used in real-world.

```{note}
If you do not have prior knowledge/experience with linear algebra, Python programming, and probability and statistics, please go through {doc}`00-prerequisites` before starting this course.
```

## Data and machine learning problems

This section introduces the real-world datasets used in this course. We will also learn about the machine learning problems that can be solved using these datasets.

### Real-world datasets in this course

In this course, we will use real-world datasets to introduce machine learning from the perspective of AI transparency. We will use the following datasets from the textbook. You can click on the name of the dataset to see the actual data.

```{list-table} Datasets used in this course, from the textbook
:header-rows: 1
:widths: "auto"
:name: datasets-table
* - Name
  - Data provided
  - Machine learning problem
* - [Auto](https://github.com/pykale/transparentML/blob/main/data/Auto.csv)
  - Gas mileage, horsepower, and other information for cars.
  - Predict gas mileage for a car.
* - [Bikeshare](https://github.com/pykale/transparentML/blob/main/data/Bikeshare.csv)
  - Hourly usage of a bike sharing program in Washington, DC.
  - Predict the number of bikes rented per hour.
* - [Boston](https://github.com/pykale/transparentML/blob/main/data/Boston.csv)
  - Housing values and other information about Boston census tracts.
  - Predict the median value of a house. 
* - [BrainCancer](https://github.com/pykale/transparentML/blob/main/data/BrainCancer.csv)
  - Survival times for patients diagnosed with brain cancer.
  - Predict the survival time for a patient.
* - [Caravan](https://github.com/pykale/transparentML/blob/main/data/Caravan.csv)
  - Information about individuals offered caravan insurance.
  - Predict whether an individual will buy caravan insurance.
* - [Carseats](https://github.com/pykale/transparentML/blob/main/data/Carseats.csv)
  - Information about car seat sales in 400 stores.
  - Predict the sales of a car seat.
* - [College](https://github.com/pykale/transparentML/blob/main/data/College.csv)
  - Demographic characteristics, tuition, and more for USA colleges.
  - Predict the number of applications received by a college.
* - [Credit](https://github.com/pykale/transparentML/blob/main/data/Credit.csv)
  - Information about credit card debt for 10,000 customers.
  - Predict the amount of credit card debt for a customer.
* - [Default](https://github.com/pykale/transparentML/blob/main/data/Default.csv) 
  - Customer default records for a credit card company.
  - Predict whether a customer will default on a credit card payment.
* - [Fund](https://github.com/pykale/transparentML/blob/main/data/Fund.csv)  
  - Returns of 2,000 hedge fund managers over 50 months.
  - Predict the returns of a hedge fund manager.
* - [Hitters](https://github.com/pykale/transparentML/blob/main/data/Hitters.csv)  
  - Records and salaries for baseball players.
  - Predict the salary of a baseball player.
* - [Khan](https://github.com/pykale/transparentML/blob/main/data/Khan.json)  
  - Gene expression measurements for four cancer types.
  - Predict the cancer type for a patient.
* - [NCI60](https://en.wikipedia.org/wiki/NCI-60)  
  - Gene expression measurements for 64 cancer cell lines.
  - Find clusters or groups among the cell lines for personalised treatment.
* - [OJ](https://github.com/pykale/transparentML/blob/main/data/OJ.csv)  
  - Sales information for Citrus Hill and Minute Maid orange juice.
  - Predict the sales of orange juice.
* - [Portfolio](https://github.com/pykale/transparentML/blob/main/data/Portfolio.csv)  
  - Past values of financial assets, for use in portfolio allocation.
  - Predict the value of a financial asset.
* - [Publication](https://github.com/pykale/transparentML/blob/main/data/Publication.csv)  
  - Time to publication for 244 clinical trials.
  - Predict the time to publication for a clinical trial.
* - [Smarket](https://github.com/pykale/transparentML/blob/main/data/Smarket.csv)  
  - Daily percentage returns for S&P 500 over a 5-year period.
  - Predict whether the stock index with increase or decrease.
* - [USArrests](https://github.com/pykale/transparentML/blob/main/data/USArrests.csv)  
  - Crime statistics per 100,000 residents in 50 states of USA.
  - Predict the crime rate in a state.
* - [Wage](https://github.com/pykale/transparentML/blob/main/data/Wage.csv)  
  - Income survey data for men in central Atlantic region of USA.
  - Predict the income of men
* - [Weekly](https://github.com/pykale/transparentML/blob/main/data/Weekly.csv)  
  - 1,089 weekly stock market returns for 21 years.  
  - Predict the stock market return in a week
``` 
<!-- * - [NYSE](https://github.com/pykale/transparentML/blob/main/data/.csv)  
  - Returns, volatility, and volume for the New York Stock Exchange.
  - Predict the returns of a stock. -->

The above datasets show the diverse range of problems that machine learning can solve, which shows only the tip of the iceberg actually. Next, let us learn about the machine learning problems in their typical definitions.

### Machine learning problems

Machine learning problems can be broadly classified into two categories: **supervised learning** and **unsupervised learning**. In supervised learning, the data is labelled, and the goal is to predict the label for new data. In unsupervised learning, the data is not labelled, and the goal is to find patterns in the data. Machine learning models can generate two types of outputs: discrete and continuous. Discrete outputs are categorical, such as the class of an image. Continuous outputs are numerical, such as the price of a house. 

Thus, supervised learning can be further divided into classification and regression. In **classification**, the output is a discrete value, such as a label or a class. In **regression**, the output is a continuous value, such as a number or a probability. Unsupervised learning can be further divided into clustering and dimensionality reduction. In **clustering**, the goal is to find groups of similar data points so the output is a discrete value, such as a cluster index. In **dimensionality reduction**, the goal is to find a lower-dimensional representation of the data so the output is of continuous values, such as a vector.

The following table shows the different types of machine learning problems in the context of the above definitions. 

```{list-table} Supervised and unsupervised machine learning
:header-rows: 1
:widths: "auto"
:name: mlproblems-table
* - Machine Learning
  - Supervised
  - Unsupervised
* - **Discrete output**
  - Classification
  - Clustering
* - **Continuous output**
  - Regression
  - Dimensionality reduction
```

### Exercises

1. Choose three or more datasets of your interest from {numref}`datasets-table`. Click on the name of each chosen dataset to explore and get a sense of the data. You may not be able to get a beautiful view or a view at all for those larger ones. Write down the possible machine learning problems using terminology in {numref}`mlproblems-table` that can be solved using each of your chosen dataset. Click below for a sample answer.

   ```{toggle}
   Sample answer: to be completed.
   ```
2. To be completed
  
    ```{toggle}
    Sample answer: to be completed.
    ```

## Machine learning ingredients

## Machine learning systems

## Machine learning models

## Machine learning tasks

## Machine learning evaluation

## Machine learning deployment

## Machine learning ethics

## Machine learning interpretability

## Machine learning transparency

## Machine learning reproducibility

## AI, machine learning, and deep learning



### Exercises

min 3 max 5

## Quiz

_Not for now. To finish in the next cycle._ Complete [Quiz](https://forms.gle/8Q5Z7Z7Z7Z7Z7Z7Z7) to check your understanding of this topic. You are advised to score at least 50% to proceed to the next topic.

## Summary

In this topic, you learned how to:
- Use ...

## References and further reading

This material is based on the following resources:
- Reference 1