# Introduction to Machinelearning

The content in this part is primary from: 
- [Deep Learning for molecules & materials](https://dmol.pub/ml)

```{seealso}
1. [<ins>Introductory Machine Learning</ins>](https://ai.stanford.edu/~nilsson/mlbook.html)
2. Two reviews of machine learning in materials{cite}`fung2021benchmarking,balachandran2019machine`
3. A review of machine learning in computational chemistry{cite}`gomez2020machine`
4. A review of machine learning in metals{cite}`nandy2018strategies`
```

We will learn about how machine learning is a method of modeling data, typically with predictive functions. Machine learning includes many techniques, but here we will focus on only those necessary to transition into deep learning. For example, random forests, support vector machines, and nearest neighbor are widely-used machine learning techniques that are effective but not covered here.

```{admonition} Objectives
In this chapter
  * Define features, labels
  * Distinguish between supervised and unsupervised learning
  * Understand what a loss function is and how it can be minimized with gradient descent
  * Understand what model is and its connection to features and labels
  * Be able to cluster data and describe what it tells us about data
```

## The Ingredients 

Machine learning the fitting of models $\hat{f}(\vec{x})$ to data $\vec{x}, y$ that we know came from some ``data generation'' process $f(x)$ . Firstly, definitions:

**Features** 

&nbsp;&nbsp;&nbsp;&nbsp;set of $N$ vectors $\{\vec{x}_i\}$ of dimension $D$. Can be reals, integers, etc.

**Labels** 

&nbsp;&nbsp;&nbsp;&nbsp;set of $N$ integers or reals $\{y_i\}$. $y_i$ is usually a scalar
  
**Labeled Data** 

&nbsp;&nbsp;&nbsp;&nbsp;set of $N$ tuples $\{\left(\vec{x}_i, y_i\right)\}$ 

**Unlabeled Data** 

&nbsp;&nbsp;&nbsp;&nbsp;set of $N$ features  $\{\vec{x}_i\}$  that may have unknown $y$ labels

**Data generation process**

&nbsp;&nbsp;&nbsp;&nbsp;The unseen process $f(\vec{x})$ that takes a given feature vector in and returns a real label $y$ (what we're trying to model)

**Model**

&nbsp;&nbsp;&nbsp;&nbsp;A function $\hat{f}(\vec{x})$ that takes a given feature vector in and returns a predicted $\hat{y}$

**Predictions**

&nbsp;&nbsp;&nbsp;&nbsp; $\hat{y}$, our predicted output for a given input $\vec{x}$.


## Supervised Learning

**Supervised learning** means predicting $y$ from $\vec{x}$ with a model trained on data. It is *supervised* because we tell the algorithm what the labels are in our dataset. Another method we'll explore is **unsupervised learning** where we do not tell the algorithm the labels. We'll see this supervised/unsupervised distinction can be more subtle later on, but this is a great definition for now. 

To see an example, we will use a dataset called AqSolDB{cite}`Sorkun2019` that is about 10,000 unique compounds with measured solubility in water (label). The dataset also includes molecular properties (features) that we can use for machine learning. The solubility measurement is solubility of the compound in water in units of log molarity.