# Course Overview

Astronomy and astrophysics are witnessing dramatic increases in data volume as detectors, telescopes, and computers become ever more powerful. During the last decade sky surveys across the electromagnetic spectrum have collected hundreds of terabytes of astronomical data for hundreds of millions of sources. Over the next decade, data volumes will enter the petabyte domain, and provide accurate measurements for billions of sources. Standard analysis methods employed in astronomy often lag behind the rapid progress in statistics and computer science. This course provides the interface between astronomical data analysis problems and modern statistical methods.   

The primary focus of this course is on techniques related to Data Mining, Machine Learning and Knowledge discovery, providing a practical introduction to these concepts through the use of the Python programming language. Most of the example applications of the techniques covered will rely on astronomical data, primarily coming for the [Sloan Digital Sky Survey (SDSS)](https://www.sdss.org/).


## Course overview

- 4 ECTS Credits
- 100% Assignment(s)
- Course material will be provided
- There is no need to buy any books, there are plenty of resources online
- You are encouraged to bring your laptop with you and work on the course material in class

If you have any questions you can contact Alessio Magro (alessio.magro@um.edu.mt) or Andrea DeMarco (andrea.demarco@um.edu.mt). You are also welcome to come by our office with any questions: Rooms 229 and 230, Maths and Physics building. It is advised to send an email and book a time slot to make sure that one of us will be available.

## Programming Environment

We will be using the Python programming language to demonstrate the techniques that will be discussed during the course. You need to have Python installed and set up on your machine (either a conda environment or a virutal environment). You will need to have the following packages installed: `scipy matplotlib numpy tensorflow keras astropy sklearn opencv`

If you have a Macbook with and M series CPU follow the following instructions to set up your environment: [apple_instructions.pdf]


## Data Mining, Machine Learning and Knowledge Discovery

*Data mining*, *machine learning* and *knowledge discovery* refer to research areas which can all be thought of as outgrowths of multivariate statistics. Their common themes are analysis and interpretation of data, and even more often resorting to numerical methods. The rapid development of these fields over the last few decades was led by computer scientists, often in collaboration with statisticians. To an outsider, data mining, machine learning and knowledge discovery compared to statistics are akin to engineering compared to fundamental physics and chemistry: applied fields that "make things work". The techniques in all of these areas are well studied, and rest upon the same firm statistical foundation. While there are many varying definitions in the literature and on the web, we will adopt the following:
- **Data mining** is a set of techniques for analysing and describing structured data, for example, finding patterns in large data sets. Common methods include density estimation, unsupervised classification, clustering and principal component analysis. Often, the term *knowledge discovery* is used interchangeably with data mining. The data mining techniques result in the understanding of data set properties such as "My measurements of the size and temperature of stars form a well-defined sequence in the size-temperature diagram, though I find some stars in three clusters far away from this sequence". From the data mining point of view, it is not important to immediately contrast these data with a model (of stellar structure in this case), but rather to quantitatively describe the "sequence", as well as the behaviour of measurements falling "far away" from it. In short, data mining is about what the data themselves are telling us
- **Machine learning** is an umbrella term for a set of techniques for interpreting data by comparing them to models for data behaviour (including the so-called nonparametric models), such as various regression methods, supervised classification methods, maximum likelihood estimators, and the Bayesian method. They are often called inference techniques, data-based statistical inferences, or just plain old "fitting". Following the above example, a physical stellar structure model can predict the position and shape of the so-called main sequence in the size-temperature diagram for stars, and when combined with galaxy formation and evolution models, the model can even predict the distribution of stars away from that sequence. Then, there could more than one competing model and the data might tell us whether (at least) one of them can be rejected.

Historically, the emphasis in data mining and knowledge discovery has been on what statisticians call *exploratory data analysis*: that is, learning qualitative features of the data that were not previously known. Much of this is captured under the heading of "unsupervised learning" techniques. The emphasis in machine learning has been on *prediction* of one variable based on the other variables - much of this is captured under the heading of "supervised learning".

## Course Introduction

Most of the techniques covered in this course concern the extraction of knowledge from data, where "knowledge" means a quantitative summary of data behaviour, and "data" essentially means the results of measurements. Let us starts with the simple case of a scalar quantity, $x$, that is measured $N$ times, and use the notation $x_i$ for a single measurement with $i = 1,...,N$. We will use ${x_i}$ to refer to the set of all $N$ measurements. In statistics, the data $x$ are viewed as realizations of a random variables $X$ (random variables are functions on the sample space, or the set of all outcomes of an experiment). In most cases, $x$ is a real number (e.g., stellar brightness measurement) but it can also take discrete values (e.g., stellar spectral type).

Possibly the most important single problem in data mining is how to estimate the distribution $h(x)$ from which values of $x$ are drawn (or which "generates" $x$). The function $h(x)$ quantifies the probability that a value lies between $x + dx$, equal to $h(x)dx$, and is called the *probability density function* or simply *probability distribution*. When $x$ is discrete, statisticians use the term "probability mass function". The integral of the pdf,

$$
\begin{equation*}
H(x) = \int^x_{-\infty}{h(x')dx'},
\end{equation*}
$$

is called the "cumulative distribution function" (cdf). The inverse of the cumulative distribution function is called the "quantile function".

To distinguish the true pdf $h(x)$ (called the *population pdf*) from a data-derived estimate (called the *empirical pdf*), we shall call the latter $f(x)$ (and its cumulative counterpart $F(x)$. Hereafter, we will assume that both $h(x)$ and $f(x)$ are properly normalized probability density functions (though this is not a necessary assumption), that is

$$
\begin{equation*}
H(\infty) = \int^{+\infty}_{-\infty}{h(x')dx'} = 1,
\end{equation*}
$$

and analogously for $F(\infty)$. Given that sets are never infinitely large, $f(x)$ can never be exactly equal to $h(x)$. Furthermore, we shall also consider cases when measurement errors for $x$ are not negligible and thus $f(x)$ will not tend to $h(x)$ even for an infinitely large sample (in this case $f(x)$ will be a "broadened" or "blurred" version of $h(x)$.

$f(x)$ is a *model* of the true distribution $h(x)$. Only samples from $h(x)$ are observed (i.e. data points); the functional form of $h(x)$, used to constraint the model $f(x)$, must be guessed. Such forms can range from relatively simple *parametric* models, such as a single Gaussian, to much more complicated and flexible *nonparametric* models, such as the superposition of many Gaussians. Once the functional form of the model is chosen the best-fitting member of that model family, corresponding to the best settings of the model's parameters (such as the Gaussian's mean and standard deviation) must be chosen. 

A model can be as simple as an analytic function (e.g. a straight line), or it can be the result of complex simulations and other computations. Irrespective of the model's origin, it is important to remember that we can never prove that a model is correct; we can only test it against the data, and sometimes reject it. Furthermore, within the Bayesian logical framework, we cannot reject a model if it is the only one we have at our disposal - we can only compare models against each other and rank them by their success.

These analysis steps are often not trivial; and can be quite complex. The simplest nonparametric model to determine $f(x)$ is to use a histogram; bin the $x$ data and count how many measurements fall in each bin. Very quickly several complications arise: First, what is the optimal choice of the bin size? Does it depend on the sample size or other measurement properties? How does one determine the count error in each bin, and can we treat them as Gaussian errors?

An additional frequent complication is that the quantity $x$ is measured with some uncertainty ir error distribution, $e(x)$, defined as the probability of measuring value $x$ if the true value is $\mu$,

$$
\begin{equation*}
e(x) = p(x|\mu, I),
\end{equation*}
$$

where $I$ stand for all other information that specifies the details of the error distribution, and "$|$" is read as "given". This should be interpreted as giving a probability $e(x)dx$ that the measurement will be between $x$ and $x+dx$.

For the commonly used Gaussian (or normal) error distribution, the probability is given by

$$
\begin{equation*}
p(x|\mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} \text{exp} \left(  \frac{-(x-\mu)^2}{2\sigma^2} \right)
\end{equation*}
$$

where in this case $I$ is simply $\sigma$, the standard deviation (it is related to the uncertainty estimate popularly known as the "error bar"). The error distribution function could also include a bias $b$, and $(x - \mu)$ in the above expression would become $(x-b-\mu)$. That is, the bias $b$ is a *systematic* offset of all measurements from the true value $\mu$, and $\sigma$ controls their "scatter". How exactly the measurements are "scattered around" is described by the shape of $e(x)$. In astronomy, error distributions are often non-Gaussian or, even when they are Gaussian, $\sigma$ might not be the same for all measurements, and often depends on the signal strength (i.e. on $x$; each measured $x_i$ is accompanied by a different $\sigma_i$). These types of errors are called *heteroscedastic* error in which the error distribution is the same for each point.

Quantities described by $f(x)$ (e.g. astronomical measurements) can have different meanings in practice. A special case often encountered in practice is when the "intrinsic" or "true" (population pdf) $h(x)$ is a delta function, $\delta(x)$; that is, we are measuring some specific single-valued quantity (e.g. the length of a rod) and the "observed" (empirical pdf) $f(x)$, sampled by our measurements $x_i$, simply reflects their error distribution ($e(x)$). Another special case involves measurements with negligible measurement errors, but the underlying intrinsic or true pdf $h(x)$ has a finite width (as opposed to a delta function). Hence, in addition to the obvious effects of finite sample size, the difference between $f(x)$ and $h(x)$ can have *two very different origins*: at one extreme it can reflect our measurement error distribution (we can measure the same rod over and over again to improve our knowledge of the length), and at the other extreme it can represent measurements of a number of different rods (or the same rod at different times, if we suspect its length may vary with time) with measurement errors *much smaller* than the expected and/or observed length variation. Despite being extremes, these two limiting cases are often found in practice, and may sometimes be treated with the same techniques because of their mathematical similarity (e.g., when fitting a Gaussian to $f(x)$, we do not distinguish the case where its width is due to measurement errors from the case when we measure a population property using a finite sample).

The next level of complication when analysing $f(x)$ comes from the sample size and dimensionality. There can be a large number of different scalar quantities, such as $x$, that we measure for each object, and each of these quantities can have a different error distribution (and sometimes even different selection functions). In addition, some of these quantities may not be statistically independent. When there is more than one dimension, analysis can get complicated and is prone to pitfalls; when there are many dimensions, analysis is always complicated. If the sample size is measured in hundreds of millions, even the most battle-tested algorithms and tools can choke and become too slow.

Classification of a set of measurements is another important data analysing task. We can often "tag" each $x$ measurement by some "class descriptor" (such quantities are called "categorical" in the statistics literature). For example, we could be comparing the velocity of stars, $x$, around the Galaxy centre with subsamples of stars classified by other means as "halo" and "dark" stars. In such cases we would determine two independent distributions $f(x)$ - one for each of these two subsamples. Any new measurement of $x$ could then be classified as a "halo" or "dark" star. This simple example can become nontrivial when $x$ is heteroscedastic or multidimensional, and also raises the question of completeness vs. purity trade-offs (e.g., do we care more about never ever misclassifying a halo star, or do we want to minimize the total number of misclassifications for both disk and halo stars?). Even in the case of discrete variables, such as "halo" and "dark" stars, or "star" vs. "galaxy" in astronomical images we can assign them a continuous variable, which often is interpreted as the probability of belonging to a class. At first it may be confusing to talk about the probability that an object is a star vs. being a galaxy because it cannot be both at the same time. However, in this context we are talking about *our current state of knowledge about a given object* and its classification, which can be elegantly expressed using the framework of probability.