# Machine learning in python

# Machine learning basics

## What is machine learning?
Fundamentally, machine learning involves building mathematical models that help understand data. The learning component arises from the way models have tunable parameters that can be adjusted to fit to some training data. Ideally, once these models have been fit to previously seen training data, they are able to predict some aspect of newly observed data.

This lecture aims to teach you the basic concepts of machine learning, typical machine learning workflows and code based examples for simple problems using python and [scikit-learn](https://scikit-learn.org).

## Three different types of machine learning
<BR CLEAR="left">
<img src="images_ML/03_01.png" style="width: 400px;" align="left"/>
<BR CLEAR="left">
    
    

## Unsupervised learning:  Discovering structure in data
Unsupervised learning involves models that describe data without reference to any known labels.


### Clustering - Inferring labels on unlabeled data
One common case of unsupervised learning is clustering. Here data is automatically assigned to some number of discrete groups. For example, we might have some two-dimensional data like that shown in the following figure:
<BR CLEAR="left">
<img src="images_ML/03_02.png" style="width: 300px;" align="left"/>
<BR CLEAR="left">

By eye, it is very clear that the points belong to three distinct and separate classes. Given the input of points, a clustering algorithm will use the intrinsic structure of the data to determine which points are related to each other. An example algorithm for this is [k-means clustering](https://scikit-learn.org/stable/modules/clustering.html#k-means).




### Dimensionality reduction and manifold learning - Inferring structure on unlabeled data
Dimensionality reduction is another example of unsupervised learning. However, unlike clustering, dimensionality reduction attempts to extract a lower dimensional representation of the data whilst preserving the relevant qualities of the full dataset. In the following examples, we can see how different dimensionality reduction algorithms transform 3d data into a 2d representation.

<BR CLEAR="left">
<img src="images_ML/03_03.png" style="width: 800px;" align="left"/>
<BR CLEAR="left">
<img src="images_ML/03_04.png" style="width: 800px;" align="left"/>
<BR CLEAR="left">

SciKit-learn has a large number of algorithms for [dimensionality reduction](https://scikit-learn.org/stable/modules/decomposition.html#decompositions).


## Supervised learning: Making predictions

### Classification - Predicting discrete labels
Classification algorithms are supervised learning methods that aim to group unseen data into some set number of labels. In this example, we have two-dimensional data. That is, we have two features for each point, represented by the (x,y) positions of the points on the plot.

Further, we have one of two class labels for each point - represented by the colors of the points. From these features and labels, we would like to create a model that will let us decide whether a new point should be labeled "blue" or "red."

There are many classification algorithms available, each with their own strengths and weaknesses, but in this example, we will use a very simple linear classifier to separate the points. To do so, we will make the assumption that the two groups can be separated by drawing a straight line through the plane between them, such that points on each side of the line fall in the same group. The optimal values for these model parameters are learned from the data (this is the "learning" in machine learning), which is often called training the model.
<BR CLEAR="left">
<img src="images_ML/03_05.png" style="width: 300px;" align="left"/>
<BR CLEAR="left">

This is the basic idea of a classification task in machine learning, where "classification" indicates that the data has known and discrete class labels. In the above example, this may look fairly trivial. However, most real datasets are significantly more complex, due to the size of the dataset and number of extra dimensions. For example, a protein sequence of length 100 has 2,000 dimensions. That being said, the advantage of the more advanced machine learning algorithms are that they are able to handle such sizes and dimensionality, whilst generalising to unseen data.

SciKit-learn has a large number of common algorithms for [classification](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning).


### Regression - Predicting continuous labels
In contrast with classification, which uses discrete labels, regression tasks are those where the labels are continuous quantities.

Consider the following plot. As with the classification problem, the data is two-dimensional, however, unlike classification, this regression problem aims to predict the value of y for a given value of x.
In this example, we have fit a simple linear regression model to the data.
<BR CLEAR="left">
<img src="images_ML/03_06.png" style="width: 300px;" align="left"/>
<BR CLEAR="left">

Many of you will have come across regression models when dealing biological data, such as linear and logistic regression.

### Solving interactive problems with reinforcement learning
<BR CLEAR="left">
<img src="images_ML/03_07.png" style="width: 400px;" align="left"/>
<BR CLEAR="left">




## A roadmap for building machine learning systems

<BR CLEAR="left">
<img src="images_ML/03_08.png" style="width: 1000px;" align="left"/>
<BR CLEAR="left">



## Scikit-learn

[Scikit-learn](https://scikit-learn.org/stable/index.html) is a free machine learning library for python, that provides a wide range of supervised and unsupervised learning algorithms through a consistent interface.

Scikit-learn is built on the SciPy (Scientific Python) library, and requires the following stack:
* __SciPy__: Fundamental library for scientific computing
* __NumPy__: Base n-dimensional array package
* __Matplotlib__: Comprehensive 2D/3D plotting
* __IPython__: Enhanced interactive console
* __Sympy__: Symbolic mathematics
* __Pandas__: Data structures and analysis

Extensions or modules for SciPy are called [SciKits](https://scikits.appspot.com/scikits). As such, the module that provides learning algorithms is named scikit-learn.

The strives for a level of robustness and support required for use in production environments. This means there is a deep focus on ease of use, code quality, collaboration, documentation and performance.

Scikit-learn is focused on modelling data. It does not provide the means for loading and manipulating your datasets; for this, you will need to use [NumPy](https://www.numpy.org/), [SciPy](https://www.scipy.org/), [Pandas](https://pandas.pydata.org/) and or [SciKit-image](https://scikit-image.org/).

Sckit-learn provides functions and tools for the following:

* __Clustering__: for grouping unlabeled data such as KMeans.
* __Cross Validation__: for estimating the performance of supervised models on unseen data.
* __Datasets__: for test datasets and for generating datasets with specific properties when investigating model behavior.
* __Dimensionality Reduction__: for reducing the number of attributes in data for summarization, visualization and feature selection such as principal component analysis (PCA).
* __Ensemble methods__: for combining the predictions of multiple supervised models.
* __Feature extraction__: for defining attributes in image and text data.
* __Feature selection__: for identifying meaningful attributes from which to create supervised models.
* __Parameter Tuning__: for getting the most out of supervised models.
* __Manifold Learning__: for summarizing and depicting complex multi-dimensional data.
* __Supervised Models__: a vast array of algorithms, but not limited to generalized linear models, discriminate analysis, naive bayes, lazy methods, neural networks, support vector machines and decision trees.

When approaching a machine learning problem, Scikit-learn provides a convenient flowchart that can guide you in algorithm selection.

<BR CLEAR="left">
<img src="images_ML/03_09.png" style="width: 1000px;" align="left"/>
<BR CLEAR="left">

## Preprocessing data

### Data representation in Scikit-learn

* Features/targets
* Data dimensions
* Biological sequences
* Images
* Time series measurements


### Scaling and normalisation
* Applies to input and targets
* standardisation


### Whitening data


### Missing values
* Remove all rows
* Fill in with mean, zeros, predicted values


### Dataset balance
* Smote
* Oversampling
* Collect more data
* Make a point that imbalanced data can cause problems.




## Examples

- arbitrary data
-- Classify colours as warm, hot, cool, dull etc.



## Data exploration
* Feature selection



## Unsupervised learning

-Mathematical
-PCA
-tSNE

-Hieracial
-DBscan


- Read docs for more examples




## Supervised learning

- Logistic regression
- knearest neighbour
- SVMs
- xgboost


## Algorithm selection





## Model training
* Cross validation
* Test/trg/validation sets





## Evaluating the model

### Baselines
* Lin reg/logistic reg
* null hypothesis
* random prediction

### Loss metrics
Different tasks (classification/regression) require different loss metrics.






## Ensemble models





## Keras and neural networks