# Pretraining

Supervised deep learning models are generally trained on labeled data to achieve a single task. While this is useful when large amounts of labeled data are available, this approach of learning a single task leads to poor generalization when labeled data is scarce {cite}`mao2020pretraining`. For most practical problems, the number of labeled examples available are limited or expensive to obtain. Self-supervised learning (SSL) is an unsupervised learning approach which enables a model to learn from the limited amount of labeled data and automatically annotate unlabeled data. This eliminates the need for labeled data. The labeled data from SSL can then be used to fine-tune a pre-trained deep learning model for downstream tasks. This also allows for transfer learning in low data regimes.

How does pretraining work?

Pre-training involves training shallow neural networks using self-supervised approach before  stacking them together to create a deep neural network.

Why does pretraining work?


## Transfer learning vs fine-tuning


New dataset small but similar to original data - Transfer learning -  Freeze feature extraction layers and modify the last layer for the task at hand. 

New dataset large but similar to original data - Fine-tuning - Change the last layer to make it task-specific but also change parts of the feature extraction process.

[image for TL vs FT]

## Pretraining for graph models

Contrastive learning and predictive learning
{cite}`xie2021graphpretraining`

Contrastive learning - "Given training graphs, contrastive learning aims to learn one or more encoders such that representations of similar graph instances agree with each other, and that representations of dissimilar graph instances disagree with each other."

## Applications in chemistry

### Noisy student/teacher method (Alphafold)


### ESM and Unirep for protein sequence embedding

The protein representation and feature extraction layers in Unirep[cite] and ESM[cite] models have been used by researchers for transfer learning as well as fine tuning for downstream tasks. 

Unirep model was trained on 24 million Uniref50 primary amino-acid sequences. The model was trained to perform next amino-acid prediction (minimizing cross-entropy loss) and, in so doing, was forced to learn how to internally represent proteins. During application, the trained model is used to generate a single fixed-length vector representation of the input sequence by globally averaging intermediate mLSTM numerical summaries (the hidden states). A top model (for example, a sparse linear regression or random forest) trained on top of the representation, which acts as a featurization of the input sequence, enables supervised learning on diverse protein informatics tasks.

ESM model by Facebook research uses UniRef dataset to create pretraining datasets using a recurrent LSTM bidirectional language model. [add details]

### Chemberta - pretraining in context of chemistry and molecules

ChemBERTa is based on the RoBERTa transformer implementation in HuggingFace. Their implementation uses 12 attention heads and 6 layers, resulting in 72 distinct attention mechanisms. They use the pretraining procedure from Roberta on the PubChem 77M dataset by masking 15% of all input strings. The pretrained dataset is then used to finetune several MoleculeNet classification models.

## Code Example

Reference - Chemberta tutorials - possiblly use for solubility dataset used in the book.

## Videos

https://www.youtube.com/watch?v=3nbin3bT8ec&t=1s - TL vs FT

https://www.youtube.com/watch?v=qWUslmU7BjY - TL in NLP and HuggingFace