# ResNets
ResNets are a neural net architecture designed for image classification tasks, but they have a long (for machine learning) history dating back to the 1980s.
Generally, they are a solution to the problem that deep neural nets can "lose track" of image information in nets that aren't already well-trained. This makes it hard
for gradients to be propagated deep into the network to identify useful features. The solution was to introduce 
skip connections (residual connections, or identity maps) between stacks of convolutional layers to allow gradients to flow directly through the network, mitigating the vanishing gradient problem in very deep networks.
The residual blocks allow the model to learn identity mappings easily, making it more robust for deep architectures.

<img src="ResBlock.png" width="20%">

__Loss Function__: the loss function is a quantitative measure of the distance between two tensors

__Learning Rate__: a hyperparameter that controls how much the model's parameters/weights are adjusted during each training epoch

__Back propogation__: using gradient information to update the weights at each layer in the network in order to minimize the loss function

### Vanishing Gradients problem

* Consider a network with a learning rate $\lambda$ and loss function $L$
* update the weights such that $w_i^\prime = w_i + \Delta w_i = w_i -\lambda\frac{\partial L}{\partial \Delta w_i}$
    - We're updating each weight by a very small amount based on the loss function and the learning rate
    - For example, imagine $\lambda = 10^{-4}$ and an average gradient of $10^{-15}$, which yields $\Delta w_i = 10^{-19}$ for each step
    - Per the chain rule, this value gets smaller for each back propogation step through the network. Some of the first (and perhaps more informative) layers won't see this change at all

### ResNet

* A key feature of the ResNet is to introduce __Skip Connections__ that back propograte gradient information by skipping layers of the network.  This was introduced, in part, to help mitigate the vanishing gradients problem


<img src="Original-ResNet-18-Architecture.png" width="50%">

And of course, there are many flavors of residual networks with varying depths and numbers of parameters:

<img src="resnet_param_counts.png" width="50%">

This general architecture, whereby earlier layers are fed forward with intermediate layers providing augmented context, is also very widely used, including
in U-Nets (used for image segmentation), Transformer networks (e.g. the "T" in GPT), and many others. But because we are interested in image classification,
we'll start with ResNets here.

Here's an example of some of the intermediate layers of a ResNet-18 that was trained on a very large and generic set of millions of labeled images, and is now being shown a hops berry (I think?).

<img src="resnet_feature_visual.png" width="50%">

What's interesting here is how generic the filter shapes are, especially in the first few layers.

It is important to have large training sets, and, in many cases, to augment our training sets with transformations in order to avoid overfitting.
Well, it turns out that training a ResNet on a huge number of images that have nothing to do with galaxies can lead it to identify features
that are generally useful for image classification, and these features can help jump-start galaxy classification.

Even better, someone else spend a huge amount of computing time getting you these weights, so you can leverage trainings that might have taken months on large GPU clusters to run.

Harnessing pre-trained networks for new tasks is called **learning transfer**, and it can be quite powerful.
The trick is knowning where to create the splice between your pre-trained network and a custom network that focuses on your particular task.