<a href="https://colab.research.google.com/github/vanderbilt-data-science/ai_summer/blob/main/2_images_and_audio_solns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Images and Audio
> Learning more about transformers for image and audio processing

Today we'll leverage what we've learned about the general structure of transformers, expectations around training models, and expected differences in model training and apply this to two new modalities: audio and images.

* Quick reference for breakout room document: https://docs.google.com/document/d/15deDo3TBlgue_7ueoHake-O3HoEqCZKBZOHWfmfUlFQ/edit?usp=sharing

# Learning more about standard vision and audio models

In this section, we'll explore a little bit more about the most standard models for audio and image transformers currently implemented in HuggingFace. This section is structured to give you insights into the models themselves a bit as well to inform the HF API that you'll see.

# Image models - Vision Transformer (VIT)
* [Paper reference](https://arxiv.org/abs/2010.11929)
* [Pdf shortcut](https://arxiv.org/pdf/2010.11929.pdf)

## Questions to Answer (Breakout Rooms)

1. Identify the blocks in the VIT diagram that correspond to the blocks in the Text Pipeline documentation. Starting with the transformer block may help to inform the other pieces.
2. What do you expect to go into the transformer portion of ViT?
3. What do you expect the outputs of the transformer portion of the transformer to be?

<center>
<img src="https://github.com/vanderbilt-data-science/ai_summer/blob/main/img/vit-relationships.png?raw=true" width="1000"/>
</center>

## Looking into the HF API for VIT

Let's now look at the HF API for some insights into the model. Using the [HF API for VIT](https://huggingface.co/docs/transformers/model_doc/vit), answer the following questions:
1. What's the class name of the element that is functionally equivalent to a tokenizer in text models?
2. What tasks are currently supported for VIT?

### HuggingFace Implementation
**Guided Explanation**
* [How-to Guide](https://huggingface.co/docs/transformers/tasks/image_classification)

**Breakout Room Exploration**

* [HuggingFace Exploratory Notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)

During your exploration of VIT, note the following observations in the breakout room document as compared to our exploration of training text transformers:
* What differences do you see in the *preprocessing* of the data? Will this be relevant to your data?
* What differences do you see in the *feature extraction* portion of the code? Before, we saw `input_values` as one of the fields produced by tokenizers. How does this differ with VIT? Will this be relevant to your data?
* What differences do you see in *data collation* in the process? Do you think this will be relevant to your process?
* What differences do you see in training, if any? Do you think these differences will be relevant to your process?
* What differences do you see in post-processing, if any? Do you think these differences will be relevant to your process?

Make sure to include the answers to these questions in the breakout room document for your room!



---



# Audio models - Wav2Vec2
## References
* [Paper reference](https://arxiv.org/abs/2006.11477)
* [Pdf shortcut](https://arxiv.org/pdf/2006.11477.pdf)

## Questions to Answer (Breakout Rooms)

1. Identify the blocks in the wav2vec2 diagram that correspond to the blocks in the Text Pipeline documentation. Starting with the transformer block may help to inform the other pieces.
2. What do you expect to go into the transformer portion of the wav2vec2 transformer?
3. [A little harder] What do you expect the outputs of the transformer portion of the wav2vec2 transformer to be?
4. [Much harder] From the paper:
    * There were two types of training tasks performed to train this model. One for the context representations and the other was an application. What are these two tasks?
    * Because of the fine-tuning application, what elements might you expect to see as a part of the API for wav2vec2?

<center>
<img src="https://github.com/vanderbilt-data-science/ai_summer/blob/main/img/wav2vec-relationships.png?raw=true" width="1000"/>
</center>

## Looking into the HF API for wav2vec2

Let's now look at the HF API for some insights into the model. Using the [HF API for wav2vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2), answer the following questions:
1. What's the class name of the element that is functionally equivalent to a tokenizer in text models?
2. [Hard] Given the fine-tuning application of wav2vec2, what functionality do you think the CTC _Tokenizer_ and Wav2Vec2 _Processor_ relate to? Are those geared more towards text, or towards "textless" NLP capabilities?
3. [Very hard] How do you think you would do domain adaptation for this model?



### HuggingFace Implementation
**Guided Exploration**

* [How-to Guide](https://huggingface.co/docs/transformers/tasks/audio_classification)

**Breakout Room Exploration**
* [HuggingFace Exploratory Notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb)

During your exploration of wav2vec2, note the following observations in the breakout room document as compared to our exploration of training text transformers:
* What differences do you see in the *preprocessing* of the data? Will this be relevant to your data?
* What differences do you see in the *feature extraction* portion of the code? Will this be relevant to your data?
* What differences do you see in *data collation* in the process? Do you think this will be relevant to your process?
* What differences do you see in training, if any? Do you think these differences will be relevant to your process?
* What differences do you see in post-processing, if any? Do you think these differences will be relevant to your process?

Make sure to include the answers to these questions in the breakout room document for your room!

# Other HuggingFace Tasks and Models for Audio and Images

What support is offered for which models? The [Auto Classes](https://huggingface.co/docs/transformers/model_doc/auto) documentation can help us out with this!