1. Fix the model's understanding of fingers.

2. Video: Fix the issues of objects suddenly disappearing or changing color.

3. Make the model understand stuff like "Empire State Building on it's side" which it currently doesn't understand.

4. Conditional stable diffusion and image-to-image take text &/or an image as input. But what other modalities can be taken as input? Sound? (e.g., music / a bird call) / speech (in particular the emotion & gender / accent aspects of voice - imagine creating a talking video character (realistic / cartoony) based on voice only! - you can create a video radio station / podcast from speech alone; the lip movements should be based on the text, target language & time alignment; use CLIP for audio encoding?) Tabular data? (e.g., generate visualizations from an Excel sheet)

5. An image captioning model uses an image encoder and an autoregressive text decoder. Is there any equivalent to the next token prediction task in computer vision? Next pixel prediction? Next patch prediction? The latter may be possible. If so, then is it possible to replace diffusion models with the analogue of image captioning models, i.e., encode the text, and use an auto-regressive decoder to generate the image? How about using diffusion to generate the initial patch, and then using auto-regression to generate the rest? Then we can create a Midjourney-like zoom out feature.

6. Have the concepts of guidance scale, negative prompt, sound-to-sound, textual inversion & DreamBooth been implemented for audio stable diffusion / other modalities? If not, they may be opportunities, especially sound-to-sound.

7. Not related to stable diffusion: Has a high-quality audio captioning model been created? This should work similarly to image captioning.

8. The textual inversion trick of fine-tuning a single token: Can it not be used for models like T5 / Donut to teach them new concepts (without having to fine-tune the full model)? Try it out on T5 and Donut to add new tasks / keys to an existing fine-tuned model! Compare the results to a stage-1 model fine-tuned on the full dataset with all the tasks / keys. Explore how this is related to machine unlearning: Can it be used to forget just a single task / key? The latter may be a new type of unlearning. So far, unlearning has focused on forgetting training examples, but not full concepts. (Perhaps just resetting the embedding vector for that task / key special token to random values will do the trick?)

9. A one-shot style transfer model. Input a new image and a reference image (containing a style), and the model should be able to perform style transfer. Does it already exist? This is as opposed to full pre-training / fine-tuning (on a particular style without prompting, or a limited set of styles with prompting) / textual inversion approaches.

10. Research ideas related to DreamBooth mentioned in Lesson 9:

> Other ideas that may work include: use Exponential Moving Average (EMA) so that the final weights preserve some of the previous knowledge, use progressive learning rates for fine-tuning, or combine the best of Textual Inversion with DreamBooth. These could make for some interesting projects to try out!

These are related to prior preservation.

11. Jeremy's formulation of Stable Diffusion: According to him, it can take you in some innovative research directions. What are they?

12. **Repetition of a point in 'Lesson_9_Notes.ipynb' because it's a research idea:** The first function (`f`) he mentions: it outputs the probability that an image is a handwritten digit. Why not just create a binary classification model (with a single output unit & a sigmoid activation function)? The model can be trained using MNIST (handwritten digits) & Fashion-MNIST (not handwritten digits)! According to Jeremy, if we have this function, we can actually use it to generate handwritten digits. **Note:** With the MNIST, Fashion-MNIST approach, you don't have to start with random noise to perform gradient ascent.

13. **Repetition of a point in 'Lesson_9_Notes.ipynb' because it's a research idea:** For `f.backward()` and `X_3.grad` to work, we need `X_3` to have `requires_grad=True`. But `X_3` is an image (an input tensor with `requires_grad=False`), not an `nn.Parameter` with `requires_grad=True`. So how would that work? A hack might be: after the model is trained, freeze all it's parameters. Then set `requires_grad=True` for the input tensor. And do a forward pass without using the `torch.no_grad()` context manager. Then you might be able to get the gradient of `f` w.r.t. `X_3`? **Note:** This idea of starting from an input, and changing it (using gradient ascent) till it becomes something else might be more general than just the image generation task. What other tasks could we use this strategy for? Accent softening? Music enhancement (e.g., adding more instruments)? Style transfer? What else? (First, we have to validate that this strategy works using the MNIST dataset.)

14. A potential alternative to using a UNet to predict noise: Maybe generate 784 length vector labels for amount of noise added to an image, and train an MLP regression model to predict that?

15. Apart from Stable Diffusion, in what other situations is training a large neural net prohibitively expensive? Video data? Can we use a VAE to compress the data in such cases? Can this be used to perform modeling experiments that are currently only accessible to the largest AGI labs due to compute limitations?

16. The idea of contrastive loss - it is surely being used in audio applications as well, such as zero shot audio classification and audio diffusion. What other audio applications does contrastive loss have? What about other modalities such as video (GIFs for example), tabular data, graphs & charts, etc? What are the potential applications of putting two different modalities into the same space?

17. Related to above: Is text the only form of guidance possible? How about reversing the process? That it, providing the model an image, and asking it to generate some text / audio / video data. (This is already done for image captioning. But the idea is: can we do diffusion? What would the inputs & outputs of such a model look like?) How about an audio file as guidance for audio stable diffusion? (This is not about creating an audio-to-audio pipeline. This is about using an input audio as guidance. For example, in the case of accent conversion.)

18. **Repetition of a point in 'Lesson_9_Notes.ipynb' because it's a research idea:** Questions like how to choose the value of the constant `c` are decided by the "*diffusion sampler*". But this looks a lot like deep learning optimizers (Momentum, RMSProp, Adam, etc). In a deep learning optimizer, the constant `c` is the learning rate. So concepts such as momentum should be applicable to diffusion samplers as well! This is an area of research that Jeremy is exploring. Diffusion models originally came from the world of differential equations. And there are a whole lot of parallel concepts in the two worlds of (a) optimizers and (b) differential equations. And so, differential equation solvers use a lot of the same kinds of ideas that deep learning optimizers use. One thing that differential equation solvers do is that they tend to take '*time*' as an input. And in fact, pretty much all diffusion models take not just the noisy latent and the prompt as an input; they also take '*time*' as an input. The idea is: the model will be better at removing the noise if you tell it how much noise still exists (after removing noise in the previous time steps). Jeremy very strongly suspects that this premise is incorrect, because figuring out how noisy an image is should be very straightforward for a fancy neural net. (**Idea:** Train a separate regression model to do this, i.e., predict how much noise there is. Or maybe use an auxilliary target in the same neural net.) So Jeremy very much doubts that we actually need to pass in '*time*' as an input. And as soon as you stop doing that, things stop looking like differential equations, and they start looking like optimizers.

19. **Repetition of a point in 'Lesson_9_Notes.ipynb' because it's a research idea:** We decided that the loss function of the UNet is MSE. The truth is, in statistics & ML, every time you see somebody use MSE, it's because the math worked out easier that way. What if we replaced MSE with more sophisticated loss functions like "*perceptual loss*"? This loss function tells us: after removing noise from a noisy latent, how good is the noisy latent? Does it look like a (compressed) digit? Does it have the qualities of a (compressed) digit?