# IDS 576: Assignment 2 (10 points)

## Learning Objectives

By completing this assignment, you will:

1. **Understand transfer learning**: Learn how to leverage pretrained models (ResNet18) for feature extraction and finetuning on new classification tasks, understanding when each approach is appropriate.

2. **Compare training strategies**: Analyze the trade-offs between using frozen feature extractors versus end-to-end finetuning in terms of accuracy, training time, and computational resources.

3. **Implement embedding models**: Build and train embedding representations for recommendation systems using gradient-based optimization, applying concepts from word embeddings to a new domain.

4. **Visualize high-dimensional representations**: Apply dimensionality reduction techniques (t-SNE) to interpret and evaluate learned embeddings, connecting visualization to model understanding.

5. **Analyze model behavior**: Investigate how hyperparameters (embedding dimension, learning rate) and data characteristics (popularity, cold-start) affect model performance and recommendation quality.

---

## Submission Guidelines

- **Format**: Submit both a Jupyter notebook (`.ipynb`) and PDF version on the submission site.
- **Naming Convention**: `Assignment2_GroupNumber.ipynb` and `Assignment2_GroupNumber.pdf`
- **Structure**: Organize your notebook with clear section headers matching the question numbers (Q1, Q2, etc.). Each question should have:
  - Code cells with comments
  - Markdown cells explaining your approach and findings
  - Output cells showing results/figures (do not clear outputs before submission)
- **Citations**: Always cite all sources (papers, tutorials, Stack Overflow, etc.)
- **Collaboration**: Within-group discussion allowed; cross-group collaboration is **not** allowed.
- **No need to submit**: Datasets, word documents, or external files.

---

### Question 1: CNNs and Finetuning (4 pt)

In this question, you will explore transfer learning using a pretrained ResNet18 model on the CIFAR-10 dataset. You will compare two approaches: using the pretrained model as a fixed feature extractor versus finetuning the entire network.

**Dataset:**
Download the CIFAR-10 dataset (original data can be found [here](http://www.cs.toronto.edu/~kriz/cifar.html), and here is a link to the pickled [python version](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz)).

#### Part A: Feature Extraction (2 pt)

Use the pretrained ResNet18 model (from `torchvision.models`) to extract features. Freeze all layers and use the extracted features as inputs to a new multi-class logistic regression model (use `nn.Linear`/`nn.Module` to define your model).

**Sub-questions:**
- (a) Describe the preprocessing steps applied to CIFAR-10 images to match ResNet18's expected input format. Report the final test accuracy. (0.5 pt)
- (b) Train the logistic regression classifier for at least 10 epochs. Plot the training and validation loss curves. Report train, validation, and test accuracy. (0.5 pt)
- (c) Display the top 5 correct predictions and the top 5 incorrect predictions for each of the 10 classes. Show the images with their predicted and true labels in a compact grid format. (1 pt)

#### Part B: Finetuning (2 pt)

Finetune the ResNet18 model's parameters (unfreeze some or all layers) and repeat the analysis from Part A.

**Sub-questions:**
- (a) Describe your finetuning strategy (which layers were unfrozen, learning rate choices, number of epochs). Report the final test accuracy. (0.5 pt)
- (b) Plot the training and validation loss curves. Report train, validation, and test accuracy. (0.5 pt)
- (c) Display the top 5 correct predictions and the top 5 incorrect predictions for each of the 10 classes (same format as Part A). (0.5 pt)
- (d) Create a comparison table showing: training time, number of trainable parameters, and test accuracy for both approaches (feature extraction vs. finetuning). Discuss when you would prefer one approach over the other. (0.5 pt)

**Deliverables:**
- [ ] Code cells showing data loading, preprocessing, and model definition
- [ ] Training/validation loss plots for both approaches
- [ ] Grid visualizations of correct/incorrect predictions (10 classes × 5 images × 2 categories = 100 images per approach)
- [ ] Comparison table with metrics for both approaches
- [ ] Written analysis comparing the two approaches (at least 3-4 sentences)

**Grading Criteria:**
- Full credit: All deliverables present, code runs without errors, visualizations are clear and properly labeled, analysis demonstrates understanding of transfer learning concepts
- Partial credit: Missing deliverables, unclear visualizations, or superficial analysis
- No credit: Code does not run or fundamental misunderstanding of the task

---

### Question 2: Movie Embeddings (4 pt)

Instead of embedding words, we will embed movies. If we can embed movies, then similar movies will be close to each other and can be recommended. This reasoning is analogous to the [distributional hypothesis of word meanings](https://en.wikipedia.org/wiki/Distributional_semantics). For words, this roughly translates to: words that appear in similar sentences should have similar vector representations. For movies, vectors for two movies should be similar if they are watched by similar people.

**Mathematical Formulation:**

Let the total number of movies be $M$. Let $X_{i,j}$ be the number of users that liked both movies $i$ and $j$. We want to obtain vectors $v_1,...,v_i,...,v_j,...,v_M$ for all movies such that we minimize the cost:

$$c(v_1,...,v_M) = \sum_{i=1}^{M}\sum_{j=1}^{M}\mathbf{1}_{[i\neq j]}(v_i^Tv_j - X_{i,j})^2$$

Here $\mathbf{1}_{[i\neq j]}$ is an indicator function that is $0$ when $i=j$ and $1$ otherwise.

#### Part A: Data Preparation (0.5 pt)

Compute the co-occurrence matrix $X_{i,j}$ from the MovieLens (small) [dataset](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip). You can also download using the link to `ml-latest-small.zip` from this [page](https://grouplens.org/datasets/movielens/) (be sure to read the corresponding [description](https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html)).

**Sub-questions:**
- (a) Describe your data preprocessing workflow. How did you define "liked" (what rating threshold did you use)? Report the number of movies, users, and non-zero entries in your co-occurrence matrix.

**Deliverables:**
- [ ] Code cells showing data loading and preprocessing
- [ ] Summary statistics: number of movies, users, ratings, and co-occurrence matrix sparsity

#### Part B: Optimization (1.5 pt)

Optimize function $c(v_1,...,v_M)$ over $v_1,...,v_M$ using gradient descent (using PyTorch or TensorFlow).

**Sub-questions:**
- (a) Implement the loss function and training loop. Use an embedding dimension of 50 as your baseline. (0.5 pt)
- (b) Plot the loss as a function of iteration for at least 3 different learning rates (e.g., 0.001, 0.01, 0.1). Train for at least 100 epochs. (0.5 pt)
- (c) Compare at least 2 different optimizers (e.g., SGD, Adam, RMSprop). Plot the loss curves on the same figure with a legend. Discuss which optimizer converges faster and why. (0.5 pt)

**Deliverables:**
- [ ] Code cells with loss function implementation and training loop
- [ ] Loss vs. iteration plot comparing learning rates
- [ ] Loss vs. iteration plot comparing optimizers
- [ ] Written discussion of optimizer comparison (at least 2-3 sentences)

#### Part C: Movie Recommendations (2 pt)

Recommend the top 10 movies (not vectors or indices, but movie names) for the following query movies:
- (a) _Apollo 13_
- (b) _Toy Story_
- (c) _Home Alone_

**Sub-questions:**
- (a) Describe your recommendation strategy. How do you find similar movies given a query movie's embedding vector? (0.5 pt)
- (b) Present the recommendations for each query movie in a clear table format. (0.5 pt)
- (c) Do the recommendations change when you change learning rates or optimizers? Run the experiment with at least 2 different configurations and report the results. Explain why the recommendations are stable or unstable. (1 pt)

**Deliverables:**
- [ ] Description of similarity/recommendation strategy
- [ ] Three tables showing top 10 recommendations for each query movie
- [ ] Comparison of recommendations across different training configurations
- [ ] Written analysis explaining recommendation stability (at least 3-4 sentences)

**Grading Criteria:**
- Full credit: All deliverables present, loss curves show convergence, recommendations are sensible and well-presented, analysis demonstrates understanding of embedding optimization
- Partial credit: Missing deliverables, unconverged training, or superficial analysis
- No credit: Code does not run or recommendations are clearly incorrect (e.g., returning random movies)

---

### Question 3: Embedding Visualization (1 pt)

Visualize the learned movie embeddings using dimensionality reduction to understand what the model has learned.

#### Part A: t-SNE Visualization (0.5 pt)

Apply t-SNE to reduce the movie embedding vectors to 2 dimensions and create a scatter plot.

**Sub-questions:**
- (a) Apply t-SNE with perplexity=30 to your learned movie embeddings. Create a scatter plot of the 2D projections.
- (b) Color-code the points by movie genre (use the primary genre from the MovieLens dataset). Include a legend.
- (c) Annotate at least 10 well-known movies on the plot (e.g., the query movies from Question 2 plus others).

**Deliverables:**
- [ ] t-SNE scatter plot with genre color-coding
- [ ] Legend showing genre-to-color mapping
- [ ] Annotations for at least 10 recognizable movies

#### Part B: Cluster Analysis (0.5 pt)

Analyze the structure of the embedding space.

**Sub-questions:**
- (a) Identify 2-3 clusters of movies that appear close together in the t-SNE visualization. List the movies in each cluster and explain what they have in common (genre, era, themes, etc.).
- (b) Find 2 movies that are close in embedding space but seem different on the surface. Hypothesize why the model considers them similar.

**Deliverables:**
- [ ] Description of 2-3 identified clusters with movie lists
- [ ] Analysis of 2 surprising similar movies with hypothesis

**Grading Criteria:**
- Full credit: Clear, well-labeled visualization with insightful cluster analysis
- Partial credit: Visualization present but analysis is superficial
- No credit: No visualization or completely incorrect implementation

---

### Question 4: Embedding Analysis (1 pt)

Investigate how design choices affect the quality of learned embeddings.

#### Part A: Embedding Dimension Ablation (0.5 pt)

Study how the embedding dimension affects model performance and recommendations.

**Sub-questions:**
- (a) Train embedding models with dimensions: 10, 25, 50, 100, and 200. Use the same optimizer and learning rate for fair comparison.
- (b) Create a table showing: embedding dimension, final training loss, and training time for each configuration.
- (c) For one query movie (e.g., Toy Story), show how the top 5 recommendations change across different embedding dimensions.
- (d) Discuss the trade-offs: When might you prefer smaller vs. larger embedding dimensions?

**Deliverables:**
- [ ] Table comparing embedding dimensions (loss, time)
- [ ] Comparison of recommendations across dimensions for one query movie
- [ ] Written analysis of dimension trade-offs (at least 2-3 sentences)

#### Part B: Cold-Start Problem Analysis (0.5 pt)

Analyze how the model handles movies with few ratings.

**Sub-questions:**
- (a) Identify 5 movies with very few ratings (< 10 ratings) and 5 movies with many ratings (> 100 ratings) in the dataset.
- (b) For each of these 10 movies, find their 3 nearest neighbors in embedding space and report them.
- (c) Compare the quality of recommendations for popular vs. unpopular movies. Are the recommendations equally sensible? Why or why not?

**Deliverables:**
- [ ] Table of 5 low-rating and 5 high-rating movies with their rating counts
- [ ] Nearest neighbor recommendations for each of the 10 movies
- [ ] Written analysis comparing recommendation quality (at least 3-4 sentences)

**Grading Criteria:**
- Full credit: Thorough ablation study with insightful analysis of results
- Partial credit: Experiments run but analysis is superficial
- No credit: Missing experiments or no analysis

---

## Hints

### Question 1 Hints

**Data Loading:**
- Use `torchvision.datasets.CIFAR10` for easy data loading
- CIFAR-10 images are 32×32, but ResNet18 expects 224×224 input. Use `transforms.Resize(224)` or `transforms.Resize(256)` followed by `transforms.CenterCrop(224)`
- Apply ImageNet normalization: `transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])`

**Feature Extraction:**
- Load pretrained model: `models.resnet18(pretrained=True)` or `models.resnet18(weights='IMAGENET1K_V1')`
- To extract features, remove the final fully connected layer or use a forward hook
- Freeze parameters: `for param in model.parameters(): param.requires_grad = False`
- ResNet18 produces 512-dimensional feature vectors (before the final FC layer)

**Finetuning:**
- Use a smaller learning rate for pretrained layers (e.g., 1e-4) and larger for the new classifier head (e.g., 1e-3)
- Consider unfreezing only the last few layers initially
- Use `torch.optim.lr_scheduler` for learning rate scheduling

**Visualization:**
- Use `matplotlib.pyplot.subplots()` with appropriate grid dimensions
- Remember to denormalize images before displaying: `img = img * std + mean`

### Question 2 Hints

**Data Preparation:**
- Read `ratings.csv` using `pandas.read_csv()`
- A common threshold for "liked" is rating ≥ 4.0 (out of 5)
- Build a user-movie matrix first, then compute co-occurrence: $X = A^T A$ where $A$ is the binary user-movie "liked" matrix
- The diagonal of $X$ counts how many users liked each movie; ignore it in the loss

**Optimization:**
- Define embeddings using `nn.Embedding(num_movies, embedding_dim)` or `nn.Parameter(torch.randn(num_movies, embedding_dim))`
- Compute the loss efficiently using matrix operations: `predictions = embeddings @ embeddings.T`
- Create a mask to exclude diagonal elements: `mask = ~torch.eye(M, dtype=bool)`
- Start with a smaller subset of movies if training is slow

**Recommendation:**
- Use cosine similarity or Euclidean distance to find similar movies
- `torch.nn.functional.cosine_similarity()` or `sklearn.metrics.pairwise.cosine_similarity()`
- Use `movies.csv` to map movie IDs to titles

### Question 3 Hints

**t-SNE:**
- Use `sklearn.manifold.TSNE(n_components=2, perplexity=30, random_state=42)`
- For large datasets, consider using `n_iter=1000` for better convergence
- t-SNE is stochastic; set `random_state` for reproducibility

**Genre Coloring:**
- Read `movies.csv` for genre information
- Movies can have multiple genres; use the first listed genre as "primary"
- Use a categorical colormap: `plt.cm.tab20` or `sns.color_palette("husl", n_genres)`

**Annotations:**
- Use `plt.annotate()` or `ax.text()` to label points
- Offset labels slightly to avoid overlapping with points

### Question 4 Hints

**Ablation Study:**
- Keep all hyperparameters except embedding dimension constant
- Train for the same number of epochs for fair comparison
- Use `time.time()` to measure training time

**Cold-Start Analysis:**
- Count ratings per movie: `ratings_df.groupby('movieId').size()`
- Popular movies should have better embeddings due to more training signal
- Consider: What happens when a movie only co-occurs with a few others?

---

## FAQ

**Q: Do I need a GPU for this assignment?**  
A: A GPU will significantly speed up training, especially for Question 1 (CNN finetuning). However, the assignment is completable on CPU. Expected times:
- Question 1 (feature extraction): ~10 min on CPU, ~1 min on GPU
- Question 1 (finetuning): ~30-60 min on CPU, ~5 min on GPU
- Question 2-4 (embeddings): ~5-10 min on CPU, faster on GPU

**Q: What rating threshold should I use for "liked" in Question 2?**  
A: You have flexibility here. Common choices are ≥ 4.0 or ≥ 3.5. Document your choice and justify it briefly. The key is to be consistent.

**Q: How many epochs should I train for?**  
A: For Question 1, train until the validation loss plateaus (typically 10-25 epochs). For Question 2, train until the loss converges (typically 100-500 epochs depending on learning rate).

**Q: What if a query movie isn't in my dataset?**  
A: The three query movies (Apollo 13, Toy Story, Home Alone) are all present in the MovieLens small dataset. If you have issues finding them, check for exact title matching (case-sensitive) and year information in the title.

**Q: How should I handle movies with multiple genres?**  
A: For visualization purposes, use the first genre listed as the primary genre. You may also try other approaches (e.g., one-hot encoding multiple genres) and discuss the differences.

**Q: What if my t-SNE visualization looks random/unstructured?**  
A: This could indicate: (1) embeddings haven't converged – train longer, (2) perplexity is too high/low – try perplexity in [5, 50], (3) the embedding model isn't learning meaningful representations – check your loss is decreasing.

**Q: Should I normalize embeddings before computing similarity?**  
A: Yes, for cosine similarity the vectors are normalized. For Euclidean distance, normalization is optional but can help. Be consistent and document your choice.

---

## Resources

### Lecture Slides
- [`M03a_cnn_and_transfer.pdf`](slides/M03a_cnn_and_transfer.pdf) – CNNs, transfer learning, feature extraction, and finetuning strategies
- [`M04_text.pdf`](slides/M04_text.pdf) – Word embeddings and distributional semantics (concepts applicable to movie embeddings)

### Example Notebooks
- [`ConvolutionalNet_Classifier_Example.ipynb`](examples/M03_cnn_transfer/ConvolutionalNet_Classifier_Example.ipynb) – Transfer learning with ResNet, feature extraction vs. finetuning
- [`TSNE_Embedding_Example_MNIST.ipynb`](examples/M03_cnn_transfer/TSNE_Embedding_Example_MNIST.ipynb) – t-SNE visualization of embeddings
- [`FFN_Classifier_Example.ipynb`](examples/M02_feedforward/FFN_Classifier_Example.ipynb) – PyTorch training loop patterns

### External Documentation

**PyTorch:**
- [torchvision.models.resnet18](https://pytorch.org/vision/stable/models/generated/torchvision.models.resnet18.html)
- [torch.nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
- [Transfer Learning Tutorial](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html)

**Scikit-learn:**
- [sklearn.manifold.TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)
- [Manifold Learning Examples](https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html)

### Dataset Links
- [CIFAR-10 Dataset](http://www.cs.toronto.edu/~kriz/cifar.html)
- [CIFAR-10 Python Version (direct download)](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz)
- [MovieLens Small Dataset](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip)
- [MovieLens Dataset Description](https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html)

---

## Point Summary

| Question | Topic | Points |
|----------|-------|--------|
| Q1 | CNNs and Finetuning | 4.0 |
| Q2 | Movie Embeddings | 4.0 |
| Q3 | Embedding Visualization | 1.0 |
| Q4 | Embedding Analysis | 1.0 |
| **Total** | | **10.0** |