### Week 1

#### Q1.Why do we usually begin with a linear model?

##### Answer


Key Advantages of Linear Models

Linear models are valuable for several reasons, which make them a popular choice for a wide range of applications:

$(1)$ Simplicity and Interpretability

$(2)$ Computational Efficiency

$(3)$ Well-Understood Statistical Properties

$(4)$ Foundation for More Complex Models


When to Use a Linear Model

A linear model is often the best choice in the following situations:

$(1)$ When the relationship is likely linear

$(2)$ For baseline performance

$(3)$ When interpretability is a priority

Limitations of Linear Models

Their main limitation is their assumption of a linear relationship between the dependent and independent variables.
If the underlying data has a complex, non-linear pattern, a linear model will likely fail to capture it accurately. 
This can lead to underfitting and poor predictive performance.

---

#### Q2.What is the difference between BGD, SGD, and Mini-Batch Gradient Descent, and what are their advantages?


##### Answer
$(1)$ Batch Gradient Descent (BGD)

Advantages:

Stable convergence because the gradient is computed using all data.

Smooth and predictable updates.

Disadvantages:

Very slow on large datasets since every step requires going through all data.

Not suitable for online or streaming data.

Use Case: Small datasets

$(2)$ Stochastic Gradient Descent (SGD)

Advantages:

Much faster per update because it uses only one example.

Can handle very large datasets or streaming data.

Can escape local minima more easily due to noisy updates.

Disadvantages:

Updates are noisy and can fluctuate heavily.

Convergence is less stable; may require learning rate decay.

Use Case: Very large/streaming data

$(3)$ Mini-Batch Gradient Descent

Advantages:

Faster than BGD because it doesn’t need the full dataset.

Less noisy than SGD; more stable convergence.

Can exploit optimized matrix operations on hardware (like GPUs).

Strikes a balance between speed and accuracy.

Disadvantages:

Still requires tuning of batch size.

Can have some noise in gradient estimate, though usually beneficial.

Use Case: Most deep learning tasks

---

### Week 2

### Q1.Does having more layers or more neurons in a neural network lead to a better result?

Adding more layers (depth) or more neurons per layer (width) both increase the capacity of a neural network, but the effect on performance depends on the task, data, and optimization process.

1.More layers (depth)

Pros: Capture complex, hierarchical features.

Cons: Harder to train, higher compute cost.

2.More neurons (width)

Pros: Increases expressiveness, easier to train.

Cons: More parameters, risk of overfitting.

3.Right Model Size

Too few parameters → underfitting.

Too many parameters → overfitting.

It depends on:

(1)Dataset size/complexity.

(2)Compute budget.

(3)Regularization methods .

4.Key idea

If your model underfits → add more layers/neurons.

If your model overfits → reduce size, or add regularization / more data.

If training is unstable → depth might be too large without proper techniques.

---

### Week 3

#### Q1.In class we can approximate every polynoimal by using neural network. Can we use same way to approximate other function e.g. sin(x), cos(x) by using taylor expansion? What may happened, and what is its result?

The Result: What happens? (The Problem)
If you train a standard neural network to approximate $\sin(x)$ by mimicking a polynomial (or simply training it on data near $x=0$), three major issues arise:

A. Local Accuracy vs. Global Divergence\
Taylor series are approximations centered around a specific point (usually $x=0$).

Result: The network will be very accurate near $x=0$.\
Failure: As $x$ moves away from zero (e.g., $x=100$), the polynomial terms (like $x^7$) dominate and shoot towards $+\infty$ or $-\infty$. However, the real $\sin(x)$ stays bounded between $[-1, 1]$.

B. Failure to Extrapolate (The Extrapolation Problem)\
This is a fundamental limitation in Deep Learning. Most standard neural networks (e.g., those using ReLU activation functions) are piecewise linear.\
Result: If you train the network on the range $[-\pi, \pi]$, it will fit the wave perfectly. However, if you ask it to predict $x = 10\pi$, it will not repeat the wave pattern. It will simply continue the linear slope from the last point it saw. It cannot "learn" the concept of periodicity through polynomials.

C. Inefficiency\
To approximate many cycles of a sine wave need an incredibly high-degree polynomial. This requires a very large and deep neural network, which is computationally inefficient compared to using a periodic activation function.

References: \
**Ziyin, L., Hartwig, T., & Ueda, M. (2020). Neural networks fail to learn periodic functions and how to fix it. Advances in Neural Information Processing Systems, 33.**

**Morra, S., Tosello, F., & Zoso, D. (2022). NN-Poly: Approximating common neural networks with Taylor polynomials. Frontiers in Robotics and AI, 9.**

---

### Week 4

#### Q1.在不同情況下 Generative model 與 discriminative model 表現情況不同，甚麼情況下Generative model 比 discriminative model 好/不好?

When is Generative "Better"? \
A. Small Data Regimes (Few Training Examples) \
Because the Discriminative model is prone to overfitting or simply cannot find the boundary yet.
B. Missing Data\
Generative model can marginalize over the missing variables to still make a prediction.\
C. Out-of-Distribution (OOD) & Anomaly Detection\
Because Generative models learn what "normal" data looks like ($P(X)$), they can tell you if a new input is weird.
D. Unsupervised / Semi-Supervised Learning
Generative models can easily use unlabeled data to improve their understanding of $P(X)$. Discriminative models generally require every $X$ to have a label $Y$.

When is Discriminative is "Better"?\
A. Large Data Regimes
With enough data, they can find a more precise solution.\
B. Complex Feature Correlations\
If the relationship between input features is complex (e.g., in image pixels), simple Generative models (like Naive Bayes) fail because their assumptions (independence) are violated. Discriminative models (like CNNs) don't care about the distribution; they just learn the complex mapping $X \to Y$.



References: \
**Ng, A. Y., & Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems, 14.**




---

### Week 5

#### Q1.What may happened when the if the real-world data we used doesn't satisfied normal distribution in GDA model? What is the result?

A. The "Multimodal" Failure (Catastrophic) \
The GDA Result: GDA tries to fit a single Gaussian "bell curve" over these two peaks. It will calculate the mean ($\mu$) to be exactly in the middle of the two clusters—where no actual data exists.\
Consequence: The model predicts the highest probability for a region where the data density is actually zero.

B. The "Heavy Tail" / Outlier Problem
The GDA Result: GDA estimates parameters using Mean and Variance. These calculations are not robust. A single extreme outlier will significantly shift the calculated Mean and inflate the Covariance matrix.\
Consequence: The decision boundary shifts dramatically toward the outlier to accommodate it, degrading accuracy for the majority of "normal" points.

C. Asymptotic Error (The Theoretical Limit)
If assumptions hold: GDA is asymptotically efficient (it requires the least amount of data to reach the truth).\
If assumptions fail: GDA approaches a biased limit. Even if you give it 1,000,000 data points, it will converge to a wrong solution.\
Contrast: Discriminative models (like Logistic Regression) make fewer assumptions. As data increases, they will eventually find the correct boundary even if the data is not Gaussian.

References: \
**Ng, A. Y., & Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems, 14.**

**Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer.**

**Hubert, M., & Van Driessen, K. (2004). Fast and robust discriminant analysis. Computational Statistics & Data Analysis, 45(2).**


---

### Week 6

#### Q1.數據是高斯分布情況下使用GDA或Logistic regression進行分析，哪個結果會比較準確? 若數據稍微偏離一些結果會是甚麼樣子?



Scenario A: The Data is Perfectly Gaussian\
Winner: GDA\
Why?\
GDA uses the strong assumption of normality to "fill in the gaps" where data is scarce. Logistic Regression ignores this distributional information.

Scenario B: The Data Deviates Slightly\
Winner: Logistic Regression\
Why?\
As the dataset grows, Logistic Regression will eventually outperform GDA because GDA is stuck with a "modeling error" (Bias), whereas Logistic Regression can adjust to the data's actual shape.


References: \
**Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70(352).**

**Ng, A. Y., & Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems, 14.**

---


#### Q2.有甚麼方法可以調整GDA的決策邊界，讓其形狀能夠改變成不同樣子？有辦法變成不規則曲線嗎？

A. Mixture Discriminant Analysis (MDA) - The key to "Irregular" shapes\
If you want the boundary to be an irregular or wavy curve, standard GDA (one Gaussian per class) is not enough. You must use Mixture Models.\
Logic: $P(X|Y=A) \approx \sum w_i \cdot \mathcal{N}(\mu_i, \Sigma_i)$.\
Result: The decision boundary becomes the complex intersection of these multiple "hills." This allows the boundary to be highly irregular, disconnected, or wobbly, effectively approximating any complex shape.\

B. Kernel GDA (Generalized Discriminant Analysis)
Method: You map the original input data $x$ into a very high-dimensional feature space $\phi(x)$ (e.g., using a Gaussian Radial Basis Function kernel). You then perform standard GDA in that high-dimensional space.\
Result: When you project the boundary back to the original 2D space, it appears as a highly complex, non-linear, and irregular contour enclosing the data points.

References: \
**Hastie, T., & Tibshirani, R. (1996). Mixture discriminant analysis. Journal of the Royal Statistical Society: Series B (Methodological), 58(1).**

**Baudat, G., & Anouar, F. (2000). Generalized discriminant analysis using a kernel approach. Neural Computation, 12(10).**

**Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84(405).**


---

### Week 7

#### Q1.在操作DSM時，我們要如何去選擇noise？能夠讓我們方便計算且結果準確。

The standard choice is Gaussian Noise combined with a Geometric Noise Schedule (Annealing).

A. Choice of Distribution: Why Gaussian?\
The Result: This allows us to train the neural network simply by trying to predict the noise $z$ (or the clean image $x$) using standard Mean Squared Error (MSE). If you used other distributions (like Cauchy or Uniform), this clean gradient relationship would not exist, making training unstable or computationally intractable.

B. Choice of Parameters: Why a Schedule?\
If $\sigma$ is too small: The model only sees data near the manifold. In the empty space, the estimated gradients are random/inaccurate. When you try to generate data starting from random noise, the model won't know how to "push" the point toward the manifold.\
If $\sigma$ is too large: The noise destroys the data structure. The model learns a blurry "blob" rather than detailed features.

To maximize both convenience and accuracy, you should set the perturbation kernel as $q_\sigma(\tilde{x}|x) = \mathcal{N}(\tilde{x}; x, \sigma^2 I)$ and train the model conditioning on a variable $\sigma$ that decreases over time.





References:\
**Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation, 23(7).**

**Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32.**

**Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6.**

---

#### Q2.我們可以結合DSM與SSM的觀念，將noise與轉換成vector的觀念結合再一起，做出效果可能更好的方法嗎？

Based on the evolution of Score-Based Generative Models, the answer is Yes. Combining the concepts of DSM (noise perturbation) and SSM (vector projection/slicing) creates a powerful method, typically known as Sliced Score Matching with Data Perturbation or SSM on Noisy Data.\

The Solution: The Hybrid Approach

A. Handling "Black Box" or Non-Gaussian Noise\
Result: You can learn to generate data even if the corruption process is complex or unknown (as long as you can sample from it).

B. Stabilizing SSM (The Song & Ermon Breakthrough)
Problem: Running pure SSM on real images fails because real data sits on a lower-dimensional manifold. The gradients are undefined in the empty space.\
Fix: They "inflated" the manifold by adding noise (DSM concept).


References:\
**Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32.**

**Song, Y., Garg, S., Shi, J., & Ermon, S. (2019). Sliced score matching: A scalable approach to density and score estimation. Proceedings of the 35th Uncertainty in Artificial Intelligence Conference.**

**Yoon, S., Lee, G. H., & Hwang, S. J. (2023). Robustifying generation of score-based models via divergence minimization. International Conference on Machine Learning.**

---

### Week 8

#### Q1.大部分SDE長期趨勢會受到drift term影響，那有沒有會受到noise影響的例子?

Yes.

This is broadly known as "Noise-Induced Phenomena."

A. Geometric Brownian Motion (GBM)\
The Noise Effect:According to Itô's Lemma, the actual long-term trend of the log-price ($\ln X_t$) is determined by:$$d(\ln X_t) = (\mu - \frac{1}{2}\sigma^2) dt + \sigma dW_t$$\
The Result:
Even if your drift is positive (e.g., $\mu = 5\%$), if the noise is too large (e.g., $\sigma^2 > 2\mu$), the effective growth rate $(\mu - \frac{1}{2}\sigma^2)$ becomes negative.
Conclusion: High noise causes the system to decay toward 0 in the long run, overpowering the positive drift.

B. Noise-Induced Transitions - Physics\
The Result: The long-term behavior (the stationary distribution) is dictated by the noise level. The noise determines whether the system stays in a sub-optimal state or converges to the true global equilibrium (Boltzmann distribution).

C. Langevin Dynamics - Generative Models\
The SDE: $dX_t = -\nabla E(x) dt + \sqrt{2T} dW_t$\
The Result:\
If Noise = 0: The long-term result is a single point (Optimization).\
If Noise > 0: The long-term result is a Probability Distribution (Sampling).\
Conclusion: Here, the noise changes the fundamental nature of the solution from "finding a value" to "generating a distribution."








References:\
**Horsthemke, W., & Lefever, R. (1984). Noise-induced transitions: Theory and applications in physics, chemistry, and biology. Springer-Verlag.**

**Øksendal, B. (2003). Stochastic differential equations: An introduction with applications (6th ed.). Springer.**

**Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations.**

---

#### Q2.Brownian motion 對不同SDE系統的長期影響有什麼差異？

In short, as time $t \to \infty$, noise can cause a system to stabilize into a distribution, die out (extinction), or continuously hop between states.\

A. Mean-Reverting Systems (Additive Noise)\
Impact of Brownian Motion: "Maintenance of Equilibrium"\
Without Noise: The system would eventually stop exactly at $\mu$.\
With Noise: The system never stops; it continuously fluctuates around $\mu$.\
Long-term Result:The system converges to a Gaussian Stationary Distribution.

B. Multiplicative Noise Systems\
Long-term Result:\
Low Noise: The system grows exponentially.\
High Noise ($\sigma^2 > 2\mu$): The system will almost surely converge to 0 (Extinction), even if the drift $\mu$ is positive.\
Difference: Here, noise can "kill" a system, whereas in the OU process, noise "sustains" it.

C. Bi-stable / Multi-stable Systems\
Impact of Brownian Motion: "Ergodicity & Tunneling"\
Drift: Tries to trap the system in the nearest local minimum (the bottom of one well).\
Noise: Provides the energy to climb over the barrier separating the wells.\
Long-term Result:The system does not settle in a single state. Instead, it converges to a Boltzmann Distribution ($e^{-V(x)/\sigma^2}$), visiting all stable states over time.


References:\
**Khasminskii, R. Z. (2011). Stochastic stability of differential equations (2nd ed.). Springer.**

**Evans, L. C. (2013). An introduction to stochastic differential equations. American Mathematical Society.**

**Uhlenbeck, G. E., & Ornstein, L. S. (1930). On the theory of the Brownian motion. Physical Review, 36(5), 823–841.**

---

### Week 10

#### Q1. 在 Reverse SDE 的 Euler-Maruyama 求解過程中，$\Delta t$ 的大小有什麼具體影響？如果 $\Delta t$ 不夠小，會如何改變最終的數據分佈？

If $\Delta t$ is not small enough, it leads to severe Discretization Error, causing the generated data distribution to deviate significantly from the true distribution.\

A. Drift Overshooting and Trajectory Deviation\
Result: The generated images will exhibit geometric distortions, blurred edges, or loss of structural coherence.

B. Manifold Deviation\
Result: Once the trajectory enters these regions, the model cannot guide the data back to the correct path, resulting in pure noise or unrecognizable artifacts.

C. Noise-Drift Imbalance\
Result: If $\Delta t$ is chosen poorly, it disrupts the delicate balance between removing noise and injecting noise.\
If $\Delta t$ is too large, the deterministic drift might overpower the stochastic diffusion (or vice versa), leading to over-smoothed textures or excessive graininess.\

D. Distribution Shift (Statistical Divergence)\
Result:\
Degraded FID Scores: Quantitative metrics for image quality and diversity will worsen significantly.\
Mode Drop: The model may fail to generate diverse samples, capturing only the most common modes of the data while losing rare or detailed examples.


References:\
**Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations.**

**Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35.**

**Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35.**

---