## 1) Training Dynamics (MLP vs LSTM)

From the MLP curves, training improves steadily and validation improves too, but there is a small generalization gap by the end:

- Train loss keeps decreasing while val loss flattens, which is a classic mild overfitting pattern.
- Train accuracy/F1 remain a bit higher than validation.


![](outputs/mlp/mlp_loss.png)
![](outputs/mlp/mlp_acc.png)
![](outputs/mlp/mlp_f1.png)

Some changes might reduce overfitting:

- Increase regularization: slightly higher dropout or stronger weight decay.
- Add early stopping (stop when val macro-F1 stops improving).
- Reduce hidden size a bit, smaller model = less capacity to overfit.



The LSTM shows stronger overfitting:

- Training loss falls sharply to very low values, but validation loss stops improving and then increases noticeably after about mid training.
- Train accuracy/macro-F1 rise very high, while validation levels off much lower.

This indicates the LSTM is learning patterns that fit training well but do not generalize as well.


![](outputs/lstm/lstm_loss.png)
![](outputs/lstm/lstm_acc.png)
![](outputs/lstm/lstm_f1.png)

Some change to address LSTM overfitting:

- Use early stopping, save best by val macro-F1 and stop once it hasn’t improved for a few epochs.
- Increase dropout (especially on the classification head; optionally increase LSTM dropout by using `num_layers > 1`).
- Reduce model capacity (smaller hidden size, or remove bidirectionality if used).
- Stronger weight decay / slightly smaller learning rate can also help.


When useing the class weights, in the confusion matrices, both models learn to predict *neg* and *pos* rather than collapsing into mostly *neu*. The LSTM especially increases correct *pos* and *neg* predictions compared to the MLP.

Weighted loss can make updates “larger” for minority-class mistakes, which often causes the validation curves to look more jagged early on, but the final macro-F1 is usually better because the model is not ignoring minority classes.


![](outputs/mlp/mlp_confusion_matrix.png)
![](outputs/lstm/lstm_confusion_matrix.png)




## 2) Model Performance and Error Analysis (MLP vs LSTM)


The LSTM generalized better than the MLP based on the test confusion matrices (and the metrics derived from them):

- MLP test accuracy: \( approx 0.715\)  
- LSTM test accuracy: \( approx 0.751\)

- MLP test macro-F1 (from confusion matrix) 
- LSTM test macro-F1 (from confusion matrix)

So the LSTM has higher test accuracy and higher test macro-F1, which is stronger evidence of better generalization (macro-F1 is especially important for imbalanced classes).


![](outputs/mlp/mlp_confusion_matrix.png)  
![](outputs/lstm/lstm_confusion_matrix.png)




By raw count (most frequent overall): Neutral (neu) is misclassified the most in both models because it is the largest class and sits “between” negative and positive:

- MLP: neu misclassified 97 times  
- LSTM: neu misclassified 94 times

By error rate : Positive (pos) is misclassified the most:

- MLP pos error rate: \( approx 41.7\% \)
- LSTM pos error rate: \( approx 32.4\% \)



Likely reason for this to happen: Neutral is semantically close to both neg and pos, so many sentences have subtle wording where sentiment is weak/implicit → the model confuses “slightly positive/negative” with neutral. And financial text is often phrased cautiously, with limited emotional words. Small cues like “expected”, “may”, “forecast”, “pressure”, “improve” can flip sentiment but are easy to miss.




## 3) Cross-Model Comparison (MLP, RNN, LSTM, GRU, BERT, GPT)


The MLP uses mean-pooled FastText, which collapses each sentence into a single 300-d vector by averaging word vectors. This limits it because there is no word order, like sentence "profits fell despite strong guidance” vs “strong guidance despite profits fell” can become very similar after averaging. And ther is no phrase structure, negation/modifiers like “not”, “barely”, “despite”, “however” get diluted in the average. In addition, there is no token-level emphasis, one crucial sentiment word can be washed out by many neutral words.



The LSTM is a sequence mode: it processes a fixed-length sequence in order, and the final hidden state summarizes the sentence.
So the advantages compare to MLP are

- Models word order.
- Captures patterns like negation, contrast, and clause-level shifts.
- Uses context: the meaning of a word can depend on surrounding words.


Did fine-tuned LLMs (BERT/GPT) outperform classical baselines?

Yes. BERT and GPT outperform the FastText+neural baselines because they start from pretrained contextual representations:
- Pretraining on massive corpora teaches general language patterns, syntax, semantics, and domain-relevant associations.
- Their embeddings are contextual: the representation of “rise”, “cut”, “beat”, “miss” changes based on surrounding words.
- Fine-tuning adapts those rich features to sentiment classification with relatively little labeled data.

**BERT**  
![](outputs/bert/bert_f1_learning_curves.png)  
![](outputs/bert/bert_confusion_matrix.png)

**GPT**  
![](outputs/gpt/gpt_f1_learning_curves.png)  
![](outputs/gpt/gpt_confusion_matrix.png)



Rank all six models

Using the test confusion matrices, the test macro-F1 ranking is:

1. **BERT** 
2. **GPT** 
3. **LSTM** 
4. **GRU**  
5. **MLP** 

Why this ranking makes sense:
- BERT/GPT : pretrained contextual features + transformer attention → strongest generalization and best handling of subtle sentiment cues.
- LSTM/GRU: sequence modeling captures order/negation/phrases; GRU and LSTM are similar, with small differences depending on hyperparams and regularization.
- MLP: mean pooling removes order and weakens compositional meaning → struggles more on nuanced examples.
- Vanilla RNN: weaker at long-range dependencies (vanishing gradients) compared to gated models (LSTM/GRU), so it tends to confuse sentiment when important cues occur later in the sentence.



## AI Use Disclosure


- **Tool(s) used:** Chatgpt
- **How you used them:** I give it the code I wrote and ask if it is correct, I also asked it of the errors I got and the suggestion to fix them. And I used it to refine my answer to the open queation
- **What you verified yourself:** I checked the output to see whether it is reasonable



