# Language Processing 1, Fall 2024: Assignment 3
Assignment 3: assigned on November 25, 2024, and to be returned: **December 9, 23:59**.

Welcome to the third and last assignment for LP1. It has two parts:

In the first part, you will be working further on Sentiment analysis. The goal is to test your ability to use a finetuned model for sentiment analysis and to see how it works on your corpus. Then, you will compare it to the model we worked on in the last assignment. **You can get 60 points here.**

In the second part, you will work on text classification. To do that, you will build different text representations, and you will assess the performance for each of those. **You can get 50 points here.**

You can get a maximum of 110 points and you need **65 points** to pass the assignment.

# Part 1 (Sentiment analysis):

In this part, you will be applying sentiment analysis to the corpus that we have been working in the previous assignments. You will be working on the English version only (level 0). The goal of this part is to check how different sentiment analysis models perform in sentences from novels. We compare a simple model with a fine-tuned `distilbert` model (`distilbert-base-uncased-finetuned-sst-2-english`), just like the one you saw in class (Lecture 11).


### Exercise 1.1 (15 points):

Please get the code for sentiment analysis from the second assignment and paste it here. In the case that you did not implement this part in the second assignment, please use the Vader Sentiment Analysis package[1,2].

Prepare a function that given a sentence, writes to a file the sentence and a "P" or "N", indicating whether the sentence is positive or negative.

[1] https://pypi.org/project/vaderSentiment/

[2] https://github.com/cjhutto/vaderSentiment

In [6]:
import os
import numpy as np
import random

In [8]:
#YOUR CODE HER
def filetowordlist(path, sfx):
    words = []
    for item in sorted(os.listdir(path)):
        if sfx in item:
            f=open(path + item, encoding="iso8859-1")
            lines = [line.strip() for line in f]
            f.close()
            wordsinfile = []
            for l in lines:
                sentencewords = l.split()
                wordsinfile = wordsinfile + sentencewords
            words.append(wordsinfile)
    return words

def log(number):
    return np.log(number)
    
posreviews_all = filetowordlist("mix20_rand700_tokens_0211/tokens/pos/", ".txt")
negreviews_all = filetowordlist("mix20_rand700_tokens_0211/tokens/neg/", ".txt")    

posreviews_train = posreviews_all[:550]
negreviews_train = negreviews_all[:550]

posreviews_test  = posreviews_all[550:]
negreviews_test  = negreviews_all[550:]

poswords_train=[word for sent in posreviews_train for word in sent]
negwords_train=[word for sent in negreviews_train for word in sent]

#Vocabularies for positive and negative reviews
pos_vocab_train = set(poswords_train)
neg_vocab_train = set(negwords_train)
vocab_train = pos_vocab_train.union(neg_vocab_train)

#Number of types (vocabulary size)
pos_vocab_size_train = len(pos_vocab_train)
neg_vocab_size_train = len(neg_vocab_train)
vocab_size_train = len(vocab_train)

#Number of words (tokens)
noposwords_train=len(poswords_train)
nonegwords_train=len(negwords_train)

#Number of reviews
noposreviews_train=len(posreviews_train)
nonegreviews_train=len(negreviews_train)

total_reviews_train = noposreviews_train + nonegreviews_train

prior_probabiolity_pos_train = np.log(noposreviews_train / total_reviews_train)
prior_probabiolity_neg_train = np.log(nonegreviews_train / total_reviews_train)

from collections import Counter

pos_frequencies = Counter(poswords_train)
neg_frequencies = Counter(negwords_train)

vocab_size_train = len(vocab_train)
noposwords_train = len(poswords_train)
nonegwords_train = len(negwords_train)

pos_logprobs = {}
neg_logprobs = {}

for word in vocab_train:
    pos_prob = (pos_frequencies[word] + 1) / (noposwords_train + vocab_size_train)
    pos_logprobs[word] = np.log(pos_prob)
    
    neg_prob = (neg_frequencies[word] + 1) / (nonegwords_train + vocab_size_train)
    neg_logprobs[word] = np.log(neg_prob)

pos_oov_word_logprob = np.log(1 / (noposwords_train + vocab_size_train))
neg_oov_word_logprob = np.log(1 / (nonegwords_train + vocab_size_train))


def positive_or_not(sentence):
    log_prob_pos = prior_probabiolity_pos_train
    log_prob_neg = prior_probabiolity_neg_train

    for word in sentence.split():
        log_prob_pos += pos_logprobs.get(word, pos_oov_word_logprob)
        log_prob_neg += neg_logprobs.get(word, neg_oov_word_logprob)

    return 'P' if log_prob_pos > log_prob_neg else 'N'


### Exercise 1.2 (20 points):

Select 100 sentences of your choice from the English corpus that you obtained in the first assignment. For each sentence, apply both Sentiment Analysis models, Naive Bayes/Vader and the finetuned BERT model. 
Also for the BERT model you should make a function that prints the results of the model and returns a "P" or "N" depending on whether the sentence is positive or negative.

The result of this exercise should be a list of 100 sentences, and two lists of "P" and "N", that state whether a sentence is positive or negative, according to two different Sentiment Analysis models.

Print the three lists on the screen and to a file, "sent_comp.txt", in the following format:
```
Sentence, ResNB/Vader, ResBERT 
"Lady...", P, P 
"He went...", N, P
```



In [10]:
from transformers import pipeline
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def truncate_sentence(sentence, max_length=512):
    tokens = tokenizer.encode(sentence, truncation=True, max_length=max_length, add_special_tokens=True)
    return tokenizer.decode(tokens, skip_special_tokens=True)

bert_model = pipeline("sentiment-analysis",model="distilbert-base-uncased-finetuned-sst-2-english", device=0)                           

def bert_sentiment(sentence):
    truncated_sentence = truncate_sentence(sentence)
    result = bert_model(truncated_sentence)
    if result[0]['label']=='POSITIVE':
        return 'P'
    else:
        return 'N'

with open("en.txt", "r") as f:
    sentences = f.readlines()
    
random_sentences = random.sample(sentences, 100)
    

def compare_sentiment_analysis(sentences):
    results = []

    for sentence in sentences:
        ResNB = positive_or_not(sentence)
        ResBERT = bert_sentiment(sentence)
        results.append((sentence,ResNB,ResBERT))

    for sentence,ResNB,ResBERT in results:
        print(f'"{sentence}", {ResNB}, {ResBERT}')

    with open("sent_comp.txt", "w") as file:
        for sentence, ResNB, ResBERT in results:
            file.write(f'"{sentence}", {ResNB}, {ResBERT}\n')

compare_sentiment_analysis(random_sentences)

""How?"
", N, N
" Egremont smiled. “Your Convention is in its bloom, or rather its bud,” he said; “all is       fresh and pure now; but a little while and it will find the fate of all popular assemblies.       You will have factions.”
", P, P
""And yet you want to see new lands?" he pursued. "What is it you want to see there?"
", P, P
""Yes. Because, if women never see the serpent's face, neither do they ever scent the smell of     the paradise roses; and it will be hard for you to die without a single rose d'amour in your     pretty breast, poor little Bébée?"
", N, P
""Then we must be quick." she answered. "Come."
", N, P
"When she had thus brought him all she had, and he to please her had sat down to the simple     food, she gathered a spray of roses and set it in a pot beside him, then left him and went and     stood at a little distance, waiting, with her hands lightly crossed on her chest, to see if     there were anything that he might want.
", N, N
" “Higher!” cried Birkin. “Ye

### Exercise 1.3 (5 points): 
Calculate how often the two models agree?

In [18]:
def compare_anaysis(sentences) :
    agree_count = 0
    total_count = len(sentences)

    for sentence in sentences:
        ResNB = positive_or_not(sentence)
        ResBERT = bert_sentiment(sentence)
        if ResNB == ResBERT:
            agree_count += 1

    agree_percentage = (agree_count / total_count) * 100
    return agree_percentage

agree_percentage = compare_anaysis(random_sentences)
print(f"The two models agree {agree_percentage:.2f}% of the time.")

The two models agree 73.00% of the time.


### Exercise 1.4 (20 points):

If the models agree, do they provide the right response? If not, which model is wrong? Briefly discuss the results from some examples from the selection (5 examples).

1. Sentence:"Egremont smiled. 'Your Convention is in its bloom, or rather its bud,' he said; 'all is fresh and pure now; but a little while and it will find the fate of all popular assemblies. You will have factions.'"

Outcome: Both models (NB and BERT): Positive (P)

Discussion: The start of the sentence feels positive, talking about new beginnings, but then it becomes more cynical, predicting future problems. Both models tagged it as positive, probably because of "fresh and pure." Both models agree, but they miss the shift in sentiment. It feels more neutral-to-negative at the end, so they’re not completely right, but at least they agree.

2. Sentence:"And yet you want to see new lands?" he pursued. "What is it you want to see there?"

Outcome: Both models: Positive (P)

Discussion:This sentence is quite neutral—there is no strong emotion. It's just a question. Both models are wrong here because they marked it as positive. There is nothing happy or sad about it, just a simple question.

3. Sentence:"Yes. Because, if women never see the serpent's face, neither do they ever scent the smell of the paradise roses; and it will be hard for you to die without a single rose d'amour in your pretty breast, poor little Bébée?"

Outcome: NB: Negative (N) BERT: Positive (P)

Discussion:This sentence has mixed emotions. The "serpent's face" part feels dark and warning, but then it changes to a more romantic and affectionate tone with "pretty breast" and "poor little Bébée."  I think Naive Bayes is closer to the truth in this case,as the darker part makes the sentence feel a bit negative overall. 

4. Sentence:"Then we must be quick," she answered. "Come."

Outcome: NB: Negative (N) BERT: Positive (P)

Discussion:This sentence is neutral, but there is a sense of urgency. It’s not really positive or negative, just someone telling another to hurry. I think BERT is more accurate here, seeing it as positive because it has an action-oriented tone.

5. Sentence:"When she had thus brought him all she had, and he to please her had sat down to the simple food, she gathered a spray of roses and set it in a pot beside him, then left him and went and stood at a little distance, waiting, with her hands lightly crossed on her chest, to see if there were anything that he might want."

Outcome: Both models: Negative (N)

Discussion:This sentence is subtle—not clearly sad, but there’s a quiet sense of resignation. Both models tagged it as negative, which makes sense because the tone is definitely subdued and negative.

Conclusion:
When the models agree, they usually get it right for simple sentences, but they can miss the mark with more complex ones.
When they disagree, Naive Bayes sometimes picks up on the subtler, darker emotions, while BERT tends to see things more positively or romantically when the sentiment is mixed.
Overall, both models are useful, but they have their own weaknesses. Sometimes they both get it right, but other times they miss the deeper meaning, especially in more complicated sentences.

---

# Part 2 (Text classification)

In the second part of the assignment, you will be working on text classification. We will employ the same data set that we used for the TF-IDF example in lecture 10, provided by scikit-learn. It is the dataset called `fetch_20newsgroups`.

The goal of this assignment is to learn how to perform basic text classification using scikit-learn, but also to see whether named entities can positively contribute to a better classification performance.

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

cats = ['alt.atheism', 'sci.space', 'comp.graphics']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)

len(newsgroups_test['data']),len(newsgroups_train['data'])

(1102, 1657)

In [4]:
#This whole cell is just to 

whole_train_instances = newsgroups_train['data']
whole_y_train = [newsgroups_train['target_names'][el] for el in newsgroups_train['target']]

train_instances, _, y_train, _ = train_test_split(whole_train_instances, whole_y_train, train_size=0.15)

whole_test_instances = newsgroups_test['data']
whole_y_test = [newsgroups_test['target_names'][el] for el in newsgroups_test['target']]

test_instances, _, y_test, _ = train_test_split(whole_test_instances, whole_y_test, train_size=0.15)

len(train_instances), len(y_train), len(test_instances), len(y_test)

(248, 248, 165, 165)

In [7]:
#The goal of this cell is just to get a smaller subset of the corpus,
#as the third exercise of this part can be quite computationally expensive.
#If it is too expensive for your computers, feel free to use Google colab or to just contact Manex.

from sklearn.model_selection import train_test_split

whole_train_instances = newsgroups_train['data']
whole_y_train = [newsgroups_train['target_names'][el] for el in newsgroups_train['target']]

train_instances, _, y_train, _ = train_test_split(whole_train_instances, whole_y_train, train_size=0.25)

whole_test_instances = newsgroups_test['data']
whole_y_test = [newsgroups_test['target_names'][el] for el in newsgroups_test['target']]

test_instances, _, y_test, _ = train_test_split(whole_test_instances, whole_y_test, train_size=0.15)

len(train_instances), len(y_train), len(test_instances), len(y_test)

(414, 414, 165, 165)

### Exercise 2.1 (15 points):

In this first exercise, you need to use a Bag of Words representation for the articles from `fetch_20newsgroups`. Vectorize the news articles (please use `CountVectorizer` with its default parameters) and train a Logistic Regression model on the training data.

After that, apply the trained model on the test data and check the performance by printing the classification report.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_instances)
X_test = vectorizer.transform(test_instances)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred, target_names=newsgroups_test['target_names']))

               precision    recall  f1-score   support

  alt.atheism       0.93      0.86      0.89        49
comp.graphics       0.83      0.92      0.87        60
    sci.space       0.87      0.84      0.85        56

     accuracy                           0.87       165
    macro avg       0.88      0.87      0.87       165
 weighted avg       0.88      0.87      0.87       165



### Exercise 2.2 (15 points):

You should do the same as in Exercise 2.1, but instead of using a Bag of Words, you should normalize these frequencies with the inverse document frequency. Please, use a relevant package from `sklearn`. You should print the classification report, as above. Are results better? Why does this happen? Think about it.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_instances)
X_test = vectorizer.transform(test_instances)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred, target_names=newsgroups_test['target_names']))

               precision    recall  f1-score   support

  alt.atheism       0.98      0.84      0.90        49
comp.graphics       0.89      0.93      0.91        60
    sci.space       0.85      0.91      0.88        56

     accuracy                           0.90       165
    macro avg       0.91      0.89      0.90       165
 weighted avg       0.90      0.90      0.90       165



TF-IDF often performs better than Bag of Words because it down-weights common words and highlights unique, category-defining terms. This makes the model focus on more relevant features, improving classification accuracy.


### Exercise 2.3 (20 points):

We would now like to know whether the use of Named Entities can help in text classification. Currently, if a multiword Named entity is present in the corpus, the bag of words interprets them as two separate words. We would like the model to recognize also the whole named entities.

In order to test this, we would like you to extract named entities, and somehow include those multiword named entities, as extra words. For instance, assuming that we have a 6 word dictionary,

```
dict = ["i":0,
         "house":1,
         "like":2,
         "new":3,
         "plant":4,
         "york":5]
```

we would return this as the representation of this sentence: `I like New York`

`bag_of_words = [1, 0, 1, 1, 0, 1]`

Now, we would like to have an extra word for named entities in the dictionary, which would represent the word `New York`, such as:

```
new_dict = ["i":0,
         "house":1,
         "like":2,
         "new":3,
         "plant":4,
         "york":5,
         "new york": 6]
```

`new_bag_of_words = [1, 0, 1, 1, 0, 1, 1]`

A trick can be to use an underscore to join multiword entities, such as `new_york`.

---

Train a Logistic Regression using CountVectorizer, where we also include Named Entities. Do results improve? Check them with classification report, just like in the previous two exercises.

In [10]:
!pip install spacy
!python -m spacy download en_core_web_sm

[0mCollecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [41]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

nlp = spacy.load("en_core_web_sm")

def extract_named_entities(text):
    doc = nlp(text)
    entities = [ent.text.replace(" ", "_") for ent in doc.ents]
    return text + " " + " ".join(entities)

cats = ['alt.atheism', 'sci.space', 'comp.graphics']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)

train_instances, _, y_train, _ = train_test_split(newsgroups_train['data'], 
                                                  newsgroups_train['target'], 
                                                  train_size=0.25)

test_instances, _, y_test, _ = train_test_split(newsgroups_test['data'], 
                                                newsgroups_test['target'], 
                                                train_size=0.15)

train_instances_with_ner = [extract_named_entities(text) for text in train_instances]
test_instances_with_ner = [extract_named_entities(text) for text in test_instances]

vectorizer = CountVectorizer()

X_train = vectorizer.fit_transform(train_instances_with_ner)
X_test = vectorizer.transform(test_instances_with_ner)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred, target_names=newsgroups_test['target_names']))


               precision    recall  f1-score   support

  alt.atheism       0.93      0.84      0.88        49
comp.graphics       0.79      0.92      0.85        53
    sci.space       0.92      0.86      0.89        63

     accuracy                           0.87       165
    macro avg       0.88      0.87      0.87       165
 weighted avg       0.88      0.87      0.87       165



The addition of named entity recognition (NER) didn’t significantly improve my model's performance.

Comparison: alt.atheism: Precision dropped from 0.98 to 0.93.  comp.graphics: Precision slightly decreased from 0.89 to 0.79.  sci.space: Precision improved from 0.85 to 0.92.

Discussion：Why might this happen?
Irrelevant Named Entities: The NER might have captured entities that weren’t helpful for the classification.
Feature Overload: Adding too many features can introduce noise instead of improving the model.
Model Limitation: Logistic Regression might not be the best choice to take full advantage of these additional features.