<a href="https://colab.research.google.com/github/tylervu1/sentiment-analysis/blob/main/3033116804.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1
You should submit the **UniversityNumber.ipynb** file and your final prediction file **UniversityNumber.test.txt** to moodle. Make sure your code does not use your local files and that the results are reproducible. Before submitting, please **1. clean all outputs and 2. run all cells in your notebook and keep all running logs** so that we can check.

# 1 $n$-gram Language Model

In [1]:
!mkdir -p data/lm
!wget -O data/lm/train.txt https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/lm/train.txt
!wget -O data/lm/dev.txt https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/lm/dev.txt
!wget -O data/lm/test.txt https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/lm/test.txt

--2024-02-19 16:39:30--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/lm/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8385238 (8.0M) [text/plain]
Saving to: ‘data/lm/train.txt’


2024-02-19 16:39:31 (101 MB/s) - ‘data/lm/train.txt’ saved [8385238/8385238]

--2024-02-19 16:39:31--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/lm/dev.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1680641 (1.6M) [text/plain]
Saving to: ‘data/lm/dev.txt

## 1.1 Building vocabulary

You will download and preprocess the tokenized training data to build the vocabulary. To handle out- of-vocabulary(OOV) words, you will convert tokens that occur less than three times in the training data into a special unknown token 〈UNK〉. You should also add start-of-sentence tokens 〈START〉 and end-of-sentence 〈END〉 tokens. If you did this correctly, your language model’s vocabulary, including the 〈START〉, 〈END〉 and 〈UNK〉 tokens should have 26,601 words.

Please show the vocabulary size and discuss the number of parameters of n-gram models.

### Code

In [2]:
from collections import defaultdict
import math

def load_data(file_path):
  with open(file_path, 'r') as file:
    data = file.readlines()
  return data

train_data = load_data('data/lm/train.txt')
dev_data = load_data('data/lm/dev.txt')
test_data = load_data('data/lm/test.txt')

def preprocess_data(train_data):
  count = defaultdict(int)
  bigram_count = defaultdict(int)
  trigram_count = defaultdict(int)
  vocab = set(['<START>', '<END>', '<UNK>'])

  # first pass: count unigrams to determine vocab
  for line in train_data:
    tokens = ['<START>'] + line.strip().split() + ['<END>']
    for token in tokens:
      count[token] += 1

  # determine vocab
  unk_count = 0
  for token, freq in list(count.items()):
    if freq < 3 and token not in ['<START>', '<END>', '<UNK>']:
      unk_count += freq
      del count[token]

  # add UNK count to vocab
  count['<UNK>'] = unk_count
  vocab.update(count.keys())

  # count bigrams and trigrams
  for line in train_data:
    tokens = ['<START>'] + line.strip().split() + ['<END>']
    tokens = [token if token in vocab else '<UNK>' for token in tokens]

    for i in range(len(tokens)):
      if i < len(tokens) - 1:
        bigram = (tokens[i], tokens[i+1])
        bigram_count[bigram] += 1
      if i < len(tokens) - 2:
          trigram = (tokens[i], tokens[i+1], tokens[i+2])
          trigram_count[trigram] += 1
  return vocab, count, bigram_count, trigram_count

vocab, unigram_count, bigram_count, trigram_count = preprocess_data(train_data)

print("Vocabulary size:", len(vocab))

Vocabulary size: 26601


### Discussion

Our preprocessing included converting tokens that occur less than three times in the training data into a special unknown token 〈UNK〉and adding start-of-sentence tokens 〈START〉 and end-of-sentence 〈END〉 tokens. After preprocessing, our vocabulary size was 26,601.

The number of parameters that n-gram models take in are dependent on the size of the vocabulary and the value of n. This can be expressed in the following equation:

$|V|^n$

where $V$ is the size of the vocabulary and $n$ is the value of n in n-gram model.

In this case, the vocabulary size is 26601.

## 1.2 $n$-gram Language Modeling

After preparing your vocabulary, you are expected to build bigram and unigram language models and report their perplexity on the training set, and dev set. Please discuss your experimental results. If you encounter any problems, please analyze them and explain why.

### Code

In [3]:
def calculate_unigram_probabilities(unigram_count):
  total_words = sum(unigram_count.values())
  unigram_probabilities = {word: count / total_words for word, count in unigram_count.items()}
  return unigram_probabilities

def calculate_bigram_probabilities(bigram_count, unigram_count):
  bigram_probabilities = {}
  for (word1, word2), count in bigram_count.items():
      bigram_probabilities[(word1, word2)] = count / unigram_count[word1]
  return bigram_probabilities

def calculate_trigram_probabilities(trigram_count, bigram_count):
  trigram_probabilities = {}
  for (word1, word2, word3), count in trigram_count.items():
      trigram_probabilities[(word1, word2, word3)] = count / bigram_count[(word1, word2)]
  return trigram_probabilities

# calculate probabilities
unigram_probabilities = calculate_unigram_probabilities(unigram_count)
bigram_probabilities = calculate_bigram_probabilities(bigram_count, unigram_count)
trigram_probabilities = calculate_trigram_probabilities(trigram_count, bigram_count)

In [4]:
def calculate_unigram_perplexity(dataset, unigram_probabilities):
  log_sum = 0
  N = 0  # total number of words
  for line in dataset:
    tokens = ['<START>'] + line.strip().split() + ['<END>']
    for token in tokens:
      prob = unigram_probabilities.get(token, unigram_probabilities.get('<UNK>', 0))
      log_sum += math.log2(prob) if prob > 0 else 0
      N += 1
  return 2 ** (-log_sum / N)

def calculate_bigram_perplexity(dataset, bigram_probabilities, unigram_probabilities):
  log_sum = 0
  N = 0  # total number of bigrams
  for line in dataset:
    tokens = ['<START>'] + line.strip().split() + ['<END>']
    for i in range(len(tokens) - 1):
      bigram = (tokens[i], tokens[i+1])
      prob = bigram_probabilities.get(bigram, unigram_probabilities.get(bigram[1], unigram_probabilities.get('<UNK>', 0)))
      log_sum += math.log2(prob) if prob > 0 else 0
      N += 1
  return 2 ** (-log_sum / N)

def calculate_trigram_perplexity(dataset, trigram_probabilities, bigram_probabilities):
  log_sum = 0
  N = 0  # total number of trigrams
  for line in dataset:
    tokens = ['<START>'] + line.strip().split() + ['<END>', '<END>']
    for i in range(len(tokens) - 2):
      trigram = (tokens[i], tokens[i+1], tokens[i+2])
      prob = trigram_probabilities.get(trigram, bigram_probabilities.get((tokens[i], tokens[i+1]), bigram_probabilities.get((tokens[i+1], tokens[i+2]), 0)))
      log_sum += math.log2(prob) if prob > 0 else 0
      N += 1
  return 2 ** (-log_sum / N)

unigram_perplexity = calculate_unigram_perplexity(train_data, unigram_probabilities)
bigram_perplexity = calculate_bigram_perplexity(train_data, bigram_probabilities, unigram_probabilities)
# trigram_perplexity = calculate_trigram_perplexity(train_data, trigram_probabilities, bigram_probabilities)

print(f"Unigram Perplexity on Training Dataset: {unigram_perplexity}")
print(f"Bigram Perplexity on Training Dataset: {bigram_perplexity}")
# print(f"Trigram Perplexity on Training Dataset: {trigram_perplexity}")

unigram_perplexity = calculate_unigram_perplexity(dev_data, unigram_probabilities)
bigram_perplexity = calculate_bigram_perplexity(dev_data, bigram_probabilities, unigram_probabilities)
# trigram_perplexity = calculate_trigram_perplexity(dev_data, trigram_probabilities, bigram_probabilities)

print(f"Unigram Perplexity on Dev Dataset: {unigram_perplexity}")
print(f"Bigram Perplexity on Dev Dataset: {bigram_perplexity}")
# print(f"Trigram Perplexity on Dev Dataset: {trigram_perplexity}")

Unigram Perplexity on Training Dataset: 888.8280671470116
Bigram Perplexity on Training Dataset: 80.40609215648743
Unigram Perplexity on Dev Dataset: 815.9305409683114
Bigram Perplexity on Dev Dataset: 193.16625998212717


### Discussion

Initially, I ran into issues with zero probabilities for words that weren't previously encountered when building our vocabulary. This would later lead to infinite perplexity. However, to solve this issue, we used the probability of〈UNK〉which is in a nonzero (in this dataset). With these changes, our perplexities on the training and dev datasets become much more reasonable. However, there is a higher perplexity on the dev dataset in comparison to perplexity on the training dataset.

## 1.3 Smoothing

### 1.3.1 Add-one (Laplace) smoothing

### Code

In [5]:
def calculate_bigram_probabilities_with_smoothing(bigram_count, unigram_count, k, vocab_size):
  bigram_probabilities = {}
  for (word1, word2), count in bigram_count.items():
    adjusted_count = count + k
    total_count_for_word1 = unigram_count[word1] + k * vocab_size
    bigram_probabilities[(word1, word2)] = adjusted_count / total_count_for_word1
  return bigram_probabilities

k = 1 # k=1 in add-k smoothing is the same as add-1/laplace smoothing
bigram_probabilities_with_smoothing = calculate_bigram_probabilities_with_smoothing(bigram_count, unigram_count, k, len(vocab))

def calculate_bigram_perplexity_with_smoothing(dataset, bigram_probabilities, unigram_probabilities, k, vocab_size):
  log_sum = 0
  N = 0
  for line in dataset:
    tokens = ['<START>'] + line.strip().split() + ['<END>']
    for i in range(len(tokens) - 1):
      bigram = (tokens[i], tokens[i+1])
      prob = bigram_probabilities.get(bigram, (k / (unigram_probabilities.get(bigram[0], 0) + k * vocab_size)))
      log_sum += math.log2(prob) if prob > 0 else 0
      N += 1
  return 2 ** (-log_sum / N)

# calculating perplexity with smoothing
bigram_perplexity_with_smoothing_train = calculate_bigram_perplexity_with_smoothing(train_data, bigram_probabilities_with_smoothing, unigram_probabilities, k, len(vocab))
bigram_perplexity_with_smoothing_dev = calculate_bigram_perplexity_with_smoothing(dev_data, bigram_probabilities_with_smoothing, unigram_probabilities, k, len(vocab))

print(f"Bigram w/add-one smoothing Perplexity on Training Dataset: {bigram_perplexity_with_smoothing_train}")
print(f"Bigram w/add-one smoothing Perplexity on Dev Dataset: {bigram_perplexity_with_smoothing_dev}")

Bigram w/add-one smoothing Perplexity on Training Dataset: 2018.1786384515503
Bigram w/add-one smoothing Perplexity on Dev Dataset: 2463.4992695068327


#### Discussion

The main difference between our bigram model in the previous section and our bigram model here with add-one smoothing is how we approach the issue of zero probabilities. Our bigram model in the previous section used the probability of〈UNK〉whereas this section simply adds one to everything which is an alternate way to fix our problem of zero probabilities. However, it also results in a much higher perplexity in comparison but is still better than receiving an infinite perplexity from a zero probability.

#### 1.3.2 Add-k smoothing

##### Code

In [6]:
k = [0.5, 0.05, 0.01]

for x in k:
  bigram_probabilities_with_smoothing = calculate_bigram_probabilities_with_smoothing(bigram_count, unigram_count, x, len(vocab))

  # calculating perplexity with smoothing
  bigram_perplexity_with_smoothing_train = calculate_bigram_perplexity_with_smoothing(train_data, bigram_probabilities_with_smoothing, unigram_probabilities, x, len(vocab))
  bigram_perplexity_with_smoothing_dev = calculate_bigram_perplexity_with_smoothing(dev_data, bigram_probabilities_with_smoothing, unigram_probabilities, x, len(vocab))

  print(f"Bigram w/add-{x} smoothing Perplexity on Training Dataset: {bigram_perplexity_with_smoothing_train}")
  print(f"Bigram w/add-{x} smoothing Perplexity on Dev Dataset: {bigram_perplexity_with_smoothing_dev}\n")

Bigram w/add-0.5 smoothing Perplexity on Training Dataset: 1385.778829603733
Bigram w/add-0.5 smoothing Perplexity on Dev Dataset: 1846.816592358543

Bigram w/add-0.05 smoothing Perplexity on Training Dataset: 436.9516894347024
Bigram w/add-0.05 smoothing Perplexity on Dev Dataset: 829.8408326274347

Bigram w/add-0.01 smoothing Perplexity on Training Dataset: 239.76672030538336
Bigram w/add-0.01 smoothing Perplexity on Dev Dataset: 571.888391134207



##### Discussion

This section is very similar to the previous section because I simplified much of the code for reusability. The logic here derives from the understanding that add-one smoothing is simply a version of add-k smoothing which enables us to simply set k=1 in the previous section. However, in this section, we experiment with adjusting k to less significant values, namely 0.5, 0.05, and 0.01. Through our testing of these values of k, we found that 0.01 had the lowest perplexity and performed the best on our datasets.

### 1.3.3 Linear Interpolation

#### Code

In [7]:
import numpy as np

def calculate_interpolated_probability(word, previous, unigram_prob, bigram_prob, trigram_prob, lambdas):
  lambda1, lambda2, lambda3 = lambdas
  unigram_p = unigram_prob.get(word, unigram_prob.get('<UNK>', 0))
  bigram_p = bigram_prob.get((previous[-1], word), bigram_prob.get(('<UNK>', word), unigram_prob.get(word, unigram_prob.get('<UNK>', 0))))
  trigram_p = trigram_prob.get((previous[-2], previous[-1], word), trigram_prob.get((previous[-1], word), unigram_prob.get(word, unigram_prob.get('<UNK>', 0))))

  return lambda3 * trigram_p + lambda2 * bigram_p + lambda1 * unigram_p

def calculate_interpolated_perplexity(dataset, unigram_prob, bigram_prob, trigram_prob, lambdas):
  log_sum = 0
  N = 0
  for line in dataset:
    tokens = ['<START>'] + line.strip().split() + ['<END>']
    for i in range(2, len(tokens)):
      prob = calculate_interpolated_probability(tokens[i], tokens[max(0, i-2):i], unigram_prob, bigram_prob, trigram_prob, lambdas)
      log_sum += math.log2(prob)
      N += 1
  return 2 ** (-log_sum / N)

lambdas = (0.2, 0.3, 0.5)  # given λ values
train_perplexity = calculate_interpolated_perplexity(train_data, unigram_probabilities, bigram_probabilities, trigram_probabilities, lambdas)
dev_perplexity = calculate_interpolated_perplexity(dev_data, unigram_probabilities, bigram_probabilities, trigram_probabilities, lambdas)

print(f"Hyperparameters used: {lambdas}")
print(f"Train Perplexity: {train_perplexity}")
print(f"Dev Perplexity: {dev_perplexity}\n")

best_lambdas = (0.0, 0.7, 0.3)
best_perplexity = float('inf')

# grid search for hyperparameters
# for lambda1 in np.arange(0.1, 1, 0.1):
#   for lambda2 in np.arange(0, 1 - lambda1, 0.1):
#     lambda3 = 1 - lambda1 - lambda2
#     current_perplexity = calculate_interpolated_perplexity(dev_data, unigram_probabilities, bigram_probabilities, trigram_probabilities, (lambda1, lambda2, lambda3))
#     if current_perplexity < best_perplexity:
#       best_perplexity = current_perplexity
#       best_lambdas = (lambda1, lambda2, lambda3)

train_perplexity = calculate_interpolated_perplexity(train_data, unigram_probabilities, bigram_probabilities, trigram_probabilities, best_lambdas)
dev_perplexity = calculate_interpolated_perplexity(dev_data, unigram_probabilities, bigram_probabilities, trigram_probabilities, best_lambdas)
test_perplexity = calculate_interpolated_perplexity(test_data, unigram_probabilities, bigram_probabilities, trigram_probabilities, best_lambdas)
print(f"Hyperparameters used: {best_lambdas}")
print(f"Train Perplexity: {train_perplexity}")
print(f"Dev Perplexity: {dev_perplexity}")
print(f"Test Perplexity: {test_perplexity}")

Hyperparameters used: (0.2, 0.3, 0.5)
Train Perplexity: 13.952334576753357
Dev Perplexity: 203.33597875162883

Hyperparameters used: (0.0, 0.7, 0.3)
Train Perplexity: 16.830567503808854
Dev Perplexity: 169.90879634178935
Test Perplexity: 168.70895077575767


#### Discussion

Assuming we can assign 0 to a hyperparameter, the best hyperparameters I was able to find on the dev set was 0, 0.7, and 0.3. With these hyperparameters, the perplexity on the dev dataset was 169.91 and the perplexity on the testing dataset was 168.71. The main takeaway from this is that by assigning a value of zero to the first hyperparameter, our unigram probabilities aren't that helpful. Rather, the bigram probabilities were the most significant with the trigrams coming right behind that.

##### 1.3.4 Optimization

#### Discussion

There are various ways to find this optimal set of hyperparameters. A learning algorithm that would be particularly useful here is the Expectation-Maximization (EM) algorithm. Essentially, this algorithm is an iterative learning algorithm that converges on locally optimal λs.

# 2 Word Vectors

In [8]:
def load_embedding_model():
    """ Load GloVe Vectors
        Return:
            wv_from_bin: All 400000 embeddings, each lengh 50
    """
    import gensim.downloader as api
    wv_from_bin = api.load("glove-wiki-gigaword-200")
    print("Loaded vocab size %i" % len(list(wv_from_bin.index_to_key)))
    return wv_from_bin

wv_from_bin = load_embedding_model()

Loaded vocab size 400000


## 2.1 Find most similar word
Use cosine similarity to find the most similar word to each of these words. Report the most similar word and its cosine similarity.

In [9]:
words = ["dog", "whale", "before", "however", "fabricate"]

most_similar_words = {}
for word in words:
  similar_word, cosine_similarity = wv_from_bin.most_similar(positive=[word])[0]
  print(word, ':', similar_word, cosine_similarity)

dog : dogs 0.8136862516403198
whale : whales 0.7918056845664978
before : after 0.8931248188018799
however : although 0.9336162805557251
fabricate : fabricating 0.618401825428009


## 2.2 Finding Analogies
Use vector addition and subtraction to compute target vectors for the analogies below. After computing each target vector, find the top three candidates by cosine similarity. Report the candidates and their similarities to the target vector.

- dog : puppy :: cat : ?:
- speak : speaker :: sing : ?:
- France : French :: England : ?:
- France : wine :: England : ?

In [10]:
analogies = [('dog', 'puppy', 'cat'),
             ('speak', 'speaker', 'sing'),
             ('france', 'french', 'england'),
             ('france', 'wine', 'england')]

for a, b, c in analogies:
  result = wv_from_bin.most_similar(positive=[c,b], negative=[a])[:3]
  analogy = f"{a} : {b} :: {c} : {result}"
  print(analogy)

dog : puppy :: cat : [('puppies', 0.6142244935035706), ('kitten', 0.5919069647789001), ('kittens', 0.5758378505706787)]
speak : speaker :: sing : [('sang', 0.48609817028045654), ('chorus', 0.4438377618789673), ('singing', 0.4332594871520996)]
france : french :: england : [('english', 0.7599895596504211), ('british', 0.5879749059677124), ('scottish', 0.5616408586502075)]
france : wine :: england : [('tea', 0.540773868560791), ('wines', 0.5329136848449707), ('yorkshire', 0.49405309557914734)]


# 3 Sentiment analysis

In [11]:
!mkdir -p data/classification
!wget -O data/classification/train.txt https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/classification/train.txt
!wget -O data/classification/dev.txt https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/classification/dev.txt
!wget -O data/classification/test-blind.txt https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/classification/test-blind.txt

--2024-02-19 16:42:04--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/classification/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 738844 (722K) [text/plain]
Saving to: ‘data/classification/train.txt’


2024-02-19 16:42:04 (74.6 MB/s) - ‘data/classification/train.txt’ saved [738844/738844]

--2024-02-19 16:42:04--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/classification/dev.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94400 (92

## 3.1 Logistic Regression

In [12]:
train_data = load_data('data/classification/train.txt')
dev_data = load_data('data/classification/dev.txt')
test_data = load_data('data/classification/test-blind.txt')

In [13]:
def preprocess_data(data):
  labels = []
  texts = []
  for line in data:
    if line.strip():
      parts = line.strip().split('\t', 1)
      if len(parts) == 2:
        labels.append(int(parts[0]))
        texts.append(parts[1])
      else:
        texts.append(line.strip())
  return labels, texts

train_labels, train_texts = preprocess_data(train_data)
dev_labels, dev_texts = preprocess_data(dev_data)

### Unigram Features

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# extract unigram features
vectorizer_unigram = CountVectorizer(ngram_range=(1, 1))
X_train_unigram = vectorizer_unigram.fit_transform(train_texts)
X_dev_unigram = vectorizer_unigram.transform(dev_texts)

# training
lr_unigram = LogisticRegression(max_iter=1000)
lr_unigram.fit(X_train_unigram, train_labels)
preds_unigram = lr_unigram.predict(X_dev_unigram)

report_unigram = classification_report(dev_labels, preds_unigram, target_names=['Negative', 'Positive'])

print(report_unigram)

              precision    recall  f1-score   support

    Negative       0.80      0.76      0.78       428
    Positive       0.78      0.81      0.80       444

    accuracy                           0.79       872
   macro avg       0.79      0.79      0.79       872
weighted avg       0.79      0.79      0.79       872



### Bigram Features

In [15]:
# extract bigram features
vectorizer_bigram = CountVectorizer(ngram_range=(1, 2))
X_train_bigram = vectorizer_bigram.fit_transform(train_texts)
X_dev_bigram = vectorizer_bigram.transform(dev_texts)

# training
lr_bigram = LogisticRegression(max_iter=1000)
lr_bigram.fit(X_train_bigram, train_labels)
preds_bigram = lr_bigram.predict(X_dev_bigram)

report_bigram = classification_report(dev_labels, preds_bigram, target_names=['Negative', 'Positive'])

print(report_bigram)

              precision    recall  f1-score   support

    Negative       0.80      0.77      0.79       428
    Positive       0.79      0.81      0.80       444

    accuracy                           0.79       872
   macro avg       0.79      0.79      0.79       872
weighted avg       0.79      0.79      0.79       872



### GloVe Features

In [16]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

def text_to_glove_vector(text, wv_from_bin):
  vector_size = 200
  vectors = []
  for word in text.split():
    if word in wv_from_bin:
      vectors.append(wv_from_bin[word])
  if vectors:
    return np.mean(vectors, axis=0)
  else:
    return np.zeros(vector_size)

train_vectors = np.array([text_to_glove_vector(text, wv_from_bin) for text in train_texts])
dev_vectors = np.array([text_to_glove_vector(text, wv_from_bin) for text in dev_texts])

lr_glove = LogisticRegression(max_iter=1000)
lr_glove.fit(train_vectors, train_labels)

preds_glove = lr_glove.predict(dev_vectors)
report_glove = classification_report(dev_labels, preds_glove, target_names=['Negative', 'Positive'])

print(report_glove)

              precision    recall  f1-score   support

    Negative       0.78      0.74      0.76       428
    Positive       0.76      0.80      0.78       444

    accuracy                           0.77       872
   macro avg       0.77      0.77      0.77       872
weighted avg       0.77      0.77      0.77       872



Compare the performance of three types of features on dev set. Report the weighted average precision, recall and F1-score for each feature set.

| Feature | precision | recall | F1-score |
| ----------- | --------- | ------ | -------- |
| unigram     |     0.79      |    0.79    |      0.79    |
| bigram      |      0.79     |     0.79   |      0.79    |
| GloVe       |    0.77       |   0.77     |      0.77    |

## 3.2 Better Feature

In [17]:
!pip install transformers[torch]
!pip install accelerate -U



The code below is my process of fine-tuning a DistilBERT model for a sentiment analysis. You do NOT have to run this. Rather, it is more here as demonstration as to how I fine-tuned my model. I ran this code on a compute cluster for computational reasons, not in Google Colab like the rest of the code for this assignment.

**Uncomment if you want to train**

In [18]:
# from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
# from transformers import Trainer, TrainingArguments

# tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

# train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
# dev_encodings = tokenizer(dev_texts, truncation=True, padding=True, max_length=128)

# import torch

# class SentimentDataset(torch.utils.data.Dataset):
#   def __init__(self, encodings, labels):
#       self.encodings = encodings
#       self.labels = labels

#   def __getitem__(self, idx):
#       item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
#       item['labels'] = torch.tensor(self.labels[idx])
#       return item

#   def __len__(self):
#       return len(self.labels)

# train_dataset = SentimentDataset(train_encodings, train_labels)
# dev_dataset = SentimentDataset(dev_encodings, dev_labels)

# training_args = TrainingArguments(
#   output_dir='./results',          # output directory
#   num_train_epochs=3,              # total number of training epochs
#   per_device_train_batch_size=16,  # batch size per device during training
#   per_device_eval_batch_size=64,   # batch size for evaluation
#   warmup_steps=500,                # number of warmup steps for learning rate scheduler
#   weight_decay=0.01,               # strength of weight decay
#   logging_dir='./logs',            # directory for storing logs
#   logging_steps=10,
# )

# trainer = Trainer(
#   model=model,
#   args=training_args,
#   train_dataset=train_dataset,
#   eval_dataset=dev_dataset
# )

# trainer.train()

# model_path = "./save/model"
# tokenizer_path = "./save/tokenizer"

# model.save_pretrained(model_path)
# tokenizer.save_pretrained(tokenizer_path)

And this is testing the model's performance on the dev dataset and testing dataset.

In [19]:
!mkdir -p finetuned-bert/model
!gdown --id 1AAeOu8LMrf4XIbhm63iY1_H2SmAOHBFw -O finetuned-bert/model/config.json
!gdown --id 1PFbzdDbuYL9ujOzNcPls1trl4uqj5HVh -O finetuned-bert/model/pytorch_model.bin

!mkdir -p finetuned-bert/tokenizer
!gdown --id 1fqIy-Qk3x-bkCSG-kUgJ3Bw_WggJcx1F -O finetuned-bert/tokenizer/special_tokens_map.json
!gdown --id 1KZPZy05uAHIwoZYmywxJUK6aipq1PY3F -O finetuned-bert/tokenizer/tokenizer_config.json
!gdown --id 1KthdGrViT39L8FlaWgaXh6oSSAFHksLR -O finetuned-bert/tokenizer/vocab.txt

Downloading...
From: https://drive.google.com/uc?id=1AAeOu8LMrf4XIbhm63iY1_H2SmAOHBFw
To: /content/finetuned-bert/model/config.json
100% 615/615 [00:00<00:00, 1.64MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1PFbzdDbuYL9ujOzNcPls1trl4uqj5HVh
From (redirected): https://drive.google.com/uc?id=1PFbzdDbuYL9ujOzNcPls1trl4uqj5HVh&confirm=t&uuid=b7129b82-fd52-421a-9770-46a1b78b733a
To: /content/finetuned-bert/model/pytorch_model.bin
100% 268M/268M [00:02<00:00, 127MB/s]
Downloading...
From: https://drive.google.com/uc?id=1fqIy-Qk3x-bkCSG-kUgJ3Bw_WggJcx1F
To: /content/finetuned-bert/tokenizer/special_tokens_map.json
100% 125/125 [00:00<00:00, 331kB/s]
Downloading...
From: https://drive.google.com/uc?id=1KZPZy05uAHIwoZYmywxJUK6aipq1PY3F
To: /content/finetuned-bert/tokenizer/tokenizer_config.json
100% 372/372 [00:00<00:00, 882kB/s]
Downloading...
From: https://drive.google.com/uc?id=1KthdGrViT39L8FlaWgaXh6oSSAFHksLR
To: /content/finetuned-bert/tokenizer/vocab.txt
100% 232

In [20]:
import pandas as pd
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score
import torch

model_path = "finetuned-bert/model"
tokenizer_path = "finetuned-bert/tokenizer"
tokenizer = DistilBertTokenizer.from_pretrained(tokenizer_path)
model = DistilBertForSequenceClassification.from_pretrained(model_path)

# tokenize
dev_encodings = tokenizer(dev_texts, truncation=True, padding=True, max_length=128)

class SentimentDataset(torch.utils.data.Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx])
    return item

  def __len__(self):
    return len(self.labels)

dev_dataset = SentimentDataset(dev_encodings, dev_labels)

# calculate predictions
trainer = Trainer(model=model)
raw_pred, _, _ = trainer.predict(dev_dataset)
predictions = np.argmax(raw_pred, axis=1)

report = classification_report(dev_labels, predictions, target_names=['Negative', 'Positive'])

print(report)

              precision    recall  f1-score   support

    Negative       0.90      0.89      0.89       428
    Positive       0.90      0.90      0.90       444

    accuracy                           0.90       872
   macro avg       0.90      0.90      0.90       872
weighted avg       0.90      0.90      0.90       872



In [21]:
test_encodings = tokenizer(test_data, truncation=True, padding=True, max_length=128)

class TestDataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings

  def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    return item

  def __len__(self):
    return len(self.encodings.input_ids)

test_dataset = TestDataset(test_encodings)

# calculate predictions
raw_pred, _, _ = trainer.predict(test_dataset)
predictions = np.argmax(raw_pred, axis=1)

# formatting predictions
output_lines = ["{}\t{}".format(pred, text) for pred, text in zip(predictions, test_data)]
output_content = "".join(output_lines)

print(output_content)

1	If you sometimes like to go to the movies to have fun , Wasabi is a good place to start .
1	Emerges as something rare , an issue movie that 's so honest and keenly observed that it does n't feel like one .
1	Offers that rare combination of entertainment and education .
1	Perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions .
1	Steers turns in a snappy screenplay that curls at the edges ; it 's so clever you want to hate it .
1	But he somehow pulls it off .
1	Take Care of My Cat offers a refreshingly different slice of Asian cinema .
1	This is a film well worth seeing , talking and singing heads and all .
1	What really surprises about Wisegirls is its low-key quality and genuine tenderness .
1	-LRB- Wendigo is -RRB- why we go to the cinema : to be fed through the eye , the heart , the mind .
1	One of the greatest family-oriented , fantasy-adventure movies ever .
1	An utterly compelling ` who wrote it ' in which the reputation of th


| Feature | precision | recall | F1-score |
| ----------- | --------- | ------ | -------- |
| unigram     |     0.79      |    0.79    |      0.79    |
| bigram      |      0.79     |     0.79   |      0.79    |
| GloVe       |    0.77       |   0.77     |      0.77    |
| better feature       |    0.90       |   0.90     |      0.90    |