**Initialization**

In [1]:
import random
import tqdm

**Feature Extraction**

This represents the feature extraction function where 𝒇 maps the input 𝒙 to a feature vector 𝒉.

In [2]:
def extract_features(x: str) -> dict[str, float]:
    features = {}
    x_split = x.split(' ')
    for x in x_split:
        features[x] = features.get(x, 0) + 1.0
    return features

# initialize the weights to zero
feature_weights = {}

**Data Reading**

This function reads the data from training, development, or test data file.

In [3]:
def read_xy_data(filename: str) -> tuple[list[str], list[int]]:
    x_data = []
    y_data = []
    with open(filename, 'r') as f:
        for line in f:
            label, text = line.strip().split(' ||| ')
            x_data.append(text)
            y_data.append(int(label))
    return x_data, y_data

In [4]:
from google.colab import drive
drive.mount('/content/drive')

x_train, y_train = read_xy_data('/content/drive/My Drive/data/sentiment/train.txt')
x_test, y_test = read_xy_data('/content/drive/My Drive/data/sentiment/dev.txt')

Mounted at /content/drive


In [6]:
print(x_train[1])
print(y_train[1])

The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer\/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .
1


**Classification**

Run the classifier to predict the lable of 𝒙.

In [22]:
def run_classifier(features: dict[str, float]) -> int:
    score = 0
    for feat_name, feat_value in features.items():
        score = score + feat_value * feature_weights.get(feat_name, 0)
    if score > 0:
        return 1
    elif score < 0:
        return -1
    else:
        return 0

**Training**

The following codes are to train the feature parameters based on the given data.

In [14]:
NUM_EPOCHS = 5
for epoch in range(1, NUM_EPOCHS+1):
    # Shuffle the order of the data
    data_ids = list(range(len(x_train)))
    random.shuffle(data_ids)
    # Run over all data points
    for data_id in tqdm.tqdm(data_ids, desc=f'Epoch {epoch}'):
        x = x_train[data_id]
        y = y_train[data_id]
        # We will skip neutral examples
        if y == 0:
            continue
        # Make a prediction
        features = extract_features(x)
        predicted_y = run_classifier(features)
        # Update the weights if the prediction is wrong
        if predicted_y != y:
            for feature in features:
                feature_weights[feature] = feature_weights.get(feature, 0) + y * features[feature]

Epoch 1: 100%|██████████| 8544/8544 [00:00<00:00, 106706.61it/s]
Epoch 2: 100%|██████████| 8544/8544 [00:00<00:00, 105431.40it/s]
Epoch 3: 100%|██████████| 8544/8544 [00:00<00:00, 109933.87it/s]
Epoch 4: 100%|██████████| 8544/8544 [00:00<00:00, 107898.56it/s]
Epoch 5: 100%|██████████| 8544/8544 [00:00<00:00, 92538.14it/s]


In [16]:
import itertools
# Print the first 10 key-value pairs
for key, value in itertools.islice(feature_weights.items(), 10):
    print(f'{key}: {value}')

The: 1.0
fluid: 5.0
motion: 1.0
is: 0.0
astounding: -1.0
on: -1.0
any: -1.0
number: -4.0
of: -2.0
levels: 2.0


**Evaluation**

To compute the accuracy of the predicted results.

In [25]:
def calculate_accuracy(x_data: list[str], y_data: list[int]) -> float:
    total_number = 0
    correct_number = 0
    for x, y in zip(x_data, y_data):
        y_pred = run_classifier(extract_features(x))
        total_number += 1
        if y == y_pred:
            correct_number += 1
    return correct_number / float(total_number)

In [18]:
label_count = {}
for y in y_test:
    if y not in label_count:
        label_count[y] = 0
    label_count[y] += 1
print(label_count)

{1: 444, 0: 229, -1: 428}


In [26]:
train_accuracy = calculate_accuracy(x_train, y_train)
test_accuracy = calculate_accuracy(x_test, y_test)
print(f'Train accuracy: {train_accuracy}')
print(f'Dev/test accuracy: {test_accuracy}')

Train accuracy: 0.769311797752809
Dev/test accuracy: 0.5758401453224341


**Error Analysis**

The analysis of errors is an important part of building an NLP system. This helps us pinpoint the system's failures and allows us to improve it during the development process.

In [29]:
import random
def find_errors(x_data, y_data):
    error_ids = []
    y_preds = []
    for i, (x, y) in enumerate(zip(x_data, y_data)):
        y_preds.append(run_classifier(extract_features(x)))
        if y != y_preds[-1]:
            error_ids.append(i)
    for _ in range(5):
        my_id = random.choice(error_ids)
        x, y, y_pred = x_data[my_id], y_data[my_id], y_preds[my_id]
        print(f'{x}\ntrue label: {y}\npredicted label: {y_pred}\n')

In [30]:
find_errors(x_train, y_train)

What we get ... is Caddyshack crossed with the Loyal Order of Raccoons .
true label: 0
predicted label: 1

This is the first full scale WWII flick from Hong Kong 's John Woo .
true label: 0
predicted label: 1

but ` Why ? '
true label: 0
predicted label: -1

Blithely anachronistic and slyly achronological .
true label: 0
predicted label: 1

Did it move me to care about what happened in 1915 Armenia ?
true label: 0
predicted label: -1

