**Feature Extraction**

This represents the feature extraction function where 𝒇 maps the input 𝒙 to a feature vector 𝒉.

In [1]:
def extract_features(x: str) -> dict[str, float]:
    features = {}
    x_split = x.split(' ')

    # Count the number of "good words" and "bad words" in the text
    good_words = ['love', 'good', 'nice', 'great', 'enjoy', 'enjoyed']
    bad_words = ['hate', 'bad', 'terrible', 'disappointing', 'sad', 'lost', 'angry']
    for x_word in x_split:
        if x_word in good_words:
            features['good_word_count'] = features.get('good_word_count', 0) + 1
        if x_word in bad_words:
            features['bad_word_count'] = features.get('bad_word_count', 0) + 1

    # The "bias" value is always one, to allow us to assign a "default" score to the text
    features['bias'] = 1

    return features

feature_weights = {'good_word_count': 1.0, 'bad_word_count': -1.0, 'bias': 0.5}

**Data Reading**

This function reads the data from training, development, or test data file.

In [4]:
def read_xy_data(filename: str) -> tuple[list[str], list[int]]:
    x_data = []
    y_data = []
    with open(filename, 'r') as f:
        for line in f:
            label, text = line.strip().split(' ||| ')
            x_data.append(text)
            y_data.append(int(label))
    return x_data, y_data

In [12]:
from google.colab import drive
drive.mount('/content/drive')

x_train, y_train = read_xy_data('/content/drive/My Drive/data/sentiment/train.txt')
x_test, y_test = read_xy_data('/content/drive/My Drive/data/sentiment/dev.txt')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
print(x_train[0])
print(y_train[0])

The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .
1


**Classification and Evaluation**

Run the classifier and compute the accuracy of the classification results.

In [6]:
def run_classifier(x: str) -> int:
    score = 0
    for feat_name, feat_value in extract_features(x).items():
        score = score + feat_value * feature_weights.get(feat_name, 0)
    if score > 0:
        return 1
    elif score < 0:
        return -1
    else:
        return 0

In [7]:
def calculate_accuracy(x_data: list[str], y_data: list[int]) -> float:
    total_number = 0
    correct_number = 0
    for x, y in zip(x_data, y_data):
        y_pred = run_classifier(x)
        total_number += 1
        if y == y_pred:
            correct_number += 1
    return correct_number / float(total_number)

In [13]:
label_count = {}
for y in y_test:
    if y not in label_count:
        label_count[y] = 0
    label_count[y] += 1
print(label_count)

{1: 444, 0: 229, -1: 428}


In [16]:
train_accuracy = calculate_accuracy(x_train, y_train)
test_accuracy = calculate_accuracy(x_test, y_test)
print(f'Train accuracy: {train_accuracy}')
print(f'Dev/test accuracy: {test_accuracy}')

Train accuracy: 0.4345739700374532
Dev/test accuracy: 0.4214350590372389


**Error Analysis**

The analysis of errors is an important part of building an NLP system. This helps us pinpoint the system's failures and allows us to improve it during the development process.

In [3]:
import random
def find_errors(x_data, y_data):
    error_ids = []
    y_preds = []
    for i, (x, y) in enumerate(zip(x_data, y_data)):
        y_preds.append(run_classifier(x))
        if y != y_preds[-1]:
            error_ids.append(i)
    for _ in range(5):
        my_id = random.choice(error_ids)
        x, y, y_pred = x_data[my_id], y_data[my_id], y_preds[my_id]
        print(f'{x}\ntrue label: {y}\npredicted label: {y_pred}\n')

In [17]:
find_errors(x_train, y_train)

But its storytelling prowess and special effects are both listless .
true label: -1
predicted label: 1

How inept is Serving Sara ?
true label: -1
predicted label: 1

While you have to admit it 's semi-amusing to watch Robert DeNiro belt out `` When you 're a Jet , you 're a Jet all the way , '' it 's equally distasteful to watch him sing the lyrics to `` Tonight . ''
true label: -1
predicted label: 1

Amid the cliché and foreshadowing , Cage manages a degree of casual realism ... that is routinely dynamited by Blethyn .
true label: 0
predicted label: 1

Make no mistake , ivans xtc .
true label: 0
predicted label: 1

