# Build a Rule-based Sentiment Classifier

This is a notebook for [CMU CS11-711 Advanced NLP](http://phontron.com/class/anlp2021/), in which you can attempt to build a rule-based sentiment classifier. It will take in a text `X` and return a `label` of "1" if the sentiment of the text is positive, "-1" if the sentiment of the text is negative, and "0" if the sentiment of the text is neutral. You can test the accuracy of your classifier on the [Stanford Sentiment Treebank](http://nlp.stanford.edu/sentiment/index.html) by running the notebook all the way to end.

The only thing that you should change in this notebook is the following cell which contains two important elements. The first is `extract_features(X)`, which will extract a dictionary of (named) feature values from the text. You should create this by hand, and a simple example is shown for you. The second is `feature_weights`, a dictionary which will assign a weight to each extracted feature.

The final way the classifier decides whether to assign a positive, negative, or neutral label is by calculating the dot product `feature_weights * extract_features(X)`, and if the value is greater than zero, return 1, less than zero return -1, and if exactly zero return 0.

Let's have some fun trying to design a classifier 😁

In [None]:
def extract_features(X):
    features = {}
    X_split = X.split(' ')
    
    # Count the number of "good words" and "bad words" in the text
    good_words = ['love', 'good']
    bad_words = ['hate', 'bad']
    for x in X_split:
        if x in good_words:
            features['good_word_count'] = features.get('good_word_count', 0) + 1
        if x in bad_words:
            features['bad_word_count'] = features.get('bad_word_count', 0) + 1
    
    # The "bias" value is always one, to allow us to assign a "default" score to the text
    features['bias'] = 1
    
    return features

feature_weights = {'good_word_count': 1.0, 'bad_word_count': -1.0, 'bias': 0.5}

## Data Reading

Read in the data from the training and dev (or finally test) sets

In [None]:
def read_XY_data(filename):
    X_data = []
    Y_data = []
    with open(filename, 'r') as f:
        for line in f:
            label, text = line.strip().split(' ||| ')
            X_data.append(text)
            Y_data.append(int(label))
    return X_data, Y_data

In [None]:
X_train, Y_train = read_XY_data('../data/sst-sentiment-text-threeclass/train.txt')
X_test, Y_test = read_XY_data('../data/sst-sentiment-text-threeclass/dev.txt')

In [None]:
print(X_train[0])
print(Y_train[0])

## Run the Classifier and Calculate Accuracy

Run the classifier over the training and dev (test) sets and calculate accuracy

In [None]:
def run_classifier(X):
    score = 0
    for feat_name, feat_value in extract_features(X).items():
        score = score + feat_value * feature_weights.get(feat_name, 0)
    if score > 0:
        return 1
    elif score < 0:
        return -1
    else:
        return 0

In [None]:
def calculate_accuracy(X_data, Y_data):
    total_number = 0
    correct_number = 0
    for X, Y in zip(X_data, Y_data):
        Y_pred = run_classifier(X)
        total_number += 1
        if Y == Y_pred:
            correct_number += 1
    return correct_number / float(total_number)

In [None]:
train_accuracy = calculate_accuracy(X_train, Y_train)
test_accuracy = calculate_accuracy(X_test, Y_test)
print(f'Train accuracy: {train_accuracy}')
print(f'Dev/test accuracy: {test_accuracy}')

## Error Analysis

An important part of improving any system is figuring out where it goes wrong. The following two functions allow you to randomly observe some mistaken examples, which may help you improve the classifier. Feel free to write more sophisticated methods for error analysis as well.

In [None]:
import random
def find_errors(X_data, Y_data):
    error_ids = []
    Y_preds = []
    for i, (X, Y) in enumerate(zip(X_data, Y_data)):
        Y_preds.append(run_classifier(X))
        if Y != Y_preds[-1]:
            error_ids.append(i)
    for _ in range(5):
        my_id = random.choice(error_ids)
        X, Y, Y_pred = X_data[my_id], Y_data[my_id], Y_preds[my_id]
        print(f'{X}\ntrue label: {Y}\npredicted label: {Y_pred}\n')

In [None]:
find_errors(X_train, Y_train)