# Rule based classifer

We are building a rule based classifier for sentiment analysis of movie reviews. We have positive (`1` label), neutral (`0` label) or negative (`-1` label) sentiments.

The idea is to use a list of words and identify if they are more inclined towards the positive or negative sentiment

In [2]:
import os

## Get the data

In [1]:
data_folder = "data/sst-sentiment-text-threeclass"

In [3]:
with open(os.path.join(data_folder, "train.txt"), "r") as f:
    lines = f.read().splitlines()
    for line in lines[:5]:
        print(line)

1 ||| The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .
1 ||| The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer\/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .
1 ||| Singer\/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .
0 ||| You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .
1 ||| Yet the act is still charming here .


In [4]:
x_train, y_train = [], []
x_dev, y_dev = [], []
x_test, y_test = [], []

In [7]:
for file_name in ["train.txt", "dev.txt", "test.txt"]:
    with open(os.path.join(data_folder, file_name), "r") as f:
        lines = f.read().splitlines()
        for line in lines:
            label, review = line.split("|||")
            label = int(label.strip())
            
            if "train" in file_name:
                x_train.append(review)
                y_train.append(label)
            elif "dev" in file_name:
                x_dev.append(review)
                y_dev.append(label)
            elif "test" in file_name:
                x_test.append(review)
                y_test.append(label)

In [8]:
assert len(x_train) == len(y_train)
assert len(x_dev) == len(y_dev)
assert len(x_test) == len(y_test)

In [9]:
len(x_train), len(x_dev), len(x_test)

(8544, 1101, 2210)

## Write code to extract features

### List from prof code

In [10]:
good_words_prof = ['love', 'good', 'nice', 'great', 'enjoy', 'enjoyed']
bad_words_prof = ['hate', 'bad', 'terrible', 'disappointing', 'sad', 'lost', 'angry']

In [14]:
def extract_features(x, good_words, bad_words):
    features = {"good_words": 0, "bad_words": 0}
    words = x.split(' ')

    for word in words:
        word = word.lower().strip()
        if word in good_words:
            features["good_words"] += 1.0
        if word in bad_words:
            features["bad_words"] += 1.0

    features["bias"] = 1.0

    return features

## Get a list of initial weights

In [12]:
feature_weights = {"good_words": 1.0, "bad_words": -1.0, "bias": 0.5}

## Extract score and label

In [17]:
def get_label(x, good_words, bad_words):
    features = extract_features(x, good_words, bad_words)

    score = 0.0
    for key in features.keys():
        score += feature_weights[key] * features[key]

    if score < 0:
        return -1
    elif score > 0:
        return 1
    else:
        return 0

## Compute accuracy

In [20]:
def get_accuracy(x_arr, y_arr, good_words, bad_words):
    y_pred_arr = []
    for x in x_arr:
        y_pred_arr.append(get_label(x, good_words, bad_words))

    correct_predictions = sum(p == t for p, t in zip(y_pred_arr, y_arr))
    accuracy = (correct_predictions / len(y_arr)) * 100.00
    return accuracy

## Now get accuracy

In [21]:
train_accuracy = get_accuracy(x_train, y_train, good_words_prof, bad_words_prof)
dev_accuracy = get_accuracy(x_dev, y_dev, good_words_prof, bad_words_prof)

print(f"Train Accuracy: {train_accuracy:.2f}%")
print(f"Dev Accuracy: {dev_accuracy:.2f}%")

Train Accuracy: 43.56%
Dev Accuracy: 42.23%
