## Problem: Sentiment Analysis

`Needless to say, I wasted my money.` is a **negative** sentence.

`Good case, Excellent value.` is a **positive** sentence.

**Word Embeddings** are a way to represent words in a numerical way. They are generated by training on billions of sentences, based on the context in which words appear. Here is a visualization of word embeddings after applying Principal Component Analysis on them.
![Word Embeddings](https://miro.medium.com/max/1280/1*mWerYTuy9xH4SlRY9fFg1A.jpeg)

## Before starting

In [1]:
import hnumpy as hnp
import io
import numpy as np
import pathlib
import pickle
import re
import timeit
import torch
import urllib
import zipfile

In [None]:
RAW_EMBEDDINGS_PATH = pathlib.Path("data/lstm/wiki-news-300d-1M.vec")
EMBEDDINGS_PATH = pathlib.Path("data/lstm/embeddings.pickle")

if not EMBEDDINGS_PATH.exists():
    def load_word_embeddings(file):
        fin = io.open(file, 'r', encoding='utf-8', newline='\n', errors='ignore')
        n, d = map(int, fin.readline().split())
        embeddings = {
            "indices": {},
            "data": np.zeros((n, d)),
        }
        for i, line in enumerate(fin):
            tokens = line.rstrip().split(' ')
            embeddings["indices"][tokens[0]] = i
            for j, value in enumerate(map(float, tokens[1:])):
                embeddings["data"][i, j] = value

        return embeddings

    if not RAW_EMBEDDINGS_PATH.exists():
        url = "https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip"
        extract_dir = pathlib.Path("data", "lstm")

        zip_path, _ = urllib.request.urlretrieve(url)
        with zipfile.ZipFile(zip_path, "r") as zip_file:
            zip_file.extractall(extract_dir)

    with open(EMBEDDINGS_PATH, "wb") as file:
        pickle.dump(load_word_embeddings(RAW_EMBEDDINGS_PATH), file)
else:
    print()

## Here we go

In [2]:
with open("data/lstm/embeddings.pickle", "rb") as file:
    embeddings = pickle.load(file)
    embed = lambda tokens: embeddings["data"][[embeddings["indices"][token] for token in tokens], :]

In [3]:
index = embeddings["indices"]["encryption"]
print(index)

11196


In [4]:
embedding = embeddings["data"][index, :]
print(embedding)

[-5.000e-03 -1.992e-01 -4.550e-02  1.454e-01 -1.571e-01 -2.310e-02
  1.353e-01 -1.410e-01 -4.920e-02  1.590e-02  1.834e-01 -1.863e-01
  5.220e-02  4.200e-02 -6.340e-02  1.412e-01 -3.682e-01 -9.100e-02
 -2.147e-01  1.527e-01 -5.660e-01  4.370e-02 -1.274e-01  2.398e-01
  1.387e-01 -8.950e-02  1.634e-01 -1.001e-01  1.094e-01  4.300e-02
 -1.049e-01  1.742e-01 -1.222e-01 -2.710e-02 -1.227e-01  5.100e-02
 -9.210e-02  4.490e-02  6.960e-02  3.200e-02  4.250e-02 -2.030e-02
  1.830e-01  2.036e-01 -8.450e-02  6.350e-02 -6.380e-02  9.720e-02
  1.447e-01 -6.450e-02 -7.830e-02 -8.050e-02 -8.305e-01  1.572e-01
  3.028e-01 -3.560e-02  3.350e-01 -1.540e-01 -1.980e-02  1.585e-01
  1.698e-01 -1.545e-01  3.230e-02 -2.000e-03 -2.580e-01 -5.160e-02
  2.900e-02  8.570e-02  1.540e-02 -1.833e-01 -1.032e-01  1.003e-01
  6.960e-02  8.000e-04 -7.060e-02 -1.465e-01  1.549e-01 -1.060e-01
  1.040e-02  1.298e-01  1.740e-02 -1.158e-01 -1.030e-01 -1.747e-01
 -1.668e-01 -1.496e-01 -6.450e-02  2.748e-01  2.840e-01 -1.840

In [5]:
words_to_ignore = []
for word, index in embeddings["indices"].items():
    embedding = embeddings["data"][index, :]
    if embedding.min() < -1 or embedding.max() > 1:
        words_to_ignore.append(word)
for word in words_to_ignore:
    del embeddings["indices"][word]

In [6]:
def encode(sentence):
    sentence = sentence.strip().lower()
    sentence = re.sub(r"[^\w\s]", ' ', sentence)
    sentence = re.sub(r"\s+", ' ', sentence)
    return embed(filter(lambda token: token != "", sentence.split(' ')))

In [7]:
encode("i like cookies").shape

(3, 300)

In [8]:
DATASET_PATHS = ["data/lstm/amazon.txt", "data/lstm/imdb.txt", "data/lstm/yelp.txt"]

DATASET = []
for path in DATASET_PATHS:
    with open(path, "r") as file:
        for line in file:
            [line, orientation] = line.strip().split('\t')
            try:
                DATASET.append((encode(line), float(orientation)))
            except:
                pass

In [9]:
print(DATASET[0])

(array([[-0.0154, -0.002 , -0.0725, ...,  0.1858,  0.105 , -0.0423],
       [-0.0424,  0.007 , -0.1028, ...,  0.2318, -0.01  ,  0.0948],
       [ 0.0156,  0.0752, -0.078 , ...,  0.0882, -0.0882, -0.0096],
       ...,
       [ 0.0242, -0.0265,  0.0822, ...,  0.0831, -0.0466,  0.0315],
       [ 0.0047,  0.0223, -0.0087, ...,  0.1479,  0.1324, -0.0318],
       [-0.0582, -0.1396, -0.045 , ...,  0.0076,  0.0079,  0.0541]]), 0.0)


In [10]:
print(len(DATASET))

2677


![LSTM](https://miro.medium.com/max/666/1*vnqygSyLIA3QVTe6teca4Q.png)

In [11]:
HIDDEN_SIZE = 100

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()

        self.lstm = torch.nn.LSTM(input_size=300, hidden_size=HIDDEN_SIZE)
        self.fc = torch.nn.Linear(HIDDEN_SIZE, 1)
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, x):
        _, (x, _) = self.lstm(x)
        x = self.fc(x)
        return self.sigmoid(x)

In [12]:
LEARNING_RATE = 0.001
EPOCHS = 10

model = Model()

optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
criterion = torch.nn.MSELoss()

model.train()
for i in range(EPOCHS):
    for sentence, score in DATASET:
        x = torch.tensor(sentence.reshape(-1, 1, 300))
        prediction = model(x.float())

        optimizer.zero_grad()
        loss = criterion(prediction, torch.tensor([[[score]]]).float())

        loss.backward()
        optimizer.step()

    print("Epoch", i + 1, "is completed...", flush=True)
model.eval()

Epoch 1 is completed...
Epoch 2 is completed...
Epoch 3 is completed...
Epoch 4 is completed...
Epoch 5 is completed...
Epoch 6 is completed...
Epoch 7 is completed...
Epoch 8 is completed...
Epoch 9 is completed...
Epoch 10 is completed...


Model(
  (lstm): LSTM(300, 100)
  (fc): Linear(in_features=100, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [13]:
class Inferer:
    def __init__(self, model):
        parameters = list(model.lstm.parameters())

        W_ii, W_if, W_ig, W_io = parameters[0].split(HIDDEN_SIZE)
        W_hi, W_hf, W_hg, W_ho = parameters[1].split(HIDDEN_SIZE)
        b_ii, b_if, b_ig, b_io = parameters[2].split(HIDDEN_SIZE)
        b_hi, b_hf, b_hg, b_ho = parameters[3].split(HIDDEN_SIZE)

        self.W_ii = W_ii.detach().numpy()
        self.b_ii = b_ii.detach().numpy()

        self.W_hi = W_hi.detach().numpy()
        self.b_hi = b_hi.detach().numpy()

        self.W_if = W_if.detach().numpy()
        self.b_if = b_if.detach().numpy()

        self.W_hf = W_hf.detach().numpy()
        self.b_hf = b_hf.detach().numpy()

        self.W_ig = W_ig.detach().numpy()
        self.b_ig = b_ig.detach().numpy()

        self.W_hg = W_hg.detach().numpy()
        self.b_hg = b_hg.detach().numpy()

        self.W_io = W_io.detach().numpy()
        self.b_io = b_io.detach().numpy()

        self.W_ho = W_ho.detach().numpy()
        self.b_ho = b_ho.detach().numpy()

        self.W = model.fc.weight.detach().numpy().T
        self.b = model.fc.bias.detach().numpy()

    def infer(self, x):
        x_t, h_t, c_t = None, np.zeros(HIDDEN_SIZE), np.zeros(HIDDEN_SIZE)
        for i in range(x.shape[0]):
            x_t = x[i]
            _, h_t, c_t = self.lstm_cell(x_t, h_t, c_t)

        r = np.dot(h_t, self.W) + self.b
        return self.sigmoid(r)

    def lstm_cell(self, x_t, h_tm1, c_tm1):
        i_t = self.sigmoid(
            np.dot(self.W_ii, x_t) + self.b_ii + np.dot(self.W_hi, h_tm1) + self.b_hi
        )
        f_t = self.sigmoid(
            np.dot(self.W_if, x_t) + self.b_if + np.dot(self.W_hf, h_tm1) + self.b_hf
        )
        g_t = np.tanh(
            np.dot(self.W_ig, x_t) + self.b_ig + np.dot(self.W_hg, h_tm1) + self.b_hg
        )
        o_t = self.sigmoid(
            np.dot(self.W_io, x_t) + self.b_io + np.dot(self.W_ho, h_tm1) + self.b_ho
        )

        c_t = f_t * c_tm1 + i_t * g_t
        h_t = o_t * np.tanh(c_t)

        return o_t, h_t, c_t

    @staticmethod
    def sigmoid(x):
        return 1 / (1 + np.exp(-x))

In [14]:
SENTENCE_LENGTH_LIMIT = 5

inferer = Inferer(model)
homomorphic_inferer = hnp.compile_fhe(
    inferer.infer,
    {
        "x": hnp.encrypted_ndarray(bounds=(-1, 1), shape=(SENTENCE_LENGTH_LIMIT, 300))
    },
    config=hnp.config.CompilationConfig(
        parameter_optimizer="handselected",
        apply_topological_optimizations=True,
        probabilistic_bounds=6,
    ),
)

context = homomorphic_inferer.create_context()
keys = context.keygen()

operations = homomorphic_inferer.operation_count()
pbses = homomorphic_inferer.pbs_count()

print("\nTarget graph has", operations, "nodes and", pbses, "of them are PBS...")

2021-06-04 13:18:00.548 | INFO     | hnumpy.convert:compile_fhe:376 - Compiling infer into an FHE function
2021-06-04 13:18:00.549 | INFO     | hnumpy.convert:compile_fhe:378 - Checking input and output
2021-06-04 13:18:00.644 | INFO     | hnumpy.convert:compile_homomorphic:262 - Create target graph
2021-06-04 13:18:00.657 | INFO     | hnumpy.convert:compile_homomorphic:267 - Optimize target graph with optimizer `handselected`
2021-06-04 13:18:01.855 | INFO     | hnumpy.convert:compile_homomorphic:282 - Correct encoding
2021-06-04 13:18:01.866 | INFO     | hnumpy.convert:compile_homomorphic:285 - Create VM graph
2021-06-04 13:18:01.886 | INFO     | hnumpy.convert:compile_homomorphic:301 - Return the result to the caller
2021-06-04 13:18:01.894 | INFO     | hnumpy.client:keygen:28 - Creating 1 keyswitching key(s) and 1 bootstrapping key(s). This should take approximately 40 seconds (0.6666666666666666 minutes)
2021-06-04 13:18:21.958 | DEBUG    | hnumpy.client:keygen:42 - Key creation t

Target graph has 460 nodes and 91 of them are PBS...


In [17]:
def evaluate(sentence):
    try:
        embedded = encode(sentence)
    except KeyError as error:
        print("! the word", error, "is unknown")
        return

    if embedded.shape[0] > SENTENCE_LENGTH_LIMIT:
        print(f"! the sentence should not contain more than {SENTENCE_LENGTH_LIMIT} tokens")
        return

    padded = np.zeros((SENTENCE_LENGTH_LIMIT, 300))
    padded[SENTENCE_LENGTH_LIMIT - embedded.shape[0]:, :] = embedded

    original = model(torch.tensor(padded.reshape((-1, 1, 300))).float()).detach().numpy()[0, 0, 0]
    simulated = homomorphic_inferer.simulate(padded)[0]

    start = timeit.default_timer()

    actual = homomorphic_inferer.simulate(padded)[0]
    # to run actual homomorphic computation, comment the line above and uncomment the one below    # actual = homomorphic_inferer.encrypt_and_run(keys, padded)[0]
    end = timeit.default_timer()

    if actual < 0.35:
        print("- the sentence was negative", end=' ')
    elif actual > 0.65:
        print("+ the sentence was positive", end=' ')
    else:
        print("~ the sentence was neutral", end=' ')

    print(
        f"("
        f"original: {original * 100:.2f}%, "
        f"simulated: {simulated * 100:.2f}%, "
        f"actual: {actual * 100:.2f}%, "
        f"difference: {np.abs(original - actual) * 100:.2f}%, "
        f"took: {end - start:.3f} seconds"
        f")"
    )

In [19]:
evaluate("Shipment was slow.")

- the sentence was negative (original: 0.01%, simulated: 0.01%, actual: 0.01%, difference: 0.01%, took: 31.229 seconds)


In [20]:
evaluate("This product is a disgrace!")

- the sentence was negative (original: 0.00%, simulated: 0.00%, actual: 0.01%, difference: 0.00%, took: 31.815 seconds)


In [22]:
evaluate("It become my new favourite!")

+ the sentence was positive (original: 99.47%, simulated: 99.19%, actual: 99.13%, difference: 0.34%, took: 31.160 seconds)


In [23]:
evaluate("This is perfect!")

+ the sentence was positive (original: 99.94%, simulated: 99.93%, actual: 99.53%, difference: 0.41%, took: 31.254 seconds)


In [25]:
evaluate("THIS IS A SCAM.")

- the sentence was negative (original: 0.56%, simulated: 2.72%, actual: 2.55%, difference: 1.99%, took: 31.205 seconds)


In [27]:
evaluate("Boring.")

- the sentence was negative (original: 0.08%, simulated: 0.22%, actual: 0.47%, difference: 0.40%, took: 31.208 seconds)


In [28]:
evaluate("Fun!")

+ the sentence was positive (original: 99.98%, simulated: 99.98%, actual: 99.85%, difference: 0.12%, took: 31.268 seconds)


In [37]:
evaluate("I would rather burn money.")

- the sentence was negative (original: 0.02%, simulated: 0.11%, actual: 0.54%, difference: 0.52%, took: 31.240 seconds)


In [40]:
evaluate("Great view.")

+ the sentence was positive (original: 99.80%, simulated: 99.59%, actual: 99.98%, difference: 0.18%, took: 31.208 seconds)


In [42]:
evaluate("I am happy.")

+ the sentence was positive (original: 99.82%, simulated: 99.79%, actual: 99.80%, difference: 0.02%, took: 31.801 seconds)


In [43]:
evaluate("What a lovely place!")

+ the sentence was positive (original: 99.94%, simulated: 99.91%, actual: 98.78%, difference: 1.16%, took: 31.210 seconds)


In [44]:
evaluate("It is bad, really bad!")

- the sentence was negative (original: 0.00%, simulated: 0.00%, actual: 0.00%, difference: 0.00%, took: 31.294 seconds)


In [58]:
evaluate("It is like a dream!")

+ the sentence was positive (original: 87.32%, simulated: 88.21%, actual: 76.75%, difference: 10.57%, took: 31.243 seconds)


In [69]:
evaluate("It was like a nightmare!")

- the sentence was negative (original: 0.00%, simulated: 0.01%, actual: 0.05%, difference: 0.05%, took: 31.302 seconds)
