# IMDB sentiment with Vowpal Wabbit
Attempt to train model with the largest available dataset: [IMDb Largest Review Dataset by Enam Biswas](https://www.kaggle.com/ebiswas/imdb-review-dataset)

My preprocessing of the dataset can be found in [this kernel](https://www.kaggle.com/andrii0yerko/preprocessing-for-vowpal-wabbit-sentiment-analysis)

The model will be linear classifier on bag of words, that can be easily implemented with VW.
The main benefit of VW is that it works out-of-core, that means we don't need to load all the dataset into RAM and build a dictionary for bag of words, that can be to large in case of limited RAM, despite VW reads dataset line by line and create bag of words implicitly using hashing trick.

## Environment preparation

In [None]:
%%capture
!pip install bs4 --quiet
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
tqdm.pandas()

!pip install pandarallel
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from zipfile import ZipFile
import os
import re

In [None]:
TEMP = '/kaggle/temp'
try:
    os.mkdir(TEMP)
    print(f'{TEMP} created')
except:
    print(f'{TEMP} already exists')

In [None]:
%%capture
# install the latest VW version
!git clone --recursive https://github.com/VowpalWabbit/vowpal_wabbit.git $TEMP/vowpal_wabbit
!cd $TEMP/vowpal_wabbit/; make 
!cd $TEMP/vowpal_wabbit/; make install

Unzip and load test and original train data, it will be used for validation

In [None]:
directory = '/kaggle/input/word2vec-nlp-tutorial/'
for file in os.listdir(directory):
    if file.split('.')[-1] != 'zip':
        continue
    with ZipFile(directory+file, 'r') as archive:
        archive.extractall()

val_df = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
test_df = pd.read_csv("testData.tsv", header=0, delimiter="\t", quoting=3)
val_df.shape, test_df.shape

In [None]:
test_df.head()

## Test & validation sets preprocessing
Validation and test sets must be preprocessed in the same way as a training one.

In [None]:
stops = set(stopwords.words("english"))
stemmer = PorterStemmer()

# the same as in preprocessing notebook
def preprocess_review(raw_review):
    # Remove HTML
    review_text = BeautifulSoup(raw_review,).get_text()
    # Remove URLs
    review_text = re.sub("https?:\/\/[\w+.\/]+", " ", review_text)
    # Remove non-letters
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    # Convert to lower case, split into individual words
    words = letters_only.lower().split()
    # Remove stop words (and stem others if needed)
    meaningful_words = [stemmer.stem(w) for w in words if not w in stops]
        
    return(" ".join( meaningful_words))

In [None]:
%%time
val_df['review'] = val_df['review'].parallel_apply(preprocess_review)
val_df['sentiment'] = val_df['sentiment'].replace(0, -1)
test_df['review'] = test_df['review'].parallel_apply(preprocess_review)

Save in the Vowpal Wabbit format

In [None]:
np.savetxt("val.vw", val_df[['sentiment', 'review']], delimiter=' |text ', fmt='%s')  # with labels
np.savetxt('test.vw', '|text ' + test_df['review'], fmt='%s')

Let's respect the rules, and drop test lines from the training data.

The same will be done for the validation set to preserve data leakage

In [None]:
# the real labels of training data are unknown, so lines with both possible labels
# will be used for comparing, which seems much more faster than applying "^-?1" regex
np.savetxt(TEMP+'/test_pattern0', '-1 |text ' + test_df['review'], fmt='%s')
np.savetxt(TEMP+'/test_pattern1', '1 |text ' + test_df['review'], fmt='%s')
!cat $TEMP/test_pattern1 $TEMP/test_pattern1 >$TEMP/test_pattern

In [None]:
%%time
INPUT_PATH = "/kaggle/input/preprocessing-for-vowpal-wabbit-sentiment-analysis/train.vw"
!wc -l $INPUT_PATH
# drop the train lines that appears in the test_pattern
!grep -Fvxf $TEMP/test_pattern $INPUT_PATH >/kaggle/temp/temp.vw
!wc -l $TEMP/temp.vw
!grep -Fvxf val.vw $TEMP/temp.vw >train.vw 
!wc -l train.vw 

# Vowpal Wabbit
Let's start with SVM model, bigrams and 22 bit hash

In [None]:
%%time
!vw --data=train.vw \
    --ngram=2 \
    --bit_precision=22 \
    --loss_function=hinge \
    --final_regressor=model.vw

In [None]:
%%time
# predict
!vw --initial_regressor=model.vw \
    --testonly \
    --data=val.vw \
    --ngram=2 \
    --binary \
    --predictions=pred.txt \
    --raw_predictions=pred_margins.txt

In [None]:
from sklearn.metrics import classification_report
y_pred = np.loadtxt('pred.txt', dtype='int')
y_true = val_df['sentiment']
print(classification_report(y_true, y_pred, digits=4))

In [None]:
from sklearn.metrics import hinge_loss, roc_auc_score
raw = np.loadtxt('pred_margins.txt')
print(f'Hinge loss: {hinge_loss(y_true, raw)}')
print(f'ROC AUC: {roc_auc_score(y_true, raw)}')

## Hyperparameter tuning
Let's explore how the hash dimension and the ngrams affect the model quality

In [None]:
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

def vw_svm_gridsearch_scores(search_params, additional_params=""):
    '''
    Fits VW SVM model (hinge loss) for each element of search space
    and returns accuracy, f1 and hinge scores of each fit.
    
    search_params: 1d iterable
        list of vw param strings to be tried
        E.g. ["--l2=0.1", "--l2=1", "--l2=10"]
    additional_params: string
        additional parameters that to be applied to each fit
        E.g. "--bit_precision=26 --ngram=2"
    '''
    acc_list, f1_list, loss_list, auc_list = [], [], [], []
    
    for param in tqdm(search_params):
        # yeah, cursed
        # fit
        !vw --data=train.vw \
            $param $additional_params \
            --loss_function=hinge \
            --quiet \
            --final_regressor=model.vw

        # predict
        !vw --initial_regressor=model.vw \
            --testonly \
            $param $additional_params \
            --data=val.vw \
            --binary \
            --quiet \
            --predictions=val.pred \
            --raw_predictions=val_raw.pred

        y_pred = np.loadtxt('val.pred', dtype='int')
        raw = np.loadtxt('val_raw.pred')
        acc_list.append(accuracy_score(y_true, y_pred))
        f1_list.append(f1_score(y_true, y_pred))
        loss_list.append(hinge_loss(y_true, raw))
        auc_list.append(roc_auc_score(y_true, raw))
    return {
        'accuracy': acc_list,
        'f1': f1_list,
        'hinge': loss_list,
        'roc_auc': auc_list
    }

In [None]:
def plot_scores(scores, ticks=None):
    if ticks is None:
        ticks = range(len(scores['roc_auc']))
    fig, ax = plt.subplots(1, 3, figsize=(14, 5))
    ax[0].plot(ticks, scores['roc_auc'], "o-")
    ax[0].set_title('AUC')
    ax[1].plot(ticks, scores['f1'], "o-")
    ax[1].set_title('F1');
    ax[2].plot(ticks, scores['hinge'], "o-")
    ax[2].set_title('Hinge');

At first, let's receive some intuition on how do the choosen hyperparameters impact the resulting score.

In [None]:
hashdims = [f"--bit_precision={i}" for i in range(18, 30, 2)]
scores = vw_svm_gridsearch_scores(hashdims, "--ngram=2")
plot_scores(scores)

In [None]:
hashdims = [f"--bit_precision={i}" for i in range(18, 30, 2)]
scores = vw_svm_gridsearch_scores(hashdims, "--ngram=2 --ngram=3 --ngram=4")
plot_scores(scores, range(18, 30, 2))

In [None]:
a = [f'--ngram={i}' for i in range(2, 7)]
ngrams = [" ".join(a[:i]) for i in range(len(a)) ]
scores = vw_svm_gridsearch_scores(ngrams, "--bit_precision=26 --binary")
plot_scores(scores, range(1, 6))

Searching for the best hashdim and ngrams combination

In [None]:
from itertools import product
a = [f'--ngram={i}' for i in range(2, 6)]
ngrams = [" ".join(a[:i]) for i in range(len(a))]
hashdims = [f"--bit_precision={i}" for i in range(24, 30, 2)]
search_space = [" ".join(i) for i in product(hashdims, ngrams)]
scores = vw_svm_gridsearch_scores(search_space, "--binary")
plot_scores(scores)

In [None]:
argmax = np.argmin(scores['hinge']) 
best_params = search_space[argmax]
best_params

### Regularization
Regularization doesn't help on such sparse data, so I wouldn't try it. If you want to be sure about this yourself, you can run the following code snippet based on `vw-hyperopt.py` (included in vw)

In [None]:
# !python $TEMP/vowpal_wabbit/utl/vw-hyperopt.py \
#     --train=train.vw \
#     --holdout=val.vw \
#     --outer_loss_function=hinge \
#     --vw_space="--l2=1e-8..1e-2~LO --l1=1e-8..1e-2~LO" \
#     --additional_cmd="--binary --bit_precision=28 --ngram=2 --ngram==3 --loss_function=hinge --quiet" \
#     --max_eval=10

In [None]:
# !tail log.log -n 9

# Final model

In [None]:
%%time
!vw --data=train.vw \
     $best_params \
    --loss_function=hinge \
    --final_regressor=model.vw

# predict
!vw --initial_regressor=model.vw \
    --testonly \
    --data=val.vw \
     $best_params \
    --binary \
    --predictions=val.pred \
    --raw_predictions=val_raw.pred

In [None]:
y_true = val_df['sentiment']
y_pred = np.loadtxt('val.pred', dtype='int')
raw = np.loadtxt('val_raw.pred')
print(classification_report(y_true, y_pred, digits=4))

print(f'Hinge loss: {hinge_loss(y_true, raw)}')
print(f'ROC AUC: {roc_auc_score(y_true, raw)}')

## Making a submission

In [None]:
!vw --initial_regressor=model.vw \
    --testonly \
    --data=test.vw \
     $best_params \
    --binary \
    --quiet \
    --raw_predictions=test.pred

a = np.loadtxt("test.pred")
test_df['sentiment'] = a
test_df[['id','sentiment']].to_csv("submission.csv", index=False, quoting=3) # 0.97240 accuracy!!

In [None]:
test_df

In [None]:
!rm *.log
!rm *.pred
!rm *.json
!rm *.cache
!rm *.txt
!rm *.tsv

# Ways for improvement
- Convergence hyperparameters tuning
- Usually, set log loss as the objective is a good idea when roc auc is a target metric. Should be tried.
- Better preprocessing. What about lemmatization instead of stemming? Maybe we should choose higher and lower rating thresholds for the positive and negative classes than ones used by original dataset creators.