# Using the coefficients of a logistic regression model as a query vector

Rather than using a logistic regression classifier to score each example in a dataset so that we can find the highest-scoring cases, we show here an approach to use the vector of coefficients from that model as a query in a vector search. Because this approach will be amenable to approximate vector search, it promises to be vastly more scalable than traditional model scoring.

Here we demonstrate this process on the '[IMDB sentiment](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)' dataset from Kaggle. This dataset is available from other locations as well (the original paper about it was from  Stanford), but the Kaggle version is convenient in that it is formatted as a CSV file that can be easily loaded into a Pandas dataframe. However, you need to log into Kaggle to download it.

In [3]:
import pandas as pd
import numpy as np
import os
from zipfile import ZipFile
from sklearn.linear_model import LogisticRegressionCV
from sentence_transformers import SentenceTransformer

ST_MODEL = 'all-MiniLM-L6-v2'

file = ZipFile("data/archive.zip")
text_df = pd.read_csv(file.open("IMDB Dataset.csv"), encoding='utf8') # 'latin1'
text_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
# Featurization takes several minutes without a GPU, so cache it.
featurized_data_file = 'IMDB_sentiment.parquet'

if  os.path.exists(featurized_data_file):
    print('Load featurized data')
    text_df = pd.read_parquet(featurized_data_file)
else:
    print('Featurize data')
    sentxformer = SentenceTransformer(ST_MODEL)
    text_df['vector'] = sentxformer.encode(text_df['review'].values).tolist()
    text_df.to_parquet(featurized_data_file)

text_df

Load featurized data


Unnamed: 0,review,sentiment,vector
0,One of the other reviewers has mentioned that ...,positive,"[0.030099309980869293, 0.050417669117450714, -..."
1,A wonderful little production. <br /><br />The...,positive,"[-0.012201860547065735, 0.05196147784590721, -..."
2,I thought this was a wonderful way to spend ti...,positive,"[0.014258158393204212, -0.0791383758187294, 0...."
3,Basically there's a family where a little boy ...,negative,"[-0.041720371693372726, 0.010464085265994072, ..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[-0.03167529031634331, 0.00642420444637537, -0..."
...,...,...,...
49995,I thought this movie did a down right good job...,positive,"[-0.031536709517240524, -0.06321507692337036, ..."
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"[-0.04839408025145531, -0.08552412688732147, 0..."
49997,I am a Catholic taught in parochial elementary...,negative,"[-0.03935423120856285, -0.002203170442953706, ..."
49998,I'm going to have to disagree with the previou...,negative,"[-0.01833348162472248, -0.026902485638856888, ..."


In [35]:
from sklearn.model_selection import train_test_split

# train, test = train_test_split(text_df, test_size=0.2)

train = text_df[0:25000].copy()
test = text_df[25000:].copy()

X_train = [v for v in train['vector']]
y_train = [ s == 'positive' for s in train['sentiment'] ]

X_test = [v for v in test['vector']]
y_test = [ s == 'positive' for s in test['sentiment'] ]

In [36]:
from collections import Counter
print('text_df', Counter(text_df['sentiment']).most_common())
print('train', Counter(train['sentiment']).most_common())
print('test', Counter(test['sentiment']).most_common())


text_df [('positive', 25000), ('negative', 25000)]
train [('negative', 12526), ('positive', 12474)]
test [('positive', 12526), ('negative', 12474)]


In [40]:
# clf = LogisticRegressionCV(cv=5, scoring='roc_auc', n_jobs=-1, max_iter=10000)

clf = LogisticRegressionCV(cv=5, n_jobs=-1, max_iter=10000)  # default scoring metric is 'accuracy'
# NOTE: it does not help to use more folds (which should give bigger training sets). I got about the same accuracy isung cv=15.

clf.fit(X_train, y_train)


0.8171593772481998

# Find top-scoring sentences the old-fashioned way

Here we use the classifier object to generate a score for each example.

In [38]:
test['score'] = clf.predict_proba(X_test)[:,1]

test.sort_values('score', ascending=False).head(5)[['review', 'score', 'sentiment']]

Unnamed: 0,review,score,sentiment
36819,I LOVE this movie! Beautifully funny and utter...,0.999997,positive
40795,I so much enjoyed this little musical fantasy ...,0.999982,positive
47466,"In my opinion, the best movie ever. I love whe...",0.999977,positive
42797,"Great movie, great actors, great soundtrack! I...",0.999938,positive
40722,Definitely one of my favourite movies. The sto...,0.999938,positive


## How good is this model?

Compare to the results reported by [Maas et al](https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf), Table 2. They report accuracies from 87.80 for bag of words to 88.89 for the approch described in their paper.

In [41]:
# Accuracy on cross-validation
# Note that we use the default scoring method (accuracy) to compare to Maas et al.
get_mean_xval_score_for_binary_classifier = lambda clf: np.mean([np.max(v) for v in clf.scores_[True]])
get_mean_xval_score_for_binary_classifier(clf)

0.8171593772481998

In [55]:
# Accuracy on held-out test set.
get_accuracy = lambda predicted, observed: np.sum(predicted == observed)/len(observed)

get_accuracy(clf.predict(X_test), y_test)

0.8188

## Try other classifiers

In [56]:
from sklearn import svm

clf_svm = svm.SVC().fit(X_train, y_train)  # SVC(kernel='rbf') by default
get_accuracy(clf_svm.predict(X_test), y_test)

0.83568

In [62]:
clf_lsvm = svm.SVC(kernel='linear').fit(X_train, y_train)
get_accuracy(clf_lsvm.predict(X_test), y_test)

0.81712

In [57]:
from sklearn.ensemble import RandomForestClassifier

clf_rf = RandomForestClassifier().fit(X_train, y_train)
get_accuracy(clf_rf.predict(X_test), y_test)

0.76792

In [60]:
from sklearn.ensemble import GradientBoostingClassifier

clf_gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=3, random_state=0).fit(X_train, y_train)
get_accuracy(clf_gbc.predict(X_test), y_test) # max_depth=1: 0.76012, max_depth=2: 0.76092

0.75792

# Find top-scoring sentences by similarity to coefficient vector

This approach is more scalable because it can be used with approximate vector similarity approaches (Here we compute exact similarity against each example, but we could theoretically use approximate similarity. We'll show that later.)

In [38]:
coef_vector = clf.coef_[0]

test['similarity'] = [np.dot(coef_vector, v) for v in test['vector']]
test.sort_values('similarity', ascending=False).head(5)[['review', 'similarity', 'sentiment']]

Unnamed: 0,review,similarity
42797,"Great movie, great actors, great soundtrack! I...",7.961847
46615,I smiled through the whole film. The music is ...,7.870014
9857,I think this show is definitely the greatest s...,7.678084
22061,This is one of the most beautiful films I have...,7.627164
22967,"Beautiful and touching movie. Rich colors, gre...",7.492701


# About the suspicious similarity between the dot product and the cosine distance

It turns out that the sentence transformer model I'm using here ('all-MiniLM-L6-v2') produces unit vectors, as shown below:

In [39]:
get_vector_length = lambda v: np.sqrt( sum( np.multiply(v, v) ) )

vector_length = [get_vector_length(v) for v in text_df['vector']]
print(f"Vector lengths are all between {np.min(vector_length)} and {np.max(vector_length)}.")

Vector lengths are all between 0.9999998492123835 and 1.0000001540798946.


The vector of model coefficients is not a unit vector, but that only results in the difference between the dot product and the cosine similarity being off by a constant factor (equal to the euclidean length of the coefficient vector).

In [40]:
get_vector_length(coef_vector)  # not a unit vector

48.87573453999129

# Discussion

Vector similarity searching is very flexible. Here we show that we can use a vector of coefficients from a logistic regression model as a query vector. To appreciate the power of this approach, try to imagine how you would come up with a semantic vector that would cover most of the positive movie reviews in this dataset!

Accelerated vector searching algorithms should work for this type of query just as well as they work for other vector queries. Stay tuned for our demonstration of using this approach to search a big vector database in (more or less) constant time.