# Reverse Recommendation for Anomaly Detection with Qdrant
Have you ever wondered how fraud detection systems spot suspicious transactions in a sea of normal ones? Let's try to understand how reverse recommendation can help us in finding fraudulent transactions. Detection of fraudulent transactions is very important to safeguard the interests of the customers. In this blog, I'll explain how we can use Qdrant to build a vector store of normal transactions and detect anomalies based on how "off" a transaction is from the usual patterns of the customer. We will build a fraud detection system that uses similarity search of Qdrant and checks how far away a new transaction is from the normal transactions in vector space. If the new transaction is far away from the normal transactions, then it is flagged as a fraudulent transaction. We also use a RAG based approach to not only detect the fraudulent transactions but also to explain why a transaction is flagged as fraudulent so that customer-support can take necessary actions. Let's understand how we can build this system in detail.

Why use reverse recommendations for anomaly detection? Simple: they help spot outliers faster by comparing new transactions against a baseline of normal behavior. Using reverse recommendations we can find most dissimilar entries thereby helping us identify abnormal behavior of the data.

In [1]:
import sys
sys.path.append('..')

In [2]:
import time
from tqdm import tqdm

from utils import (
    convert_transaction_data_to_str,
    get_user_basic_info,
    get_transactional_data,
    embed_transaction,
    insert_transaction,
    get_as_close_transactions,
    load_data,
    detect_anomalies
)

Let's start by loading the transaction dataset for a particular user. load_data function loads the original dataset, gets 5 random users, and then extracts all the transactions for all 5 users. It also does some preprocessing like converting the dob to age, explicitly mentioning the gender of the user, combining the first and last names, and removing the 'fraud_' prefix in the merchant column. All these preprocessing steps are done to make the data more readable in human terms.

Since each user has a different pattern of transactions, we will build a separate vector store for each user and find anomalies based on how far away a new transaction is from the normal transactions of that particular user. In this example, we're extracting transactions for 5 random users. You can extend this to all users in your dataset.

In [3]:
root = '/home/quamer23nasim38/reverse-recommendation-for-anomaly-detection/'
data_path = 'data/fraudTrain.csv'

In [4]:
data, random_cc_num = load_data(root, data_path)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['age'] = data['dob'].map(dob_to_age)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.gender = data.gender.replace({
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['name'] = data['first'] + ' ' + data['last']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using 

Let's now load the embedding model that will be used later to convert the transaction data into vector embeddings.

In [5]:
from transformers import AutoTokenizer, AutoModel

# Load the pre-trained model
embedding_model_id = "BAAI/bge-small-en"
tokenizer = AutoTokenizer.from_pretrained(embedding_model_id)
model = AutoModel.from_pretrained(embedding_model_id)
model.eval()
print("Model loaded successfully")

Model loaded successfully


We will be only storing the genuine transactions in the vector store. So we start by filtering out the genuine transactions from the dataset. For now we'll only create vector stores for just 1 user. You can extend this to all 5 extracted users by running everything in a loop.

In [7]:
for user in random_cc_num:
    # Get the data for the user
    user_data = data[data['cc_num'] == user]
    # Filter out the fraud transactions
    user_data = user_data[user_data['is_fraud'] == 0]
    if user_data.shape[0]>1500:
        # get the user data which has at most 1500 transactions.
        # This can be changed to any number as per the requirement
        break

Customer Information Loaded Successfully for 213161869125933


Let's now create a new collection in Qdrant to store the vector embeddings of the genuine transactions. In this blog we're storing the vector embeddings in memory, but you can also store them in a persistent storage.

In [None]:
from qdrant_client import QdrantClient, models

# Initialize in-memory Qdrant client
client = QdrantClient(":memory:")

# Create a collection in Qdrant for storing transaction embeddings
client.create_collection(
    collection_name="transactions",
    vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE)
)

Now we will convert the transaction data into a string format and then convert it into vector embeddings using the embedding model. We will then store these vector embeddings in the Qdrant collection. We are provided with the lat-long of the merchant and the payment address. We convert these lat-long into address using the reverse geocoding API. Getting right columns in this step is very important. In the transaction data we try to only have the columns that are important for the transaction such as the amount, merchant, payment address, and merchant address. All these information will be necessary to identify abnormal transactions from the normal pattern of the user. We do not add the customer information in the transaction data as it doesn't change with each transaction and having them will artificially increase the similarity between the transactions which is not desirable. Finally we convert the transaction description into embeddings using the embedding model and store it in the Qdrant collection.

In [8]:
for idx, (_, transaction) in tqdm(enumerate(user_data.iterrows()), total=len(user_data)):
    # get the transactional information for a particular transaction
    transaction_information, merchant_information, payment_address, merchant_address = get_transactional_data(transaction, convert_coordinates_to_address=True)
    # convert the transaction information to string
    transaction_description = convert_transaction_data_to_str(transaction_information, merchant_information, payment_address, merchant_address)
    # embed the transaction description
    embedding = embed_transaction(transaction_description, model, tokenizer)
    embedding = embedding[0].tolist()
    # upload the transaction embedding and data to the qdrant client
    insert_transaction(embedding, transaction_description, idx, client)
    time.sleep(1)

    if idx == 200:
        break

 13%|█▎        | 200/1549 [09:49<1:06:16,  2.95s/it]


Here we will test with a new transaction dataset that we created, we hope that our system will be able to detect this as a fraudulent transaction. We will convert the transaction data into vector embeddings and then search for the nearest neighbors in the Qdrant collection. We extract the top 10 similar transactions. The idea is that if the new transaction is far away from the normal transactions of the user in the vector space, then it is flagged as a fraudulent transaction. We check this distance based on the threshold value. Say if the mean expected transaction cosine similarity score is greater than 95% then we say that the transaction is genuine. But if the mean cosine similarity score is less than 95% then we say that the transaction is fraudulent. We can change this threshold value based on the business requirements.

In [9]:
new_transaction_info = '''
420000.54
-----------------------
Rajesh, Kumar; savings_account
-----------------------
Chandini Chowk; Delhi; India; 20.0583; 16.008
-----------------------
Vietnaam; 20.152538; 16.227746
'''

In [10]:
# Embed the new transaction information
new_embedding = embed_transaction(new_transaction_info, model, tokenizer)
results = get_as_close_transactions(new_embedding, client)

We can see that the mean similarity score of the genuine transactions with the new transaction is about 88%, but we want it to be greater than 95%. Hence we flag this transaction as fraudulent.

In [12]:
results

[ScoredPoint(id=101, version=0, score=0.8875640384450292, payload={'transaction_data': '\n5.19\n-----------------------\nSchumm PLC; shopping_net\n-----------------------\nUnited States; East Andover; US; 04216; East Andover; Oxford County; house; N; place; 216; Farmers Hill Road; house; Maine; 44.6084; -70.6993\n-----------------------\nN; United States; natural; US; peak; 03812; Sawyer Rock; locality; 44.073571; -71.313451\n'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id=53, version=0, score=0.886838667727958, payload={'transaction_data': '\n63.57\n-----------------------\nKling Inc; gas_transport\n-----------------------\nUnited States; East Andover; US; 04216; East Andover; Oxford County; house; N; place; 216; Farmers Hill Road; house; Maine; 44.6084; -70.6993\n-----------------------\nUnited States; West Paris; US; 04289; Oxford County; house; N; place; 20; Littlehale Road; house; Maine; 44.369583; -70.512889\n'}, vector=None, shard_key=None, order_value=None),

In [11]:
if detect_anomalies(results):
    print("The new transaction is fraudulent")
else:
    print("The new transaction is genuine")

The new transaction is fraudulent


That's great, our system was able to detect the fraudulent transaction. But how do we explain why this transaction was flagged as fraudulent? In the next section, we will explain how we can use the RAG based approach to not only detect the fraudulent transactions but also to explain why a transaction is flagged as fraudulent so that customer-support can take necessary actions.