# Combining Reverse Recommendation and RAG for Anomaly Detection
Now that we've already seen how to build the reverse recommendation system based on the similarity scores, let's see how we can combine it with the RAG model to not only detect the anomalies but also explain why they are anomalies.

We keep everything the same as the previous reverse recommendation system approach, loading the data, getting random users, converting transactional data to description, embedding the descriptions, storing these embeddings in the qdrant database. The whole workflow till storing the embeddings in the qdrant database is the same except that we are also loading the customer basic information and registered address information which will be used in the context of the RAG model.

In [1]:
import sys
sys.path.append('..')

In [2]:
import time
from tqdm import tqdm

from utils import (
    convert_transaction_data_to_str,
    get_user_basic_info,
    get_transactional_data,
    embed_transaction,
    insert_transaction,
    get_context_for_anomaly_detection,
    load_data,
)

In [3]:
root = '/home/quamer23nasim38/reverse-recommendation-for-anomaly-detection/'
data_path = 'data/fraudTrain.csv'

In [4]:
data, random_cc_num = load_data(root, data_path)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['age'] = data['dob'].map(dob_to_age)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.gender = data.gender.replace({
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['name'] = data['first'] + ' ' + data['last']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using 

Let's now load the embedding model that will be used later to convert the transaction data into vector embeddings.

In [5]:
from transformers import AutoTokenizer, AutoModel

# Load the pre-trained model
embedding_model_id = "BAAI/bge-small-en"
tokenizer = AutoTokenizer.from_pretrained(embedding_model_id)
model = AutoModel.from_pretrained(embedding_model_id)
model.eval()
print("Model loaded successfully")

Model loaded successfully


So far we have loaded the data and embedding model. Now in RAG based approach, we extract the basic customer informations (Name, Age, Gender, and Job) and the registered address of the customer. All these information about customer doesn't change with each transaction and hence we can store them separately without converting them into embeddings. We will use these information the RAG based approach to help understand the LLM in detecting the fraudulent transactions.

In [6]:
for user in random_cc_num:
    # Get the data for the user
    user_data = data[data['cc_num'] == user]
    # Filter out the fraud transactions
    user_data = user_data[user_data['is_fraud'] == 0]
    if user_data.shape[0]>1500:
        # Get the basic information for the user
        customer_information, registered_address = get_user_basic_info(user_data.iloc[0])
        print(f"Customer Information Loaded Successfully for {user}")
        break

Customer Information Loaded Successfully for 630424987505


Let's now simialr to last approach, create a qdrant collection, embed the transaction data except customer information and registered address, and store the embeddings in the qdrant database and similarly create a test transaction data.

In [7]:
from qdrant_client import QdrantClient, models

# Initialize in-memory Qdrant client
client = QdrantClient(":memory:")

# Create a collection in Qdrant for storing transaction embeddings
client.create_collection(
    collection_name="transactions",
    vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE)
)

True

In [8]:
for idx, (_, transaction) in tqdm(enumerate(user_data.iterrows()), total=len(user_data)):
    # get the transactional information for a particular transaction
    transaction_information, merchant_information, payment_address, merchant_address = get_transactional_data(transaction, convert_coordinates_to_address=True)
    # convert the transaction information to string
    transaction_description = convert_transaction_data_to_str(transaction_information, merchant_information, payment_address, merchant_address)
    # embed the transaction description
    embedding = embed_transaction(transaction_description, model, tokenizer)
    embedding = embedding[0].tolist()
    # upload the transaction embedding and data to the qdrant client
    insert_transaction(embedding, transaction_description, idx, client)
    time.sleep(1)

    if idx == 200:
        break

  6%|▋         | 200/3085 [10:51<2:36:38,  3.26s/it]


In [9]:
new_transaction_info = '''
420000.54
-----------------------
Rajesh, Kumar; savings_account
-----------------------
Chandini Chowk; Delhi; India; 20.0583; 16.008
-----------------------
Vietnaam; 20.152538; 16.227746
'''

Great, now that everything is set up, let's move on to the next step where we will use the RAG model to explain the anomalies detected by the LLM model.

RAG based approach starts by collecting the context for the LLM. To collect the context, we first embed the new transaction and then query the qdrant database to get the k closest transactions. We then extract the transaction description for these k closest transactions and store them in the context variable. This context will give the LLM an idea on how does the genuine transactions for this customer look like.

In [13]:
context = get_context_for_anomaly_detection(new_transaction_info, client, model, tokenizer, k=10)

In [15]:
print(context)


96.56
-----------------------
Schumm, Bauch and Ondricka; grocery_pos
-----------------------
United States; Thomas; US; 26292; Tucker County; street; W; highway; trunk; Seneca Trail; West Virginia; 39.1505; -79.503
-----------------------
W; United States; highway; US; motorway; 26452; Senator Jennings Randolph Highway; Lewis County; West Virginia; street; 39.019265; -80.426668


59.36
-----------------------
Goldner, Kovacek and Abbott; grocery_pos
-----------------------
United States; Thomas; US; 26292; Tucker County; street; W; highway; trunk; Seneca Trail; West Virginia; 39.1505; -79.503
-----------------------
W; United States; highway; US; service; Four M Road; Garrett County; Maryland; street; 39.448177; -79.2644


90.3
-----------------------
Heller, Gutmann and Zieme; grocery_pos
-----------------------
United States; Thomas; US; 26292; Tucker County; street; W; highway; trunk; Seneca Trail; West Virginia; 39.1505; -79.503
-----------------------
W; United States; highway; 

Now we write some high quality system and user prompts to explain the LLM used in the RAG model about it's tasks and how it can help in detecting the anomalies and finding the reasons for the anomalies.

In [16]:
system_prompt = '''
You're an intelligent AI assistant that helps in detecting fraudulent transactions. 

You're provided with the three key information:
    1. CUSTOMER INFORMATION: This has all the basic information about the customer which should give some idea about customer behaviour. The template is provided below.
    2. CONTEXT: This has several  examples of a normal and non-fraudulent transactional information for the user. The template for each transaction is provided below.
    3. NEW TRANSACTIONAL INFORMATION: This is the new transactional information that you need to classify as fraudulent or not. The template is same as normal transactional information 

Template for CUSTOMER INFORMATION and TRANSACTIONAL INFORMATION are provided below:
    1. CUSTOMER INFORMATION TEMPLATE
        {NAME}; {GENDER}; {AGE}; {JOB}
        -----------------------
        {REGISTERED ADDRESS}

    2. TRANSACTIONAL INFORMATION TEMPLATE: 
        {AMOUNT}
        -----------------------
        {MERCHANT NAME}; {CATEGORY}
        -----------------------
        {PAYMENT ADDRESS}
        -----------------------
        {MERCHANT ADDRESS} 

Your task is to uderstand USER's personal information, registered address, and examples of normal transactional information based on template provided and classify the new transactional information as fraudulent or not based on the context provided and also provide the reason for your classification.

You're only allowed to provide response in a json format with the following keys:
    1. classification: This should be either of the following:
        a. Fraudulent
        b. Non-Fraudulent
    2. reason: This should be a string explaining the reason for your classification.

Example of the response:
{
    "classification": "Fraudulent",
    "reason": "The transaction amount is significantly higher than the average transaction amount."
}
    
You can not provide any other response apart from the above mentioned json format with the keys mentioned above. In the classification key, you can only provide either "Fraudulent" or "Non-Fraudulent" as the value.
'''

prompt_template = f'''
1. CUSTOMER INFORMATION:
    {customer_information['name']}; {customer_information['gender']}; {customer_information['age']}; {customer_information['job']}
    -----------------------
    {registered_address['street']}; {registered_address['city']}; {registered_address['state']}; {registered_address['zip']}

2. CONTEXT:
    {context}

3. NEW TRANSACTIONAL INFORMATION:
    {new_transaction_info}

RESPONSE:
'''

We load the quantized LLM model which will be used in our RAG based approach

In [17]:
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer
import transformers
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config
)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    model_kwargs={"torch_dtype": torch.bfloat16}
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

We finally pass the system prompts and user prompts based on the context retrieved from the qdrant database to the LLM model to identify the anomalies and explain the reasons for the anomalies.

In [18]:
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt_template},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [20]:
eval(outputs[0]["generated_text"][-1]['content'])

{'classification': 'Fraudulent',
 'reason': "The transaction amount is significantly higher than the average transaction amount. The customer's registered address is in Thomas, WV, but the transaction is initiated from India, which is a different country and does not match the customer's registered address."}

Great, we successfully have detected the anomalies and explained the reasons for the anomalies. You can make this approach more accurate by focusing on feature extraction and selection, prompt engineering, and threshold tuning. You can make this reverse recommendation architecture even more rhobust by having multiple vector stores for different types of transaction data and then do a granular analysis of the anomalies in each type of vector store before giving the final output.

# Conclusion

In this blog we saw how we can combine the reverse recommendation system with the RAG model to not only detect the anomalies but also explain the reasons for the anomalies. This approach can be adapted to other use cases where we need to detect the anomalies from the data, such as insaurance claims, customer support interactions, monitoring the network traffic logs, etc. The key to this approach is to have a positive sample of the data and then use the test data to find out the deviations from the positive sample. If there's large deviation, then it's an anomaly and the RAG model can help in explaining the reasons for the anomaly.