# Content Design for RAG
This notebook is part of a collection of material related to content design principles for retrieval-augmented generation (RAG).

You can explore the complete collection here: [Content Design for RAG on GitHub](https://github.com/spackows/ICAAI-2024_RAG-CD/blob/main/README.md)

**Example scenario**

Imagine your company sells seeds and gardening supplies online.  On your website, you have articles with gardening information and advice.  You are building a RAG solution for your company website that can answer customer questions about your products, using your website articles as a knowledge base.

# Remove HAP, PII
You need to remove hate, abuse, and profaity (HAP) as well as personal identifiable information (PII) from user input as well as from any output generated by a large language model.

This sample notebook demonstrates a simple approach this problem: using two Python libraries.

**Contents**
1. Remove profanity
2. Remove PII

## 1. Remove profanity
You can use the [profanity-police](https://pypi.org/project/profanity-police) library to identify profanity in text.

In [None]:
!pip install profanity-police | tail -n 1

In [16]:
from profanity_police.checker import Checker
import re
import json

def removeProfanity( txt_in, language_code ):
    checker = Checker()
    txt_in = re.sub( r"[\,\;\:\.\!\?\~\@\#\$\%\^\&\*\(\)\_\-\+\{\}\[\]\|\\=\"\'\<\>\/]", "", txt_in, re.IGNORECASE )
    transcript = [ { "text" : txt_in } ]
    check_results = checker.check_swear_word( transcript, language_code )
    print( json.dumps( check_results, indent=3 ) )
    txt_out = txt_in
    if ( len( check_results ) > 0 ) and ( "found" in check_results[0] ):
        words_arr = check_results[0]["found"]
        for word in words_arr:
            txt_out = re.sub( word, "[explitive]", txt_out, flags=re.IGNORECASE )
    return txt_out

In [20]:
txt = "The bitch is barking again"
clean_txt = removeProfanity( txt, "en" )
print( "\nCleaned text: " + clean_txt )

[
   {
      "text": "The bitch is barking again",
      "found": [
         "bitch"
      ]
   }
]

Cleaned text: The [explitive] is barking again


## 2. Remove PII
You can use the [scrubadub](https://scrubadub.readthedocs.io/en/stable/index.html) library to identify PII in text.

In [None]:
!pip install scrubadub | tail -n 1

In [30]:
import scrubadub

txt = "My id is bob@intrnet-email.com, and I want a refund"

words_arr = scrubadub.list_filth( txt )
print( words_arr )

[<EmailFilth text='bob@intrnet-email.com' beg=9 end=30 detector_name='email' locale='en_US'>]


In [34]:
scrubadub.filth.base.Filth.prefix = "["
scrubadub.filth.base.Filth.suffix = "]"

def removePII( txt, b_mask=False ):
    if b_mask:
        return scrubadub.clean( txt )
    scrubber = scrubadub.Scrubber( post_processor_list=[ scrubadub.post_processors.FilthRemover() ] )
    return scrubber.clean( txt )

In [35]:
removePII( txt )

'My id is , and I want a refund'

In [36]:
removePII( txt, b_mask=True )

'My id is [EMAIL], and I want a refund'