# How to use Bedrock Titan FM embedding vectors to build a LLM content moderation engine with RandomForestClassifier

In this demo notebook, we demonstrate how to use Bedrock Titan FM embedding vectors to build a LLM content moderation engine with RandomForestClassifier

We will use the Bedrock Python SDK for Embeddings Generation.


## Content moderation Pipeline

Email Template content ->Bedrock Titan FM -> Embedding vectors -> RandomForestClassifier -> result

1. [Set Up](#1.-Set-Up)
2. [Embeddings Generation](#2.-Embeddings-Generation)
3. [Items Similarity](#3.-Items-Similarity)

Note: This notebook was tested in Amazon SageMaker Studio with Python 3 (Data Science 2.0) kernel.

### 1. Set Up

---
Before executing the notebook for the first time, execute this cell to add bedrock extensions to the Python boto3 SDK

---

In [None]:
!python3 -m pip install dependencies/boto3-1.26.162-py3-none-any.whl
!python3 -m pip install dependencies/botocore-1.29.162-py3-none-any.whl

Let's initialize the boto3 client to use Bedrock

In [2]:
import boto3
import json
bedrock = boto3.client(
 service_name='bedrock',
 region_name='us-east-1',
 endpoint_url='https://bedrock.us-east-1.amazonaws.com'
)

Lets test the endpoint to see what models are available

In [3]:
bedrock.list_foundation_models()

{'ResponseMetadata': {'RequestId': '9b303ddb-8725-48c2-9f6c-2fe104bc1e17',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Mon, 04 Sep 2023 14:43:14 GMT',
   'content-type': 'application/json',
   'content-length': '1166',
   'connection': 'keep-alive',
   'x-amzn-requestid': '9b303ddb-8725-48c2-9f6c-2fe104bc1e17'},
  'RetryAttempts': 0},
 'modelSummaries': [{'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-tg1-large',
   'modelId': 'amazon.titan-tg1-large'},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-e1t-medium',
   'modelId': 'amazon.titan-e1t-medium'},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/stability.stable-diffusion-xl',
   'modelId': 'stability.stable-diffusion-xl'},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/ai21.j2-grande-instruct',
   'modelId': 'ai21.j2-grande-instruct'},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/ai21.j2-jumbo-instruct',
   'modelId': 'ai21.j2-jumbo-i

## Load training dataset CSV file into dataframe

In [4]:
import pandas as pd

# Specify the file path
csv_file = "compliance_dataset.csv"

# Load the CSV file into a DataFrame
df = pd.read_csv(csv_file)

# Display the first few rows of the DataFrame to verify the data loading
print(df.head())

                                                text is_toxic
0  Elon Musk is a piece of shit, greedy capitalis...    Toxic
1  The senile credit card shrill from Delaware ne...    Toxic
2  He does that a lot -- makes everyone look good...    Toxic
3                                         F*ck Lizzo    Toxic
4  Epstein and trump were best buds!!! Pedophiles...    Toxic


In [5]:
# Rename the 'is_toxic' column to 'label'
df.rename(columns={'is_toxic': 'label'}, inplace=True)

# Display the first few rows of the DataFrame to verify the column name change
print(df.head())

                                                text  label
0  Elon Musk is a piece of shit, greedy capitalis...  Toxic
1  The senile credit card shrill from Delaware ne...  Toxic
2  He does that a lot -- makes everyone look good...  Toxic
3                                         F*ck Lizzo  Toxic
4  Epstein and trump were best buds!!! Pedophiles...  Toxic


In [6]:

# Specify the file path
csv_file = "compliance_dataset.csv"

# Load the CSV file into a DataFrame
df = pd.read_csv(csv_file)

# Rename the 'is_toxic' column to 'label'
df.rename(columns={'is_toxic': 'label'}, inplace=True)

# Replace values in the 'label' column
df['label'] = df['label'].replace({'Toxic': 'non-compliance', 'Not Toxic': 'compliance'})

# Display the first few rows of the DataFrame to verify the changes
print(df.head())


                                                text           label
0  Elon Musk is a piece of shit, greedy capitalis...  non-compliance
1  The senile credit card shrill from Delaware ne...  non-compliance
2  He does that a lot -- makes everyone look good...  non-compliance
3                                         F*ck Lizzo  non-compliance
4  Epstein and trump were best buds!!! Pedophiles...  non-compliance


In [7]:
# Filter the DataFrame to show rows where 'label' is 'non-compliance'
non_compliance_df = df[df['label'] == 'compliance']

# Display the first few rows of the filtered DataFrame
print(non_compliance_df.head())


                                                  text       label
501                         They’re shallots, actually  compliance
502            me thinking in my head: mmm pizzaaaa...  compliance
503  I have never seen a dice hammer that is so fuc...  compliance
504  Got a question for you guys out there: did rev...  compliance
505  This sub’s notorious for it. It might be the m...  compliance


In [8]:
# Duplicate the original DataFrame and assign it to 'vectors_df'
vectors_df = pd.DataFrame(df)


In [9]:
# Count the number of 'compliance' and 'non-compliance' labels
label_counts = vectors_df['label'].value_counts()

# Print the counts
print(label_counts)


non-compliance    501
compliance        499
Name: label, dtype: int64


In [10]:
# Export the DataFrame 'vectors_df' to a JSON file named 'compliance_dataset.json'
vectors_df.to_json('compliance_dataset.json', orient='records', lines=True)


In [11]:
import json

# Open the JSON file for reading
with open('compliance_dataset.json', 'r', encoding='utf-8') as file:
    data = []
    for line in file:
        # Parse each line as a separate JSON object
        try:
            record = json.loads(line.strip())
            data.append(record)
        except json.JSONDecodeError as e:
            print(f"Skipping invalid JSON: {e}")

# 'data' now contains a list of dictionaries, each representing a JSON object


In [12]:
import json

# Open the JSON file for reading
with open('compliance_dataset.json', 'r', encoding='utf-8') as file:
    data = []
    for line in file:
        # Parse each line as a separate JSON object
        try:
            record = json.loads(line.strip())
            data.append(record)
        except json.JSONDecodeError as e:
            print(f"Skipping invalid JSON: {e}")

# 'data' now contains a list of dictionaries, each representing a JSON object

# Specify the filename for the new JSON file
output_filename = 'parsed_compliance_data.json'

# Write the 'data' list to the new JSON file
with open(output_filename, 'w', encoding='utf-8') as output_file:
    json.dump(data, output_file, indent=4)

print(f'Data has been saved to {output_filename}')


Data has been saved to parsed_compliance_data.json


In [13]:
import json

# Specify the filename of the JSON file you want to load
input_filename = 'parsed_compliance_data.json'

# Load the JSON data from the file into a Python variable
with open(input_filename, 'r', encoding='utf-8') as input_file:
    loaded_data = json.load(input_file)

# Now, 'loaded_data' contains the JSON data as a Python data structure (likely a list of dictionaries)

# You can now work with 'loaded_data' in your Jupyter Notebook


In [14]:
# Assuming you've already loaded the JSON data into the 'loaded_data' variable

# Display the top 3 records
top_3_records = loaded_data[:3]
for record in top_3_records:
    print(record)



{'text': 'Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world.\n All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid.\n I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. \n He is not a representation of our community. Don’t celebrate him on this page.', 'label': 'non-compliance'}
{'text': 'The senile credit card shrill from Delaware needs to resign!!', 'label': 'non-compliance'}
{'text': "He does that a lot -- makes everyone look good but him...I guess it's also probably the Dems and the Media that force him to compulsively tweet abject bullshit like a lying bitch. They're tricky, them libs.", 'label': 'non-compliance'}


### 2. Embeddings Generation

Embeddings are a key concept in generative AI and machine learning in general. An embedding is a representation of an object (like a word, image, video, etc.) in a vector space. Typically, semantically similar objects will have embeddings that are close together in the vector space. These are very powerful for use-cases like semantic search, recommendations and Classifications.

# We will be using the Titan Embeddings Model to generate our Embeddings.

def get_embedding(body, modelId, accept, contentType):
    response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    embedding = response_body.get('embedding')
    return embedding

body = json.dumps({"inputText": "explain black holes to 8th graders"})
modelId = 'amazon.titan-e1t-medium'
accept = 'application/json'
contentType = 'application/json'

embedding = get_embedding(body, modelId, accept, contentType)
print(embedding)

In [15]:
import json

import json

def get_embedding(body, modelId, accept, contentType):
    response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    embedding = response_body.get('embedding')
    return embedding

# Load the parsed JSON data from 'parsed_compliance_data.json'
with open('parsed_compliance_data.json', 'r', encoding='utf-8') as input_file:
    data = json.load(input_file)

# Initialize a list to store the results
results = []

# Loop through each record in the data
for record in data:
    text = record['text']
    label = record['label']

    # Calculate the embedding for the text
    body = json.dumps({"inputText": text})
    modelId = 'amazon.titan-e1t-medium'
    accept = 'application/json'
    contentType = 'application/json'
    embedding = get_embedding(body, modelId, accept, contentType)

    # Create a result dictionary with text, label, and embedding
    result = {
        #'text': text,
        'label': label,
        'embedding': embedding
    }

    # Append the result to the list of results
    results.append(result)

# Save the results to 'vectors.json'
with open('vectors.json', 'w', encoding='utf-8') as output_file:
    json.dump(results, output_file, indent=4)

print('Embedding vectors have been saved to vectors.json')


Embedding vectors have been saved to vectors.json


## Prepare training dataset 800 records, and test dataset 200 records
You'll need to read the 'vectors.json' file, extract the first 100 and last 100 records, and then save them to 'test.json' and the remaining 800 records to 'train.json'. Here's a sample code to do this:

In [16]:
import json

# Load the 'vectors.json' file
with open('vectors.json', 'r') as json_file:
    data = json.load(json_file)

# Extract the first 100 and last 100 records
first_100_records = data[:100]
last_100_records = data[-100:]

# Create 'test.json' with the combined 200 records
test_data = first_100_records + last_100_records
with open('test.json', 'w') as test_file:
    json.dump(test_data, test_file)

# Create 'train.json' with the remaining 800 records
train_data = data[100:-100]
with open('train.json', 'w') as train_file:
    json.dump(train_data, train_file)


## Covert embedding vectors into numpy array, train by RandomForestClassifier

In [17]:
!pip install numpy==1.16.5


Collecting numpy==1.16.5
  Downloading numpy-1.16.5.zip (5.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: numpy
  Building wheel for numpy (setup.py) ... [?25ldone
[?25h  Created wheel for numpy: filename=numpy-1.16.5-cp38-cp38-linux_x86_64.whl size=10166035 sha256=cc5983f67f15f4591da4403b98176ddd409534f66d538c8db8af84469e56027d
  Stored in directory: /root/.cache/pip/wheels/8f/3f/d3/ac786baa3379136ed1069cf94478550de71616e0490b462e90
Successfully built numpy
[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/

In [18]:
import json
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report



### Training a model

In [19]:


# Load the training dataset from 'training.json'
with open('train.json', 'r') as f:
    training_data = json.load(f)

# Load the test dataset from 'test.json'
with open('test.json', 'r') as f:
    test_data = json.load(f)

# Extract features (embedding vectors) and labels from the datasets
X_train = [data_point["embedding"] for data_point in training_data]
y_train = [data_point["label"] for data_point in training_data]

X_test = [data_point["embedding"] for data_point in test_data]
y_test = [data_point["label"] for data_point in test_data]

# Convert lists to numpy arrays for scikit-learn
X_train = np.array(X_train)
y_train = np.array(y_train)

X_test = np.array(X_test)
y_test = np.array(y_test)

# Build the classification model (Random Forest in this example)
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Evaluate the model
y_pred = clf.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate and print precision
precision = precision_score(y_test, y_pred, average='weighted')
print("Precision:", precision)

# Calculate and print recall
recall = recall_score(y_test, y_pred, average='weighted')
print("Recall:", recall)

# Calculate and print F1-score
f1 = f1_score(y_test, y_pred, average='weighted')
print("F1-score:", f1)

# Calculate and print ROC-AUC score (Note: ROC-AUC is typically used for binary classification)
if len(np.unique(y_test)) == 2:  # Check if it's a binary classification problem
    roc_auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
    print("ROC-AUC:", roc_auc)

# Print the detailed classification report
classification_report_str = classification_report(y_test, y_pred)
print("Classification Report:\n", classification_report_str)


Accuracy: 0.99
Precision: 0.99
Recall: 0.99
F1-score: 0.99
ROC-AUC: 0.99985
Classification Report:
                 precision    recall  f1-score   support

    compliance       0.99      0.99      0.99       100
non-compliance       0.99      0.99      0.99       100

      accuracy                           0.99       200
     macro avg       0.99      0.99      0.99       200
  weighted avg       0.99      0.99      0.99       200



### Load the Trained Model
Load the trained Random Forest classifier that you previously trained and saved. If you haven't saved the model, you should save it after training for later use. You can use the joblib library to save and load scikit-learn models.

In [20]:
# saving the model after training:

from joblib import dump

# Train the model (assuming 'clf' is your trained classifier)
clf.fit(X_train, y_train)

# Save the trained model to a file
dump(clf, 'trained_model.joblib')


['trained_model.joblib']

In [21]:
# loading the saved model for inference:

from joblib import load

# Load the trained model from a file
clf = load('trained_model.joblib')


## Prepare New Data

You need to preprocess the new data in the same way you preprocessed your training and test data. In your case, it appears you'll need to obtain the LLM embedding vectors for the new text data using your 'get_embedding' function.

### To load JSON data from the 'email_content_english.json' file into the new_text variable, you can use the following code:

In [22]:
import json

# Specify the filename of the JSON file
json_filename = 'email_content_english.json'

# Load the JSON data from the file into 'new_text'
with open(json_filename, 'r', encoding='utf-8') as json_file:
    data = json.load(json_file)

# Assuming that the JSON file has a key named 'inputText' containing the text data
new_text = data.get('inputText', '')

# Now, 'new_text' contains the text data from the JSON file


### Now, the new_text variable contains the text data loaded from 'email_content_english.json,' and you can use it to calculate the embedding as shown in your code:

In [23]:
new_text_embedding = get_embedding(json.dumps({"inputText": new_text}), modelId, accept, contentType)


In [24]:
# Assuming you have calculated 'new_text_embedding' using your get_embedding function
print("new_text_embedding:", new_text_embedding)


new_text_embedding: [-0.07324219, 0.036865234, -0.10107422, 0.004760742, 0.3125, 0.37304688, 0.06542969, 0.055908203, -0.028930664, 0.109375, -0.24414062, 0.0016021729, 0.0060424805, -0.046875, -0.018188477, 0.041015625, -0.068847656, -0.12695312, 0.07861328, -0.23242188, -0.008911133, -0.016845703, 0.060791016, 0.023803711, -0.018676758, 0.084472656, -0.076171875, 0.053710938, -0.011291504, -0.0056762695, -0.064941406, 0.021484375, -0.16601562, -0.16015625, -0.15820312, 0.08886719, -0.05078125, -0.16796875, -0.15527344, 0.08642578, 0.18945312, 0.06201172, -0.06982422, -0.15625, -0.07910156, 0.100097656, -0.14941406, -0.18945312, -0.045410156, -0.2265625, 0.03564453, -0.033691406, 0.111328125, -0.017333984, -0.068359375, -0.08886719, -0.16015625, -0.09472656, 0.080078125, 0.110839844, 0.04296875, -0.2109375, 0.16503906, 0.100097656, 0.1796875, 0.10107422, 0.028076172, 0.052734375, -0.05859375, -0.056152344, -0.080078125, 0.09716797, 0.033691406, 0.049560547, 0.12011719, -0.044921875, -

## Perform Inference

Use the loaded model to make predictions on the new data. You can use the predict method of your classifier.

In [25]:
# Predict the label for the new data
predicted_label = clf.predict([new_text_embedding])

# Print the predicted label
print("Predicted Label:", predicted_label[0])


Predicted Label: compliance


In [26]:
# Predict the label and obtain probability estimates
probability_estimates = clf.predict_proba([new_text_embedding])
predicted_label = clf.predict([new_text_embedding])

# Print the predicted label and probability estimates
print("Predicted Label:", predicted_label[0])
print("Probability Estimates:", probability_estimates[0])


Predicted Label: compliance
Probability Estimates: [0.75 0.25]
