# RAG - Exercise Solution

## Goal: Embed a few text documents into vectors and then use them via RAG to help answer user questions about sport events.

### Complete the tasks below

### TASK: Create a boto3 client connection to bedrock-runtime

In [1]:
# CODE HERE

In [2]:
import boto3

bedrock_runtime = boto3.client(region_name='us-east-1', service_name='bedrock-runtime')

### TASK: Create a function that takes in text and returns the Titan Embedding.

In [3]:
# Code Here
import json

def embed_text(text):
    '''
    INPUT: str text
    OUTPUT: an embedding, either Python list or Numpy Array
    '''
    json_request = {'inputText': text}
    body = json.dumps(json_request)
    response = bedrock_runtime.invoke_model(body=body, modelId='amazon.titan-embed-text-v1')
    return json.loads(response.get('body').read())['embedding']

In [4]:
#embed_text('hello how are you')

### TASK: Using pandas and the os library, open the directory "00-Sports-Articles" and read in the text from each .txt file as a string and insert it along with its filename into a Pandas DataFrame.

Hint to loop through a directory of files: https://pieriantraining.com/iterate-over-files-in-directory-using-python/

Hint on how to add a new row to a dataframe: 
https://stackoverflow.com/questions/10715965/create-a-pandas-dataframe-by-appending-one-row-at-a-time
https://pandas.pydata.org/docs/reference/api/pandas.concat.html

In [5]:
# CODE HERE

import os
import pandas as pd

# Path to the directory
directory_path = "00-Sports-Articles"

data = []
for filename in os.listdir(directory_path):
    if filename.endswith('.txt'):
        file_path = os.path.join(directory_path, filename)
        
        with open (file_path, 'r', encoding='utf-8') as file:
            content = file.read()
        data.append([filename, content])


In [6]:
df = pd.DataFrame(data, columns=['filename', 'text'])

In [7]:
df

Unnamed: 0,filename,text
0,BaseballGame.txt,Extra-Inning Thriller: Yankees Outlast Red Sox...
1,BasketballGame.txt,Nail-Biting Overtime Battle Sees Lakers Triump...
2,FootballGame.txt,Epic Clash between 49ers and Buccaneers Ends i...


### TASK: Apply your Text Embedding function to create a new column in the dataframe of the vector embedding of the text column.

In [8]:
# CODE HERE

df['embedding'] = df['text'].apply(embed_text)

In [9]:
df

Unnamed: 0,filename,text,embedding
0,BaseballGame.txt,Extra-Inning Thriller: Yankees Outlast Red Sox...,"[0.2353478, -0.14758441, -0.22002964, -0.30111..."
1,BasketballGame.txt,Nail-Biting Overtime Battle Sees Lakers Triump...,"[-0.063942306, 0.14368992, -0.018919019, -0.17..."
2,FootballGame.txt,Epic Clash between 49ers and Buccaneers Ends i...,"[-0.48678175, -0.11641871, -0.38582614, 0.0014..."


### TASK: Create a function that calculates the cosine similarity between two vectors.

In [10]:
# CODE HERE

In [11]:
def cosine_similarity(vector1,vector2):
    # Calculate the dot product of the two vectors
    dot_product = np.dot(vector1, vector2)

    # Calculate the magnitude (norm) of each vector
    magnitude_vector1 = np.linalg.norm(vector1)
    magnitude_vector2 = np.linalg.norm(vector2)

    # Calculate the cosine similarity
    return dot_product / (magnitude_vector1 * magnitude_vector2)

### TASK: Create a function that takes in a string prompt, creates its vector embedding, and then retrieves the most similar text from the dataframe.

In [12]:
# CODE HERE

In [13]:
import numpy as np

def most_similar_text(prompt):
    prompt_embedding = embed_text(prompt)
    df['prompt_similarity'] = df['embedding'].apply(lambda vector: cosine_similarity(vector, prompt_embedding))
    return df.nlargest(1, 'prompt_similarity').iloc[0]['text']

In [14]:
most_similar_text("What was the score of the 49ers Football game?")

"Epic Clash between 49ers and Buccaneers Ends in Thrilling Showdown\n\nDate: October 15, 2023\n\nIn a highly anticipated match-up on the gridiron, the San Francisco 49ers squared off against the Tampa Bay Buccaneers on October 15, 2023, in what turned out to be a thrilling display of football prowess. With fans eagerly watching, both teams brought their A-game, resulting in an electrifying contest that will be remembered for years to come.\n\nThe final scoreline of the game read 27-24 in favor of the 49ers, but the journey to that outcome was nothing short of extraordinary.\n\nFirst Quarter Fireworks:\n\nThe first quarter set the tone for the entire game as the Buccaneers, led by their star quarterback, Tom Brady, executed a perfectly choreographed drive ending in a touchdown pass to wide receiver Mike Evans. However, the 49ers' defense, known for its tenacity, responded with a crucial interception.\n\nBack-and-Forth Battle:\n\nAs the game unfolded, both teams traded blows with pinpoin

### TASK: Combine the functions created above to accept a user prompt and perform RAG and then inject that as context for the LLM call.

In [15]:
# CODE HERE

In [16]:
def llm_with_rag(prompt):
    
    rag_text = most_similar_text(prompt)
    
    full_prompt = f"{rag_text}\n\nANSWER THE FOLLOWING QUESTION:\n{prompt}"
    
    body = json.dumps({'inputText': full_prompt})
    
    response = bedrock_runtime.invoke_model(body=body, modelId='amazon.titan-text-express-v1')
    response_body = json.loads(response.get('body').read())
    return response_body['results'][0]['outputText']

In [17]:
llm_with_rag("What was the score of the 49ers Football game?")

'\nThe final score of the 49ers football game was 27-24.'