# RAG Tweets

This tutorial introduces how to create a vector database with OpenAI and MongoDB, and retrive the relevent information.

## Set up a Database and API Keys

Create a [MongoDB](www.mongodb.com) cluster and store the connection string in a safe place, such as AWS Secrets Manager. 
- key name: `connection_string`
- key value: <`the connection string`>, you need to type the password
- secret name: `mongodb`


You also need to store your oepnai api key in AWS Secrets Manager:
- key name: `api_key`
- key value: <`your openai api key`>
- secret name: `openai`

## Install Python Libraries

In [1]:
pip install pymongo

Collecting pymongo
  Downloading pymongo-4.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
Installing collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.10.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install openai

Collecting openai
  Downloading openai-1.53.0-py3-none-any.whl.metadata (24 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Downloading openai-1.53.0-py3-none-any.whl (387 kB)
Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Downloading jiter-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (325 kB)
Installing collected packages: jiter, distro, openai
Successfully installed distro-1.9.0 jiter-0.6.1 openai-1.53.0
Note: you may need to restart the kernel to use updated packages.


## Secrets Manager Function

In [3]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials

In [31]:
import pymongo
from pymongo import MongoClient
import json
import re
import os
from openai import OpenAI


openai_api_key  = get_secret('openai')['api_key']
client = OpenAI(api_key=openai_api_key)
mongodb_connect = get_secret('mongodb')['connection_string']


## Connect to the MongoDB cluster

In [10]:
mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
tweet_collection = db.tweet_test #use or create a collection named job_collection

## Define utility funcitons

In [19]:
def clean_tweet(text):
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return re.sub(url_pattern, '', text)

In [28]:
EMBEDDING_MODEL= 'text-embedding-3-small'

def get_embedding(text):

    try:
        embedding = client.embeddings.create(input=text, model=EMBEDDING_MODEL).data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None

In [41]:
model = 'gpt-4o'
temperature = 0

def openai_help(messages, model=model, temperature =temperature ):
    messages = messages
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature

    )
    return response.choices[0].message.content

In [63]:
def vector_search(user_query):

    query_embedding = get_embedding(user_query)
    if query_embedding is None:
        return "Invalid query or embedding generation failed."
    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "tweet.embedding",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 10  # Return top 5 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "tweet.text": 1, 

            }
        }
    ]

    results = tweet_collection.aggregate(pipeline)
    return list(results)

## tweets embedding 

In [39]:
from tqdm.auto import tqdm
tweets = tweet_collection.find()

for tweet in tqdm(tweets):
    try:
        tweet_embedding = get_embedding(clean_tweet(tweet['tweet']['text']))
    #     print(tweet_embedding)

        tweet_collection.update_one(
            {'tweet.id':tweet['tweet']['id']},
            {"$set":{'tweet.embedding':tweet_embedding}}
        )
    except:
        print(f"""error in embedding tweet {tweet['tweet']['id']}""")
        pass


0it [00:00, ?it/s]

## Create a vector index

As of today, you can only create a vector index manually on MognoDB website with the folloiwng json:

```json
{
  "fields": [
    {
      "type": "vector",
      "path": "tweet.embedding",
      "numDimensions": 1536,
      "similarity": "cosine"
    }
  ]
}
```

## Query the database 

In [66]:
user_query = 'what do people talk about trump'

for tweet in vector_search(user_query):
    print(tweet['tweet']['text'])

RT @SethAbramson: RETWEET this. Every day. Until Election Day.

Anyone anywhere in America who tells you Donald Trump will be better for th…
RT @SethAbramson: RETWEET this. Every day. Until Election Day.

Anyone anywhere in America who tells you Donald Trump will be better for th…
RT @SethAbramson: RETWEET this. Every day. Until Election Day.

Anyone anywhere in America who tells you Donald Trump will be better for th…
RT @SethAbramson: RETWEET this. Every day. Until Election Day.

Anyone anywhere in America who tells you Donald Trump will be better for th…
RT @SethAbramson: RETWEET this. Every day. Until Election Day.

Anyone anywhere in America who tells you Donald Trump will be better for th…
RT @SethAbramson: RETWEET this. Every day. Until Election Day.

Anyone anywhere in America who tells you Donald Trump will be better for th…
RT @SethAbramson: RETWEET this. Every day. Until Election Day.

Anyone anywhere in America who tells you Donald Trump will be better for th…
RT @SethAbram

## Chatbot

In [67]:
from openai import OpenAI

delimiter = '###'
openai_api_key  = get_secret('openai')['api_key']
client = OpenAI(api_key=openai_api_key)
model = 'gpt-4o'
temperature = 0

chat_history = [
{"role": "system", "content": """you are a chabot answer user questions based on the returned tweets"""}
]

def chatbot(prompt):

    chat_history.append({"role": "user", "content": prompt})
    
    tweets = vector_search(prompt)
    chat_history.append({"role": "system", "content": f"here the returned tweets deliminted by {delimiter}{tweets}{delimiter}"})

    response = client.chat.completions.create(
        model=model,  # Use the model you prefer
        messages=chat_history
    )

    reply = response.choices[0].message.content

    chat_history.append({"role": "assistant", "content": reply})
    
    return reply

In [None]:
while True:
    user_input = input("You: ")
    if user_input.lower() in ['exit', 'quit']:
        print("Chatbot: Goodbye!")
        break
    reply = chatbot(user_input)
    print(f"Chatbot: {reply}")

You:  what do people talk about trump


Chatbot: Based on the returned tweets, it seems that people are discussing the idea that Donald Trump will be better for the country, with a strong emphasis on encouraging others to retweet this message every day until Election Day. The repeated retweets suggest a concerted effort to challenge or debunk this claim about Trump's potential impact on America.


You:  how people compare harris vs trump


Chatbot: Based on the returned tweets, there appears to be some discussion centered around the challenge of finding negative stories about Kamala Harris on Google, as noted by Joe Rogan. This may suggest a perception of media bias or imbalance in coverage. In contrast, there is also an ongoing effort to challenge the notion that Donald Trump is better for America, highlighted by repeated calls to retweet messages against this assertion daily until Election Day. There isn't a direct comparison between Harris and Trump in these tweets, but themes of media coverage and perceptions of each figure's impact or reputation seem to be at play.
