# Create vector database for lbsocial

## Set up a Database and API Keys

Create a [MongoDB](www.mongodb.com) cluster and store the connection string in a safe place, such as AWS Secrets Manager. 
- key name: `connection_string`
- key value: <`the connection string`>, you need to type the password
- secret name: `mongodb`


You also need to purchase and your [oepnai](https://openai.com/) api key in AWS Secrets Manager:
- key name: `api_key`
- key value: <`your openai api key`>
- secret name: `openai`

google api

- key name: `api_key`
- key value: <`your gooogle api key`>
- secret name: `google`

## Install Python Libraries

- pymongo: manage the MongoDB database
- openai: create embeddings and resonpses 
- youtube_transcript_api google-api-python-client: retrive youtube video captions, titles and URLs.

In [None]:
pip install pymongo openai youtube_transcript_api google-api-python-client boto3 -q

## Secrets Manager Function

In [None]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials

In [None]:
import pymongo
from pymongo import MongoClient
import json
import re
import os
from openai import OpenAI
openai_api_key  = get_secret('openai')['api_key']
client = OpenAI(api_key=openai_api_key)


mongodb_connect = get_secret('mongodb')['connection_string']


## Connect to the MongoDB cluster

In [None]:
mongo_client = MongoClient(mongodb_connect)
db = mongo_client.video # use or create a database named demo
collection = db.video_collection #use or create a collection named tweet_collection

Sample data

```json
{
"title":"LBSocial About",
"text":"At LBSocial, we provide an innovative online learning experience, combining generative AI and cloud computing to help you master data science."
"url":"https://www.lbsocial.net/about"  
}
```

## Retrive Youtube Videos

In [None]:
from youtube_transcript_api import YouTubeTranscriptApi
from googleapiclient.discovery import build

# YouTube API setup
google_api_key = get_secret('google')['api_key']
youtube = build("youtube", "v3", developerKey=google_api_key)


def get_video_captions(video_id):
    try:
        captions = YouTubeTranscriptApi.get_transcript(video_id)
        return captions
    except Exception as e:
        print (e)
        return "No captions available"

def get_playlist_videos(playlist_id):
    videos = []
    next_page_token = None
    
    while True:
        playlist_items = youtube.playlistItems().list(
            part="snippet",
            playlistId=playlist_id,
            maxResults=50,
            pageToken=next_page_token
        ).execute()
        
        for item in playlist_items["items"]:
            video_id = item["snippet"]["resourceId"]["videoId"]
            title = item["snippet"]["title"]
            url = f"https://www.youtube.com/watch?v={video_id}"
            captions = get_video_captions(video_id)
            videos.append({
                "title": title,
                "url": url,
                "video_id": video_id,
                "captions": captions
            })
        
        next_page_token = playlist_items.get("nextPageToken")
        if not next_page_token:
            break
    
    return videos





In [None]:
video_id = 'x5CJTaQJWgs'

get_video_captions(video_id)

In [None]:
# Example usage
playlist_id = "PLHutrxqbP1BzzTi8odV40RhLZQjK8Iy6_"
playlist_videos = get_playlist_videos(playlist_id)

for video in playlist_videos:
    print(f"Title: {video['title']}")
    print(f"URL: {video['url']}")
    print(f"Captions: {video['captions'][:100]}...")  # Print first 100 characters of captions
    print("---")
    

## Utility Funcitons


- the `get_embedding` function use openai to create tweet embeddings
- the `vector_search` function return relevent tweets based on a query

In [None]:
embedding_model= 'text-embedding-3-small'

def get_embedding(text):

    try:
        embedding = client.embeddings.create(input=text, model=embedding_model).data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None

In [None]:
def vector_search(query):

    query_embedding = get_embedding(query)
    if query_embedding is None:
        return "Invalid query or embedding generation failed."
    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "tweet.embedding",
                "numCandidates": 1000,  # Number of candidate matches to consider
                "limit": 10  # Return top 10 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "tweet.text": 1 # return tweet text

            }
        }
    ]

    results = tweet_collection.aggregate(pipeline)
    return list(results)

## Tweets Embedding 

For more about text embeddings please read [Introducing text and code embeddings](https://openai.com/index/introducing-text-and-code-embeddings/)

In [None]:
from tqdm.auto import tqdm
tweets = tweet_collection.find()

for tweet in tqdm(list(tweets)):
    try:
        tweet_embedding = get_embedding(clean_tweet(tweet['tweet']['text']))
    #     print(tweet_embedding)

        tweet_collection.update_one(
            {'tweet.id':tweet['tweet']['id']},
            {"$set":{'tweet.embedding':tweet_embedding}}
        )
    except:
        print(f"""error in embedding tweet {tweet['tweet']['id']}""")
        pass


## Create a Vector Index

For more about MognoDB Vector database, please read [What are Vector Databases?](https://www.mongodb.com/resources/basics/databases/vector-databases)

Right now, if you are using the free tier MongoDB cluster, you can only create a vector index manually on MognoDB website with the folloiwng JSON.

The [MongoDB 8.0](https://www.mongodb.com/resources/products/mongodb-version-history?utm_source=Iterable&utm_medium=email&utm_campaign=campaign_11551932#mdbeightzero) will be avabale on the free tier in Nov 2024. You may create the vector index in Python after the update. 

```json
{
  "fields": [
    {
      "type": "vector",
      "path": "tweet.embedding",
      "numDimensions": 1536,
      "similarity": "cosine"
    }
  ]
}
```

In [None]:
user_query = 'support Trump'

for tweet in vector_search(user_query):
    print(tweet['tweet']['text'])

## Retrieval-Augmented Generation (RAG) 

For more about RAG, please read [Using OpenAI Latest Embeddings in a RAG System With MongoDB](https://www.mongodb.com/library/use-cases/ai-resource/using-openai-latest-embeddings).

In [None]:
from openai import OpenAI

delimiter = '###'
chat_model = 'gpt-4o'
temperature = 0

chat_history = [{"role": "system", "content": """you are a chabot answer user questions based on the returned tweets"""}]

def chatbot(prompt):

    chat_history.append({"role": "user", "content": prompt})
    
    tweets = vector_search(prompt)
    chat_history.append({"role": "system", "content": f"here the returned tweets delimitered by {delimiter}{tweets}{delimiter}"})

    response = client.chat.completions.create(
        model=chat_model,  # Use the model you prefer
        messages=chat_history
    )

    reply = response.choices[0].message.content

    chat_history.append({"role": "assistant", "content": reply})
    
    return reply

In [None]:
while True:
    user_input = input("You: ")
    if user_input.lower() in ['exit', 'quit']:
        print("Chatbot: Goodbye!")
        break
    reply = chatbot(user_input)
    print(f"Chatbot: {reply}")

## Reference

- *“Introducing Text and Code Embeddings.”* n.d. OpenAI. Accessed October 31, 2024. https://openai.com/index/introducing-text-and-code-embeddings/.
- Richmond Alake. 2024. *“Using OpenAI Latest Embeddings in a RAG System with MongoDB.”* MongoDB. July 1, 2024. https://www.mongodb.com/library/use-cases/ai-resource/using-openai-latest-embeddings.
- *“What Are Vector Databases?”* n.d. MongoDB. Accessed October 31, 2024. https://www.mongodb.com/resources/basics/databases/vector-databases.
