## Creating note base on LOS Summary and Introudctioin

In [1]:
## Pre-Creating knowledge of LOS
import snowflake.connector
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv
import os
load_dotenv(override=True)
from openai import OpenAI


In [2]:

# Snowflake config
snowflake_user = os.getenv('SNOWFLAKE_USER')
snowflake_password = os.getenv('SNOWFLAKE_PASSWORD')
snowflake_account = os.getenv('SNOWFLAKE_ACCOUNT')
database = os.getenv('SNOWFLAKE_DATABASE')
schema = os.getenv('SNOWFLAKE_SCHEMA')
warehouse = os.getenv('SNOWFLAKE_WAREHOUSE')

def get_3_data():
    """get Introduction LOS Summary"""
    df = {}
    try:
        conn = snowflake.connector.connect(       
            user=snowflake_user,
            password=snowflake_password,
            account=snowflake_account,
            warehouse=warehouse,
            database=database,
            schema=schema,
        )
        
        # run the sql query to get data from snowflake
        cursor = conn.cursor()
        sql = f"SELECT NAMEOFTOPIC, INTRODUCTION, LEARNINGOUTCOME, SUMMARY FROM CFA_COURSES ;"

        results = cursor.execute(sql)
        if results is not None:
            df = pd.DataFrame(results.fetchall())
            df.columns = ['Topic','Introduction', 'Learning outcomes', 'Summary']
        else:
            print('Fail to get data. Try again')
    except NameError as e:
        print(f'Program fail: {e}')
    finally:
        cursor.close()
        conn.close()

    return df

df = get_3_data()


In [3]:

def creat_knowledge(topic, introduction, summary, los):

    client = OpenAI()
    response = client.chat.completions.create(
    model="gpt-3.5-turbo-0125",
    messages=[
        {"role": "system", "content": """You are a helpful assistant to generate techincal note for CFA Refresher Readings articles. Based on the information provided in the prompt by user, create a technical note in Markdown format that summarizes only the key Learning outcomes
under "Summary of Key Learning Outcome" heading. Be sure to include any tables or equations necessary. Do not provide any introduction or conclusion. The "Summary of Learning outcomes" should have bullets only format which will be used as Knowledge Base for QA prompts.
Follow the md file structure below:
## Summary of Key Learning Outcome
- **bulet points for eachlearning outcomes:** 
    parapgh for summarizing this learning outcome.
- **bulet points for eachlearning outcomes:** 
    parapgh for summarizing this learning outcome."""},
        {"role": "user", "content": f"""## Topic\n{topic}\n\n## Introduction\n{introduction}\n\n## Summary\n{summary}\n\n## Learning Outcomes\n{los}\n"""}
    ]
    )
    # print(response.choices[0].message.content)
    return response

In [43]:

# topic, intro, los, summary = df.loc[0]
# response = creat_knowledge(topic, intro, summary, los)

In [4]:
response_list = []
for i in range(4):
    topic, intro, los, summary = df.loc[i]
    resp = creat_knowledge(topic, intro, summary, los)
    response_list.append(
        {
            'topic': topic,
            'los': los,
            'note':resp.choices[0].message.content
        })
print(response_list)


[{'topic': 'Time Value of Money in Finance', 'los': 'interpret interest rates as required rates of return, discount rates, or opportunity costs;\nexplain an interest rate as the sum of a real risk-free rate and premiums that compensate investors for bearing distinct types of risk;\ncalculate and interpret the effective annual rate, given the stated annual interest rate and the frequency of compounding;\nsolve time value of money problems for different frequencies of compounding;\ncalculate and interpret the future value (FV) and present value (PV) of a single sum of money, an ordinary annuity, an annuity due, a perpetuity (PV only), and a series of unequal cash flows;\ndemonstrate the use of a time line in modeling and solving time value of money problems.', 'note': '## Summary of Key Learning Outcome\n- **Interpretation of Interest Rates:** \n    Understanding interest rates as required rates of return, discount rates, or opportunity costs is essential for evaluating investments and f

In [7]:
print(response_list[3]['los'])

define a random variable, an outcome, and an event;
identify the two defining properties of probability, including mutually exclusive and exhaustive events, and compare and contrast empirical, subjective, and a priori probabilities;
describe the probability of an event in terms of odds for and against the event;
calculate and interpret conditional probabilities;
demonstrate the application of the multiplication and addition rules for probability;
compare and contrast dependent and independent events;
calculate and interpret an unconditional probability using the total probability rule;
calculate and interpret the expected value, variance, and standard deviation of random variables;
explain the use of conditional expectation in investment applications;
interpret a probability tree and demonstrate its application to investment problems;
calculate and interpret the expected value, variance, standard deviation, covariances, and correlations of portfolio returns;
calculate and interpret the

In [8]:
# gater the output and put into md

md_lines = ''
for i in range(4):
    md_lines += response_list[i]['note']
    md_lines += '\n''\n'
# print(md_lines)
with open("../data/LOS_summary.md", "w", encoding="utf-8") as file:
    file.writelines(md_lines)


## Splitter the note base on token length

**Spilt LOS with ';'**

**Split Technical Note with token**

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken
from uuid import uuid4
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [11]:
# tokenize a str

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

# def split_text_by_semicolon(data):# -> list[Any]:
#     list_chunks = []
#     # split LOS by ;
#     for i,text in enumerate(data):
#         chunks = [chunk.strip() for chunk in text['los'].split(';') if chunk.strip()]
#         list_chunks.append(chunks)
#     return list_chunks


In [12]:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

In [13]:
note_chunks = text_splitter.split_text(md_lines)
print(note_chunks[0])

# los_chunks = split_text_by_semicolon(response_list)

## Summary of Key Learning Outcome
- **Interpretation of Interest Rates:** 
    Understanding interest rates as required rates of return, discount rates, or opportunity costs is essential for evaluating investments and financial decisions.
- **Calculation of Effective Annual Rate:** 
    Being able to calculate and interpret the effective annual rate based on the stated annual interest rate and compounding frequency helps in comparing different investment options accurately.
- **Time Value of Money Calculations:**


## Creating Embedding

In [14]:
import os

# get openai api key from platform.openai.com
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or 'YOUR_API_KEY'

In [15]:
from langchain_openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY  # type: ignore
)

In [16]:

note_res = embed.embed_documents(note_chunks)
# los_res = embed.embed_documents(los_chunks)
len(note_res) # directly convert into a vector

16

## Upload to Pinecone

In [8]:
from pinecone import Pinecone
from dotenv import load_dotenv
load_dotenv()
import os

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.getenv("PINECONE_API_KEY")

# configure client
pc = Pinecone(api_key=api_key)

  from tqdm.autonotebook import tqdm


In [9]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1" 
)

In [10]:
import time

index_name = 'damg-group3-assignment5-step1'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 25}, 'TechNote': {'vector_count': 44}},
 'total_vector_count': 69}

### Upload to Pinecone

In [34]:
from typing import List
from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 100

texts = []
metadatas = []

for i, record in enumerate(tqdm(response_list)):
    # first get metadata fields for this record
    metadata = {
        'topic': record['topic'],
        'LOS': record['los'],
        'Note': record['note']
    }
    # now we create chunks from the record text
        #create record with LOS
    record['note'] = record['los'] + record['note']
    record_texts = text_splitter.split_text(record['note'])
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas)) # type: ignore
        texts = []
        metadatas = []


if len(texts) > 0:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas),namespace='TechNote') # type: ignore

100%|██████████| 4/4 [00:00<00:00, 387.15it/s]


In [35]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 25}, 'TechNote': {'vector_count': 44}},
 'total_vector_count': 69}

## Query the Data

In [3]:

from langchain_openai import OpenAIEmbeddings
import openai

model_name = 'text-embedding-ada-002'
query = "Give me a learning outcome about Time Value of Money in Finance"


embedres = openai.embeddings.create(
    model=model_name,
    input=query
)


In [11]:

print(type(embedres))
print(embedres)

# retrieve from Pinecone
embedding_vector = embedres.data[0].embedding

# # get relevant contexts (including the questions)
res = index.query(vector=embedding_vector, top_k=3, include_metadata=True,namespace='TechNote')

<class 'openai.types.create_embedding_response.CreateEmbeddingResponse'>
CreateEmbeddingResponse(data=[Embedding(embedding=[-0.014256772585213184, -0.011405417695641518, 0.02540297619998455, -0.025571465492248535, -0.03989304229617119, -0.0038136865478008986, -0.014282694086432457, -0.0023912496399134398, -0.038182228803634644, -0.02452164888381958, 0.011301732622087002, 0.022292407229542732, -0.0053009274415671825, 0.0032061536330729723, 0.017846887931227684, 0.009577958844602108, 0.021138904616236687, -0.008249486796557903, 0.001997568178921938, -0.025143763050436974, -0.01778208278119564, -0.010005662217736244, -0.027113789692521095, -0.007296875584870577, 0.0062567791901528835, 0.00643498869612813, 0.008962325751781464, -0.006285940762609243, 0.01226082444190979, 0.009584438987076283, 0.024055063724517822, -0.006412307266145945, -0.017393263056874275, -0.023394066840410233, -0.0032984986901283264, -0.003528551198542118, 0.012228422798216343, 0.006817329209297895, 0.0329201854765415

In [12]:
print(res)

{'matches': [{'id': '7f9affbd-d084-44f5-bd96-e4d992980107',
              'metadata': {'LOS': 'interpret interest rates as required rates '
                                  'of return, discount rates, or opportunity '
                                  'costs;\n'
                                  'explain an interest rate as the sum of a '
                                  'real risk-free rate and premiums that '
                                  'compensate investors for bearing distinct '
                                  'types of risk;\n'
                                  'calculate and interpret the effective '
                                  'annual rate, given the stated annual '
                                  'interest rate and the frequency of '
                                  'compounding;\n'
                                  'solve time value of money problems for '
                                  'different frequencies of compounding;\n'
                           

## Ask a question base on the tech note

In [65]:
def extract_metadata(response):
    # Initialize a list to store all extracted metadata as dictionaries
    extracted_data = []

    # Check if the response contains any matches
    if response and response.get('matches'):
        # Iterate over each match found in the response
        for match in response['matches']:
            # Retrieve the metadata dictionary from the current match
            metadata = match.get('metadata', {})

            # Extract the LOS, Note, and text from the metadata
            # Provide default values if any key is not found
            los = metadata.get('LOS', 'No LOS provided')
            note = metadata.get('Note', 'No Note provided')
            text = metadata.get('text', 'No text provided')

            # Create a dictionary with the extracted data
            data_dict = {
                'LOS': los,
                'Note': note,
                'Text': text
            }

            # Add the dictionary to the list of extracted data
            extracted_data.append(data_dict)
    else:
        print("No matches found in the response.")

    # Return the list containing dictionaries of the extracted data
    return extracted_data



In [64]:
# Call the function and print the results
extracted_metadata = extract_metadata(res)
# for item in extracted_metadata:
#     print(item['LOS'])


In [74]:
def ask_question(notedata, question):

    client = OpenAI()
    response = client.chat.completions.create(
    model="gpt-3.5-turbo-0125",
    messages=[
            {"role": "system", "content": f"Use the background knowledge here: {notedata}"},
            {"role": "user", "content": f"Answer the question base on the system content I provide: {question}"}
        ]
    )
    # print(response.choices[0].message.content)
    return response

In [82]:
notedata = ''
for data in extracted_metadata:
    note = data['Note']
    notedata += note


question = 'Tell me what is Time Value of Money in Finance'

response = ask_question(notedata,question)

In [83]:
print(response.choices[0].message.content)

Time Value of Money (TVM) is a foundational concept in finance that highlights the idea that a sum of money today is worth more than the same sum in the future due to its potential earning capacity. In essence, the concept of TVM recognizes that a dollar received today can be invested and earn a return, making it worth more than a dollar received in the future.

The core principles of Time Value of Money include:

1. **Future Value (FV):** The value of an investment at a specified future date based on the assumption of compound interest over that period.
   
2. **Present Value (PV):** The current value of a future sum of money, considering a specified rate of return or discount rate. PV is crucial for determining the fair value of an investment given its potential returns.

3. **Compounding:** The process by which interest is earned on the original amount invested (principal) as well as on any accumulated interest from previous periods.

4. **Discounting:** The process of determining t