<a href="https://colab.research.google.com/github/tabris1994/datasciencecoursera/blob/main/LLMs_for_researchers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*This is a notebook that combines text and code, used for illustrative purposes.*

*To use it with the OpenAI API, you need to create your API key and set it in the "Secrets" menu on the left (the one with the key icon)*

# Install dependencies

In [None]:
!python --version

Python 3.10.12


In [None]:
!pip install numpy pandas openai tiktoken py_markdown_table --quiet # cohere

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m257.5/257.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.8/77.8 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from google.colab import userdata
import os
os.environ["OPENAI_API_KEY"] = userdata.get('API_KEY')

# Imports

In [None]:
from openai import OpenAI
import numpy as np
import pandas as pd
from py_markdown_table.markdown_table import markdown_table

# 1. Embeddings Demo

OpenAI's latest embedder ("text-embedding-3-large") creates ~3k-dimensional vectors.

In [None]:
def get_embeddings(chunks):
    response = OpenAI().embeddings.create(input=chunks, model="text-embedding-3-large")
    return [np.array(record.embedding) for record in response.data]

In [None]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [None]:
def distance(a, b):
	return 1 - cosine_similarity(a, b)

Let's create some embeddings...

In [None]:
texts = ['Markup', 'Unemployment']
e0, e1, get_embeddings(texts)
print(e0)
print(len(e0))

[-0.02849332  0.00799687 -0.02566623 ...  0.00017049  0.00256702
  0.0213144 ]
3072


Let's compute some distances...

In [None]:
e2 = get_embeddings(['Price'])[0]
print(distance(e2, e0))
print(distance(e2, e1))

0.6442938936573992
0.7454496533084105
0.24841288290932673


But of course, how do we weight the different dimensions

In [None]:
distance(*get_embeddings(['Price', 'Precio']))

0.24836780863926555

## Going further...

This list lists different inflation statements; all merely illustrative. Which statements are more related to each other?


In [None]:
texts = ["Deflationary trends are setting in",
         "Inflation rates are moderating",
         "Unexpected inflation is an ongoing concern",
         "Price stability is uncertain",
         "Inflationary pressures are easing",
         "Price levels surged unexpectedly"]

In [None]:
vectors = get_embeddings(texts)

In [None]:
def view_correlations(table, texts):
    df = pd.DataFrame(table)
    index = {i: name for i,name in enumerate(texts)}
    df.rename(columns=index, inplace = True)
    data = df.to_dict(orient='records')
    markdown = markdown_table(data).get_markdown()
    print(markdown)

In [None]:
table = []
for x in vectors:
    table.append([])
    for y in vectors:
        table[-1].append(f'{cosine_similarity(x,y):.02f}')

In [None]:
view_correlations(table, texts)
# https://stackoverflow.com/questions/40887753/display-matrix-values-and-colormap

```
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Deflationary trends are setting in|Inflation rates are moderating|Unexpected inflation is an ongoing concern|Price stability is uncertain|Inflationary pressures are easing|Price levels surged unexpectedly|
+----------------------------------+------------------------------+------------------------------------------+----------------------------+---------------------------------+--------------------------------+
|               1.00               |             0.64             |                   0.54                   |            0.43            |               0.66              |              0.40              |
+----------------------------------+------------------------------+------------------------------------------+----------------------------+-----------------------------

# 2. RAG Demo

In [None]:
text = '''
A summary of the developments in the four broad categories of vulnerabilities since the last report is as follows:

1. Asset valuations. Equity prices grew faster than expected earnings, pushing the forward price-to-earnings ratio into the upper ranges of its historical distribution. Risk premiums in corporate bond markets narrowed somewhat and remained near the middle of their historical distributions. Prices of residential and commercial properties remained high relative to fundamentals (see Section 1, Asset Valuations).

2. Borrowing by businesses and households. Balance sheets of many nonfinancial businesses and households remained solid. Growth of business debt continued to decline through the first half of the year, although business debt remained high when measured relative to gross domestic product (GDP) or business assets. Measures of the ability of firms to service their debt remained strong. Household debt remained at modest levels relative to GDP, with most of that debt owed by households with strong credit histories or considerable home equity (see Section 2, Borrowing by Businesses and Households).

3. Leverage in the financial sector. The banking sector remains sound and resilient overall, and most banks continued to report capital levels well above regulatory requirements. That said, the increase in interest rates over the past two years has contributed to declines in the fair value of longer-maturity, fixed-rate assets that, for some banks, were sizable. Outside the banking sector, available data suggest that hedge fund leverage remained somewhat elevated, especially for the largest hedge funds. Leverage at life insurance companies remained near the middle of its historical range, while broker-dealer leverage remained historically low (see Section 3, Leverage in the Financial Sector).

4. Funding risks. Most domestic banks have ample liquidity and limited reliance on short-term wholesale funding; nevertheless, some banks continued to face funding strains, likely owing to vulnerabilities associated with high levels of uninsured deposits and declines in the fair value of assets. The Bank Term Funding Program (BTFP) helped mitigate these strains. Structural vulnerabilities remained in other short-term funding markets. Prime and tax-exempt money market funds (MMFs), as well as other cash-investment vehicles and stablecoins, remained vulnerable to runs. Bond and loan funds that hold assets that can become illiquid during periods of stress remained susceptible to large redemptions. Life insurers continued to rely on a higher-than-average share of runnable liabilities (see Section 4, Funding Risks).
'''

In [None]:
# Step 1: split text into chunks

chunks = text.split('\n\n')
print(f'1) Text split into {len(chunks)} chunks')

# Step 2: compute embeddings for each chunk
embeddings = get_embeddings(chunks)
print(f'2) Computed {len(embeddings)} embeddings of size {len(embeddings[0])}')

# Step 3: select relevant embeddings
query = 'How has household debt evolved lately?'
#query = 'household debt'
#query = 'stablecoins'
query_embedding = get_embeddings(query)[0]

for i, chunk in enumerate(chunks):
    dist = 1 - cosine_similarity(query_embedding, embeddings[i])
    print(f'   - Distance of chunk {i} to query: {dist:4.2f}')

distances = [1 - cosine_similarity(query_embedding, e) for e in embeddings]
best_i = np.argmin(distances)
context = chunks[best_i]
print(f'3) Best chunk is number {best_i} ("{context[:30]}..."")')

1) Text split into 5 chunks
2) Computed 5 embeddings of size 3072
   - Distance of chunk 0 to query: 0.73
   - Distance of chunk 1 to query: 0.62
   - Distance of chunk 2 to query: 0.36
   - Distance of chunk 3 to query: 0.60
   - Distance of chunk 4 to query: 0.67
3) Best chunk is number 2 ("2. Borrowing by businesses and..."")


In [None]:
# Step 4: query the LLM
prompt = f'''
You have been tasked to extract information from a report. Text excerpts for this examination have been attached at the end of this text, after the word "CONTEXT:".

Please answer the following question: {query}

CONTEXT:

{context}
'''

response = OpenAI().chat.completions.create(
    model = 'gpt-4-0125-preview',
    messages=[
        {"role": "user", "content": prompt}
  ]
)

print(f'4) LLM answer:')
print(response.choices[0].message.content)

4) LLM answer:
Household debt has remained at modest levels relative to GDP, with most of the debt being owed by households that have strong credit histories or considerable home equity.


In [None]:
# Step 4: query the LLM to retrieve Stata-ready output
prompt = f'''
You have been tasked to extract information from a report. Text excerpts for this examination have been attached at the end of this text, after the word "CONTEXT:".

Please answer the following question: {query}

Please provide your response using two JSON fields, the first one named 'success' with values True or
False, the second named 'answer', with values 1 to 5, with 1 meaning "great" and 5 meaning "terrible".

CONTEXT:

{context}
'''

response = OpenAI().chat.completions.create(
    model = 'gpt-4-0125-preview',
    response_format = { "type": "json_object" },
    messages=[
        {"role": "system", "content": "You are a helpful assistant designed to output JSON."},
        {"role": "user", "content": prompt}
  ]
)

print(f'4) LLM answer:')
print(response.choices[0].message.content)

4) LLM answer:
{
  "success": true,
  "answer": 2
}
