# Leveraging ChatGPT on Domain-Specific Content 
This program shows how to use openAI's ChatGPT language model to answer questions in specific domain areas.  

The content used is drawn from the 2023 investment outlook summaries posted on the websites of Morgan Stanley [(here)](https://www.morganstanley.com/ideas/global-investment-strategy-outlook-2023), 
JPMorgan [(here)](https://www.jpmorgan.com/insights/research/market-outlook) and 
Goldman Sachs [(here)](https://www.goldmansachs.com/insights/pages/gs-research/macro-outlook-2023-this-cycle-is-different/report.pdf).  

The notebook first preps the content for the model. 

Three approaches to answering domain-specific questions are then explored: 
1.  Selected content is appended to each question as context before it is fed to the model. 
2.  Rather than using the base ChatGPT model (called GPT-3) as is, the model is fine-tuned on the domain-specific content.  
3.  The third approach combines the first two.  Context is injected into the  prompt (as in approach one), but the fine-tuned model is used to answer questions (approach two).  

For a detailed discussion, see ["Leveraging ChatGPT for
Business and Organizational Purposes"](https://github.com/robjm16/domain_specific_ChatGPT/blob/main/DOMAIN_SPECIFIC_CHATGPT.md).

To see the chatbot in action, click [here](https://huggingface.co/spaces/robjm16/domain_specific_ChatGPT).

## Install/Import Libraries 

In [None]:
! pip install openai 
! pip install transformers 
! pip install gradio
! pip install PyPDF2
! pip install python-docx
! pip install pandas

In [None]:
import docx
import pandas as pd
import numpy as np
import json 
import openai
import gradio as gr
import pickle
import ast
import os
from transformers import GPT2TokenizerFast
from sklearn.model_selection import train_test_split # only if using validation file 
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

## Variables

In [None]:
use_interface = True  # Change to False to run the code without the Gradio interface and instead see a single pre-supplied question 
filepath = '/content/drive/MyDrive/Colab Notebooks/Data/Compilation_investment_outlook_2023.docx' # Path to document containing domain content.  
data_path="/content/drive/MyDrive/Colab Notebooks/Data/"
completions_model = "text-davinci-003" 
api_key = 'YOUR OPENAI KEY HERE'
os.environ['API_KEY'] = api_key                                              
OPENAI_API_KEY = os.environ["API_KEY"]
model_name = "curie"
doc_embeddings_model = f"text-search-{model_name}-doc-001"
query_embeddings_model= f"text-search-{model_name}-query-001"
max_section_length=1100  # The API limits total tokens for the prompt (question, context, instructions) and completion (the response or answer) to 2048 tokens, or about 1500 words.  
separator = "\n* "  # A string called separator is defined as the newline character followed by an asterisk and a space. This string will be used as a separator between different pieces of text.
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
separator_len = len(tokenizer.tokenize(separator))
completions_api_params= {
    "temperature": 0.0,  # Temperature of 0.0 gives the most predictable, factual answer.
    "max_tokens": 200,
    "model": ft_model, # Fine tuned or base completions_model
    "stop":[".###"]
}


## Functions

In [None]:
def load_text(filepath):
  """
  Loads a Microsoft Word document and returns a DataFrame containing the text of each paragraph in the document.

  Input:
    filepath (str): the filepath to the Microsoft Word document.
    
  Returns:
    df (pandas.DataFrame): a DataFrame containing the 'content' column with the text of each paragraph in the document.
  """
  # Open the Word document
  doc = docx.Document(filepath)

  # Create an empty pandas DataFrame
  # df = pd.DataFrame(columns=['Doc_Num', 'Doc_Type', 'Doc_Title', 'Content', 'Para_Num', 'Tokens_Num', 'Embedding'])
  df = pd.DataFrame(columns=['Content'])
  # Iterate through the paragraphs in the document and add each to the df
  for i, p in enumerate(doc.paragraphs):

      # Add the paragraph text [and index to the DataFrame]    
      df.loc[i, 'Content'] = p.text
      # df.loc[i, 'paragraph_index'] = i

  # Delete empty paragraphs
  df['Content'] = df['Content'].replace('', np.nan)
  # df['Doc_Num'] = doc_num
  # df['Doc_Type'] = doc_type
  # df['Doc_Title'] = doc_title
  df = df.dropna(axis=0, subset=['Content']).reset_index(drop=True)

  return df


def explode_text(df, max_tokens=100):
    """
    Given a dataframe `df`, the function splits each row with 'Content' column value that has more tokens than `max_tokens` into multiple rows, with each row having a maximum of `max_tokens` tokens. The other columns are replicated for each of the split rows.
      
      :param df: pandas dataframe that has 'Doc_Num', 'Doc_Type', 'Doc_Title', 'Content', 'Tokens_Num', 'Embedding' columns
      :param max_tokens: int, maximum number of tokens allowed in each row. Default value is 100.
      :return: pandas dataframe with updated 'Content' column, having a maximum of `max_tokens` tokens in each row.
    """
    exploded_rows = []

    for i in range(len(df)):
        if df['Tokens_Num'][i] > max_tokens:
            text = df['Content'][i]
            # sentences = text.split(".")
            sentences = sent_tokenize(text) 
            truncated_text = ""
            truncated_tokens = 0
            for sentence in sentences:
                tokens = tokenizer.encode(sentence + ".")
                if truncated_tokens + len(tokens) <= max_tokens:
                    truncated_text += sentence + "."
                    truncated_tokens += len(tokens)
                else:
                    exploded_rows.append({
                        'Content': truncated_text,
                        'Tokens_Num': truncated_tokens,
                    })
                    truncated_text = sentence + "."
                    truncated_tokens = len(tokens)
            exploded_rows.append({
                'Content': truncated_text,
                'Tokens_Num': truncated_tokens,
            })
        else:
            exploded_rows.append({
                'Content': df['Content'][i],
                'Tokens_Num': df['Tokens_Num'][i],
            })
    return pd.DataFrame(exploded_rows)

def count_tokens(row):
    """count the number of tokens in a string"""
    return len(tokenizer.encode(row))


In [None]:
def get_embedding(text, model): 
    """
    Generates an embedding for the given text using the specified OpenAI model.
    
    Args:
        text (str): The text for which to generate an embedding.
        model (str): The name of the OpenAI model to use for generating the embedding.
    
    Returns:
        numpy.ndarray: The embedding for the given text.
    """
    result = openai.Embedding.create(
      model=model,
      input=[text]
    )
    return result["data"][0]["embedding"]

def get_doc_embedding(text):
    """
    Generates an embedding for the given text using the OpenAI document embeddings model.
    
    Args:
        text (str): The text for which to generate an embedding.
    
    Returns:
        numpy.ndarray: The embedding for the given text.
    """
    return get_embedding(text, doc_embeddings_model)

def get_query_embedding(text):
   """
    Generates an embedding for the given text using the OpenAI query embeddings model.
    
    Args:
        text (str): The text for which to generate an embedding.
    
    Returns:
        numpy.ndarray: The embedding for the given text.
    """
   return get_embedding(text, query_embeddings_model)

def compute_doc_embeddings(df): 
     """
    Generate embeddings for each row in a Pandas DataFrame using the OpenAI document embeddings model.
    
    Args:
        df (pandas.DataFrame): The DataFrame for which to generate embeddings.
    
    Returns:
        dict: A dictionary that maps the embedding vectors to the indices of the rows that they correspond to.
    """
     return {
        idx: get_doc_embedding(r.Content.replace("\n", " ")) for idx, r in df.iterrows() # r here refers to each row 
   }


In [None]:
def vector_similarity(x, y):
    """
    Calculate the similarity between two vectors using dot product.
    
    Args:
        x (iterable): The first vector.
        y (iterable): The second vector.
    
    Returns:
        float: The dot product of the two vectors.
    """
    return np.dot(np.array(x), np.array(y))

def order_document_sections_by_query_similarity(query, contexts):  
    """
    Find the query embedding for the given query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Args:
        query (str): The query for which to find relevant document sections.
        contexts (dict): A dictionary mapping document embeddings to their indices.
      
    Returns:
        list: A list of tuples, each containing the similarity score and index of a document section, sorted in descending
        order of relevance.
    """
    query_embedding = get_query_embedding(query)
    # print("GETTING DOC SIMILARIIES.........")  # FOR TESTING PURPOSES
    document_similarities = sorted([(vector_similarity(query_embedding, doc_embedding), doc_index) \
                                    for doc_index, doc_embedding in contexts.items()], \
                                    reverse=True)
    # print("FINISHED DOC SIMILARITIES..............")  # FOR TESTING PURPOSES
    
    return document_similarities
    
def construct_prompt(question, context_embeddings, df):
    """
    Construct a prompt for answering a question using the most relevant document sections.
    
    Args:
      question (str): The question to answer.
      context_embeddings (dict): A dictionary mapping document embeddings to their indices.
      df (pandas.DataFrame): A DataFrame containing the document sections.
    
    Returns:
      str: The prompt, including the question and the relevant context.
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for _, section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.        
        document_section = df.loc[section_index]
        
        chosen_sections_len += document_section.Tokens_Num + separator_len  # Note that "token" column is used here 
        if chosen_sections_len > max_section_length:
            break
            
        chosen_sections.append(separator + document_section.Content.replace("\n", " ")) # Note that 'content" column is used here 
        chosen_sections_indexes.append(str(section_index))
            
    # Useful diagnostic information  -- FOR TESTING PURPOSES
    print(f"Selected {len(chosen_sections)} document sections for Context:")
    print(",".join(chosen_sections_indexes))
    
    header = """Given the following context, answer the question as truthfully as possible, and if the answer is not contained within the context below, say "Sorry, I don't know."\n\nContext:\n"""

    full_prompt = header + "".join(chosen_sections) + "\n\nQuestion: " + question + "\n\n###\n\n"

    print(f"\nFull Prompt: \n{full_prompt}") # FOR TESTING PURPOSES

    return full_prompt
    

def answer_query_with_context(
    query,
    df,
    document_embeddings,
    show_prompt: bool = False):
    prompt = construct_prompt(
        query,
        document_embeddings,
        df
    )
    """
    Answer a query using relevant context from a DataFrame.
    
    Args:
        query (str): The query to answer.
        df (pandas.DataFrame): A DataFrame containing the document sections.
        document_embeddings (dict): A dictionary mapping document embeddings to their indices.
        show_prompt (bool, optional): If `True`, print the prompt before generating a response.
    
    Returns:
        str: The generated response to the query.
    """   
    # print("LINE 232..............")  # FOR TESTING PURPOSES



    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **completions_api_params
            )

    return response["choices"][0]["text"].strip(" \n")

In [None]:
def get_questions(context):
    """
    Takes in a string of text(context) as an argument and returns a string of questions 
    generated based on the context using the OpenAI API. The function uses the 
    "text-davinci-001" engine, the prompt is constructed by combining the context 
    and the string "Write questions based on the text below\n\nText: {context}\n\nQuestions:\n1."  
    The temperature, max_tokens, top_p, frequency_penalty, presence_penalty all set 
    to 0, and stop is set to "\n\n"
    If there is any exception, the function will return an empty string.
    """
    try:
        response = openai.Completion.create(
            engine="text-davinci-001",
            prompt=f"Write questions based on the text below\n\nText: {context}\n\nQuestions:\n1.",
            temperature=0,
            max_tokens=257,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stop=["\n\n"]
        )
        # print(response)  # FOR TESTING PURPOSES
        # print(response['choices'][0]['text'])  # FOR TESTING PURPOSES
        return response['choices'][0]['text']
    except:
        return ""

def get_answers(row):
    """
    Takes in a row of dataframe and returns a string of answers generated based
    on the questions and context in the dataframe using the OpenAI API.
    The function uses the "text-davinci-001" engine, the prompt is constructed by
    combining the context and the questions in the dataframe.
    The temperature, max_tokens, top_p, frequency_penalty, presence_penalty all 
    set to 0.
    If there is any exception, the function will print the exception and return 
    an empty string.
  """
    try:
        response = openai.Completion.create(
            engine="text-davinci-001",
            prompt=f"Write questions based on the text below\n\nText: {row.Content}\n\nQuestions:\n{row.Questions}\n\nAnswers:\n1.",
            temperature=0,
            max_tokens=257,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )
        return response['choices'][0]['text']
    except Exception as e:
        print (e)
        return ""



## Prepare the Content/Corpus

Skip this cell if preloading prepared content with embeddings 


In [None]:
# Assign variables 
doc_num=1
doc_type="Research"
doc_title="Investment_Outlook_2023"
max_tokens=150

# create an empty dataframe with specified column names
df = pd.DataFrame(columns=['Content'])

# Load text into dataframe
df = load_text(filepath)

# Count the tokens; used to size sections in potentially truncating text df = df.copy()    
df['Tokens_Num'] = df['Content'].apply(count_tokens)

# Truncate/explode text into new rows if > max_tokens thresold 
df=explode_text(df, max_tokens)

# Add embeddings of "Content" column text to df, to allow repeated use 
document_embeddings = compute_doc_embeddings(df)
df = df.assign(Embedding=df.index.map(document_embeddings))

# Add identifying info to dataframe 
df['Doc_Num'] = doc_num
df['Doc_Type'] = doc_type
df['Doc_Title'] = doc_title
df = df.assign(Paragraph_Num = lambda x: range(1, len(x) + 1))

# Reorder columns
df = df[['Doc_Num',  'Doc_Title',  'Doc_Type',   'Paragraph_Num',
         'Content', 'Tokens_Num', 'Embedding']]

In [None]:
# Save/load file
# df.to_pickle(data_path + "investment_outlook_corpus.pkl")
# df=pd.read_pickle(data_path + "investment_outlook_corpus.pkl")
df.head(30)

Unnamed: 0,Doc_Num,Doc_Title,Doc_Type,Paragraph_Num,Content,Tokens_Num,Embedding
0,1,Investment_Outlook_2023,Research,1,Morgan Stanley says: In an environment of slo...,143,"[0.028803126886487007, -0.007771830074489117, ..."
1,1,Investment_Outlook_2023,Research,2,Morgan Stanley says: Bonds—the biggest losers...,124,"[0.010308968834578991, -0.00835722591727972, -..."
2,1,Investment_Outlook_2023,Research,3,Morgan Stanley says: Other key takeaways from ...,142,"[0.013922976329922676, -0.004230298567563295, ..."
3,1,Investment_Outlook_2023,Research,4,"Morgan Stanley says: Overall, investors will ...",113,"[-0.0011054601054638624, -0.015017570927739143..."
4,1,Investment_Outlook_2023,Research,5,"Morgan Stanley says: In 2023, with interest ra...",110,"[0.016102489084005356, -0.011134202592074871, ..."
5,1,Investment_Outlook_2023,Research,6,"Morgan Stanley says: However, investors shoul...",135,"[0.003058797912672162, -0.009665897116065025, ..."
6,1,Investment_Outlook_2023,Research,7,"Meanwhile, rising rates are limiting the suppl...",97,"[0.0020399547647684813, -0.006524039898067713,..."
7,1,Investment_Outlook_2023,Research,8,"Morgan Stanley says: Equities next year, howev...",140,"[-0.002090581925585866, -0.023302830755710602,..."
8,1,Investment_Outlook_2023,Research,9,“This should ultimately more than offset the 1...,65,"[0.004823953844606876, -0.02416054718196392, 0..."
9,1,Investment_Outlook_2023,Research,10,Morgan Stanley says: This has been a major be...,150,"[-0.0021850778721272945, -0.011115548200905323..."


## Three Approaches to Question Answering

###1. Injecting Context into Prompts  
The question, domain-specific content and any other instructions are combined into a "prompt".  

Specifically, an interface asks a user for a question about the 2023 investment outlooks.  The program compares the user's query with the domain content to identify the most useful sections of text. The program answers the question by using the model's powerful underlying natural language capabilities while referencing the specific context supplied in the prompt. 

In [None]:
# Read in master content dataframe and excerpt needed columns
df=pd.read_pickle(data_path + "investment_outlook_corpus.pkl")
df=df.reset_index()
df_excerpt = df[['Content', 'Tokens_Num', 'Embedding']].copy()

# # Create dictionary of embeddings, by row of df
doc_embeddings = df.set_index('index').to_dict()['Embedding']

In [None]:
# Use OpenAI's prompt/completion approach to answer questions, using the Gradio interface 
if use_interface:
    demo = gr.Interface(
    fn=lambda query: answer_query_with_context(query, df_excerpt, doc_embeddings),
    inputs=gr.Textbox(lines=2,  label="Query", placeholder="Type Question Here..."),
    outputs=gr.Textbox(lines=2, label="Answer"),
    description="Example of a domain-specific chatbot, using ChatGPT with supplemental content and fine-tuning.<br>\
                  Here, the content relates to the investment outlook for 2023, according to Morgan Stanley, JPMorgan and Goldman Sachs.<br>\
                  Sample queries: What is Goldman's outlook for inflation? What about the bond market? What does JPMorgan think about 2023?<br>\
                  NOTE: High-level demo only. Supplemental content used here limited to about 30 paragraphs, due to limits on free-of-charge usage of ChatGPT.<br>\
                  Far more robust domain-specific responses are possible.",
    title="Fine-Tuned Domain-Specific Chatbot",)
    # Launch the interface
    demo.launch() # To show errors in colab notebook, set debug=True in launch()
else:
    prompt = construct_prompt(
        'What is the outlook for inflation?',
        document_embeddings,
        df_excerpt
    )
    # print("===\n", prompt) # FOR TESTING ONLY
    answer_query_with_context("What is Goldman's outlook for inflation?", df_excerpt, document_embeddings)  

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

Selected 9 document sections for Context:
35,19,50,43,49,46,51,0,36

Full Prompt: 
Given the following context, answer the question as truthfully as possible, and if the answer is not contained within the context below, say "Sorry, I don't know."

Context:

* For 2023, it is no surprise that inflation and Fed rate policy remain top of mind for investors: in the J.P. Morgan Research 2023 Outlook Survey, respondents ranked these two factors as the most important for U.S. fixed income markets in 2023, followed by U.S. recession risks..With inflation already showing signs of softening, the Fed is expected to deliver a 50bp hike in December, before dialing down the tightening pace further and delivering 25bp hikes at both the February and March meetings..It is expected to pause rate hikes thereafter..
* JPMorgan says:  Global consumer price index (CPI) inflation is on track to slow toward 3.5% in early 2023 after approaching 10% in the second half of 2022. “Circumstances warrant considering

### 2. Fine Tuning the Model  
Questions and answers are generated for each section of text, and the GPT-3 model is trained on this additional content. Questions are then fed to the customized model, but with no domain-specific content added.    

In [None]:
# Load the content, generate fine-tuning questions 
df_ft=pd.read_pickle(data_path + "investment_outlook_corpus.pkl")
df_ft=df_ft.filter(['Content'])
df_ft['Questions']= df_ft['Content'].apply(get_questions) # Generates questions for each section
df_ft['Questions'] = "1." + df_ft.Questions  # Adds number to first question
df_ft.head(30)

Unnamed: 0,Content,Questions
0,Morgan Stanley says: In an environment of slo...,1. What is the main difference between 2023 an...
1,Morgan Stanley says: Bonds—the biggest losers...,1. What are the global macro trends that Morga...
2,Morgan Stanley says: Other key takeaways from ...,1. What is the main reason for the predicted d...
3,"Morgan Stanley says: Overall, investors will ...",1. What is the main message of the text?\n2. W...
4,"Morgan Stanley says: In 2023, with interest ra...",1. What is the main reason that Morgan Stanley...
5,"Morgan Stanley says: However, investors shoul...",1. What is Morgan Stanley's opinion on high-yi...
6,"Meanwhile, rising rates are limiting the suppl...",1. What is the main reason for the limited sup...
7,"Morgan Stanley says: Equities next year, howev...",1. What is the main reason Morgan Stanley beli...
8,“This should ultimately more than offset the 1...,1. What is the expected decline in earnings-pe...
9,Morgan Stanley says: This has been a major be...,1. What are the cyclical winds that are in fav...


Add answers (to the questions), paragraph numbers and embeddings of the content to the dataframe.  
NOTE:  May need to wait >1 min to run next cell in order to stay under free-of-charge usage limits.  


In [None]:
df_ft['Answers']= df_ft.apply(get_answers, axis=1)
df_ft['Answers'] = "1." + df_ft.Answers  # Adds first number to Answers
df_ft = df_ft.dropna().reset_index().drop('index',axis=1)
print(df_ft[['Answers']].values[0][0]) 
# df_ft.tail()  # For testing purposes.....................
# Add paragraph number column for optional use later in generating adversarial context/answer
for i, row in enumerate(df_ft.iterrows()):
    df_ft.loc[i, "Paragraph_Num"] = i
# Add embeddings of "context" column text to df_ft, to allow repeated use 
document_embeddings = compute_doc_embeddings(df_ft)
df_ft = df_ft.assign(Embeddings=df_ft.index.map(document_embeddings))
# df_ft['embeddings'] = df_ft['embeddings'].apply(lambda x: [float(i) for i in ast.literal_eval(x)]) #Changes string to float
df_ft.head(3) # For testing purposes .........

1. The main difference between 2023 and 2022 is that 2023 will be a good year for income investing, while 2022 was marked by resilient growth and high inflation.
2. Income investing will be a good strategy in 2023 due to the slow growth, lower inflation, and new monetary policies that are expected to prevail.


Unnamed: 0,Content,Questions,Answers,Paragraph_Num,Embeddings
0,Morgan Stanley says: In an environment of slo...,1. What is the main difference between 2023 an...,1. The main difference between 2023 and 2022 i...,0.0,"[0.028803126886487007, -0.007771830074489117, ..."
1,Morgan Stanley says: Bonds—the biggest losers...,1. What are the global macro trends that Morga...,1. Morgan Stanley believes that global macro t...,1.0,"[0.010308968834578991, -0.00835722591727972, -..."
2,Morgan Stanley says: Other key takeaways from ...,1. What is the main reason for the predicted d...,1. The main reason for the predicted decline i...,2.0,"[0.013922976329922676, -0.004230298567563295, ..."


In [None]:
# Save/load processed df
# df_ft.to_pickle(data_path + 'invest_outlook_2023_rev1.pkl')
# df_ft = pd.read_pickle(data_path + 'invest_outlook_2023_rev1.pkl')
# df_ft.head(3) # For testing purposes ...............................................

In [None]:
# Create a new dataframe with one question and answer per row
expanded_df_ft = pd.DataFrame(columns=df_ft.columns)
for i, row in df_ft.iterrows():
    Questions = row['Questions'].split("\n")
    Answers = row['Answers'].split("\n")
    for j in range(len(Questions)):
        if j < len(Questions) and j < len(Answers):
            new_row = {'Paragraph_Num': row['Paragraph_Num'],\
                       'Content': row['Content'], 'Embeddings': row['Embeddings'],\
                       'Questions': Questions[j], 'Answers': Answers[j]}
            expanded_df_ft = expanded_df_ft.append(new_row, ignore_index=True)

# Label Questions "original" to distinguish from optional adversarial examples, if added 
expanded_df_ft["Label"] = "Original"
expanded_df_ft.rename(columns={'Answers': 'Completion'}, inplace=True)

# Remove question/completion numbers
expanded_df_ft['Questions'] = expanded_df_ft['Questions'].str[2:].str.strip()
expanded_df_ft['Completion'] = expanded_df_ft['Completion'].str[2:].str.strip()

# Create prompts
expanded_df_ft["Prompt"] = expanded_df_ft.apply(lambda row: f"Content: {row['Content']}\nQuestion: {row['Questions']} ", axis=1) 

# Add whitespace to start of completion and unqiue identfier to end, per OpenAI
expanded_df_ft['Completion'] = expanded_df_ft['Completion'].str.ljust(1)
expanded_df_ft['Prompt'] = expanded_df_ft['Prompt'] + "\n\n###\n\n"
expanded_df_ft['Completion'] = ' ' + expanded_df_ft['Completion'] + "###"

expanded_df_ft.head(2) # For testing only .................................

Unnamed: 0,Content,Questions,Completion,Paragraph_Num,Embeddings,Label,Prompt
0,Morgan Stanley says: In an environment of slo...,What is the main difference between 2023 and 2...,The main difference between 2023 and 2022 is ...,0.0,"[0.028803126886487007, -0.007771830074489117, ...",Original,Content: Morgan Stanley says: In an environme...
1,Morgan Stanley says: In an environment of slo...,What will be a good year for income investing ...,Income investing will be a good strategy in 2...,0.0,"[0.028803126886487007, -0.007771830074489117, ...",Original,Content: Morgan Stanley says: In an environme...


In [None]:
# Extract columns needed for fine tuning  
temp_df=expanded_df_ft[["Prompt","Completion"]].copy()

# Writes prompts/completions to jsonl (single lines)
df_train_list = temp_df.to_dict(orient='records')
with open('df_train3.jsonl', 'w', encoding='utf-8') as outfile:
    for row in df_train_list:
        outfile.write(json.dumps(row, ensure_ascii=False))
        outfile.write('\n')
        # print(row)  # For testing purposes............................

df_train_list[0] # For testing purposes.......................................

# # If using validation file
# # Split the expanded dataframe into training and testing sets
# train_df, test_df = train_test_split(temp_df, test_size=0.2)

# train_df = train_df.reset_index(drop=True)
# test_df = test_df.reset_index(drop=True)

# Create (optional) adversarial questions; train dataset only

# for i in range(1, len(train_df), 15):
#     if i < len(train_df):
#         # Randomly select a question from another paragraph
#         adversarial_question = train_df[train_df['paragraph_number'] != train_df.loc[i, 'paragraph_number']].sample(1)
#         # Replace the question with the adversarial value
#         train_df.loc[i, 'completion'] = adversarial_question['completion'].item()
#         train_df.loc[i, 'label'] = "adversarial question"
#     else:
#         break

# for i in range(7, len(train_df), 15):
#     if i < len(train_df):
#         # Randomly select a context from another paragraph
#         adversarial_context = train_df[train_df['paragraph_number'] != train_df.loc[i, 'paragraph_number']].sample(1)
#         # Replace the context with the adversarial value
#         train_df.loc[i, 'context'] = adversarial_context['context'].item()
#         train_df.loc[i, 'label'] = "adversarial context"
#     else:
#         break

# If separate train and test datasets 
# df_train_list = train_df.to_dict(orient='records')
# with open('df_train2.jsonl', 'w', encoding='utf-8') as outfile:
#     for row in df_train_list:
#         outfile.write(json.dumps(row, ensure_ascii=False))
#         outfile.write('\n')
#         print(row)  # For testing purposes

# df_test_list = test_df.to_dict(orient='records')
# with open('df_test2.jsonl', 'w', encoding='utf-8') as outfile:
#     for row in df_test_list:
#         outfile.write(json.dumps(row, ensure_ascii=False))
#         outfile.write('\n')
#         print(row)  # For testing purposes

{'Prompt': 'Content: Morgan Stanley says:  In an environment of slow growth, lower inflation and new monetary policies, expect 2023 to have upside for bonds, defensive stocks and emerging markets. Investors may find themselves a bit whiplashed in 2023 as inflation and some of this year’s other dominant market trends fully reverse themselves, according to the 2023 Strategy Outlook from Morgan Stanley Research.  “For markets, this presents a\xa0very\xa0different backdrop than 2022, which was marked by resilient growth, high inflation and hawkish policy,” says Andrew Sheets, Chief Cross-Asset Strategist for Morgan Stanley Research. “Overall, 2023 will be a good year for income investing.” \nQuestion: What is the main difference between 2023 and 2022 according to Morgan Stanley Research? \n\n###\n\n',
 'Completion': ' The main difference between 2023 and 2022 is that 2023 will be a good year for income investing, while 2022 was marked by resilient growth and high inflation.###'}

In [None]:
# Prepare the data in the JSONL file for fine-tuning
!openai tools fine_tunes.prepare_data -f df_train3.jsonl -q 

Analyzing...

- Your file contains 185 prompt-completion pairs
- The `prompt` column/key should be lowercase
- The `completion` column/key should be lowercase
- All prompts end with suffix `? \n\n###\n\n`
- All prompts start with prefix `Content: `
- All completions end with suffix `.###`

Based on the analysis we will perform the following actions:
- [Necessary] Lower case column name to `prompt`
- [Necessary] Lower case column name to `completion`


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `df_train3_prepared.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "df_train3_prepared.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `? \n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[".###"]` so that the generated texts ends at the expected place.
Once your mo

In [None]:
# Create the fine tuning 
!openai api fine_tunes.create -t "df_train3_prepared.jsonl" 

In [None]:
# Load the fine-tuned model and test on a few questions; change index in prompt question to try other questions
question = ['Why is inflation so high?', 'What is the outlook for oil?', 'What does JPMorgan think about 2023?', 'What is the view on emerging markets?']
ft_model = '' # ADD YOUR FINE-TUNED MODEL NAME FROM PRIOR CELL'S OUTPUT; eg 'ada:ft-openai-2021-07-30-12-26-20'
result = openai.Completion.create(model=ft_model, prompt=question[0] + '\n\n###\n\n', max_tokens=120, temperature=0, stop=[".###"]) # To test from test dataset use "df_test_list['prompt'][0] + " instead of "question" 
print(result['choices'][0]['text'])



The main cause of inflation is the Federal Reserve's monetary policy. The Fed is trying to keep inflation under control by raising interest rates


After fine-tuning on the question-and-answer pairs, the new model is able to answer questions on the new content -- but only at a very high level.

###3. Combined Approach: Fine-Tuned Model with Context Added to Prompts  
Specific context is added to the prompts of the fine-tuned model. 

If picking up a previously fined-tuned model, change model name in completion parameters under Variables.  

In [None]:
df=pd.read_pickle(data_path + "investment_outlook_corpus.pkl")
df.tail(30)

Unnamed: 0,Doc_Num,Doc_Title,Doc_Type,Paragraph_Num,Content,Tokens_Num,Embedding
22,1,Investment_Outlook_2023,Research,23,JPMorgan says: “Consumers with a cushion of s...,124,"[-0.004307280760258436, -0.005382366944104433,..."
23,1,Investment_Outlook_2023,Research,24,"JPMorgan says: In the first half of 2023, we ...",139,"[-0.0007864225190132856, -0.015624849125742912..."
24,1,Investment_Outlook_2023,Research,25,Upside and downside to this base case will lar...,55,"[-0.0010516275651752949, -0.014208441600203514..."
25,1,Investment_Outlook_2023,Research,26,JPMorgan says: The convergence between the U....,139,"[-0.019465884193778038, -0.018701419234275818,..."
26,1,Investment_Outlook_2023,Research,27,"Tactically, the Asia reopening trade led by Ch...",69,"[-0.014102987945079803, -0.0058497474528849125..."
27,1,Investment_Outlook_2023,Research,28,JPMorgan says: Commodities Outlook..Entering ...,142,"[0.003698094980791211, -0.01282254047691822, 0..."
28,1,Investment_Outlook_2023,Research,29,Despite more pessimistic expectations for bala...,69,"[0.007378492038697004, -0.014822321943938732, ..."
29,1,Investment_Outlook_2023,Research,30,JPMorgan says: Commodity price forecasts 2023...,100,"[0.0006911815726198256, -0.012200652621686459,..."
30,1,Investment_Outlook_2023,Research,31,There is still substantial room for a cyclical...,120,"[0.011535905301570892, -0.016301343217492104, ..."
31,1,Investment_Outlook_2023,Research,32,"Growth from U.S. shale producers, traditionall...",49,"[0.009517782367765903, -0.024347815662622452, ..."


In [None]:
df=df.reset_index()
df_excerpt = df[['Content', 'Tokens_Num', 'Embedding']].copy()
# # Create dictionary of embeddings, by row of df
doc_embeddings = df.set_index('index').to_dict()['Embedding']

Launch the Q/A interface.  
NOTE: Additional information may be printed for validation purposes

In [None]:
if use_interface:
    demo = gr.Interface(
    fn=lambda query: answer_query_with_context(query, df_excerpt, doc_embeddings),
    inputs=gr.Textbox(lines=2,  label="Query", placeholder="Type Question Here..."),
    outputs=gr.Textbox(lines=2, label="Answer"),
    description="Example of a domain-specific chatbot, using ChatGPT with supplemental content and fine-tuning.<br>\
                  Here, the content relates to the investment outlook for 2023, according to Morgan Stanley, JPMorgan and Goldman Sachs.<br>\
                  Sample queries: What is Goldman's outlook for inflation? What about the bond market? What does JPMorgan think about 2023?<br>\
                  NOTE: High-level demo only. Supplemental content used here limited to about 30 paragraphs, due to limits on free-of-charge usage of ChatGPT.<br>\
                  Far more robust domain-specific responses are possible.",
    title="Fine-Tuned Domain-Specific Chatbot",)
    # Launch the interface
    demo.launch(debug=True) # To show errors in colab notebook, set debug=True in launch()
else:
    prompt = construct_prompt(
        'What is the outlook for inflation?',
        document_embeddings,
        df_excerpt
    )
    # print("===\n", prompt) # FOR TESTING ONLY
    answer_query_with_context("What is Goldman's outlook for inflation?", df_excerpt, document_embeddings)  

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

Selected 9 document sections for Context:
35,19,50,43,0,46,49,51,24

Full Prompt: 
Given the following context, answer the question as truthfully as possible, and if the answer is not contained within the context below, say "Sorry, I don't know."

Context:

* For 2023, it is no surprise that inflation and Fed rate policy remain top of mind for investors: in the J.P. Morgan Research 2023 Outlook Survey, respondents ranked these two factors as the most important for U.S. fixed income markets in 2023, followed by U.S. recession risks..With inflation already showing signs of softening, the Fed is expected to deliver a 50bp hike in December, before dialing down the tightening pace further and delivering 25bp hikes at both the February and March meetings..It is expected to pause rate hikes thereafter..
* JPMorgan says:  Global consumer price index (CPI) inflation is on track to slow toward 3.5% in early 2023 after approaching 10% in the second half of 2022. “Circumstances warrant considering