## Get Packages

The commands you've written are for interacting with Git and Pip, which are commonly used tools for software development. Here's a breakdown of what they do:

1. `! git clone https://github.com/yiqiao-yin/wyn-chatbot-io.git`:
   - `!` is often used in the context of a Jupyter Notebook to execute shell commands.
   - `git clone` is a Git command used to clone (or copy) a repository from an existing URL to your local machine.
   - `https://github.com/yiqiao-yin/wyn-chatbot-io.git` is the URL of the repository you're cloning. In this case, it's the "wyn-chatbot-io" repository from the GitHub user "yiqiao-yin".

2. `! pip install -r /content/wyn-chatbot-io/requirements.txt`:
   - `!` again is used to execute shell commands in a Jupyter Notebook.
   - `pip install` is a command to install Python packages.
   - `-r` is an option for `pip install` which allows you to install multiple packages listed in a file.
   - `/content/wyn-chatbot-io/requirements.txt` is the path to the requirements file in the cloned repository. This file typically contains a list of packages and their versions that are necessary for the project in the repository to run.

After running these commands, you would have cloned the repository into your local environment and installed the necessary Python packages listed in the `requirements.txt` file.

In [None]:
! git clone https://github.com/yiqiao-yin/wyn-chatbot-io.git

In [None]:
! pip install -r /content/wyn-chatbot-io/requirements.txt

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('/content/new.csv')

In [28]:
# df.drop(columns=['Unnamed: 0'], inplace=True)
df['questions'] = df['question']

In [29]:
df.head()

Unnamed: 0,context,question,answers,questions
0,Heliconia : Heliconias are a genus of flowerin...,What is the native range of Heliconia?,"The native range of Heliconia is global, with...",What is the native range of Heliconia?
1,cultures across the Amazon use heliconia leave...,What is the dish called and what is wrapped i...,The dish is called maito and the foods that a...,What is the dish called and what is wrapped i...
2,leaves and cook them over fire or in water. So...,What are the similarities between heliconia f...,\n\nThe similarities between heliconia fruits ...,What are the similarities between heliconia f...
3,"bananas, but tend to be less favorable because...",What is the difference between a banana and a...,Bananas are typically sweeter and have fewer ...,What is the difference between a banana and a...
4,species. The flowers of the heliconias create ...,What is the significance of the flowers of th...,\n\nThe flowers of the heliconias create nurse...,What is the significance of the flowers of th...


## Introduce Functions

Here's a summary of the functions:

1. **calculate_cosine_similarity**
   - Takes two input sentences.
   - Tokenizes the sentences into lowercase words.
   - Creates a set of unique words from both sentences.
   - Creates a frequency vector for each sentence based on the unique words.
   - Calculates the cosine similarity between the frequency vectors.
   - Returns the cosine similarity as a float between 0 and 1.

2. **calculate_sts_score**
   - Takes two input sentences.
   - Loads a pre-trained SentenceTransformer model ("paraphrase-MiniLM-L6-v2").
   - Computes sentence embeddings for each input sentence.
   - Calculates cosine similarity between the embeddings.
   - Returns the similarity score as a float.

3. **openai_text_embedding**
   - Takes a text prompt as input.
   - Returns the text embedding generated by OpenAI's API with model "text-embedding-ada-002".

4. **calculate_sts_openai_score**
   - Takes two input sentences.
   - Computes sentence embeddings using the `openai_text_embedding` function.
   - Converts the embeddings to arrays.
   - Calculates cosine similarity between the embeddings.
   - Returns the similarity score as a float.

5. **palm_text_embedding**
   - Takes a text prompt as input.
   - Returns the text embedding generated by the `palm` function with model "embedding-gecko-001".

6. **calculate_sts_palm_score**
   - Takes two input sentences.
   - Computes sentence embeddings using the `palm_text_embedding` function.
   - Converts the embeddings to arrays.
   - Calculates cosine similarity between the embeddings.
   - Returns the similarity score as a float.

Note: In all functions, the cosine similarity is calculated by subtracting the cosine distance from 1. Also, the term "embedding" typically refers to a numerical representation of text that captures semantic information. The exact nature of these embeddings depends on the model used to create them.

In [9]:
from typing import Dict, List, Union
import google.generativeai as palm
import openai
from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer

In [7]:
def calculate_cosine_similarity(sentence1: str, sentence2: str) -> float:
    """
    Calculate the cosine similarity between two sentences.

    Args:
        sentence1 (str): The first sentence.
        sentence2 (str): The second sentence.

    Returns:
        float: The cosine similarity between the two sentences, represented as a float value between 0 and 1.
    """
    # Tokenize the sentences into words
    words1 = sentence1.lower().split()
    words2 = sentence2.lower().split()

    # Create a set of unique words from both sentences
    unique_words = set(words1 + words2)

    # Create a frequency vector for each sentence
    freq_vector1 = np.array([words1.count(word) for word in unique_words])
    freq_vector2 = np.array([words2.count(word) for word in unique_words])

    # Calculate the cosine similarity between the frequency vectors
    similarity = 1 - cosine(freq_vector1, freq_vector2)

    return similarity


def calculate_sts_score(sentence1: str, sentence2: str) -> float:
    model = SentenceTransformer(
        "paraphrase-MiniLM-L6-v2"
    )  # Load a pre-trained STS model

    # Compute sentence embeddings
    embedding1 = model.encode([sentence1])[0]  # Flatten the embedding array
    embedding2 = model.encode([sentence2])[0]  # Flatten the embedding array

    # Calculate cosine similarity between the embeddings
    similarity_score = 1 - cosine(embedding1, embedding2)

    return similarity_score


def openai_text_embedding(prompt: str) -> str:
    return openai.Embedding.create(input=prompt, model="text-embedding-ada-002")[
        "data"
    ][0]["embedding"]


def calculate_sts_openai_score(sentence1: str, sentence2: str) -> float:
    # Compute sentence embeddings
    embedding1 = openai_text_embedding(sentence1)  # Flatten the embedding array
    embedding2 = openai_text_embedding(sentence2)  # Flatten the embedding array

    # Convert to array
    embedding1 = np.asarray(embedding1)
    embedding2 = np.asarray(embedding2)

    # Calculate cosine similarity between the embeddings
    similarity_score = 1 - cosine(embedding1, embedding2)

    return similarity_score


def palm_text_embedding(prompt: str) -> str:
    model = "models/embedding-gecko-001"
    return palm.generate_embeddings(model=model, text=prompt)["embedding"]


def calculate_sts_palm_score(sentence1: str, sentence2: str) -> float:
    # Compute sentence embeddings
    embedding1 = palm_text_embedding(sentence1)  # Flatten the embedding array
    embedding2 = palm_text_embedding(sentence2)  # Flatten the embedding array

    # Convert to array
    embedding1 = np.asarray(embedding1)
    embedding2 = np.asarray(embedding2)

    # Calculate cosine similarity between the embeddings
    similarity_score = 1 - cosine(embedding1, embedding2)

    return similarity_score

This function, `add_dist_score_column`, takes a dataframe and a sentence as input arguments. It calculates similarity scores between the input sentence and a column of sentences ("questions") in the dataframe, based on a specified similarity measure. It then adds a column of these similarity scores to the dataframe, sorts the dataframe by the similarity scores in descending order, and returns the top five rows of the sorted dataframe.

Here are the steps the function performs:

1. **Input Parameters:**
   - `dataframe`: a Pandas DataFrame that should have a column named "questions" containing sentences to be compared against the input sentence.
   - `sentence`: a single sentence (string) that will be compared against each sentence in the "questions" column of the dataframe.
   - `similarity_indicator`: a string that specifies the similarity measure to be used. It can be one of "cosine", "levenshtein", "sts", "stsopenai", or "stspalm". If an unsupported value is passed, it defaults to "cosine".

2. **Calculate Similarity Scores and Add to DataFrame:**
   - If `similarity_indicator` is "cosine", the function calculates cosine similarity scores between the input sentence and each sentence in the "questions" column using the `calculate_cosine_similarity` function. It then adds a column named "cosine" to the dataframe containing these scores.
   - If `similarity_indicator` is "levenshtein", it calculates cosine similarity scores (though it should probably be calculating Levenshtein distances) and adds a column named "levenshtein".
   - If `similarity_indicator` is "sts", it calculates similarity scores using the `calculate_sts_score` function and adds a column named "sts".
   - If `similarity_indicator` is "stsopenai", it calculates similarity scores using the `calculate_sts_openai_score` function and adds a column named "stsopenai".
   - If `similarity_indicator` is "stspalm", it calculates similarity scores using the `calculate_sts_palm_score` function and adds a column named "stspalm".
   - If an unsupported value is passed for `similarity_indicator`, it defaults to calculating cosine similarity scores and adding a column named "cosine".

3. **Sort DataFrame and Return Top Rows:**
   - The function sorts the dataframe by the similarity scores in descending order (highest scores at the top).
   - It then returns the top five rows of the sorted dataframe.

Note: The function appears to have an error in the "levenshtein" case where it calculates cosine similarity scores instead of Levenshtein distances. If Levenshtein distances were intended to be used, a separate function for calculating those distances should be called instead.

In [8]:
def add_dist_score_column(
    dataframe: pd.DataFrame, sentence: str, similarity_indicator: str = "cosine"
) -> pd.DataFrame:
    if similarity_indicator == "cosine":
        dataframe["cosine"] = dataframe["questions"].apply(
            lambda x: calculate_cosine_similarity(x, sentence)
        )
    elif similarity_indicator == "levenshtein":
        dataframe["levenshtein"] = dataframe["questions"].apply(
            lambda x: calculate_cosine_similarity(x, sentence)
        )
    elif similarity_indicator == "sts":
        dataframe["sts"] = dataframe["questions"].apply(
            lambda x: calculate_sts_score(x, sentence)
        )
    elif similarity_indicator == "stsopenai":
        dataframe["stsopenai"] = dataframe["questions"].apply(
            lambda x: calculate_sts_openai_score(str(x), sentence)
        )
    elif similarity_indicator == "stspalm":
        dataframe["stspalm"] = dataframe["questions"].apply(
            lambda x: calculate_sts_palm_score(str(x), sentence)
        )
    else:
        dataframe["cosine"] = dataframe["questions"].apply(
            lambda x: calculate_cosine_similarity(x, sentence)
        )

    sorted_dataframe = dataframe.sort_values(by=similarity_indicator, ascending=False)

    return sorted_dataframe.iloc[:5, :]


This code defines a function called `convert_to_list_of_dict`, which takes a pandas DataFrame as input and converts it into a list of dictionaries. Each dictionary represents a question or answer and has two keys: "role" and "content."

Here is a step-by-step explanation of what the code does:

1. **Function Definition:**
   - The function is defined with the name `convert_to_list_of_dict` and takes one argument `df`, which is expected to be a pandas DataFrame.
   - The function is expected to return a list of dictionaries, with each dictionary having a string key-value pair.

2. **Function Documentation:**
   - The function is documented with a docstring that explains its purpose, input parameters, and return value.
   - It reads in a pandas DataFrame with columns named 'questions' and 'answers' and produces a list of dictionaries with two keys each: 'question' and 'answer.'

3. **Initialize an Empty List:**
   - The function initializes an empty list called `result`, which will be used to store the dictionaries created from the DataFrame rows.

4. **Loop through the DataFrame Rows:**
   - The function iterates through each row of the input DataFrame using the `iterrows()` method.
   - For each row, it retrieves the values in the "questions" and "answers" columns.

5. **Create Question and Answer Dictionaries:**
   - For each row, the function creates two dictionaries:
     - `qa_dict_quest`: This dictionary represents a question and has two key-value pairs: "role" is set to "user" and "content" is set to the value in the "questions" column of the current row.
     - `qa_dict_ans`: This dictionary represents an answer and has two key-value pairs: "role" is set to "assistant" and "content" is set to the value in the "answers" column of the current row.

6. **Add Dictionaries to the Result List:**
   - The function adds the two dictionaries created in the previous step to the `result` list. It first adds the question dictionary and then the answer dictionary.

7. **Return the Result List:**
   - After iterating through all the rows of the DataFrame and adding the corresponding dictionaries to the `result` list, the function returns the `result` list.

The returned list can be used to represent a sequence of questions and answers in a chatbot conversation, where each dictionary represents a message in the conversation with information about the sender's role (user or assistant) and the content of the message.

In [10]:
def convert_to_list_of_dict(df: pd.DataFrame) -> List[Dict[str, str]]:
    """
    Reads in a pandas DataFrame and produces a list of dictionaries with two keys each, 'question' and 'answer.'

    Args:
        df: A pandas DataFrame with columns named 'questions' and 'answers'.

    Returns:
        A list of dictionaries, with each dictionary containing a 'question' and 'answer' key-value pair.
    """

    # Initialize an empty list to store the dictionaries
    result = []

    # Loop through each row of the DataFrame
    for index, row in df.iterrows():
        # Create a dictionary with the current question and answer
        qa_dict_quest = {"role": "user", "content": row["questions"]}
        qa_dict_ans = {"role": "assistant", "content": row["answers"]}

        # Add the dictionary to the result list
        result.append(qa_dict_quest)
        result.append(qa_dict_ans)

    # Return the list of dictionaries
    return result

### Enter API Key

In [31]:
openai.api_key = "ENTER API KEY HERE"

In [32]:
palm.configure(api_key = "ENTER API KEY HERE")

In [33]:
def call_chatgpt(prompt: str) -> str:
    """
    Uses the OpenAI API to generate an AI response to a prompt.

    Args:
        prompt: A string representing the prompt to send to the OpenAI API.

    Returns:
        A string representing the AI's generated response.

    """

    # Use the OpenAI API to generate a response based on the input prompt.
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt,
        temperature=0.3,
        max_tokens=800,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )

    # Extract the text from the first (and only) choice in the response output.
    ans = response.choices[0]["text"]

    # Return the generated AI response.
    return ans

In [34]:
def call_palm(prompt: str) -> str:
    completion = palm.generate_text(
        model="models/text-bison-001",
        prompt=prompt,
        temperature=0,
        max_output_tokens=800,
    )

    return completion.result

In [73]:
user_input = "hello"
similarity_indicator = "cosine"

def get_ans_from_no_reference(user_input: str) -> str:
    """
    Uses the LLM to generate an AI response to a prompt.

    Args:
        prompt: A string representing the prompt to send to the LLM.

    Returns:
        A string representing the AI's generated response.
    """
    output = call_chatgpt(user_input)

    return output


def get_answers_from_gpt(user_input: str, similarity_indicator) -> str:
    """
    Uses the LLM to generate an AI response to a prompt.

    Args:
        prompt: A string representing the prompt to send to the LLM.

    Returns:
        A string representing the AI's generated response.
    """

    df_screened_by_dist_score = add_dist_score_column(
        df, user_input, similarity_indicator
    )
    qa_pairs = convert_to_list_of_dict(df_screened_by_dist_score)
    processed_user_question = f"""
        Learn from the context: {qa_pairs}
        Answer the following question as if you are the AI assistant: {user_input}
        Produce a text answer that are complete sentences.
    """
    output = call_chatgpt(processed_user_question)

    return output

## Run Experiments

### Use `.apply` to Generate Answers

This code is using a Python cell magic command, `%%time`, which is a feature of Jupyter notebooks. It is used to time the execution of the code in the cell.

The code snippet applies a function, `get_ans_from_no_reference`, to each row of the "question" column in the dataframe `df`. It does this by using the `apply` method of the Pandas DataFrame. The result is then stored in a new column of the dataframe, "ans_from_gpt_no_ref".

Here is the step-by-step breakdown of the code:

1. **Time the Execution:**
   - The `%%time` magic command is used to measure the time it takes to execute the code in the cell.

2. **Apply Function to DataFrame:**
   - The `apply` method is called on the "question" column of the dataframe `df`.
   - For each row in the "question" column, the function `get_ans_from_no_reference` is applied.
   - The function is applied using a lambda function, which calls `get_ans_from_no_reference` with the current question (`x`) as the argument.

3. **Store Results in New Column:**
   - The results of applying the function to each row of the "question" column are stored in a new column of the dataframe, "ans_from_gpt_no_ref".

The function `get_ans_from_no_reference` is defined as follows:

1. **Function Definition:**
   - The function is named `get_ans_from_no_reference` and takes one argument, `user_input`, which is a string.

2. **Function Documentation:**
   - The function is documented with a docstring that explains its purpose, input parameters, and return value.
   - The function uses the LLM (presumably some kind of language model) to generate an AI response to a prompt.

3. **Call External Function:**
   - The function calls another function, `call_chatgpt`, with `user_input` as the argument.
   - The result of the `call_chatgpt` function call is stored in the variable `output`.

4. **Return the Output:**
   - The function returns the value stored in the variable `output`.

The purpose of this code is to generate AI responses to each question in the dataframe and store the responses in a new column. The time taken to perform this operation is displayed in the Jupyter notebook cell output.

In [74]:
%%time
df['ans_from_gpt_no_ref'] = df.question.apply(lambda x: get_ans_from_no_reference(user_input=x))

CPU times: user 906 ms, sys: 84.7 ms, total: 991 ms
Wall time: 1min 42s


In [39]:
%%time
df['ans_from_cosine+gpt'] = df.question.apply(lambda x: get_answers_from_gpt(user_input=x, similarity_indicator="cosine"))

CPU times: user 1.61 s, sys: 99.9 ms, total: 1.71 s
Wall time: 2min 27s


In [None]:
%%time
df['ans_from_stsopenai+gpt'] = df.question.apply(lambda x: get_answers_from_gpt(user_input=x, similarity_indicator="stsopenai"))

In [41]:
%%time
df['ans_from_stspalm+gpt'] = df.question.apply(lambda x: get_answers_from_gpt(user_input=x, similarity_indicator="stspalm"))

CPU times: user 2min, sys: 9.16 s, total: 2min 9s
Wall time: 2h 2min 8s


In [56]:
df[["answers", "ans_from_cosine+gpt"]].iloc[0, :].apply(lambda x: x[0])

answers                  
ans_from_cosine+gpt    \n
Name: 0, dtype: object

In [57]:
%%time
df["ans_from_cosine+gpt_palmscore"] = df[["answers", "ans_from_cosine+gpt"]].apply(
    lambda x: calculate_sts_palm_score(
        x[0],
        x[1]
    )
)

CPU times: user 58.8 ms, sys: 2.25 ms, total: 61.1 ms
Wall time: 2.91 s


In [78]:
df.columns = ['context', 'question', 'answers', 'questions', 'cosine',
       'ans_from_cosine__gpt', 'stsopenai', 'stspalm', 'ans_from_stspalm__gpt',
       'ans_from_cosine__gpt_palmscore', 'ans_from_gpt_no_ref']

### Check Performance

Use different options: 1) embedding, 2) LLM, 3) similarity scores, to evaluate the a) answers without reference, and b) answers with references.

The goal is to show that we can produce better and more informative answers in the chatbot.

In [67]:
from tqdm import tqdm

In [90]:
scores_0 = []
for i in tqdm(range(len(df))):
    truth = df.answers[i]
    prediction = df.ans_from_gpt_no_ref[i]
    scores_0.append(calculate_sts_palm_score(truth, prediction))

100%|██████████| 83/83 [01:27<00:00,  1.06s/it]


In [91]:
df["ans_from_gpt_no_ref_palmscore"] = scores_0

In [92]:
print(f"""
    The average similarity score based on the following premises:
    1) no context or any reference,
    2) just throw raw question into chatgpt,
    3) and palm embedding to compute the similarity between predicted answer and real answer
    is: {df["ans_from_gpt_no_ref_palmscore"].mean()}
""")


    The average similarity score based on the following premises:
    1) no context or any reference, 
    2) just throw raw question into chatgpt,
    3) and palm embedding to compute the similarity between predicted answer and real answer
    is: 0.754197628682477



In [68]:
scores = []
for i in tqdm(range(len(df))):
    x = df[["answers", "ans_from_cosine__gpt"]].iloc[i, :]
    scores.append(calculate_sts_palm_score(x["answers"], x["ans_from_cosine__gpt"]))

100%|██████████| 83/83 [01:27<00:00,  1.05s/it]


In [70]:
df["ans_from_cosine+gpt_palmscore"] = scores

In [71]:
print(f"""
    The average similarity score based on the following premises:
    1) cosine as embedding,
    2) chatgpt as llm,
    3) and palm embedding to compute the similarity between predicted answer and real answer
    is: {df["ans_from_cosine+gpt_palmscore"].mean()}
""")


    The average similarity score based on the following premises:
    1) cosine as embedding, 
    2) chatgpt as llm,
    3) and palm embedding to compute the similarity between predicted answer and real answer
    is: 0.9182174069649235



In [94]:
scores_1 = []
for i in tqdm(range(len(df))):
    x = df[["answers", "ans_from_stspalm__gpt"]].iloc[i, :]
    scores_1.append(calculate_sts_palm_score(x["answers"], x["ans_from_stspalm__gpt"]))

100%|██████████| 83/83 [01:28<00:00,  1.06s/it]


In [95]:
df["ans_from_stspalm__gpt_palmscore"] = scores_1

In [96]:
print(f"""
    The average similarity score based on the following premises:
    1) cosine as embedding,
    2) palm as llm,
    3) and palm embedding to compute the similarity between predicted answer and real answer
    is: {df["ans_from_stspalm__gpt_palmscore"].mean()}
""")


    The average similarity score based on the following premises:
    1) cosine as embedding, 
    2) palm as llm,
    3) and palm embedding to compute the similarity between predicted answer and real answer
    is: 0.9256671985477677



In [97]:
df.to_csv('new_result_with_pred.csv')