# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import numpy as np
import pandas as pd
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import tiktoken

In [2]:
openai.api_key = "..."
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

In [3]:
def get_text(df: pd.DataFrame) -> list:
    """
    Generate text from DataFrame.

    Parameters:
        df (pd.DataFrame): Input DataFrame containing URL, Trends, and Source columns.

    Returns:
        list: A list of strings containing text generated from the DataFrame.
    """
    text = []
    for _, row in df.iterrows():
        text.append("URL: " + row["URL"] + "\n" 
                              + "Trends: " + row["Trends"] + "\n"
                              + "Source: " + row["Source"])
    return text

In [4]:
def get_embeddings(df: pd.DataFrame, batch_size: int) -> pd.DataFrame:
    """
    Generate embeddings for text data in DataFrame.

    Parameters:
        df (pd.DataFrame): Input DataFrame containing text data.
        batch_size (int): Batch size for generating embeddings.

    Returns:
        pd.DataFrame: DataFrame with 'embeddings' column containing generated embeddings.
    """
    embeddings = []
    df["text"] = get_text(df)
    for i in range(0, len(df), batch_size):
        response = openai.Embedding.create(
            input=df.iloc[i:i+batch_size]["text"].tolist(),
            engine=EMBEDDING_MODEL_NAME
        )

        embeddings.extend([data['embedding'] for data in response["data"]])

    df['embeddings'] = embeddings
    return df

In [5]:
def create_rag_database() -> pd.DataFrame:
    """
    Create a RAG (Retrieval-Augmented Generation) database.

    Returns:
        pd.DataFrame: DataFrame containing RAG database with text embeddings.
    """
    df = pd.read_csv('data/2023_fashion_trends.csv')
    df = get_embeddings(df, 5)
    df.to_csv('fashion_trends_embeddings.csv', index=False)
    return df.head(5)

In [6]:
create_rag_database()

Unnamed: 0,URL,Trends,Source,text,embeddings
0,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Red. Glossy red hues took ...,7 Fashion Trends That Will Take Over 2023 — Sh...,URL: https://www.refinery29.com/en-us/fashion-...,"[-0.012282825075089931, -0.023432452231645584,..."
1,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Cargo Pants. Utilitarian w...,7 Fashion Trends That Will Take Over 2023 — Sh...,URL: https://www.refinery29.com/en-us/fashion-...,"[-0.0002861183893401176, -0.03149326890707016,..."
2,https://www.refinery29.com/en-us/fashion-trend...,"2023 Fashion Trend: Sheer Clothing. ""Bare it a...",7 Fashion Trends That Will Take Over 2023 — Sh...,URL: https://www.refinery29.com/en-us/fashion-...,"[-0.006564814131706953, -0.022983480244874954,..."
3,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Denim Reimagined. From dou...,7 Fashion Trends That Will Take Over 2023 — Sh...,URL: https://www.refinery29.com/en-us/fashion-...,"[-0.00231093168258667, -0.0138161052018404, 0...."
4,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Shine For The Daytime. The...,7 Fashion Trends That Will Take Over 2023 — Sh...,URL: https://www.refinery29.com/en-us/fashion-...,"[0.00034107582177966833, 0.0012216639006510377..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [7]:
def load_rag_database() -> pd.DataFrame:
    """
    Load the RAG (Retrieval-Augmented Generation) database.

    Returns:
        pd.DataFrame: DataFrame containing RAG database with text embeddings.
    """
    df = pd.read_csv("fashion_trends_embeddings.csv")
    df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
    return df

In [8]:
def get_rows_sorted_by_relevance(question: str, df: pd.DataFrame) -> pd.DataFrame:
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question

    Parameters:
        question (str): The question for relevance sorting.
        df (pd.DataFrame): DataFrame containing text and associated embeddings.

    Returns:
        pd.DataFrame: DataFrame sorted by relevance to the question.
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [9]:
def create_prompt(question: str, df: pd.DataFrame, max_token_count: int) -> str:
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model

    Parameters:
        question (str): The question to be answered based on the context.
        df (pd.DataFrame): DataFrame containing text and associated embeddings.
        max_token_count (int): Maximum token count for the generated prompt.

    Returns:
        str: Text prompt for the question and context.
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [24]:
def answer_question(
    question: str, 
    df: pd.DataFrame, 
    max_prompt_tokens: int = 3000, 
    max_answer_tokens: int = 300
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string

    Parameters:
        question (str): The question to be answered.
        df (pd.DataFrame): DataFrame containing text and associated embeddings.
        max_prompt_tokens (int): Maximum token count for the generated prompt.
        max_answer_tokens (int): Maximum token count for the model's response.

    Returns:
        str: Answer to the question generated by the Completion model.
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""


In [25]:
df = load_rag_database()
df.head()

Unnamed: 0,URL,Trends,Source,text,embeddings
0,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Red. Glossy red hues took ...,7 Fashion Trends That Will Take Over 2023 — Sh...,URL: https://www.refinery29.com/en-us/fashion-...,"[-0.012282825075089931, -0.023432452231645584,..."
1,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Cargo Pants. Utilitarian w...,7 Fashion Trends That Will Take Over 2023 — Sh...,URL: https://www.refinery29.com/en-us/fashion-...,"[-0.0002861183893401176, -0.03149326890707016,..."
2,https://www.refinery29.com/en-us/fashion-trend...,"2023 Fashion Trend: Sheer Clothing. ""Bare it a...",7 Fashion Trends That Will Take Over 2023 — Sh...,URL: https://www.refinery29.com/en-us/fashion-...,"[-0.006564814131706953, -0.022983480244874954,..."
3,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Denim Reimagined. From dou...,7 Fashion Trends That Will Take Over 2023 — Sh...,URL: https://www.refinery29.com/en-us/fashion-...,"[-0.00231093168258667, -0.0138161052018404, 0...."
4,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Shine For The Daytime. The...,7 Fashion Trends That Will Take Over 2023 — Sh...,URL: https://www.refinery29.com/en-us/fashion-...,"[0.00034107582177966833, 0.0012216639006510377..."


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [26]:
question = "What was the fashion trend in spring 2023?"

#### Basic Completion

In [27]:
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME, 
    prompt=question, 
    max_tokens=300)

print(response["choices"][0]["text"].strip())

I'm sorry, I cannot predict fashion trends for future years.


#### Custom Completion

In [28]:
answer = answer_question(question, df)
print(answer)

The fashion trends for spring 2023 included more 3D designs with floral motifs, bold colors and bold prints, minimalist and simple styles, edgy and grunge styles, delicate and sheer fabrics, balloon and puffed shapes, tailored and tailored looks, metallic fabrics and neons, and a return to '90s and '00s fashion. The specific trends mentioned vary among sources.


### Question 2

In [29]:
question = "What was the fashion trend in autumn 2023?"

#### Basic Completion

In [30]:
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME, 
    prompt=question, 
    max_tokens=300)

print(response["choices"][0]["text"].strip())

It is impossible to predict fashion trends of a specific year in the future. Fashion trends are constantly changing and evolving and are influenced by various factors such as societal changes, cultural influences, and designer creativity. It is best to follow current fashion trends and make your own unique style choices.


#### Custom Completion

In [31]:
answer = answer_question(question, df)
print(answer)

Metallics were commonplace in autumn/winter collections.


The fashion trends dataset has been chosen for Custom Fashion Chatbot.