<a href="https://colab.research.google.com/github/timwu64/Generative-AI--Chatbot-On-Custom-Data/blob/main/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

The dataset titled "2023_fashion_trends.csv" from the designated data directory has been utilized to develop an custom chatbot.

This enhanced chatbot surpasses the capabilities of OpenAI's ChatGPT 3.5 by incorporating knowledge of fashion trends subsequent to 2022, a domain beyond the scope of the ChatGPT 3.5 model, which is trained on data available up to 2022.

The selection of the "2023_fashion_trends.csv" dataset was motivated by several factors:
  It provides comprehensive insights into contemporary fashion trends through descriptions articulated in natural language, facilitating ease of understanding.
  Consequently, this strategic choice enables the custom chatbot to deliver exceptionally relevant and up-to-date responses pertaining to the latest developments in fashion trends.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
#!pip install openai==0.28
#!pip install tiktoken

In [2]:
import openai

In [3]:
# Set the embeddig model, completetion model and batch size
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
batch_size = 100

Prepare the API Key File:

Ensure you have an API key from OpenAI. If you do not have one, you can obtain it by creating an account on the OpenAI platform and accessing the API section.
Create a text file named api_key.txt in your project directory.
Open the file and paste your OpenAI API key inside it. Save and close the file then run the code below.

In [4]:
# Path to the file containing the API key
api_key_file = 'api_key.txt'

# Function to read the API key from the file
def get_api_key(file_path):
    with open(file_path, 'r') as file:
        return file.read().strip()  # .strip() removes any leading/trailing whitespace

# Retrieve the API key from the file
openai.api_key = get_api_key(api_key_file)
# display(openai.api_key)

In [5]:
import pandas as pd
df=pd.read_csv("./data/2023_fashion_trends.csv")
df.head()

Unnamed: 0,URL,Trends,Source
0,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Red. Glossy red hues took ...,7 Fashion Trends That Will Take Over 2023 — Sh...
1,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Cargo Pants. Utilitarian w...,7 Fashion Trends That Will Take Over 2023 — Sh...
2,https://www.refinery29.com/en-us/fashion-trend...,"2023 Fashion Trend: Sheer Clothing. ""Bare it a...",7 Fashion Trends That Will Take Over 2023 — Sh...
3,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Denim Reimagined. From dou...,7 Fashion Trends That Will Take Over 2023 — Sh...
4,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Shine For The Daytime. The...,7 Fashion Trends That Will Take Over 2023 — Sh...


Project dataset is loaded into a pandas dataframe containing at least 20 rows. Each row in the dataset contains a text sample in a column named "text"

In [6]:
df["text"]=df[["Trends","Source"]].agg(','.join, axis=1)
df.head()

Unnamed: 0,URL,Trends,Source,text
0,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Red. Glossy red hues took ...,7 Fashion Trends That Will Take Over 2023 — Sh...,2023 Fashion Trend: Red. Glossy red hues took ...
1,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Cargo Pants. Utilitarian w...,7 Fashion Trends That Will Take Over 2023 — Sh...,2023 Fashion Trend: Cargo Pants. Utilitarian w...
2,https://www.refinery29.com/en-us/fashion-trend...,"2023 Fashion Trend: Sheer Clothing. ""Bare it a...",7 Fashion Trends That Will Take Over 2023 — Sh...,"2023 Fashion Trend: Sheer Clothing. ""Bare it a..."
3,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Denim Reimagined. From dou...,7 Fashion Trends That Will Take Over 2023 — Sh...,2023 Fashion Trend: Denim Reimagined. From dou...
4,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Shine For The Daytime. The...,7 Fashion Trends That Will Take Over 2023 — Sh...,2023 Fashion Trend: Shine For The Daytime. The...


In [7]:
df=df.drop(columns=["URL","Trends", "Source"],axis=1)
df.head()

Unnamed: 0,text
0,2023 Fashion Trend: Red. Glossy red hues took ...
1,2023 Fashion Trend: Cargo Pants. Utilitarian w...
2,"2023 Fashion Trend: Sheer Clothing. ""Bare it a..."
3,2023 Fashion Trend: Denim Reimagined. From dou...
4,2023 Fashion Trend: Shine For The Daytime. The...


### Create Embeddings for the User's Question

In [8]:
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.head()

Unnamed: 0,text,embeddings
0,2023 Fashion Trend: Red. Glossy red hues took ...,"[-0.01557738147675991, -0.022012077271938324, ..."
1,2023 Fashion Trend: Cargo Pants. Utilitarian w...,"[0.0011047323932871222, -0.029918808490037918,..."
2,"2023 Fashion Trend: Sheer Clothing. ""Bare it a...","[-0.005143855698406696, -0.020919671282172203,..."
3,2023 Fashion Trend: Denim Reimagined. From dou...,"[-0.007703628856688738, -0.011601298116147518,..."
4,2023 Fashion Trend: Shine For The Daytime. The...,"[0.0005812683957628906, -0.0012148884125053883..."


In [9]:
df.to_csv("embeddings.csv")

In [10]:
!ls

api_key.txt  data  embeddings.csv  sample_data


In [11]:
import numpy as np
import pandas as pd
df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df.head()

Unnamed: 0,text,embeddings
0,2023 Fashion Trend: Red. Glossy red hues took ...,"[-0.01557738147675991, -0.022012077271938324, ..."
1,2023 Fashion Trend: Cargo Pants. Utilitarian w...,"[0.0011047323932871222, -0.029918808490037918,..."
2,"2023 Fashion Trend: Sheer Clothing. ""Bare it a...","[-0.005143855698406696, -0.020919671282172203,..."
3,2023 Fashion Trend: Denim Reimagined. From dou...,"[-0.007703628856688738, -0.011601298116147518,..."
4,2023 Fashion Trend: Shine For The Daytime. The...,"[0.0005812683957628906, -0.0012148884125053883..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

Create a new list called `distances`, which represents the cosine distances between `question_embeddings` and each value in the `'embeddings'` column of `df`.

uses the `distances` list to update `df` then sorts `df` to find the most related rows. Shorter distance means more similarity, so we'll use an ascending sorting order. Run the cell below as-is.

In [12]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


In [13]:
TEST_QUESTION = "What are 2023 Fashion Trend for Denim?"

In [14]:
df_embedded_sorted = get_rows_sorted_by_relevance(TEST_QUESTION, df)
df_embedded_sorted

Unnamed: 0,text,embeddings,distances
3,2023 Fashion Trend: Denim Reimagined. From dou...,"[-0.007703628856688738, -0.011601298116147518,...",0.079226
44,I get it. Some of the trends on this list migh...,"[-0.017416860908269882, -0.006282269023358822,...",0.114483
29,Detailed Denim. I’m very excited for all of sp...,"[-0.017140744253993034, -0.012082819826900959,...",0.120037
19,Baggy Denim. Denim remains just as baggy this ...,"[-0.019026657566428185, -0.016275977715849876,...",0.120132
9,Denim-On-Denim. McKenna’s second most anticipa...,"[-0.015185900032520294, -0.01557119470089674, ...",0.120305
...,...,...,...
28,New Neoclassical. I am always smitten with lad...,"[-0.02945091389119625, -0.006418789271265268, ...",0.212237
74,"I'm not one for necklaces, statement rings or ...","[-0.019201578572392464, 0.00191422738134861, -...",0.212487
36,"Ruby Slippers. Late last year, I suddenly deci...","[0.0002897342201322317, -0.028935503214597702,...",0.213175
18,"Oversized Bags. As cute as they can be, tiny b...","[0.006035938858985901, -0.010934768244624138, ...",0.214821


In [15]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)


In [16]:
print(create_prompt(TEST_QUESTION, df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

2023 Fashion Trend: Denim Reimagined. From double-waisted jeans to carpenter jeans, it's been a while since we were this excited about denim trends. It seems like even the most luxe runway designers agree, sending out strapless dresses, shirting, and even undergarments and shoes (thigh-high-boot-jean hybrids anyone?) in the material. Whatever category you decide on, opt for timeless cuts and silhouettes that can stay in your closet rotation once the novelty wears off.,7 Fashion Trends That Will Take Over 2023 — Shop Them Now

---

Question: What are 2023 Fashion Trend for Denim?
Answer:


In [17]:
def custom_answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    prompt = create_prompt(question, df, max_prompt_tokens)

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        #print(f"""Custom Answer:""")
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [18]:
def generate_general_answer(question):
    denim_prompt = f"""
Question: {question}
Answer:
"""
    response = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=denim_prompt,
        max_tokens=150
    )
    #print(f"""Original Answer:""")
    return response["choices"][0]["text"].strip()

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [19]:
TEST_QUESTION = "In what innovative ways is denim being reimagined in 2023?"

In [20]:
general_answer =generate_general_answer(TEST_QUESTION)
print(f"""Original Answer:""")
print(general_answer)

Original Answer:
1. Sustainable Materials: In 2023, denim is being reimagined using sustainable materials such as organic cotton, recycled denim, and plant-based fibers like hemp or bamboo. This shift towards eco-friendly materials not only reduces the environmental impact of denim production but also creates a unique and more conscious look.

2. Customizable Denim: With the advancement of technology, denim is being reimagined as customizable clothing. Consumers will be able to choose the fit, color, and style of their denim through virtual reality tools, making it a more personal and unique experience.

3. Digital Prints and Embroidery: Denim is also being reimagined through digital prints and embroidery techniques, giving it a more artistic and unique touch. These designs can


In [21]:
custom_answer = custom_answer_question(TEST_QUESTION, df)
print(f"""Custom Answer:""")
print(custom_answer)

Custom Answer:
Denim is being reimagined in various innovative ways in 2023, including as double-waisted jeans, carpenter jeans, strapless dresses, shirting, undergarments, shoes (thigh-high-boot-jean hybrids), baggy denim, detailed denim, denim-on-denim, denim maxi skirts, elevated basics, cobalt blue, and cargo pants, all in timeless cuts and silhouettes.


### Question 2

In [22]:
TEST_QUESTION = "How are red hues influencing the fashion landscape in 2023?"

In [23]:
general_answer =generate_general_answer(TEST_QUESTION)
print(general_answer)

As we look ahead to the year 2023, it's clear that red hues will continue to play a major role in the fashion landscape. From bold, fiery reds to more subdued, earthy tones, this powerful color will be seen on runways, in street style looks, and in everyday fashion choices.

One major trend that we can expect to see is the incorporation of red hues in monochromatic outfits. This means wearing different shades and textures of red from head to toe for a bold and cohesive look. We'll see this trend in tailored suits, casual athleisure wear, and even evening gowns.

Another way red hues will make a statement is through statement pieces like coats, jackets, and dresses. These bold pieces will


In [24]:
custom_answer = custom_answer_question(TEST_QUESTION, df)
print(custom_answer)

Red hues are a popular trend in the fashion industry, with designers incorporating glossy red hues in their collections for Fall 2023. This trend ranges from head-to-toe looks to accent accessory pieces, and is also seen in bold and striking red ensembles on the runway.


## Conclusion

The custom answer response proves to be more effective, as it offers detailed insights into the latest fashion trends, leveraging data specifically from 2023. This is a significant advantage over OpenAI's ChatGPT 3.5, which is trained on data up until 2022.

To achieve this, I employed unsupervised machine learning techniques for prompt engineering, enabling tailored responses from the OpenAI chat model.

Through this approach, the enhanced chatbot significantly outperforms the general response mechanism. This strategic implementation ensures that the chatbot provides highly relevant and current information on the newest fashion trends.

## Chat Bot

In [25]:

print('OpenAI: Hello, How can I help you today?\n')
while True:
    question = input('You: ')
    if len(question) > 0:
        print(f'\nGeneral Answer: {generate_general_answer(question)}', end='\n\n')
        print(f'\nCustom Answer: {custom_answer_question(question, df)}', end='\n\n')
    else:
        print('Goodbye!!')
        break

OpenAI: Hello, How can I help you today?

You: How are red hues influencing the fashion landscape in 2023?

General Answer: Red hues are expected to have a strong presence in the fashion landscape in 2023. The color red has always been associated with passion, strength, and power, and it is set to make a bold statement in the fashion world in the upcoming years.

Designers are incorporating various shades of red into their collections, from bright and bold tones to deep and rich hues. This versatile color can be seen in a variety of clothing pieces, including dresses, suits, coats, and accessories.

One of the main ways that red hues are influencing the fashion landscape is through the use of monochromatic looks. This involves wearing various shades of red from head to toe, creating a bold and cohesive outfit. This trend has been seen on the runways of


Custom Answer: Red hues are dominating the fashion trends in 2023, with glossy red shades being showcased on runways and becoming a p