# Custom Chatbot Project

The dataset "2023_fashion_trends.csv" was selected for this project. The trends, URLs, and sources for the fashion trends for 2023 are all included in this dataset. It is relevant to this job since it offers perspectives on fashion-related subjects. Furthermore, the dataset provides a variety of fashion-related content, which allows us to generate questions that cover a broad spectrum of subjects and fashions.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [74]:
import pandas as pd
import numpy as np
import openai
import os
from openai.embeddings_utils import distances_from_embeddings

In [75]:
openai.api_key = 'YOUR API KEY'
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
MAX_TOKENS = 1000

In [76]:
df = pd.read_csv('2023_fashion_trends.csv')

In [77]:
df.head()

Unnamed: 0,URL,Trends,Source
0,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Red. Glossy red hues took ...,7 Fashion Trends That Will Take Over 2023 — Sh...
1,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Cargo Pants. Utilitarian w...,7 Fashion Trends That Will Take Over 2023 — Sh...
2,https://www.refinery29.com/en-us/fashion-trend...,"2023 Fashion Trend: Sheer Clothing. ""Bare it a...",7 Fashion Trends That Will Take Over 2023 — Sh...
3,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Denim Reimagined. From dou...,7 Fashion Trends That Will Take Over 2023 — Sh...
4,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Shine For The Daytime. The...,7 Fashion Trends That Will Take Over 2023 — Sh...


In [78]:
df['text'] = 'Source: ' + df['Source'] + ' --> ' + df['Trends']

In [79]:
embeddings = []

In [80]:
for index, row in df1.iterrows():
  response = openai.Embedding.create(
      input=row["text"],
      engine=EMBEDDING_MODEL_NAME
  )
  embeddings.extend([data["embedding"] for data in response["data"]])
df["embeddings"] = embeddings

In [81]:
df[["text", "embeddings"]].to_csv("fashion_trends_embeddings.csv")

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [82]:
df = pd.read_csv('fashion_trends_embeddings.csv', index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df.head()

Unnamed: 0,text,embeddings
0,Source: 7 Fashion Trends That Will Take Over 2...,"[-0.013025907799601555, -0.023445330560207367,..."
1,Source: 7 Fashion Trends That Will Take Over 2...,"[-0.0012024756288155913, -0.028465161100029945..."
2,Source: 7 Fashion Trends That Will Take Over 2...,"[-0.005991143640130758, -0.021644994616508484,..."
3,Source: 7 Fashion Trends That Will Take Over 2...,"[-0.012575166299939156, -0.008558006957173347,..."
4,Source: 7 Fashion Trends That Will Take Over 2...,"[-0.0027159699238836765, -0.002417146926745772..."


In [83]:
def question_embeddings(question):
  response = openai.Embedding.create(
      input = question,
      engine = EMBEDDING_MODEL_NAME
  )
  return response["data"][0]["embedding"]

In [84]:
q1 = "Which company has been using the dolly theme for a while?"
q2 = "In which event was the pinstriped version of the suit vest seen?"

q1_embeddings = question_embeddings(q1)
q2_embeddings = question_embeddings(q2)

In [85]:
df['q1_distances'] = distances_from_embeddings(
  q1_embeddings,
  df['embeddings'].values,
  distance_metric="cosine"
)

In [86]:
df['q2_distances'] = distances_from_embeddings(
  q2_embeddings,
  df['embeddings'].values,
  distance_metric="cosine"
)

In [87]:
df.head()
dfq1 = df.sort_values(by=["q1_distances"], ascending=True)
dfq2 = df.sort_values(by=["q2_distances"], ascending=True)

In [88]:
dfq1.head()

Unnamed: 0,text,embeddings,q1_distances,q2_distances
62,Source: Spring/Summer 2023 Fashion Trends: 21 ...,"[-0.03368574008345604, -0.004947429522871971, ...",0.233142,0.245333
55,Source: Spring/Summer 2023 Fashion Trends: 21 ...,"[-0.02430758625268936, -0.002528405049815774, ...",0.253011,0.23184
64,Source: Spring/Summer 2023 Fashion Trends: 21 ...,"[-0.01614186353981495, -0.009283903986215591, ...",0.256301,0.236784
75,Source: Spring/Summer 2023 Fashion Trends: 21 ...,"[-0.037023838609457016, -0.010987632907927036,...",0.256869,0.229826
52,Source: Spring/Summer 2023 Fashion Trends: 21 ...,"[-0.01960800401866436, -0.001156726386398077, ...",0.264842,0.243356


In [89]:
dfq2.head()

Unnamed: 0,text,embeddings,q1_distances,q2_distances
38,Source: These Are the Spring 2023 Trends Vogue...,"[-0.012374293059110641, -0.0005402872338891029...",0.265285,0.15058
25,Source: These Are the Spring 2023 Trends Vogue...,"[-0.014497331343591213, -0.02013295143842697, ...",0.274717,0.22523
23,Source: These Are the Spring 2023 Trends Vogue...,"[-0.02915928326547146, -0.023916088044643402, ...",0.276595,0.225361
22,Source: These Are the Spring 2023 Trends Vogue...,"[-0.026542559266090393, -0.01678287796676159, ...",0.278781,0.225457
70,Source: Spring/Summer 2023 Fashion Trends: 21 ...,"[-0.00910821184515953, -0.010854345746338367, ...",0.283255,0.225817


In [90]:
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")
prompt_template = """
Answer the question based on the context below, and if the
question can't be answered based on the context, say
"I don't know"

Context:

{}

---

Question: {}
Answer:"""
def get_prompt(question, df):
  token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
  context_list = []
  for text in df["text"].head(10).values:
    token_count += len(tokenizer.encode(text))
    if token_count <= MAX_TOKENS:
        context_list.append(text)
    else:
        break
  prompt = prompt_template.format("\n\n###\n\n".join(context_list), question)
  return prompt


In [91]:
q1_prompt = get_prompt(q1, dfq1)
q2_prompt=get_prompt(q2, dfq2)

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [92]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=q1_prompt,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

Gucci


### Question 2

In [93]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=q2_prompt,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

Coperni SS23 show.
