# Custom Chatbot Project

I have chosen the 2023 Fashion Trends dataset for this project because it contains specialized, 
time-specific information about fashion trends that would not be readily available in a base 
language model's training data. This makes it an ideal candidate for demonstrating the value of 
custom prompts and retrieval-augmented generation. Fashion trends change rapidly year-to-year, 
and having a curated dataset of 2023 trends allows the chatbot to provide accurate, current 
fashion advice that a standard model would struggle to deliver. The dataset's focus on specific 
trends, styles, and fashion recommendations provides concrete, factual information that can be 
retrieved and incorporated into responses, clearly showing the difference between generic fashion 
knowledge and specialized 2023 trend insights.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import pandas as pd

import openai

pd.set_option('display.max_colwidth', None)

In [2]:
df = pd.read_csv('./data/2023_fashion_trends.csv')
print(df.head())

                                                    URL  \
0  https://www.refinery29.com/en-us/fashion-trends-2023   
1  https://www.refinery29.com/en-us/fashion-trends-2023   
2  https://www.refinery29.com/en-us/fashion-trends-2023   
3  https://www.refinery29.com/en-us/fashion-trends-2023   
4  https://www.refinery29.com/en-us/fashion-trends-2023   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Trends  \
0                                                                                                      

In [3]:
data = df[['Trends']]  # Select the relevant column (note the correct case)
data.columns = ['text']   # Rename the column to 'text'

# Create a new DataFrame
df = pd.DataFrame(data)
# Verify your dataset meets requirements
print(f"✓ Number of rows: {len(df)}")
print(f"✓ Column name: {df.columns[0]}")
print(f"✓ No missing values: {df['text'].isna().sum() == 0}")

# Now you can check the DataFrame
#print(df)

✓ Number of rows: 82
✓ Column name: text
✓ No missing values: True


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [4]:

openai.api_key = ""
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

batch_size = 100
embeddings = []

for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

print(f"✓ Created embeddings for {len(df)} fashion trends")
print(df.head())



✓ Created embeddings for 82 fashion trends
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      text  \
0                                                                                                                                                                                                                                   2023 Fashion Trend: Red. Glossy red hues took over the Fall 2023 runways ranging from Sandy Liang and PatBo to Tory Burch and Wiederhoeft. Think: Juicy reds with vibrant orange undertones

In [5]:
import numpy as np

def cosine_similarity(a, b):
    """Calculate cosine similarity between two vectors"""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def create_query_embedding(query):
    """Create embedding for the user's query"""
    response = openai.Embedding.create(
        input=query,
        engine=EMBEDDING_MODEL_NAME
    )
    return response["data"][0]["embedding"]

def find_similar_texts(query_embedding, df, top_n=3):
    """Find the most similar texts from the dataset using pandas"""
    # Calculate similarity scores
    df_copy = df.copy()
    df_copy["similarity"] = df_copy["embeddings"].apply(
        lambda x: cosine_similarity(query_embedding, x)
    )
    
    # Sort and get top results
    top_results = df_copy.sort_values("similarity", ascending=False).head(top_n)
    
    return top_results

def create_custom_prompt(query, relevant_texts):
    """Create a prompt with context from the dataset"""
    context = "\n\n".join(relevant_texts["text"].tolist())
    
    prompt = f"""You are a fashion expert with knowledge of 2023 fashion trends. 
        Use the following information about 2023 fashion trends to answer the question.

        2023 Fashion Trends Context:
        {context}

        Question: {query}

        Answer based on the 2023 fashion trends provided above:"""
    
    return prompt

In [6]:
def get_custom_completion(query, df, model="gpt-3.5-turbo-instruct", max_tokens=200):
    """Get completion using custom RAG approach"""
    # Create query embedding
    query_embedding = create_query_embedding(query)
    
    # Find similar texts
    relevant_texts = find_similar_texts(query_embedding, df, top_n=3)
    
    # Create custom prompt
    custom_prompt = create_custom_prompt(query, relevant_texts)
    
    # Get completion
    response = openai.Completion.create(
        engine=model,
        prompt=custom_prompt,
        max_tokens=max_tokens,
        temperature=0.7
    )
    
    return response.choices[0].text.strip()

def get_basic_completion(query, model="gpt-3.5-turbo-instruct", max_tokens=500):
    """Get basic completion without custom context"""
    response = openai.Completion.create(
        engine=model,
        prompt=query,
        max_tokens=max_tokens,
        temperature=0.7
    )
    
    return response.choices[0].text.strip()

In [7]:
### Question 1

In [None]:
# Test it 


question1 = "What colors should I wear for 2023 fashion trends?"

print("=" * 70)
print("QUESTION 1:", question1)
print("=" * 70)

print("\n--- BASIC COMPLETION (without custom data) ---")
basic_answer1 = get_basic_completion(question1)
print(basic_answer1)

print("\n--- CUSTOM COMPLETION (with 2023 fashion trends data) ---")
custom_answer1 = get_custom_completion(question1, df)
print(custom_answer1)
print("=" * 70)

QUESTION 1: What colors should I wear for 2023 fashion trends?

--- BASIC COMPLETION (without custom data) ---


### Question 2

In [None]:
# Question 2
question2 = "What type of pants are trending in 2023?"

print("=" * 70)
print("QUESTION 2:", question2)
print("=" * 70)

print("\n--- BASIC COMPLETION (without custom data) ---")
basic_answer2 = get_basic_completion(question2)
print(basic_answer2)

print("\n--- CUSTOM COMPLETION (with 2023 fashion trends data) ---")
custom_answer2 = get_custom_completion(question2, df)
print(custom_answer2)
print("=" * 70)