# Custom Chatbot Project

My chosen dataset are the Wikipedia pages from the NASDAQ100 companies. These articles include recent information on company events, which are not in the model's original training data.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [133]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np

In [None]:
# read in wiki pages of Nasdaq companies https://en.wikipedia.org/wiki/Nasdaq-100

# get the response in the form of html
wikiurl="https://en.wikipedia.org/wiki/Nasdaq-100"
table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wikiurl)
print(response.status_code)



In [None]:
# parse data from the html into a beautifulsoup object, find nasdaq table
soup = bs(response.text, 'html.parser')
nasdaqtable=soup.find_all('table',{'class':"wikitable"})[-1]
nasdaqtable

In [None]:
# convert list to dataframe
df=pd.read_html(str(nasdaqtable))
df=pd.DataFrame(df[0])

# clean the company names so that they correspond to the wiki page titles
df["Company"] = df["Company"].str.replace(" (Class A)", "")
df["Company"] = df["Company"].str.replace(" (Class C)", "")
df["Company"] = df["Company"].str.replace("ADP", "ADP (company)")
df["Company"] = df["Company"].str.replace("Amazon", "Amazon (company)")
df["Company"] = df["Company"].str.replace("Broadcom Inc.", "Broadcom")
df["Company"] = df["Company"].str.replace("Advanced Micro Devices Inc.", "AMD")
df["Company"] = df["Company"].str.replace("CDW Corporation", "CDW")
df["Company"] = df["Company"].str.replace("DexCom", "Dexcom")
df["Company"] = df["Company"].str.replace("Lululemon", "Lululemon Athletica")
df["Company"] = df["Company"].str.replace("MercadoLibre", "Mercado Libre")
df["Company"] = df["Company"].str.replace("Mondelēz International", "Mondelez International")
df["Company"] = df["Company"].str.replace("NXP", "NXP Semiconductors")
df["Company"] = df["Company"].str.replace("O'Reilly Automotive", "O'Reilly Auto Parts")
df["Company"] = df["Company"].str.replace("PDD Holdings", "Pinduoduo")
df["Company"] = df["Company"].str.replace("Regeneron", "Regeneron Pharmaceuticals")
df["Company"] = df["Company"].str.replace("Verisk", "Verisk Analytics")

print(df)

In [None]:
# get text from the wiki page of each Nasdaq100 company

texts = []
for c in df["Company"]:
    resp = requests.get(f"https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles={c}&explaintext=1&formatversion=2&format=json")
    text = resp.json()["query"]["pages"][0]["extract"].split("\n")
    print(text[:20])
    texts.append(text)

In [None]:
# add text column and drop the columns we don't need
df["text"] = texts
df = df.drop(columns=['Ticker', 'GICS Sector', 'GICS Sub-Industry'])

In [None]:
# split list into separate entries in df text
df = df.explode('text')

In [113]:
# remove empty strings 
df['text']=df['text'].str.strip().replace('',np.nan)
df.dropna(inplace=True)

# remove headers (starting with ==)
df = df[~df["text"].str.startswith("==")]

# add df["Company"] as prefix to df["text"]
df["companytext"] = df["Company"] + ": " + df["text"]

# reset index
df = df.reset_index(drop=True)

In [114]:
df.head()

Unnamed: 0,Company,text,companytext
0,Adobe Inc.,"Adobe Inc. ( ə-DOH-bee), formerly Adobe System...","Adobe Inc.: Adobe Inc. ( ə-DOH-bee), formerly ..."
1,Adobe Inc.,"and headquartered in San Jose, California. It ...","Adobe Inc.: and headquartered in San Jose, Cal..."
2,Adobe Inc.,"As of 2022, Adobe has more than 26,000 employe...","Adobe Inc.: As of 2022, Adobe has more than 26..."
3,Adobe Inc.,The company was started in John Warnock's gara...,Adobe Inc.: The company was started in John Wa...
4,Adobe Inc.,"In the mid-1980s, Adobe entered the consumer s...","Adobe Inc.: In the mid-1980s, Adobe entered th..."


In [None]:
# save as csv file
df.to_csv("data/nasdaq.csv")

In [None]:
# read csv file
# df = pd.read_csv("data/nasdaq.csv", index_col=0)

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

### Generate Embeddings

In [137]:
import openai
from dotenv import load_dotenv
import os
import tiktoken

In [138]:
load_dotenv()

True

In [139]:
openai.api_key = os.environ['OPENAI_API_KEY']

In [140]:
client = openai.OpenAI()
EMBEDDING_MODEL_NAME = "text-embedding-3-small"
embedding_encoding = "cl100k_base"
max_tokens = 8000

In [142]:
# gets embedding
def get_embedding(text, model=EMBEDDING_MODEL_NAME):
    text = text.replace("\n", " ")
    return client.embeddings.create(input = [text], model=model).data[0].embedding

In [121]:
encoding = tiktoken.get_encoding(embedding_encoding)

# omit articles that are too long to embed
df["n_tokens"] = df.text.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens]
len(df)

3836

In [122]:
df.head()

Unnamed: 0,Company,text,companytext,n_tokens
0,Adobe Inc.,"Adobe Inc. ( ə-DOH-bee), formerly Adobe System...","Adobe Inc.: Adobe Inc. ( ə-DOH-bee), formerly ...",27
1,Adobe Inc.,"and headquartered in San Jose, California. It ...","Adobe Inc.: and headquartered in San Jose, Cal...",273
2,Adobe Inc.,"As of 2022, Adobe has more than 26,000 employe...","Adobe Inc.: As of 2022, Adobe has more than 26...",63
3,Adobe Inc.,The company was started in John Warnock's gara...,Adobe Inc.: The company was started in John Wa...,461
4,Adobe Inc.,"In the mid-1980s, Adobe entered the consumer s...","Adobe Inc.: In the mid-1980s, Adobe entered th...",52


In [131]:
# get embeddings, this can take a few minutens
df["embedding"] = df.companytext.apply(lambda x: get_embedding(x, model=EMBEDDING_MODEL_NAME))
df.to_csv("data/nasdaq_with_embeddings.csv")

In [136]:
# df = pd.read_csv("data/nasdaq_with_embeddings.csv", index_col=0)
# df["embedding"] = df["embedding"].apply(eval).apply(np.array)


### Create a Function that Finds Related Pieces of Text for a Given Question

In [159]:
from scipy.spatial.distance import cosine

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, model=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = df_copy["embedding"].apply(lambda x: cosine(question_embeddings, x))
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [176]:
get_rows_sorted_by_relevance("Did Starbucks workers go on strike in 2023?", df)

Unnamed: 0,Company,text,companytext,n_tokens,embedding,distances
3418,Starbucks,"In 2022, over a period of a few months, Starbu...","Starbucks: In 2022, over a period of a few mon...",128,"[-0.060130178928375244, 0.017282314598560333, ...",0.330274
3379,Starbucks,As of July 2023:,Starbucks: As of July 2023:,7,"[-0.037088118493556976, -0.02450413443148136, ...",0.342081
3331,Starbucks,"Inspired by their colleagues in Buffalo, worke...",Starbucks: Inspired by their colleagues in Buf...,608,"[-0.07227139174938202, 0.004459485411643982, 0...",0.347715
3419,Starbucks,"In June 2023, Starbucks attracted controversy ...","Starbucks: In June 2023, Starbucks attracted c...",83,"[-0.007473176810890436, 0.007049917243421078, ...",0.366105
3420,Starbucks,"In late 2023, Starbucks faced boycotts followi...","Starbucks: In late 2023, Starbucks faced boyco...",97,"[-0.0343046598136425, -0.03726150840520859, 0....",0.369605
...,...,...,...,...,...,...
576,Analog Devices,Radio frequency integrated circuits (RFICs) ad...,Analog Devices: Radio frequency integrated cir...,150,"[0.013311444781720638, -0.013323535211384296, ...",1.023664
868,Autodesk,Autodesk Life Sciences is an extensible toolki...,Autodesk: Autodesk Life Sciences is an extensi...,61,"[0.021789707243442535, 0.04394540935754776, 0....",1.024080
585,Analog Devices,Each reference circuit is documented with test...,Analog Devices: Each reference circuit is docu...,53,"[0.0030809645541012287, -0.017646614462137222,...",1.042249
1480,Copart,Among the vehicles sold by insurance companies...,Copart: Among the vehicles sold by insurance c...,30,"[0.035497698932886124, 0.015237448737025261, -...",1.045679


### Create a Function that Composes a Text Prompt

In [167]:
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(encoding.encode(prompt_template)) + \
                            len(encoding.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(encoding.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

In [169]:
print(create_prompt("What happend at Tesla in 2023?", df, 1800))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

From July 2019 to June 2020, Tesla reported four consecutive profitable quarters for the first time, which made it eligible for inclusion in the S&P 500. Tesla was added to the index on December 21, 2020. It was the most valuable company ever added, and was the sixth-largest member of the index immediately after it was added. During 2020, the share price increased 740%, and on January 26, 2021, its market capitalization reached $848 billion, more than the next nine largest automakers combined and becoming the fifth most valuable company in the US.Tesla introduced its second mass-market vehicle in March 2019, the Model Y mid-size crossover SUV, based on the Model 3. Deliveries started in March 2020.During this period, Tesla invested heavily in expanding its production capacity, opening three new Gigafactories in quick succession. Construction of Gig

### Create a Function that Answers a Question

In [172]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
   
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = client.completions.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response.choices[0].text.strip()
    except Exception as e:
        print(e)
        return ""

In [173]:
custom_tesla_answer = answer_question("What happend at Tesla in 2023?", df)
print(custom_tesla_answer)

As of 2023, Tesla is the world's most valuable automaker and ranked 69th in the Forbes Global 2000.


In [175]:
custom_starbucks_answer = answer_question("Did Starbucks workers go on strike in 2023?", df)
print(custom_starbucks_answer)

Yes. In June 2023, workers at unionized stores went on strike over the company's stance on in-store LGBT pride decorations.


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [177]:
client = openai.OpenAI()
tesla_prompt = """
Question: "What happend at Tesla in 2023?"
Answer:
"""
initial_tesla_answer = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=tesla_prompt,
    max_tokens=150
).choices[0].text
print(initial_tesla_answer)

I am an AI and I do not have access to specific future events. Therefore, I cannot accurately answer this question.


In [178]:
print(f"""
What happend at Tesla in 2023?

Original Answer: {initial_tesla_answer}
Custom Answer:   {custom_tesla_answer}

""")


What happend at Tesla in 2023?

Original Answer: I am an AI and I do not have access to specific future events. Therefore, I cannot accurately answer this question.
Custom Answer:   As of 2023, Tesla is the world's most valuable automaker and ranked 69th in the Forbes Global 2000.




### Question 2

In [179]:
client = openai.OpenAI()
starbucks_prompt = """
Question: "Did Starbucks workers go on strike in 2023?"
"""
initial_starbucks_answer = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=starbucks_prompt,
    max_tokens=150
).choices[0].text
print(initial_starbucks_answer)


No, there is no evidence that Starbucks workers went on strike in 2023. 


In [180]:
print(f"""
What were the American unionization efforts at Starbucks in 2023?"

Original Answer: {initial_starbucks_answer}
Custom Answer:   {custom_starbucks_answer}

""")


What were the American unionization efforts at Starbucks in 2023?"

Original Answer: 
No, there is no evidence that Starbucks workers went on strike in 2023. 
Custom Answer:   Yes. In June 2023, workers at unionized stores went on strike over the company's stance on in-store LGBT pride decorations.


