# phrase-ticker Dataset Creation

This notebook demonstrates how to create a dataset linking S&P 500 company tickers with natural language phrases, using web scraping for data collection and GPT-4 for phrase generation. Designed for easy integration with the Hugging Face `datasets` library, this tool aims to enhance the extraction of stock tickers from textual data, offering valuable resources for financial NLP tasks.

In [60]:
# import libraries

import requests
import pandas as pd
from bs4 import BeautifulSoup

# import environment variables
import dotenv
dotenv.load_dotenv()

True

## Extracting S&P 500 Company Information

In this section, we retrieve a list of S&P 500 companies and their stock ticker symbols directly from Wikipedia. Using Python's `requests` library, we access the webpage containing the S&P 500 index composition. With `BeautifulSoup`, we parse the HTML content to locate and extract the data from the relevant table. Each row of the table provides the ticker symbol and company name, which we collect into a structured list for further processing. 


In [61]:
# Get the S&P 500 companies from Wikipedia
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

# Get the page
response = requests.get(url)

# Parse the page
soup = BeautifulSoup(response.text, 'html.parser')

# Get the table
table = soup.find('table', {'id': 'constituents'})

# Get the rows
rows = table.find_all('tr')[1:]  # Skip the header row

# Get the compnay name and ticket
data = []
for row in rows:
    cols = row.find_all('td')
    symbol = cols[0].text.strip()
    name = cols[1].text.strip() 
    data.append({'ticker': symbol, 'name': name})

data[:10], len(data)

([{'ticker': 'MMM', 'name': '3M'},
  {'ticker': 'AOS', 'name': 'A. O. Smith'},
  {'ticker': 'ABT', 'name': 'Abbott'},
  {'ticker': 'ABBV', 'name': 'AbbVie'},
  {'ticker': 'ACN', 'name': 'Accenture'},
  {'ticker': 'ADBE', 'name': 'Adobe Inc.'},
  {'ticker': 'AMD', 'name': 'Advanced Micro Devices'},
  {'ticker': 'AES', 'name': 'AES Corporation'},
  {'ticker': 'AFL', 'name': 'Aflac'},
  {'ticker': 'A', 'name': 'Agilent Technologies'}],
 503)

In [62]:
# create dataframe
df = pd.DataFrame(data)

df.head()

Unnamed: 0,ticker,name
0,MMM,3M
1,AOS,A. O. Smith
2,ABT,Abbott
3,ABBV,AbbVie
4,ACN,Accenture


## Phrase Generation with GPT-4

This section uses the GPT-4 model to generate 50 unique phrases for each S&P 500 company, enhancing our dataset for ticker extraction tasks. The process, facilitated by LangChain and OpenAI APIs, dynamically creates phrases reflecting each company's attributes, making the data more versatile for NLP applications.

In [96]:
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_openai import ChatOpenAI

import dotenv

dotenv.load_dotenv()

import os
import json

output_parser = JsonOutputParser()


prompt = PromptTemplate(
    template="""
   
    You are now Phrase Generator, your job is to take in a company name, and ticker and generate natural language phrases
    that would be useful for people looking for that company

    the company name is: {name}
    the company ticker is: {ticker}


    For example if the company name is Apple, the description is iPhone maker, and the ticker is AAPL,
    you might generate phrases like "Big tech company" or "steve jobs company" etc

    To generate phrases for the company use your training data and the supplied description.

    Generate 50 casual natural language phrases.
     {format_instructions}

    """,
    input_variables=["name", "ticker"],
    partial_variables={"format_instructions": output_parser.get_format_instructions()},
)

model = ChatOpenAI(model="gpt-4-0125-preview", api_key=os.getenv("OPENAI_API_KEY"))

chain = prompt | model | output_parser

phrases = chain.invoke({"name": df["name"][0], "ticker": df["ticker"][0]})
phrases, type(phrases)

({'companyName': '3M',
  'ticker': 'MMM',
  'phrases': ['Innovative materials giant',
   'Post-it Notes creator',
   'Scotch tape manufacturer',
   'Multinational conglomerate powerhouse',
   'Diversified technology company',
   'Leader in consumer goods',
   'Healthcare product innovator',
   'Safety and industrial products supplier',
   'Famous for its adhesive solutions',
   'Global science company',
   'Makers of N95 respirators',
   'Renowned for its research and development',
   'Enterprise supplying office supplies',
   'Pioneer in home improvement products',
   'Automotive industry supplier',
   'Electronics and energy products manufacturer',
   'Provider of filtration solutions',
   'Expert in manufacturing abrasives',
   'Key player in the construction market',
   'Inventor of reflective road signs',
   'Specialist in dental and orthodontic products',
   'Industry leader in adhesives and tapes',
   'Supplier of personal protective equipment',
   'Innovation-driven conglomerat


## Concurrent Phrase Generation

This section accelerates phrase generation by using `concurrent.futures` for parallel processing. It simultaneously invokes the GPT-4 model for each S&P 500 company, updating our dataset with generated phrases. 

In [None]:

import concurrent.futures
import json

def generate_phrases(index, name, ticker):
    try:
        response = chain.invoke({"name": name, "ticker": ticker})
        phrases = response.get("phrases", [])
        print(f"Successfully processed {ticker} - {name}")
        return index, json.dumps(phrases)
    except Exception as e:
        print(f"Error processing {ticker} - {name}: {e}")
        return index, json.dumps([])

def update_phrases(df, results):
    for index, phrases in results.items():
        df.at[index, "phrases"] = phrases
    print("Update complete.")

def main(df):
    num_workers = 30
    results = {}
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        future_to_row = {executor.submit(generate_phrases, i, row['name'], row['ticker']): i for i, row in df.iterrows()}
        for future in concurrent.futures.as_completed(future_to_row):
            index, phrases = future.result()
            results[index] = phrases
            print(f"Row {index} updated with phrases.")
    
    update_phrases(df, results)

# Call main function to process the DataFrame
main(df)

# Check the updated DataFrame
df.head()


## Filling Missing Phrases

This step identifies and fills missing phrases for using the GPT-4 model, ensuring our dataset is complete before saving it to a new CSV.

In [32]:
# if any, find the companies where phrases is []
df[df['phrases'] == '[]']

# i got the following companies with phrases []
# 	ticker	name	phrases
# 284	LRCX	Lam Research	[]
# 488	WDC	Western Digital	[]

#  add back the phrases for the companies where phrases is []
for index, row in df.iterrows():
    if row['phrases'] == '[]':
        print(f"Processing {row['ticker']} - {row['name']}")
        response = chain.invoke({"name": row['name'], "ticker": row['ticker']})
        phrases = response.get("phrases", [])
        df.at[index, "phrases"] = json.dumps(phrases)

# df to new csv
df.to_csv('../data/raw.csv', index=False)

## Pivoting Dataset for phrase-ticker Mapping

This section transforms the original dataset into a new format where each row represents a unique combination of a phrase and its corresponding ticker symbol. By iterating over each company's phrases, we create a more granular dataset suitable for NLP tasks focused on ticker extraction. The final structured dataset is saved to CSV, facilitating easy access and further use.

In [33]:
import pandas as pd
import ast


# Initialize a list to store the new rows (phrase, ticker)
new_data = []

for index, row in df.iterrows():
    # Convert the string representation of the list back to a list
    phrases_list = ast.literal_eval(row['phrases'])
    
    # For each phrase, append a new row to new_data
    for phrase in phrases_list:
        new_data.append({'phrase': phrase, 'ticker': row['ticker']})

# Create a new DataFrame from the new_data list
new_df = pd.DataFrame(new_data, columns=['phrase', 'ticker'])
new_df.head()

# save to csv optionally
new_df.to_csv('../data/data.csv', index=False)


In [35]:
from datasets import Dataset

# Create a Hugging Face dataset from the new DataFrame
dataset = Dataset.from_pandas(new_df)

# Display the first few rows to verify
print(dataset)

# discrepancy in num rows could be because gpt occasionally did 51 phrases instead of 50

Dataset({
    features: ['phrase', 'ticker'],
    num_rows: 25181
})
