# Is the language in IPO prospectus informative?

**Machine Learning/Dynamic Class (ECON 433/434)**


**Project Proposal**


**Taisei Noda**

## Overview

# Is the language in IPO prospectus informative?


In this project, I will explore how informative the language in Initial Public Offering (IPO) S-1 filings is for predicting IPO performance. An S-1 filing is a registration statement submitted to the U.S. Securities and Exchange Commission (SEC) by a company seeking to go public. These filings are a crucial source of information for market participants, as newly public companies typically lack an established track record in public markets.

S-1 filings contain both quantitative information—such as financial statements and offering terms (e.g., offering price and proceeds)—and qualitative disclosures, including textual descriptions of the company’s business model, market outlook, and strategic risks, as written from management’s perspective. This project investigates whether these textual components provide predictive information about IPO outcomes, beyond what is already captured by the standard financial metrics.

More specifically, this project will address the following two questions:

* Can the language in S-1 filings predict IPO initial returns, such as first-day or first-week performance?

* Can text features from S-1s forecast whether a firm will be delisted or underperform over the next 3 to 5 years?

The second question explores the long-term informativeness of S-1 disclosures, motivated by the observation that some IPO firms—particularly during hot market periods like the dot-com bubble—entered public markets with vague or overly optimistic business plans.

## Target Figures

The figures to illustrate the main findings are as follows:

• Model Performance Summary
A scatter plot of actual vs. predicted IPO first-day (or first-week/month) returns.
    – X-axis: Actual first-day return
    – Y-axis: Predicted first-day return
If the delisting classification model provides meaningful insights, a ROC curve would also be useful to illustrate classification accuracy compared to random guessing.

• Bar Chart of Top Predictive Words
This chart will display the top predictive phrases or keywords that the model associates with high or low IPO outcomes (e.g., initial returns or delisting risk).
    – X-axis: Importance score (e.g., SHAP value, attention weight)
    – Y-axis: Words (e.g., “uncertain,” “growth,” “profitable”)

As for the model, I plan to start with FinBERT, which is specifically designed for financial text and seems well-suited for this task. However, I may also experiment with a simpler architecture such as a plain LSTM-based model.


## Data Collection

* IPO Data
  * [CRSP from wrds](https://wrds-www.wharton.upenn.edu/pages/about/data-vendors/center-for-research-in-security-prices-crsp/)
  * [Jay Ritter's IPO Data](https://site.warrington.ufl.edu/ritter/ipo-data/)
* S-1 Text
  * [EDGAR](https://www.sec.gov/edgar/search/)

I am familiar with the IPO data mentioned above, as I’ve worked with it in my own research project.


### S-1 Text Collection

I have not collected all IPO firms yet, but I have built a code block to extract text from EDGAR and begun some basic word-level analysis, as shown below. It appears to be scalable.

In [2]:
import os
import numpy as np
import re
import pandas as pd
from sec_edgar_downloader import Downloader
from bs4 import BeautifulSoup
from collections import Counter
from nltk.corpus import stopwords
import nltk

def setup(base_path=None,update=True):
    if base_path is None:
        if update == False:
            if os.name == 'posix':
            # MacOS (or Linux)
                base_path = '/Users/taisei/Dropbox/'
            elif os.name == 'nt':
                # Windows
                base_path = 'C:/Users/Taise/Dropbox/'
        else:
            base_path = input("Please enter your dropbox path: (e.g. C:/Users/Taise/Dropbox/)")
    data_path = os.path.join(base_path, "IPOMatch/data/")
    figure_path = os.path.join(base_path, "Apps/Overleaf/IPOMatch/tables_figures/")
    overleaf_path = os.path.join(base_path, "Apps/Overleaf/IPOMatch/")
    print(f"Data path: {data_path}")
    print(f"Figure/Table path: {figure_path}")
    print(f"Overleaf path: {overleaf_path}")
    return data_path,figure_path,overleaf_path

def get_full_submission_path(data_path, cik, form_type='S-1'):
    # Zero-pad the CIK
    cik_padded = f"{int(cik):010d}"
    filing_path = os.path.join(data_path, 'sec-edgar-filings', cik_padded, form_type)

    # List subdirectories (accession number folders)
    if not os.path.exists(filing_path):
        raise FileNotFoundError(f"No such directory: {filing_path}")
    
    subdirs = [d for d in os.listdir(filing_path) if os.path.isdir(os.path.join(filing_path, d))]
    if not subdirs:
        raise FileNotFoundError(f"No subfolders found in: {filing_path}")
    
    # Sort by name or last modified time (optional)
    subdirs.sort()  # or use sorted(subdirs, key=lambda x: ...)
    target_folder = subdirs[0]  # or [-1] if you want the most recent
    
    full_path = os.path.join(filing_path, target_folder, 'full-submission.txt')
    if not os.path.exists(full_path):
        raise FileNotFoundError(f"'full-submission.txt' not found in: {full_path}")
    
    return full_path

def extract_clean_text_from_html(html):
    soup = BeautifulSoup(html, "html.parser")

    # Remove script/style elements completely
    for script_or_style in soup(["script", "style"]):
        script_or_style.decompose()

    text = soup.get_text(separator=' ')
    text = re.sub(r'\s+', ' ', text)  # collapse whitespace
    return text.strip()

def basic_token_cleanup(text, stopwords=None, top_n=50):
    # Lowercase, remove punctuation/numbers
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)  # keep only letters and spaces

    # Tokenize
    words = text.split()

    # Remove stopwords
    if stopwords:
        words = [w for w in words if w not in stopwords]

    # Optional: remove short or "junk" tokens (like 'pt', 'td', 'tr')
    words = [w for w in words if len(w) > 2 and not re.search(r'(font|style|display|align|div|inline|textindent|marginright|block|marginleft|width|shall|bold|roman|textdecoration)', w)]

    return Counter(words).most_common(top_n)




In [3]:
data_path,figure_path,overleaf_path = setup(update=False)

Data path: C:/Users/Taise/Dropbox/IPOMatch/data/
Figure/Table path: C:/Users/Taise/Dropbox/Apps/Overleaf/IPOMatch/tables_figures/
Overleaf path: C:/Users/Taise/Dropbox/Apps/Overleaf/IPOMatch/


In [6]:
%run ipo_functions.py
constructor = DataConstructor(data_path)
sdc_us_common = constructor.import_sdc()
all_matches,firm_lead_matches,uw_code_list = constructor.get_firm_uw_match(sdc_us_common)
compustat_data = constructor.get_compustat_data(firm_lead_matches)
text_data = constructor.get_10K_text_data(firm_lead_matches)
crsp_data = constructor.get_crsp_data(firm_lead_matches)
crsp_performance = constructor.compute_performance(crsp_data)
ipo_firms1 = constructor.get_ipo_data()
firms = constructor.merge_firm_data(compustat_data,text_data,crsp_performance,ipo_firms1)
#uw_rank = constructor.construct_uw_rank(all_matches,uw_code_list)
#uw_share = constructor.compute_uw_share(all_matches)
#uw_industry_focus = constructor.describe_industry_focus_stats(all_matches)
#uw_performance = constructor.underwriter_performance(crsp_performance,all_matches)
#uw_data = constructor.merge_uw_data(uw_rank,uw_share,uw_performance,firms,firm_lead_matches)


  sdc = pd.concat(dataframes, ignore_index=True)


Concatenated DataFrame is ready for analysis.
issue_date_table.pkl and sdc_us_common.pkl are saved at C:/Users/Taise/Dropbox/IPOMatch/data/
Loading library list...
Done


In [7]:
#firms = pd.read_pickle(os.path.join(data_path, "firms.pkl"))
ipo_firms = firms.dropna(subset=['IPOname','cik']).reset_index(drop=True)
ipo_firms['cik'] = ipo_firms['cik'].astype(int)
ipo_firms = ipo_firms[ipo_firms['ipo_issue_date'].dt.year >= 2001].reset_index(drop=True)
ciks = ipo_firms.cik.dropna().unique().tolist()

In [None]:
sampled_ciks = [1677077,1167178,864559,1169561,1401680]

In [7]:
from sec_edgar_downloader import Downloader
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

# Initialize downloader
company_name = "Rice University"
email_address = "tn27@rice.edu"
# Your sampled CIKs (as strings, padded to 10 digits)
sampled_ciks_padded = [f"{cik:010d}" for cik in sampled_ciks]
dl = Downloader(company_name=company_name, email_address=email_address,download_folder=data_path)
# Loop and fetch S-1 filings
for cik in sampled_ciks_padded:
    dl.get("S-1", cik)  # You can increase 'amount' for more filings


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/taisei/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [8]:
stop_words = set(stopwords.words("english"))

cik = sampled_ciks_padded[0]  # Or any padded CIK like '0001401680'
file_path = get_full_submission_path(data_path, cik)
with open(file_path, 'r', encoding='utf-8') as f:
    content = f.read()
clean_text = extract_clean_text_from_html(content)
stop_words = set(stopwords.words("english"))
top_words = basic_token_cleanup(clean_text, stopwords=stop_words, top_n=100)
print(top_words)

[('stock', 913), ('may', 754), ('shares', 708), ('company', 612), ('common', 606), ('securities', 351), ('agreement', 330), ('date', 317), ('financial', 310), ('price', 301), ('board', 290), ('offering', 289), ('business', 267), ('corporation', 267), ('per', 266), ('upon', 255), ('clinical', 254), ('share', 253), ('directors', 248), ('product', 245), ('purchase', 240), ('note', 236), ('number', 208), ('additional', 207), ('time', 200), ('april', 200), ('officer', 189), ('could', 189), ('development', 189), ('fda', 189), ('act', 187), ('related', 183), ('stockholders', 182), ('approval', 180), ('including', 179), ('meeting', 178), ('required', 177), ('market', 175), ('future', 169), ('alzheimers', 169), ('executive', 168), ('value', 168), ('license', 166), ('section', 165), ('subject', 161), ('operations', 161), ('regulatory', 160), ('months', 160), ('interest', 158), ('exercise', 157), ('years', 154), ('pursuant', 153), ('compensation', 152), ('director', 151), ('issued', 150), ('notic