### Options Data from WRDS
Link: https://wrds-www.wharton.upenn.edu/pages/about/data-vendors/optionmetrics/


| Column	| Meaning |
| --------- | :------ |
| secid	| Internal security identifier for the underlying asset (unique across time).| 
| date	| Trade/quote date (observation date).| 
| symbol	| Ticker symbol of the underlying security.| 
| symbol_flag	| Indicator for special cases in the symbol (e.g., corporate actions, temporary tickers).| 
| exdate	| Option expiration date.| 
| last_date	| Last trading date for the contract (often same as exdate for US equity options).| 
| cp_flag	| Call/Put flag — usually "C" for call, "P" for put.| 
| strike_price	| Strike price of the option (often stored as integer × 1000 in raw data; check units).| 
| best_bid	| Best bid price in the market at the observation time.| 
| best_offer	| Best ask/offer price in the market at the observation time.| 
| volume	| Trading volume (number of contracts traded that day).| 
| open_interest	| Number of outstanding contracts at the end of the day.| 
| impl_volatility	| Implied volatility (annualized, decimal form) derived from option prices.| 
| delta	| Option delta — sensitivity of option price to a $1 change in underlying price.| 
| gamma	| Option gamma — sensitivity of delta to a $1 change in underlying price.| 
| vega	| Option vega — sensitivity of option price to a 1 percentage point change in implied volatility.| 
| theta	| Option theta — sensitivity of option price to one day’s passage of time.| 
| optionid	| Unique identifier for the specific option contract.| 
| cfadj	| Cumulative factor adjustment for corporate actions (splits, dividends, etc.).| 
| am_settlement	| Flag for AM‑settled options (settled based on morning opening prices).| 
| contract_size	| Number of underlying shares per contract (usually 100 for US equity options).| 
| ss_flag	| Special settlement flag (e.g., early exercise restrictions, special terms).| 
| forward_price	| Forward price of the underlying for the option’s maturity (used in IV calc).| 
| expiry_indicator	| Code indicating standard vs. non‑standard expiry (e.g., weekly, quarterly).| 
| root	| Option root symbol (base symbol before strike/expiry codes).| 
| suffix	| Additional contract code suffix (e.g., for LEAPS, weeklies).| 

__Generates parquet files for filtered WRDS options dataset. Data is filtered based on liquidity & tenor__

_Liquidity_

- Positive bid/ask quotes (exclude zero or crossed markets).
- Reasonable bid‑ask spread (e.g., spread < 50% of mid price).
- Minimum volume or open interest (e.g., OI > 0, volume > 0).
- Delta range (exclude extreme deltas like <0.05 or >0.95, which are very far OTM and illiquid).

_Tenor_

- 25 - 90 days to expiry

_Reading it back later_

Load everything:

`df_all = pd.read_parquet("options_filtered/", engine="fastparquet")`

Load just one partition (e.g., SPY 2023):

`df_spy_2023 = pd.read_parquet("options_filtered/symbol=SPY/year=2023", engine="fastparquet")`

In [1]:
import wrds
import pandas as pd

In [2]:
db = wrds.Connection(wrds_username='ayansola')
# setup pg_pass needed for access to the wrds dataset (first time only)
# db.create_pgpass_file()

Loading library list...
Done


__Query WRDS for SPY/QQQ (start with 3 months)__

In [3]:
def load_option_data(table_name: str, params: dict):
    """
    Load option data from a given OptionMetrics table.

    Args:
        table_name (str): e.g. "optionm.opprcd2022" or "optionm.opprcd2023"
        params (dict): dictionary with keys like symbols, from_date, to_date

    Returns:
        pd.DataFrame: query results
    """
    query = f"""
        SELECT o.date,
               o.secid,
               s.ticker,
               o.symbol,
               o.cp_flag,
               o.exdate,
               o.strike_price,
               o.best_bid,
               o.best_offer,
               o.volume,
               o.open_interest,
               o.impl_volatility,
               o.delta,
               o.vega,
               o.theta,
               o.forward_price,
               o.expiry_indicator
        FROM {table_name} o
        JOIN optionm.securd s
          ON o.secid = s.secid
        WHERE s.ticker IN %(symbols)s
          AND o.date BETWEEN %(from_date)s AND %(to_date)s
    """
    return db.raw_sql(query, params=params)

In [4]:
params_2022 = {
    "symbols": ("QQQ","SPY"),
    "from_date": "2022-01-01",
    "to_date": "2022-12-31"
}

params_2023 = {
    "symbols": ("QQQ","SPY"),
    "from_date": "2023-01-01",
    "to_date": "2023-12-31"
}

data_2022 = load_option_data("optionm.opprcd2022", params_2022)
data_2023 = load_option_data("optionm.opprcd2023", params_2023)

# Merge into one dataset
all_data = pd.concat([data_2022, data_2023], ignore_index=True)

In [5]:
all_data.dtypes

date                string[python]
secid                      Float64
ticker              string[python]
symbol              string[python]
cp_flag             string[python]
exdate              string[python]
strike_price               Float64
best_bid                   Float64
best_offer                 Float64
volume                     Float64
open_interest              Float64
impl_volatility            Float64
delta                      Float64
vega                       Float64
theta                      Float64
forward_price       string[python]
expiry_indicator    string[python]
dtype: object

In [6]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6392228 entries, 0 to 6392227
Data columns (total 17 columns):
 #   Column            Dtype  
---  ------            -----  
 0   date              string 
 1   secid             Float64
 2   ticker            string 
 3   symbol            string 
 4   cp_flag           string 
 5   exdate            string 
 6   strike_price      Float64
 7   best_bid          Float64
 8   best_offer        Float64
 9   volume            Float64
 10  open_interest     Float64
 11  impl_volatility   Float64
 12  delta             Float64
 13  vega              Float64
 14  theta             Float64
 15  forward_price     string 
 16  expiry_indicator  string 
dtypes: Float64(10), string(7)
memory usage: 890.0 MB


In [7]:
all_data.head()

Unnamed: 0,date,secid,ticker,symbol,cp_flag,exdate,strike_price,best_bid,best_offer,volume,open_interest,impl_volatility,delta,vega,theta,forward_price,expiry_indicator
0,2022-01-03,107899.0,QQQ,QQQ 220103C265000,C,2022-01-03,265000.0,135.67,137.23,0.0,0.0,,,,,,w
1,2022-01-03,107899.0,QQQ,QQQ 220103C270000,C,2022-01-03,270000.0,130.67,132.23,12.0,12.0,,,,,,w
2,2022-01-03,107899.0,QQQ,QQQ 220103C245000,C,2022-01-03,245000.0,155.67,157.24,0.0,0.0,,,,,,w
3,2022-01-03,107899.0,QQQ,QQQ 220103C250000,C,2022-01-03,250000.0,150.67,152.24,0.0,0.0,,,,,,w
4,2022-01-03,107899.0,QQQ,QQQ 220103C255000,C,2022-01-03,255000.0,145.67,147.22,0.0,0.0,,,,,,w


- __Filter for liquidity: drop zero volume/OI, wide spreads, extreme deltas.__

- __Filter for tenor: pick options with ~30 calendar days to expiry (say 25–35 days).__

In [8]:
# Start with your raw WRDS pull
df = all_data.copy()

# Liquidity filters
df = df[(df['volume'] > 0) | (df['open_interest'] > 0)]
df = df[(df['best_bid'] > 0) & (df['best_offer'] > 0)]
df = df[df['best_offer'] >= df['best_bid']]
df = df[(df['best_offer'] - df['best_bid']) / ((df['best_offer'] + df['best_bid'])/2) < 0.5]
df = df[(df['delta'].abs() >= 0.05) & (df['delta'].abs() <= 0.95)]

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
df['exdate'] = pd.to_datetime(df['exdate'])

# Tenor filter ~30 days
df['dte'] = (df['exdate'] - df['date']).dt.days
df = df[(df['dte'] >= 25) & (df['dte'] <= 90)]

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 781614 entries, 4054 to 6389334
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   date              781614 non-null  datetime64[ns]
 1   secid             781614 non-null  Float64       
 2   ticker            781614 non-null  string        
 3   symbol            781614 non-null  string        
 4   cp_flag           781614 non-null  string        
 5   exdate            781614 non-null  datetime64[ns]
 6   strike_price      781614 non-null  Float64       
 7   best_bid          781614 non-null  Float64       
 8   best_offer        781614 non-null  Float64       
 9   volume            781614 non-null  Float64       
 10  open_interest     781614 non-null  Float64       
 11  impl_volatility   781614 non-null  Float64       
 12  delta             781614 non-null  Float64       
 13  vega              781614 non-null  Float64       
 14  theta

__Save to Parquet: partition by symbol/year for downstream use.__

In [10]:
# generate parquet
df['year'] = df['date'].dt.year

df.to_parquet(
    "options_filtered/",
    engine="fastparquet",        # or "pyarrow"
    partition_cols=["symbol", "year"],
    index=False
)