### Options Data from WRDS
Link: https://wrds-www.wharton.upenn.edu/pages/about/data-vendors/optionmetrics/


| Column	| Meaning |
| --------- | :------ |
| secid	| Internal security identifier for the underlying asset (unique across time).| 
| date	| Trade/quote date (observation date).| 
| symbol	| Ticker symbol of the underlying security.| 
| symbol_flag	| Indicator for special cases in the symbol (e.g., corporate actions, temporary tickers).| 
| exdate	| Option expiration date.| 
| last_date	| Last trading date for the contract (often same as exdate for US equity options).| 
| cp_flag	| Call/Put flag — usually "C" for call, "P" for put.| 
| strike_price	| Strike price of the option (often stored as integer × 1000 in raw data; check units).| 
| best_bid	| Best bid price in the market at the observation time.| 
| best_offer	| Best ask/offer price in the market at the observation time.| 
| volume	| Trading volume (number of contracts traded that day).| 
| open_interest	| Number of outstanding contracts at the end of the day.| 
| impl_volatility	| Implied volatility (annualized, decimal form) derived from option prices.| 
| delta	| Option delta — sensitivity of option price to a $1 change in underlying price.| 
| gamma	| Option gamma — sensitivity of delta to a $1 change in underlying price.| 
| vega	| Option vega — sensitivity of option price to a 1 percentage point change in implied volatility.| 
| theta	| Option theta — sensitivity of option price to one day’s passage of time.| 
| optionid	| Unique identifier for the specific option contract.| 
| cfadj	| Cumulative factor adjustment for corporate actions (splits, dividends, etc.).| 
| am_settlement	| Flag for AM‑settled options (settled based on morning opening prices).| 
| contract_size	| Number of underlying shares per contract (usually 100 for US equity options).| 
| ss_flag	| Special settlement flag (e.g., early exercise restrictions, special terms).| 
| forward_price	| Forward price of the underlying for the option’s maturity (used in IV calc).| 
| expiry_indicator	| Code indicating standard vs. non‑standard expiry (e.g., weekly, quarterly).| 
| root	| Option root symbol (base symbol before strike/expiry codes).| 
| suffix	| Additional contract code suffix (e.g., for LEAPS, weeklies).| 

__Generates parquet files for filtered WRDS options dataset. Data is filtered based on liquidity & tenor__

_Liquidity_

- Positive bid/ask quotes (exclude zero or crossed markets).
- Reasonable bid‑ask spread (e.g., spread < 50% of mid price).
- Minimum volume or open interest (e.g., OI > 0, volume > 0).
- Delta range (exclude extreme deltas like <0.05 or >0.95, which are very far OTM and illiquid).

_Tenor_

- 25 - 90 days to expiry

_Reading it back later_

Load everything:

`df_all = pd.read_parquet("options_filtered/", engine="fastparquet")`

Load just one partition (e.g., SPY 2023):

`df_spy_2023 = pd.read_parquet("options_filtered/symbol=SPY/year=2023", engine="fastparquet")`

In [8]:
import wrds
import pandas as pd

In [14]:
db = wrds.Connection(wrds_username='ayansola')
# setup pg_pass needed for access to the wrds dataset (first time only)
# db.create_pgpass_file()

Loading library list...
Done


__Query WRDS for SPY/QQQ (start with 3 months)__

In [3]:
params = {
    "symbols": ("QQQ","SPY",),   # QQQ = NASDAQ, SPY = S&P 500
    "from_date": "2023-01-01",
    "to_date": "2023-03-31"
}

data = db.raw_sql(
    """
    SELECT o.date,
           o.secid,
           s.ticker,
           o.symbol,
           o.cp_flag,
           o.exdate,
           o.strike_price,
           o.best_bid,
           o.best_offer,
           o.volume,
           o.open_interest,
           o.impl_volatility,
           o.delta,
           o.vega,
           o.theta,
           o.forward_price,
           o.expiry_indicator
    FROM optionm.opprcd2023 o
    JOIN optionm.securd s
      ON o.secid = s.secid
    WHERE s.ticker IN %(symbols)s
      AND o.date BETWEEN %(from_date)s AND %(to_date)s
    """,
    params=params,
)

In [4]:
data.dtypes

date                string[python]
secid                      Float64
ticker              string[python]
symbol              string[python]
cp_flag             string[python]
exdate              string[python]
strike_price               Float64
best_bid                   Float64
best_offer                 Float64
volume                     Float64
open_interest              Float64
impl_volatility            Float64
delta                      Float64
vega                       Float64
theta                      Float64
forward_price       string[python]
expiry_indicator    string[python]
dtype: object

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 889998 entries, 0 to 389997
Data columns (total 17 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   date              889998 non-null  string 
 1   secid             889998 non-null  Float64
 2   ticker            889998 non-null  string 
 3   symbol            889998 non-null  string 
 4   cp_flag           889998 non-null  string 
 5   exdate            889998 non-null  string 
 6   strike_price      889998 non-null  Float64
 7   best_bid          889998 non-null  Float64
 8   best_offer        889998 non-null  Float64
 9   volume            889998 non-null  Float64
 10  open_interest     889998 non-null  Float64
 11  impl_volatility   767732 non-null  Float64
 12  delta             767732 non-null  Float64
 13  vega              767732 non-null  Float64
 14  theta             767732 non-null  Float64
 15  forward_price     0 non-null       string 
 16  expiry_indicator  355302 

In [6]:
data.head()

Unnamed: 0,date,secid,ticker,symbol,cp_flag,exdate,strike_price,best_bid,best_offer,volume,open_interest,impl_volatility,delta,vega,theta,forward_price,expiry_indicator
0,2023-01-03,107899.0,QQQ,QQQ 230103C210000,C,2023-01-03,210000.0,54.53,54.66,205.0,5.0,,,,,,
1,2023-01-03,107899.0,QQQ,QQQ 230103C220000,C,2023-01-03,220000.0,44.53,44.66,0.0,13.0,,,,,,
2,2023-01-03,107899.0,QQQ,QQQ 230103C225000,C,2023-01-03,225000.0,39.53,39.66,0.0,0.0,,,,,,
3,2023-01-03,107899.0,QQQ,QQQ 230103C226000,C,2023-01-03,226000.0,38.41,38.75,0.0,8.0,,,,,,
4,2023-01-03,107899.0,QQQ,QQQ 230103C227000,C,2023-01-03,227000.0,37.41,37.75,0.0,0.0,,,,,,


- __Filter for liquidity: drop zero volume/OI, wide spreads, extreme deltas.__

- __Filter for tenor: pick options with ~30 calendar days to expiry (say 25–35 days).__

In [9]:
# Start with your raw WRDS pull
df = data.copy()

# Liquidity filters
df = df[(df['volume'] > 0) | (df['open_interest'] > 0)]
df = df[(df['best_bid'] > 0) & (df['best_offer'] > 0)]
df = df[df['best_offer'] >= df['best_bid']]
df = df[(df['best_offer'] - df['best_bid']) / ((df['best_offer'] + df['best_bid'])/2) < 0.5]
df = df[(df['delta'].abs() >= 0.05) & (df['delta'].abs() <= 0.95)]

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
df['exdate'] = pd.to_datetime(df['exdate'])

# Tenor filter ~30 days
df['dte'] = (df['exdate'] - df['date']).dt.days
df = df[(df['dte'] >= 25) & (df['dte'] <= 90)]

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 86972 entries, 4074 to 386868
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date              86972 non-null  datetime64[ns]
 1   secid             86972 non-null  Float64       
 2   ticker            86972 non-null  string        
 3   symbol            86972 non-null  string        
 4   cp_flag           86972 non-null  string        
 5   exdate            86972 non-null  datetime64[ns]
 6   strike_price      86972 non-null  Float64       
 7   best_bid          86972 non-null  Float64       
 8   best_offer        86972 non-null  Float64       
 9   volume            86972 non-null  Float64       
 10  open_interest     86972 non-null  Float64       
 11  impl_volatility   86972 non-null  Float64       
 12  delta             86972 non-null  Float64       
 13  vega              86972 non-null  Float64       
 14  theta             86972

__Save to Parquet: partition by symbol/year for downstream use.__

In [11]:
# generate parquet
df['year'] = df['date'].dt.year

df.to_parquet(
    "options_filtered/",
    engine="fastparquet",        # or "pyarrow"
    partition_cols=["symbol", "year"],
    index=False
)