# Step 1: Construct Quarterly Variables
The calculations for these variables will involve aggregating data from the IBES dataset. You will need the following information from the IBES database:

- ANALYSTS: The number of analysts providing earnings per share (EPS) forecasts.
- DISPERSION: The standard deviation of the analyst EPS forecasts, normalized by the share price at the end of the previous quarter.
- FCSTERROR: The absolute value of the difference between the mean analyst EPS forecast and the actual EPS, normalized by the share price at the end of the previous quarter.

## Step 1.1: Extract Required Data from WRDS IBES
Here, we'll extract the relevant data from the IBES dataset, which includes:

- The number of analyst estimates (numest)
- The mean analyst estimate (meanest)
- The standard deviation of analyst estimates (stdev)
- The actual EPS value (actual)
- The forecast period end date (fpedats)
- The share price from Compustat for the previous quarter (prccq)

In [1]:
import os

import wrds
import pandas as pd
import numpy as np

from Constants import Constants as const
from Utilities import get_fama_french_industry

In [None]:
# Connect to WRDS
conn = wrds.Connection()

In [10]:
begin_date = '2006-01-01'
end_date = '2016-12-31'

import pandas as pd

years = range(2006, 2017)
all_data = []

for year in years:
    print(year)
    begin_date = f'{year}-01-01'
    end_date = f'{year}-12-31'
    ibes_query = f"""
        SELECT a.ticker, a.fpedats, a.anndats_act, a.meanest, a.numest, a.stdev, a.actual, b.prccq, b.fyearq, b.fqtr
        FROM ibes.statsum_epsus AS a
        JOIN comp.funda AS b ON a.ticker = b.tic AND a.fpedats = b.datadate
        WHERE a.measure = 'EPS'
          AND a.fpedats BETWEEN '{begin_date}' AND '{end_date}'
    """
    ibes_data_chunk = conn.raw_sql(ibes_query)
    all_data.append(ibes_data_chunk)

# Combine all chunks into one DataFrame
ibes_data = pd.concat(all_data)



2006


ProgrammingError: (psycopg2.errors.UndefinedColumn) column a.fyearq does not exist
LINE 4: ...     JOIN comp.fundq AS b ON a.ticker = b.tic AND a.fyearq =...
                                                             ^
HINT:  Perhaps you meant to reference the column "b.fyearq".

[SQL: 
        SELECT a.ticker, a.fpedats, a.anndats_act, a.meanest, a.numest, a.stdev, a.actual, b.prccq, b.fyearq, b.fqtr
        FROM ibes.statsum_epsus AS a
        JOIN comp.fundq AS b ON a.ticker = b.tic AND a.fyearq = b.fyearq AND a.fqtr = b.fqtr
        WHERE a.measure = 'EPS'
          AND a.fpedats BETWEEN '2006-01-01' AND '2006-12-31'
    ]
(Background on this error at: https://sqlalche.me/e/20/f405)

In [5]:
# Check available libraries in WRDS
available_libraries = conn.list_libraries()
print("Available Libraries in WRDS:", available_libraries)

# Check available tables in the Compustat library if it's there
if 'compustat' in available_libraries:
    available_tables = conn.list_tables(library='compustat')
    print("Tables in Compustat Library:", available_tables)
elif 'compustat_na' in available_libraries:
    available_tables = conn.list_tables(library='compustat_na')
    print("Tables in Compustat North America Library:", available_tables)
else:
    print("Compustat library not found in WRDS.")


Available Libraries in WRDS: ['aha_sample', 'ahasamp', 'audit', 'audit_audit_comp', 'audit_common', 'audit_corp_legal', 'audit_oia', 'auditsmp', 'auditsmp_all', 'bank', 'bank_all', 'bank_premium_samp', 'banksamp', 'block', 'block_all', 'boardex', 'boardex_na', 'boardex_trial', 'boardsmp', 'bvd_amadeus_trial', 'bvd_bvdbankf_trial', 'bvd_orbis_trial', 'bvdsamp', 'calcbench_trial', 'calcbnch', 'cboe', 'cboe_all', 'cboe_sample', 'cboesamp', 'ciq', 'ciq_common', 'ciqsamp', 'ciqsamp_capstrct', 'ciqsamp_common', 'ciqsamp_keydev', 'ciqsamp_pplintel', 'ciqsamp_ratings', 'ciqsamp_transactions', 'ciqsamp_transcripts', 'cisdmsmp', 'columnar', 'comp', 'comp_bank', 'comp_bank_daily', 'comp_execucomp', 'comp_global', 'comp_global_daily', 'comp_na_annual_all', 'comp_na_daily_all', 'comp_na_monthly_all', 'comp_segments_hist', 'comp_segments_hist_daily', 'compa', 'compb', 'compg', 'compm', 'compsamp', 'compsamp_all', 'compsamp_snapshot', 'compseg', 'contrib', 'contrib_as_filed_financials', 'contrib_ceo_

## Step 1.2: Calculate Quarterly Variables
1. ANALYSTS: The number of analysts providing earnings per share forecasts.
Use the numest column from IBES.

2. DISPERSION: The standard deviation of analyst EPS forecasts normalized by the share price at the end of the previous quarter.
Calculation: DISPERSION = stdev / prccq
Where stdev is the standard deviation of analyst forecasts and prccq is the share price at the end of the previous quarter.

3. FCSTERROR: The absolute forecast error.

Calculation: FCSTERROR = abs(meanest - actual) / prccq
Where meanest is the mean analyst EPS forecast, and actual is the actual EPS value.

In [None]:
# Calculate ANALYSTS, DISPERSION, and FCSTERROR
ibes_data['analysts'] = ibes_data['numest']
ibes_data['dispersion'] = ibes_data['stdev'] / ibes_data['prccq']
ibes_data['fcsterror'] = abs(ibes_data['meanest'] - ibes_data['actual']) / ibes_data['prccq']

# Display the first few rows to verify the calculations
ibes_data[['ticker', 'fpedats', 'analysts', 'dispersion', 'fcsterror']].head()


## Step 2.1: Update SQL Query for Annual Data Extraction

To create annual variables, we will extract data from an annual financial dataset instead of the quarterly fundq dataset. In Compustat, this is often done using the fundamental annual (fundann) dataset.

We will adjust the SQL query to access annual earnings and price data from IBES and Compustat.

Here's the modified version:

In [11]:
# Define the time range for the data extraction
begin_date = '2006-01-01'
end_date = '2016-12-31'

# SQL query to extract annual data from IBES and Compustat, joining on ticker and a date range
ibes_query_annual = f"""
    SELECT a.ticker, a.fpedats, a.anndats_act, a.meanest, a.numest, a.stdev, a.actual, b.prcc_f, b.fyear
    FROM ibes.statsum_epsus AS a
    JOIN comp.funda AS b ON a.ticker = b.tic
    WHERE a.measure = 'EPS'
      AND a.fpedats BETWEEN b.datadate AND b.datadate + interval '1 year' - interval '1 day'
      AND a.fpedats BETWEEN '{begin_date}' AND '{end_date}'
"""

# Extract data using the WRDS connection
ibes_data_annual = conn.raw_sql(ibes_query_annual)

# Convert fpedats and anndats_act to datetime
ibes_data_annual['fpedats'] = pd.to_datetime(ibes_data_annual['fpedats'])
ibes_data_annual['anndats_act'] = pd.to_datetime(ibes_data_annual['anndats_act'])

# Display the first few rows
ibes_data_annual.head()


Unnamed: 0,ticker,fpedats,anndats_act,meanest,numest,stdev,actual,prcc_f,fyear
0,AAAP,2015-12-31,2016-04-29,-0.3,3.0,0.07,-0.25,31.27,2015.0
1,AA,2011-12-31,2012-01-09,3.76,15.0,0.92,2.16,,2011.0
2,AA,2011-12-31,2012-01-09,3.84,15.0,0.81,2.16,,2011.0
3,AA,2011-12-31,2012-01-09,3.92,15.0,0.67,2.16,,2011.0
4,AA,2011-12-31,2012-01-09,4.03,14.0,0.43,2.16,,2011.0


In [94]:
# Find the index of the rows with the maximum 'numest' for each firm-year
idx = ibes_data_annual.groupby(['ticker', 'fyear'])['numest'].idxmax()

# Use the indices to create the deduplicated DataFrame
ibes_data_annual_unique = ibes_data_annual.loc[idx.reset_index(drop=True)]

# Display the first few rows of the deduplicated dataset
ibes_data_annual_unique.head()

Unnamed: 0,ticker,fpedats,anndats_act,meanest,numest,stdev,actual,prcc_f,fyear
80,AA,2012-06-30,2012-07-09,0.33,19.0,0.14,0.18,,2011.0
80,CDTX,2016-03-31,2016-05-12,-16.0,4.0,1.12,-14.2,17.16,2015.0
80,EXPD,2013-12-31,2014-02-25,1.87,21.0,0.09,1.68,44.25,2013.0
80,JJSF,2009-09-30,2009-11-03,0.67,2.0,0.02,0.79,43.19,2009.0
80,NSM,2009-05-31,2009-06-11,1.65,1.0,,0.31,,2008.0


In [96]:
ibes_data_annual_unique.

((155668, 9), (3111534, 9))

In [99]:
ibes_data_annual[(ibes_data_annual['ticker'] == 'EXPD') & (ibes_data_annual['fyear'] == 2013)]

Unnamed: 0,ticker,fpedats,anndats_act,meanest,numest,stdev,actual,prcc_f,fyear
34,EXPD,2013-12-31,2014-02-25,1.81,23.0,0.09,1.68,44.25,2013.0
35,EXPD,2013-12-31,2014-02-25,1.81,23.0,0.09,1.68,44.25,2013.0
36,EXPD,2013-12-31,2014-02-25,1.80,23.0,0.08,1.68,44.25,2013.0
37,EXPD,2013-12-31,2014-02-25,1.80,23.0,0.08,1.68,44.25,2013.0
38,EXPD,2013-12-31,2014-02-25,1.76,23.0,0.05,1.68,44.25,2013.0
...,...,...,...,...,...,...,...,...,...
747,EXPD,2014-09-30,2014-11-04,0.52,14.0,0.02,0.53,44.25,2013.0
748,EXPD,2014-09-30,2014-11-04,0.52,16.0,0.02,0.53,44.25,2013.0
749,EXPD,2014-09-30,2014-11-04,0.52,16.0,0.02,0.53,44.25,2013.0
750,EXPD,2014-09-30,2014-11-04,0.52,15.0,0.02,0.53,44.25,2013.0


In [85]:
ibes_data_annual_unique[ibes_data_annual_unique[['ticker', 'fyear']].duplicated()]

Unnamed: 0,ticker,fpedats,anndats_act,meanest,numest,stdev,actual,prcc_f,fyear,ANALYSTS,lnANALYSTS,DISPERSION,FCSTERROR
9,EXPD,2013-12-31,2014-02-25,1.96,1.0,,1.68,44.25,2013.0,1.0,0.693147,,0.006328
10,JJSF,2010-03-31,2010-04-22,0.46,3.0,0.03,0.48,43.19,2009.0,3.0,1.386294,0.000695,0.000463
11,NSM,2009-05-31,2009-06-11,-0.42,7.0,0.02,-0.28,,2008.0,7.0,2.079442,,
25,NSM,2010-08-31,2010-09-09,0.21,14.0,0.03,0.36,,2009.0,14.0,2.708050,,
27,WGL,2016-06-30,2016-08-03,0.11,4.0,0.19,0.33,57.67,2015.0,4.0,1.609438,0.003295,0.003815
...,...,...,...,...,...,...,...,...,...,...,...,...,...
155663,FOSL,2006-09-30,2006-11-14,0.34,9.0,0.02,0.32,21.51,2005.0,9.0,2.302585,0.000930,0.000930
155664,LF,2010-03-31,2010-05-03,-0.29,5.0,0.04,-0.37,3.91,2009.0,5.0,1.791759,0.010230,0.020460
155665,OXPS,2006-12-31,2007-01-31,0.95,4.0,0.06,1.15,22.69,2006.0,4.0,1.609438,0.002644,0.008814
155666,STGN,2006-12-31,2007-03-06,0.40,2.0,0.03,0.25,7.44,2006.0,2.0,1.098612,0.004032,0.020161


## 2.2 Calculating the Annual Variables:
1. ANALYSTS (Annual): This is simply the number of analysts (numest) for the year.
2. DISPERSION (Annual):
The standard deviation (stdev) of analysts' earnings per share forecasts divided by the price at the end of the fiscal year (prcc_f).
3. FCSTERROR (Annual):
The forecast error is calculated as the absolute difference between the mean analyst EPS forecast (meanest) and the actual EPS (actual), divided by the price at the end of the fiscal year (prcc_f).

In [102]:
# Calculate the annual ANALYSTS variable (simply use numest)
ibes_data_annual_unique['ANALYSTS'] = ibes_data_annual_unique['numest']
ibes_data_annual_unique['lnANALYSTS'] = ibes_data_annual_unique['numest'].apply(lambda x: np.log(x + 1))

# Calculate the annual DISPERSION variable
ibes_data_annual_unique['DISPERSION'] = ibes_data_annual_unique['stdev'] / ibes_data_annual_unique['prcc_f']

# Calculate the annual FCSTERROR variable
ibes_data_annual_unique['FCSTERROR'] = abs(ibes_data_annual_unique['meanest'] - ibes_data_annual_unique['actual']) / ibes_data_annual_unique['prcc_f']

# Display the first few rows with the calculated variables
ibes_data_annual_unique[['ticker', 'fyear', 'ANALYSTS', 'DISPERSION', 'FCSTERROR']].head()


Unnamed: 0,ticker,fyear,ANALYSTS,DISPERSION,FCSTERROR
80,AA,2011.0,19.0,,
80,CDTX,2015.0,4.0,0.065268,0.104895
80,EXPD,2013.0,21.0,0.002034,0.004294
80,JJSF,2009.0,2.0,0.000463,0.002778
80,NSM,2008.0,1.0,,


In [107]:
ibes_data_group = ibes_data_annual_unique.groupby(['ticker', 'fyear'])
analyst = ibes_data_group['ANALYSTS'].max()
lnanalyst = ibes_data_group['lnANALYSTS'].max()
dispersion = ibes_data_group['DISPERSION'].mean()
fcsterror = ibes_data_group['FCSTERROR'].mean()

ibes_result = pd.merge(analyst, lnanalyst, left_index=True, right_index=True).merge(dispersion, left_index=True, right_index=True).merge(fcsterror, left_index=True, right_index=True).reset_index(drop=False)


In [108]:
ibes_result.to_pickle(os.path.join(const.TEMP_PATH, '20241006_analysts_dispersion_fcsterror.pkl'))

# Calculate Synchronicity

In [16]:
import os

import pandas as pd
import numpy as np
import statsmodels.api as sm

from Constants import Constants as const
from Utilities import get_fama_french_industry

In [11]:
data_path = r'D:\Users\wangy\Documents\data'

# Load the CRSP data
data = pd.read_csv(os.path.join(data_path, '20010101_20161231_crsp_stock_return.zip'))

# Convert date to datetime and sort by PERMNO and date
data['date'] = pd.to_datetime(data['date'])
data = data.sort_values(by=['PERMNO', 'date'])

  data = pd.read_csv(os.path.join(data_path, '20010101_20161231_crsp_stock_return.zip'))


In [12]:
data.keys()

Index(['PERMNO', 'date', 'SICCD', 'NCUSIP', 'TICKER', 'PERMCO', 'HSICCD',
       'CUSIP', 'RET', 'vwretd'],
      dtype='object')

In [13]:
# Load the Fama-French industry classification data
industry_returns = pd.read_csv(os.path.join(data_path, '48_Industry_Portfolios_Daily.csv'))
industry_returns['Date'] = pd.to_datetime(industry_returns['Date'], format='%Y%m%d')

In [14]:
data['HSICCD'].replace('Z', np.nan, inplace=True)
data['HSICCD'] = data['HSICCD'].fillna(data['SICCD'].replace('Z', np.nan))
data = data[data['HSICCD'].notnull()]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['HSICCD'].replace('Z', np.nan, inplace=True)


In [17]:
data['HSICCD'] = data['HSICCD'].astype(int)

# Apply the mapping to the CRSP data
data['fama_french_industry'] = data['HSICCD'].apply(get_fama_french_industry)

## Calculate Daily Synchronicity

In [40]:
# Convert daily returns to monthly returns (assuming 'RET' is the daily return)
data['RET'] = pd.to_numeric(data['RET'], errors='coerce')

# Prepare the market and industry return variables
# Assuming 'vwretd' is the value-weighted market return and industry_returns contains daily industry returns
industry_returns = industry_returns.melt(id_vars=['Date'], var_name='Industry', value_name='Industry_Return')
industry_returns['date'] = pd.to_datetime(industry_returns['Date'])

data = pd.merge(data, industry_returns, left_on=['date', 'fama_french_industry'], right_on=['date', 'Industry'], how='left')


In [41]:
data.head()

Unnamed: 0,PERMNO,date,SICCD,TICKER,PERMCO,HSICCD,CUSIP,HSICMG,HSICIG,PRC,RET,vwretd,vwretx,ewretd,ewretx,sprtrn,fama_french_industry,Date,Industry,Industry_Return
0,10001,2007-01-03,4920,EWST,7953,4925,36720410,,,11.1,0.0,-0.001338,-0.001502,-0.000159,-0.000273,-0.001199,Util,2007-01-03,Util,0.31
1,10001,2007-01-04,4920,EWST,7953,4925,36720410,,,11.36,0.023423,0.000549,0.000546,0.000591,0.000575,0.001228,Util,2007-01-04,Util,-0.23
2,10001,2007-01-05,4920,EWST,7953,4925,36720410,,,11.25,-0.009683,-0.007297,-0.007302,-0.009809,-0.00983,-0.006085,Util,2007-01-05,Util,-1.69
3,10001,2007-01-08,4920,EWST,7953,4925,36720410,,,-11.345,0.008444,0.002568,0.002355,0.001731,0.001693,0.00222,Util,2007-01-08,Util,-0.02
4,10001,2007-01-09,4920,EWST,7953,4925,36720410,,,11.24,-0.009255,6e-06,5e-06,0.000262,0.00026,-0.000517,Util,2007-01-09,Util,0.21


In [44]:
# Drop rows with NaN values in relevant columns
data = data.dropna(subset=['RET', 'vwretd', 'Industry_Return'])

firm_daily_data = data[['PERMNO', 'date', 'RET', 'vwretd', 'Industry_Return']].drop_duplicates(subset=['PERMNO', 'date'], keep='last')



In [45]:
def calculate_daily_synchrony(group, mkt_only=False, ind_only=False):
    dep_var = 'RET'
    if mkt_only:
        if ind_only:
            raise ValueError('Only one of mkt_only and ind_only should be True')
        ind_vars = ['rm_t-1', 'vwretd', 'rm_t+1']
        suffix = '_MKT'
    elif ind_only:
        ind_vars = ['ri_t-1', 'Industry_Return', 'ri_t+1']
        suffix = '_IND'
    else:
        ind_vars = ['rm_t-1', 'ri_t-1', 'vwretd', 'Industry_Return', 'rm_t+1', 'ri_t+1']
        suffix = ''
        
    
    # Shift the independent variables to match the lag/lead structure as per equation (4)
    group['rm_t-1'] = group['vwretd'].shift(1)
    group['ri_t-1'] = group['Industry_Return'].shift(1)
    group['rm_t+1'] = group['vwretd'].shift(-1)
    group['ri_t+1'] = group['Industry_Return'].shift(-1)
    
    all_vars = [dep_var]
    all_vars.extend(ind_vars)

    # Drop rows with NaN values after shifting
    group = group.dropna(subset=all_vars, how='any')

    # Define the dependent and independent variables
    y = group[dep_var]
    X = group[ind_vars]
    X = sm.add_constant(X)  # Add a constant term to the regression

    try:
        # Run the regression
        model = sm.OLS(y, X, missing='drop').fit()
        r_squared = model.rsquared

        # Calculate IDIOSYN and SYNCHRONICITY
        synchrony = np.log(r_squared / (1 - r_squared)) if r_squared < 1 else np.nan

    except Exception as e:
        # If regression fails for any reason, return NaN
        synchrony = np.nan

    return pd.Series({f'SYNCHRONICITY{suffix}_D': synchrony})


In [46]:
firm_daily_data['year'] = firm_daily_data['date'].dt.year
synchrony = firm_daily_data.groupby(['PERMNO', 'year']).apply(calculate_daily_synchrony).reset_index()
synchrony_ind = firm_daily_data.groupby(['PERMNO', 'year']).apply(calculate_daily_synchrony, ind_only=True).reset_index()
synchrony_mkt = firm_daily_data.groupby(['PERMNO', 'year']).apply(calculate_daily_synchrony, mkt_only=True).reset_index()


  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  synchrony = np.log(r_squared / (1 - r_squared)) if r_squared < 1 else np.nan
  return 1 - self.ssr/self.centered_tss
  synchrony = np.log(r_squared / (1 - r_squared)) if r_squared < 1 else np.nan
  return 1 - self.ssr/self.centered_tss
  synchrony = np.log(r_squared / (1 - r_squared)) if r_squared < 1 else np.nan
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  synchrony = np.log(r_squared / (1 - r_squared)) if r_squared < 1 else np.nan
  return 1 - self.ssr/self.centered_tss
  synchrony = np.log(r_squared / (1 - r_squared)) if r_squared < 1 else np.nan
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  synchrony = np.log(r_squared / (1 - r_squared)) if r_squared < 1 else np.nan
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  synchrony = np.log(r_squared / (1 - r_squared)) if r_squared < 1 else np.nan
  retur

In [47]:
synchrony_daily_df = synchrony.merge(synchrony_mkt, on=['PERMNO', 'year'], how='left').merge(synchrony_ind, on=['PERMNO', 'year'], how='left')
synchrony_daily_df.to_pickle(os.path.join(const.TEMP_PATH, '20241010_synchrony_daily.pkl'))

In [48]:
synchrony_daily_df

Unnamed: 0,PERMNO,year,SYNCHRONICITY_D,SYNCHRONICITY_MKT_D,SYNCHRONICITY_IND_D
0,10001,2007,-3.570597,-3.726721,-3.744804
1,10001,2008,-1.987838,-2.048408,-2.313745
2,10001,2009,-3.654655,-3.772627,-3.809139
3,10001,2010,-3.342663,-3.539507,-3.814449
4,10001,2011,-1.980258,-2.225460,-2.083134
...,...,...,...,...,...
73249,93436,2012,-1.395741,-1.577335,-2.408018
73250,93436,2013,-2.845489,-3.157357,-4.179562
73251,93436,2014,-0.989619,-1.128658,-2.032239
73252,93436,2015,-1.214303,-1.261081,-1.650169


## Calculate Monthly Synchronicity


In [18]:
# Convert daily returns to monthly returns (assuming 'RET' is the daily return)
data['RET'] = pd.to_numeric(data['RET'], errors='coerce')
data['month'] = data['date'].dt.to_period('M')
monthly_returns = data.groupby(['PERMNO', 'month'])['RET'].apply(lambda x: (1 + x).prod() - 1).reset_index()
monthly_returns.rename(columns={'RET': 'monthly_return'}, inplace=True)

In [19]:
# Merge monthly returns back with original data
data = pd.merge(data, monthly_returns, on=['PERMNO', 'month'], how='left')

# Drop rows with NaN monthly returns
data = data.dropna(subset=['monthly_return'])

In [20]:
# Prepare the market and industry return variables
# Assuming 'vwretd' is the value-weighted market return and industry_returns contains daily industry returns
industry_returns = industry_returns.melt(id_vars=['Date'], var_name='Industry', value_name='Industry_Return')
industry_returns['Date'] = pd.to_datetime(industry_returns['Date'])

In [21]:
# Convert industry returns to monthly level
industry_returns['month'] = industry_returns['Date'].dt.to_period('M')
industry_monthly_returns = industry_returns.groupby(['Industry', 'month'])['Industry_Return'].apply(lambda x: (1 + x).prod() - 1).reset_index()

In [22]:
# Convert market returns to weekly level
data['vwretd'] = pd.to_numeric(data['vwretd'], errors='coerce')
market_return = data[['vwretd', 'month', 'date']].drop_duplicates().sort_values(by='date', ascending=True)
monthly_market_returns = market_return.groupby('month')['vwretd'].apply(lambda x: (1 + x).prod() - 1).reset_index()
monthly_market_returns.rename(columns={'vwretd': 'monthly_market_return'}, inplace=True)

In [23]:
# Merge weekly market returns back with original data
data = pd.merge(data, monthly_market_returns, on='month', how='left')

# Merge CRSP data with weekly industry returns
data = pd.merge(data, industry_monthly_returns, left_on=['month', 'fama_french_industry'], right_on=['month', 'Industry'], how='left')

# Drop rows with NaN values in relevant columns
data = data.dropna(subset=['monthly_return', 'monthly_market_return', 'Industry_Return'])

In [24]:
firm_monthly_data = data[['PERMNO', 'month', 'date', 'monthly_return', 'monthly_market_return', 'Industry_Return']].drop_duplicates(subset=['PERMNO', 'month'], keep='last')


In [25]:
def calculate_monthly_synchrony(group, mkt_only=False, ind_only=False):
    dep_var = 'monthly_return'
    if mkt_only:
        if ind_only:
            raise ValueError('Only one of mkt_only and ind_only should be True')
        ind_vars = ['rm_t-1', 'monthly_market_return', 'rm_t+1']
        suffix = '_MKT'
    elif ind_only:
        ind_vars = ['ri_t-1', 'Industry_Return', 'ri_t+1']
        suffix = '_IND'
    else:
        ind_vars = ['rm_t-1', 'ri_t-1', 'monthly_market_return', 'Industry_Return', 'rm_t+1', 'ri_t+1']
        suffix = ''
        
    
    # Shift the independent variables to match the lag/lead structure as per equation (4)
    group['rm_t-1'] = group['monthly_market_return'].shift(1)
    group['ri_t-1'] = group['Industry_Return'].shift(1)
    group['rm_t+1'] = group['monthly_market_return'].shift(-1)
    group['ri_t+1'] = group['Industry_Return'].shift(-1)
    
    all_vars = [dep_var]
    all_vars.extend(ind_vars)

    # Drop rows with NaN values after shifting
    group = group.dropna(subset=all_vars, how='any')

    # Define the dependent and independent variables
    y = group[dep_var]
    X = group[ind_vars]
    X = sm.add_constant(X)  # Add a constant term to the regression

    try:
        # Run the regression
        model = sm.OLS(y, X, missing='drop').fit()
        r_squared = model.rsquared

        # Calculate IDIOSYN and SYNCHRONICITY
        synchrony = np.log(r_squared / (1 - r_squared)) if r_squared < 1 else np.nan

    except Exception as e:
        # If regression fails for any reason, return NaN
        synchrony = np.nan

    return pd.Series({f'SYNCHRONICITY{suffix}': synchrony})


In [26]:
firm_monthly_data['year'] = firm_monthly_data['date'].dt.year
synchrony = firm_monthly_data.groupby(['PERMNO', 'year']).apply(calculate_monthly_synchrony).reset_index()
synchrony_ind = firm_monthly_data.groupby(['PERMNO', 'year']).apply(calculate_monthly_synchrony, ind_only=True).reset_index()
synchrony_mkt = firm_monthly_data.groupby(['PERMNO', 'year']).apply(calculate_monthly_synchrony, mkt_only=True).reset_index()


  return 1 - self.ssr/self.centered_tss
  synchrony = np.log(r_squared / (1 - r_squared)) if r_squared < 1 else np.nan
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  synchrony = np.log(r_squared / (1 - r_squared)) if r_squared < 1 else np.nan
  return 1 - self.ssr/self.centered_tss
  synchrony = np.log(r_squared / (1 - r_squared)) if r_squared < 1 else np.nan
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  r

In [27]:
synchrony_monthly_df = synchrony.merge(synchrony_mkt, on=['PERMNO', 'year'], how='left').merge(synchrony_ind, on=['PERMNO', 'year'], how='left')
synchrony_monthly_df.to_pickle(os.path.join(const.TEMP_PATH, '20250712_synchrony_monthly.pkl'))

In [18]:
synchrony_monthly_df = pd.read_pickle(os.path.join(const.TEMP_PATH, '20241010_synchrony_monthly.pkl'))

In [20]:
synchrony_monthly_df.keys()

Index(['PERMNO', 'year', 'SYNCHRONICITY', 'SYNCHRONICITY_MKT',
       'SYNCHRONICITY_IND'],
      dtype='object')

In [19]:
synchrony_monthly_df[synchrony_monthly_df['PERMNO'] == 10517]

Unnamed: 0,PERMNO,year,SYNCHRONICITY,SYNCHRONICITY_MKT,SYNCHRONICITY_IND
594,10517,2007,0.153897,-1.959758,-1.560775
595,10517,2008,1.395746,0.523615,-0.834738
596,10517,2009,3.511385,-0.512097,2.038146
597,10517,2010,2.86115,1.365538,-1.058951
598,10517,2011,0.214843,-0.946937,-0.360656
599,10517,2012,0.83872,-5.70651,-1.607901
600,10517,2013,1.697661,0.330381,-1.253293
601,10517,2014,1.478653,0.548072,-1.68631
602,10517,2015,-0.501928,-0.851317,-2.024815
603,10517,2016,0.888128,-0.026439,-0.155693


### Calculate monthly variables

In [12]:
monthly_returns['year'] = monthly_returns['month'].dt.year

In [3]:
data.head()

Unnamed: 0,PERMNO,date,SICCD,TICKER,PERMCO,HSICCD,CUSIP,HSICMG,HSICIG,PRC,RET,vwretd,vwretx,ewretd,ewretx,sprtrn
0,10001,2007-01-03,4920,EWST,7953,4925,36720410,,,11.1,0.0,-0.001338,-0.001502,-0.000159,-0.000273,-0.001199
1,10001,2007-01-04,4920,EWST,7953,4925,36720410,,,11.36,0.023423,0.000549,0.000546,0.000591,0.000575,0.001228
2,10001,2007-01-05,4920,EWST,7953,4925,36720410,,,11.25,-0.009683,-0.007297,-0.007302,-0.009809,-0.00983,-0.006085
3,10001,2007-01-08,4920,EWST,7953,4925,36720410,,,-11.345,0.008444,0.002568,0.002355,0.001731,0.001693,0.00222
4,10001,2007-01-09,4920,EWST,7953,4925,36720410,,,11.24,-0.009255,6e-06,5e-06,0.000262,0.00026,-0.000517


In [25]:
test_df = data[(data['PERMNO'] == 14276) & (data['date'].dt.year == 2010)].copy()
test_df.head()

Unnamed: 0,PERMNO,date,SICCD,TICKER,PERMCO,HSICCD,CUSIP,HSICMG,HSICIG,PRC,...,vwretx,ewretd,ewretx,sprtrn,fama_french_industry,month,monthly_return,monthly_market_return,Industry,Industry_Return


In [27]:
synchrony_monthly_df[(synchrony_monthly_df['PERMNO'] == 20626)]

Unnamed: 0,PERMNO,year,SYNCHRONICITY,SYNCHRONICITY_MKT,SYNCHRONICITY_IND
14682,20626,2007,0.292833,-1.204022,-1.080456
14683,20626,2008,4.180854,1.109165,1.189116
14684,20626,2009,4.856123,1.099743,1.520564
14685,20626,2010,3.446621,2.508349,-0.466792
14686,20626,2011,2.617086,2.331971,0.165084
14687,20626,2012,0.821894,-0.225859,-1.837376
14688,20626,2013,1.811536,-0.932521,1.502824
14689,20626,2014,0.776071,-1.28121,-0.208873
14690,20626,2015,2.854316,1.365893,0.735942
14691,20626,2016,2.235017,0.30079,-1.406803


In [4]:
data[const.YEAR] = data['date'].dt.year

In [9]:
data['RET'] = pd.to_numeric(data['RET'], errors='coerce')
data.dropna(subset=['RET'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['RET'] = pd.to_numeric(data['RET'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.dropna(subset=['RET'], inplace=True)


In [10]:
from scipy.stats import skew, kurtosis

daily_stat = data.groupby(['PERMNO', const.YEAR])['RET'].agg(
    sigma='std',
    skewness=skew,
    kurtosis=kurtosis
)

daily_stat.to_pickle(os.path.join(const.TEMP_PATH, '20250323_daily_stats.pkl'))


  f = lambda x: func(x, *args, **kwargs)


In [18]:
from scipy.stats import skew, kurtosis

monthly_stat = monthly_returns.groupby(['PERMNO', 'year'])['monthly_return'].agg(
    sigma='std',
    skewness=skew,
    kurtosis=kurtosis
)

In [20]:
monthly_stat.to_pickle(os.path.join(const.TEMP_PATH, '20250316_monthly_stats.pkl'))

## Calculate Weekly Synchronicity


In [40]:
# Convert daily returns to weekly returns (assuming 'RET' is the daily return)
data['RET'] = pd.to_numeric(data['RET'], errors='coerce')
data['week'] = data['date'].dt.to_period('W')
weekly_returns = data.groupby(['PERMNO', 'week'])['RET'].apply(lambda x: (1 + x).prod() - 1).reset_index()
weekly_returns.rename(columns={'RET': 'weekly_return'}, inplace=True)

In [41]:
# Merge weekly returns back with original data
data = pd.merge(data, weekly_returns, on=['PERMNO', 'week'], how='left')

In [42]:
# Drop rows with NaN weekly returns
data = data.dropna(subset=['weekly_return'])

In [54]:
# Prepare the market and industry return variables
# Assuming 'vwretd' is the value-weighted market return and industry_returns contains daily industry returns
industry_returns = industry_returns.melt(id_vars=['Date'], var_name='Industry', value_name='Industry_Return')
industry_returns['Date'] = pd.to_datetime(industry_returns['Date'])

In [56]:
# Convert industry returns to weekly level
industry_returns['week'] = industry_returns['Date'].dt.to_period('W')
industry_weekly_returns = industry_returns.groupby(['Industry', 'week'])['Industry_Return'].apply(lambda x: (1 + x).prod() - 1).reset_index()

In [60]:
industry_weekly_returns.head()

Unnamed: 0,Industry,week,Industry_Return
0,Aero,1926-06-28/1926-07-04,-1.96
1,Aero,1926-07-05/1926-07-11,-1.13627
2,Aero,1926-07-12/1926-07-18,-1.667164
3,Aero,1926-07-19/1926-07-25,-1.624224
4,Aero,1926-07-26/1926-08-01,-87.476932


In [58]:
# Convert market returns to weekly level
data['vwretd'] = pd.to_numeric(data['vwretd'], errors='coerce')
market_return = data[['vwretd', 'week', 'date']].drop_duplicates().sort_values(by='date', ascending=True)
weekly_market_returns = market_return.groupby('week')['vwretd'].apply(lambda x: (1 + x).prod() - 1).reset_index()
weekly_market_returns.rename(columns={'vwretd': 'weekly_market_return'}, inplace=True)

In [61]:
# Merge weekly market returns back with original data
data = pd.merge(data, weekly_market_returns, on='week', how='left')

# Merge CRSP data with weekly industry returns
data = pd.merge(data, industry_weekly_returns, left_on=['week', 'fama_french_industry'], right_on=['week', 'Industry'], how='left')

In [None]:
# Drop rows with NaN values in relevant columns
data = data.dropna(subset=['weekly_return', 'weekly_market_return', 'Industry_Return'])

In [64]:
firm_week_data = data[['PERMNO', 'week', 'date', 'weekly_return', 'weekly_market_return', 'Industry_Return']].drop_duplicates(subset=['PERMNO', 'week'], keep='last')


In [65]:
# Create firm-year level data
firm_week_data['year'] = firm_week_data['date'].dt.year

In [73]:
def calculate_synchrony(group):
    # Shift the independent variables to match the lag/lead structure as per equation (4)
    group['rm_t-1'] = group['weekly_market_return'].shift(1)
    group['ri_t-1'] = group['Industry_Return'].shift(1)
    group['rm_t+1'] = group['weekly_market_return'].shift(-1)
    group['ri_t+1'] = group['Industry_Return'].shift(-1)

    # Drop rows with NaN values after shifting
    group = group.dropna(subset=['weekly_return', 'rm_t-1', 'ri_t-1', 'weekly_market_return', 'Industry_Return', 'rm_t+1', 'ri_t+1'])

    # Define the dependent and independent variables
    y = group['weekly_return']
    X = group[['rm_t-1', 'ri_t-1', 'weekly_market_return', 'Industry_Return', 'rm_t+1', 'ri_t+1']]
    X = sm.add_constant(X)  # Add a constant term to the regression

    try:
        # Run the regression
        model = sm.OLS(y, X, missing='drop').fit()
        r_squared = model.rsquared

        # Calculate IDIOSYN and SYNCHRONICITY
        idiosyn = np.log((1 - r_squared) / r_squared) if r_squared < 1 else np.nan
        synchrony = -idiosyn if idiosyn is not np.nan else np.nan

    except Exception as e:
        # If regression fails for any reason, return NaN
        synchrony = np.nan

    return pd.Series({'SYNCHRONICITY': synchrony})


In [78]:
def calculate_synchrony_mkt(group):
    # Shift the independent variables to match the lag/lead structure as per equation (4)
    group['rm_t-1'] = group['weekly_market_return'].shift(1)
    group['rm_t+1'] = group['weekly_market_return'].shift(-1)

    # Drop rows with NaN values after shifting
    group = group.dropna(subset=['weekly_return', 'rm_t-1', 'weekly_market_return', 'rm_t+1'])

    # Define the dependent and independent variables
    y = group['weekly_return']
    X = group[['rm_t-1', 'weekly_market_return', 'rm_t+1']]
    X = sm.add_constant(X)  # Add a constant term to the regression

    try:
        # Run the regression
        model = sm.OLS(y, X, missing='drop').fit()
        r_squared = model.rsquared

        # Calculate IDIOSYN and SYNCHRONICITY
        idiosyn = np.log((1 - r_squared) / r_squared) if r_squared < 1 else np.nan
        synchrony = -idiosyn if idiosyn is not np.nan else np.nan

    except Exception as e:
        # If regression fails for any reason, return NaN
        synchrony = np.nan

    return pd.Series({'SYNCHRONICITY_MKT': synchrony})


In [79]:
def calculate_synchrony_ind(group):
    # Shift the independent variables to match the lag/lead structure as per equation (4)
    group['ri_t-1'] = group['Industry_Return'].shift(1)
    group['ri_t+1'] = group['Industry_Return'].shift(-1)

    # Drop rows with NaN values after shifting
    group = group.dropna(subset=['weekly_return', 'ri_t-1', 'Industry_Return',  'ri_t+1'])

    # Define the dependent and independent variables
    y = group['weekly_return']
    X = group[['ri_t-1', 'Industry_Return', 'ri_t+1']]
    X = sm.add_constant(X)  # Add a constant term to the regression

    try:
        # Run the regression
        model = sm.OLS(y, X, missing='drop').fit()
        r_squared = model.rsquared

        # Calculate IDIOSYN and SYNCHRONICITY
        idiosyn = np.log((1 - r_squared) / r_squared) if r_squared < 1 else np.nan
        synchrony = -idiosyn if idiosyn is not np.nan else np.nan

    except Exception as e:
        # If regression fails for any reason, return NaN
        synchrony = np.nan

    return pd.Series({'SYNCHRONICITY_IND': synchrony})


In [75]:
synchrony = firm_week_data.groupby(['PERMNO', 'year']).apply(calculate_synchrony).reset_index()

  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  idiosyn = np.log((1 - r_squared) / r_squared) if r_squared < 1 else np.nan
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  r

In [80]:
synchrony_ind = firm_week_data.groupby(['PERMNO', 'year']).apply(calculate_synchrony_ind).reset_index()
synchrony_mkt = firm_week_data.groupby(['PERMNO', 'year']).apply(calculate_synchrony_mkt).reset_index()


  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  idiosyn = np.log((1 - r_squared) / r_squared) if r_squared < 1 else np.nan
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  r

In [82]:
synchrony_df = synchrony.merge(synchrony_mkt, on=['PERMNO', 'year'], how='left').merge(synchrony_ind, on=['PERMNO', 'year'], how='left')

In [111]:
link_file = data[['PERMNO', 'TICKER']].drop_duplicates(subset=['PERMNO'])
synchrony_df_ticker = synchrony_df.merge(link_file, on=['PERMNO'], how='left')
synchrony_df_ticker.to_pickle(os.path.join(const.TEMP_PATH, '20241006_synchrony_weekly.pkl'))


In [110]:
data.columns

Index(['PERMNO', 'date', 'SICCD', 'TICKER', 'PERMCO', 'HSICCD', 'CUSIP',
       'HSICMG', 'HSICIG', 'PRC', 'RET', 'vwretd', 'vwretx', 'ewretd',
       'ewretx', 'sprtrn', 'fama_french_industry', 'week', 'weekly_return',
       'weekly_market_return', 'Industry', 'Industry_Return'],
      dtype='object')

# Test WRDS available data

In [17]:
# 连接 WRDS
db = wrds.Connection(wrds_username='wangyouan')


WRDS recommends setting up a .pgpass file.
You can create this file yourself at any time with the create_pgpass_file() function.
Loading library list...
Done


In [4]:

# 获取所有库
libraries = db.list_libraries()

print("\n📚 检查有权限访问的 WRDS 数据库...\n")

# 遍历所有库，尝试列出表
for lib in libraries:
    try:
        datasets = db.list_tables(library=lib)
        if datasets:
            print(f"✅ {lib} ({len(datasets)} datasets):")
            for table in datasets[:10]:  # 如果想全部打印可以删掉 [:10]
                print(f"   - {table}")
            if len(datasets) > 10:
                print(f"   ... ({len(datasets)-10} more tables)\n")
            else:
                print()
    except Exception as e:
        # 没有权限或者其他错误
        pass

WRDS recommends setting up a .pgpass file.
You can create this file yourself at any time with the create_pgpass_file() function.
Loading library list...
Done

📚 检查有权限访问的 WRDS 数据库...

✅ aha_sample (1 datasets):
   - annual_survey

✅ ahasamp (1 datasets):
   - annual_survey

✅ auditsmp (15 datasets):
   - accfiler
   - auditcblock
   - auditchange
   - auditfees
   - auditfeesr
   - auditnonreli
   - auditopin
   - auditors
   - auditorsinfo
   - auditsox302
   ... (5 more tables)

✅ auditsmp_all (15 datasets):
   - accfiler
   - auditcblock
   - auditchange
   - auditfees
   - auditfeesr
   - auditnonreli
   - auditopin
   - auditors
   - auditorsinfo
   - auditsox302
   ... (5 more tables)

✅ bank (43 datasets):
   - _banknames_
   - _leinames_
   - bic_to_lei
   - idrssd_to_lei
   - isin_to_lei
   - lei_legalevents
   - lei_main
   - lei_otheraddresses
   - lei_otherentnames
   - lei_successorentity
   ... (33 more tables)

✅ bank_all (21 datasets):
   - wrds_bank_crsp_link
   - wrds_

In [10]:
full_table = [i for i in libraries if ('samp' not in i) and ('trial' not in i)]
full_table

['auditsmp',
 'auditsmp_all',
 'bank',
 'bank_all',
 'block',
 'block_all',
 'boardsmp',
 'calcbnch',
 'cboe',
 'cboe_all',
 'cisdmsmp',
 'columnar',
 'comp',
 'comp_bank',
 'comp_bank_daily',
 'comp_global',
 'comp_global_daily',
 'comp_na_annual_all',
 'comp_na_daily_all',
 'comp_na_monthly_all',
 'comp_segments_hist',
 'comp_segments_hist_daily',
 'compa',
 'compb',
 'compg',
 'compm',
 'compseg',
 'contrib',
 'contrib_as_filed_financials',
 'contrib_char_returns',
 'contrib_corporate_culture',
 'contrib_general',
 'contrib_global_factor',
 'contrib_intangible_value',
 'contrib_kpss',
 'contrib_liva',
 'crsp',
 'crsp_a_indexes',
 'crsp_a_stock',
 'crsp_a_treasuries',
 'csmar',
 'csmar_af',
 'csmar_cd',
 'csmar_cg',
 'csmar_colc',
 'csmar_financial',
 'csmar_funds',
 'csmar_hld',
 'csmar_ini',
 'csmar_ipo_a',
 'csmar_ma',
 'csmar_rs',
 'csmar_trade',
 'djones',
 'djones_all',
 'dmef',
 'dmef_all',
 'doe',
 'doe_all',
 'ff',
 'ff_all',
 'fjc',
 'fjc_linking',
 'fjc_litigation',
 'frb'

In [5]:
dir(db)

['_Connection__check_schema_perms',
 '_Connection__create_pgpass_file_unix',
 '_Connection__create_pgpass_file_win32',
 '_Connection__get_schema_for_view',
 '_Connection__get_user_credentials',
 '_Connection__make_sa_engine_conn',
 '_Connection__write_pgpass_file',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_connect_args',
 '_dbname',
 '_hostname',
 '_password',
 '_port',
 '_username',
 '_verbose',
 'close',
 'connect',
 'connection',
 'create_pgpass_file',
 'describe_table',
 'engine',
 'get_row_count',
 'get_table',
 'insp',
 'list_libraries',
 'list_tables',
 'load_library_list',
 'raw_sql',
 'schema_perm']

In [12]:
db.list_tables('block')

['block', 'block2', 'blockw']

In [14]:
help(db.raw_sql)

Help on method raw_sql in module wrds.sql:

raw_sql(sql, coerce_float=True, date_cols=None, index_col=None, params=None, chunksize=500000, return_iter=False, dtype=None, dtype_backend='numpy_nullable') method of wrds.sql.Connection instance
    Queries the database using a raw SQL string.
    
    :param sql: SQL code in string object.
    :param coerce_float: (optional) boolean, default: True
        Attempts to convert values of non-string, non-numeric objects
        to floating point. Can result in loss of precision.
    :param date_cols: (optional) list or dict, default: None
        - List of column names to parse as date
        - Dict of ``{column_name: format string}`` where
            format string is:
              strftime compatible in case of parsing string times or
              is one of (D, s, ns, ms, us) in case of parsing
                integer timestamps
        - Dict of ``{column_name: arg dict}``,
            where the arg dict corresponds to the keyword argume

In [19]:
db.raw_sql(f"SELECT COUNT(*) FROM block.block")

Unnamed: 0,count
0,20975


In [22]:
# Step 1: Get all libraries
libraries = db.list_libraries()

# Step 2: Initialize list to hold accessible tables
accessible_tables = []

# Step 3: Loop through libraries and tables
for lib in libraries:
    try:
        tables = db.list_tables(library=lib)
    except:
        continue  # Skip libraries that can't even be listed

    for table in tables:
        qualified_name = f"{lib}.{table}"
        try:
            # Try a minimal query — only 1 row
            db.raw_sql(f"SELECT * FROM {qualified_name} LIMIT 1;")
            accessible_tables.append(qualified_name)
        except Exception as e:
            continue  # Permission denied or other query failure

# Step 4: Output results
print("✅ Fully accessible WRDS datasets:")
print(accessible_tables)

✅ Fully accessible WRDS datasets:
['aha_sample.annual_survey', 'ahasamp.annual_survey', 'auditsmp.accfiler', 'auditsmp.auditcblock', 'auditsmp.auditchange', 'auditsmp.auditfees', 'auditsmp.auditfeesr', 'auditsmp.auditnonreli', 'auditsmp.auditopin', 'auditsmp.auditors', 'auditsmp.auditorsinfo', 'auditsmp.auditsox302', 'auditsmp.auditsox404', 'auditsmp.benefit', 'auditsmp.diroffichange', 'auditsmp.nt', 'auditsmp.revauditopin', 'auditsmp_all.accfiler', 'auditsmp_all.auditcblock', 'auditsmp_all.auditchange', 'auditsmp_all.auditfees', 'auditsmp_all.auditfeesr', 'auditsmp_all.auditnonreli', 'auditsmp_all.auditopin', 'auditsmp_all.auditors', 'auditsmp_all.auditorsinfo', 'auditsmp_all.auditsox302', 'auditsmp_all.auditsox404', 'auditsmp_all.benefit', 'auditsmp_all.diroffichange', 'auditsmp_all.nt', 'auditsmp_all.revauditopin', 'bank.wrds_bank_crsp_link', 'bank.wrds_bank_reg_vars', 'bank.wrds_call_rcfa_1', 'bank.wrds_call_rcfd_1', 'bank.wrds_call_rcfd_2', 'bank.wrds_call_rcfn_1', 'bank.wrds_call

In [24]:
available_tables = [i for i in accessible_tables if ('samp' not in i) and ('trial' not in i) and ('smp' not in i)]
print(available_tables)

['bank.wrds_bank_crsp_link', 'bank.wrds_bank_reg_vars', 'bank.wrds_call_rcfa_1', 'bank.wrds_call_rcfd_1', 'bank.wrds_call_rcfd_2', 'bank.wrds_call_rcfn_1', 'bank.wrds_call_rcfw_1', 'bank.wrds_call_rcoa_1', 'bank.wrds_call_rcon_1', 'bank.wrds_call_rcon_2', 'bank.wrds_call_rcow_1', 'bank.wrds_call_riad_1', 'bank.wrds_call_te_1', 'bank.wrds_holding_bhck_1', 'bank.wrds_holding_bhck_2', 'bank.wrds_holding_other_1', 'bank.wrds_struct_attributes_active', 'bank.wrds_struct_attributes_branches', 'bank.wrds_struct_attributes_closed', 'bank.wrds_struct_relationships', 'bank.wrds_struct_transformations', 'bank_all.wrds_bank_crsp_link', 'bank_all.wrds_bank_reg_vars', 'bank_all.wrds_call_rcfa_1', 'bank_all.wrds_call_rcfd_1', 'bank_all.wrds_call_rcfd_2', 'bank_all.wrds_call_rcfn_1', 'bank_all.wrds_call_rcfw_1', 'bank_all.wrds_call_rcoa_1', 'bank_all.wrds_call_rcon_1', 'bank_all.wrds_call_rcon_2', 'bank_all.wrds_call_rcow_1', 'bank_all.wrds_call_riad_1', 'bank_all.wrds_call_te_1', 'bank_all.wrds_holdi

In [25]:
db.close()