In [4]:
import pandas as pd
import os

# we read only the first row to get column names
with open("data/CompFirmCharac.csv", 'r', encoding='utf-8') as f:
    full_header = f.readline().strip().split(',')

print(f"Number of columns in dataset: {len(full_header)}")

# we then define the desired columns to keep 
selected_columns = [
    'datadate', 'cusip', 'sic', 'conm',
    'oibdpy', 'capxy', 'invtq', 'actq', 'ancq',
    'ltq', 'lctq', 'niq', 'cogsq', 'revtq',
    'chechy', 'cshfdy', 'cshpry', 'xintq', 'txty',
    'epspxy', 'dltry', 'wcapcy', 'dpcy', 'saleq', 'atq'
]

# Keep only columns that exist in the actual dataset
existing_columns = [col for col in selected_columns if col in full_header]
print(f"Columns that will be loaded: {existing_columns}")

# Load filtered data in chunks
chunksize = 100_000
filtered_chunks = []

for chunk in pd.read_csv("data/CompFirmCharac.csv", usecols=existing_columns, chunksize=chunksize, low_memory=False):
    # Convert date
    chunk['datadate'] = pd.to_datetime(chunk['datadate'], errors='coerce')
    # Filter by year
    chunk = chunk[chunk['datadate'].dt.year >= 2000]
    # Drop invalid or duplicate rows
    chunk = chunk.dropna(subset=['cusip', 'datadate'])
    chunk = chunk.drop_duplicates(subset=['cusip', 'datadate'])
    chunk = chunk.dropna()
    filtered_chunks.append(chunk)

# Combine all chunks together
df_filtered = pd.concat(filtered_chunks)

# Check and print dtypes only once after full loading
print("\nFinal data types of selected columns:")
print(df_filtered.dtypes)

# Save output
os.makedirs("data", exist_ok=True)
df_filtered.to_csv("data/filtered_compustat_char.csv", index=False)

print("\nFinal shape:", df_filtered.shape)
print(df_filtered.head())


Number of columns in dataset: 256
Columns that will be loaded: ['datadate', 'cusip', 'conm', 'oibdpy', 'capxy', 'chechy', 'cshfdy', 'cshpry', 'txty', 'epspxy', 'dltry', 'wcapcy', 'dpcy']

Final data types of selected columns:
datadate    datetime64[ns]
cusip               object
conm                object
capxy              float64
chechy             float64
cshfdy             float64
cshpry             float64
dltry              float64
dpcy               float64
epspxy             float64
oibdpy             float64
txty               float64
wcapcy             float64
dtype: object

Final shape: (50, 13)
         datadate      cusip                  conm   capxy  chechy   cshfdy  \
10934  2000-01-31  008489502   AGRA INDUSTRIES LTD  17.181 -12.630   31.445   
558326 2000-03-31  858525132            STELCO INC  30.000  -8.000  105.455   
769061 2000-03-31  405620105  HALEY INDUSTRIES LTD   0.672   3.657   10.506   
769062 2000-06-30  405620105  HALEY INDUSTRIES LTD   1.302   2.157   1

| Column Name | Description                                                      | Why It’s Useful                                                                             |
| ----------- | ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
| `datadate`  | The reporting date of the financial statement (quarterly)        | Used to align financial data with stock returns and other time-indexed datasets             |
| `cusip`     | Unique identifier for the security (first 8 characters of CUSIP) | Key identifier to merge with CRSP return data or other datasets                             |
| `sic`       | Standard Industrial Classification code                          | Enables industry classification for sector analysis or industry controls                    |
| `conm`      | Company name                                                     | Useful for readability and validation of merges or analyses                                 |
| `oibdpy`    | Operating income before depreciation (year-to-date)              | Proxy for core operational profitability excluding depreciation                             |
| `capxy`     | Capital expenditures (year-to-date)                              | Indicates investment in future growth, plant, or equipment                                  |
| `invtq`     | Inventory (quarter-end)                                          | Useful for tracking operational efficiency and changes in demand/supply                     |
| `actq`      | Current assets (quarter-end)                                     | Measures liquidity; used in calculating current ratio                                       |
| `ancq`      | Non-current assets (quarter-end)                                 | Represents long-term investment assets of the company                                       |
| `ltq`       | Total liabilities (quarter-end)                                  | Indicator of total financial obligations of the firm                                        |
| `lctq`      | Current liabilities (quarter-end)                                | Used with `actq` to assess short-term liquidity risks                                       |
| `niq`       | Net income (quarterly)                                           | Standard measure of profitability after all expenses                                        |
| `cogsq`     | Cost of goods sold (quarterly)                                   | Used to compute gross margin and evaluate efficiency                                        |
| `revtq`     | Total revenue (quarterly)                                        | Top-line measure of company sales; critical for growth assessment                           |
| `chechy`    | Cash and cash equivalents (year-to-date)                         | Immediate liquidity and solvency indicator                                                  |
| `cshfdy`    | Cash flow from financing activities (year-to-date)               | Indicates capital raising, debt repayments, dividend policies                               |
| `cshpry`    | Cash flow from operating activities (year-to-date)               | Proxy for real cash-generating power from core business operations                          |
| `xintq`     | Interest expense (quarterly)                                     | Measures cost of borrowing; related to leverage                                             |
| `txty`      | Total income taxes (year-to-date)                                | Shows tax burden; relevant for after-tax profitability                                      |
| `epspxy`    | Earnings per share (basic, year-to-date)                         | Popular investor ratio; allows comparability across firms of different sizes                |
| `dltry`     | Long-term debt (year-to-date)                                    | Key indicator of leverage and financial risk                                                |
| `wcapcy`    | Working capital (year-to-date)                                   | Assesses operational liquidity and short-term health (current assets - current liabilities) |
| `dpcy`      | Depreciation (year-to-date)                                      | Affects cash flow and is used to assess investment intensity and asset usage                |
| `saleq`     | Sales/revenue (quarterly)                                        | Similar to `revtq`, often used in ratio calculations                                        |
| `atq`       | Total assets (quarter-end)                                       | Common denominator for ratio analysis (e.g., ROA, leverage)                                 |
