# **Updater** for previously downloaded NASDAQ data using Yahoo! Finance.

Automate the process of downloading and updating historical data for all traded symbols on NASDAQ using the yFinance library. The script:

- Retrieves the latest list of traded symbols from NASDAQ.
- Checks for existing data files and updates them by downloading new data starting from an overlapping window (e.g., last 5 days) to ensure consistency.
- Classifies symbols into subfolders (ETF, Stock, or Unknown) based on their ticker information.
- Uses parallel processing to speed up the data retrieval process.
- Automatically commits and pushes changes to a designated GitHub repository.

## Strategy

1. **Fetching Latest NASDAQ Traded Symbols**  
   Download the latest symbol list from the NASDAQ Trader website (`nasdaqtraded.txt`). It filters out test issues and saves the full reference list as a CSV file in the data folder.

2. **Data Download and Update**  
   For each symbol:
   - If a CSV file already exists:
     - Read the existing file and determine the last recorded date.
     - Download new data starting from a few days (overlap period) before the last date.
     - Replace the overlapping data with the newly downloaded data and append any additional rows.
   - If no file exists, download the full historical dataset.
  
3. **Classification**  
   Each symbol is classified as an ETF, Stock, or Unknown by querying yFinance. The data is then stored in corresponding subdirectories.

4. **Parallel Processing with Progress Reporting**  
   The script leverages `ThreadPoolExecutor` for concurrent downloads. It prints and logs a progress update every 25 symbols, showing:
   - The number of symbols processed.
   - The number remaining.
   - An estimated time remaining.

5. **Git Integration**  
   After all symbols are processed, the script uses Git commands to commit and push any changes made in the data folder to a specified GitHub repository.

## Usage

### Prerequisites

- **Python Version:** 3.6 or higher  
- **Required Packages:** `pandas`, `yfinance`, `requests`  
- **Git:** Must be installed and configured (the script must run inside a local clone of your GitHub repository)

### Customization

- **Overlap Days:**  
  Adjust the `overlap_days` parameter (default is 5) in the `process_symbol` function call to change the overlap period used for refreshing data.

- **Worker Threads:**  
  Modify the `max_workers` parameter to optimize parallel processing for your environment.

- **Git Commit Message:**  
  Update the commit message in the `commit_and_push_changes` function as needed.

## Disclaimer

This script uses data provided by Yahoo Finance via the yFinance library. Data accuracy and timeliness are not guaranteed. Use the script at your own risk.


In [None]:
# incase yahoo finance is not installed...
# !pip install --upgrade --no-cache-dir yfinance

In [None]:
import os
import logging
import pandas as pd
import yfinance as yf
import requests
from io import StringIO
from concurrent.futures import ThreadPoolExecutor, as_completed
from requests.adapters import HTTPAdapter
import subprocess
from datetime import datetime
import time

In [None]:
# ---------------------------
# Global session setup for better connection pooling
# ---------------------------
global_session = requests.Session()
adapter = HTTPAdapter(pool_connections=100, pool_maxsize=100)
global_session.mount("http://", adapter)
global_session.mount("https://", adapter)

In [None]:
# Disable InsecureRequestWarning (only for testing purposes)
requests.packages.urllib3.disable_warnings(requests.packages.urllib3.exceptions.InsecureRequestWarning)

# Configure logging to log errors to "errors.log"
logging.basicConfig(filename="errors.log", level=logging.ERROR,
                    format='%(asctime)s - %(levelname)s - %(message)s')

In [None]:
# ---------------------------
# 1. Get the latest list of traded symbols from NASDAQ
# ---------------------------
def get_nasdaq_symbols(data_dir):
    """
    Download the list of NASDAQ traded symbols from the NASDAQ Trader website.
    The file is pipe-separated and contains footer rows and test issues.
    Filters out rows where 'Test Issue' is not 'N', saves the full reference
    DataFrame to the provided directory, and returns the list of symbols.
    """
    url = "http://www.nasdaqtrader.com/dynamic/SymDir/nasdaqtraded.txt"
    try:
        response = requests.get(url, verify=False)
        response.raise_for_status()
        data = StringIO(response.text)
        df = pd.read_csv(data, sep="|")
        # Filter out rows where 'Test Issue' is not 'N'
        df_clean = df[df['Test Issue'] == 'N']
        df_clean = df_clean[df_clean['Symbol'].notna()]
        # Persist the reference list to a CSV file
        ref_file = os.path.join(data_dir, "nasdaq_symbols_reference.csv")
        df_clean.to_csv(ref_file, index=False)
        symbols = df_clean['Symbol'].tolist()
        return symbols
    except Exception as e:
        logging.error(f"Error fetching NASDAQ traded symbols: {e}")
        return []

In [None]:
# ---------------------------
# 2. Classification helper (ETF/Stock)
# ---------------------------
def classify_symbol(symbol):
    """
    Classify the symbol as 'ETF' or 'Stock' using yfinance Ticker info.
    Returns "Unknown" if classification is unavailable.
    """
    try:
        ticker = yf.Ticker(symbol, session=global_session)
        info = ticker.info
        qtype = info.get("quoteType", None)
        if qtype == "ETF":
            return "ETF"
        elif qtype == "EQUITY":
            return "Stock"
        else:
            logging.error(f"Unknown or missing quoteType for {symbol}. Info: {info}")
            return "Unknown"
    except Exception as e:
        logging.error(f"Error classifying symbol {symbol}: {e}")
        return "Unknown"

In [None]:
# ---------------------------
# 3. Process a single symbol: update or download full data
# ---------------------------
def process_symbol(symbol, folders, overlap_days=5):
    """
    For the given symbol:
    - Classify and determine the correct folder.
    - If a CSV file already exists:
         * Read it with an inferred datetime format.
         * Determine the last date.
         * Download new data starting from (last_date - overlap_days).
         * Replace the overlapping portion with new data and append new rows.
    - If no file exists, download the full historical data.
    Returns a status message.
    """
    classification = classify_symbol(symbol)
    if classification not in folders:
        classification = "Unknown"
    file_path = os.path.join(folders[classification], f"{symbol}.csv")
    
    try:
        if os.path.exists(file_path):
            # File exists: update existing data.
            old_df = pd.read_csv(file_path, index_col=0, parse_dates=True, infer_datetime_format=True)
            if old_df.empty:
                raise ValueError("Existing file is empty.")
            last_date = old_df.index.max()
            # Determine overlap start date as (last_date - overlap_days)
            overlap_start = last_date - pd.Timedelta(days=overlap_days)
            # Download new data starting from overlap_start date.
            new_df = yf.download(symbol,
                                 start=overlap_start.strftime("%Y-%m-%d"),
                                 progress=False,
                                 session=global_session,
                                 auto_adjust=True)
            if new_df is None or new_df.empty:
                return f"No new data for {symbol}"
            # Exclude overlapping data from the old dataset.
            updated_df = pd.concat([old_df[old_df.index < overlap_start], new_df])
            updated_df.sort_index(inplace=True)
            updated_df.to_csv(file_path)
            return f"Updated data for {symbol} in folder '{classification}'"
        else:
            # File does not exist: download full historical data.
            data = yf.download(symbol,
                               period="max",
                               progress=False,
                               session=global_session,
                               auto_adjust=True)
            if data is None or data.empty:
                logging.error(f"No data available for {symbol}")
                return f"No data available for {symbol}"
            data.to_csv(file_path)
            return f"Downloaded and saved data for {symbol} in folder '{classification}'"
    except Exception as e:
        logging.error(f"Error processing data for {symbol}: {e}")
        return f"Error processing data for {symbol}"

In [None]:
# ---------------------------
# 4. Git commit and push changes
# ---------------------------
def commit_and_push_changes(repo_path, commit_message="Update data"):
    """
    Add changes in the 'data' folder, commit them with a provided message,
    and push the commit to the remote repository.
    """
    try:
        subprocess.run(["git", "add", "data"], cwd=repo_path, check=True)
        full_message = f"{commit_message} - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
        subprocess.run(["git", "commit", "-m", full_message], cwd=repo_path, check=True)
        subprocess.run(["git", "push"], cwd=repo_path, check=True)
        print("Changes have been committed and pushed to the repository.")
    except subprocess.CalledProcessError as e:
        logging.error(f"Git operation failed: {e}")
        print("Failed to commit and push changes. See errors.log for details.")

In [None]:
# ---------------------------
# Main routine
# ---------------------------
def main():
    # Create main data directory and subfolders for classifications.
    base_dir = "data"
    os.makedirs(base_dir, exist_ok=True)
    folders = {
        "ETF": os.path.join(base_dir, "ETF"),
        "Stock": os.path.join(base_dir, "Stock"),
        "Unknown": os.path.join(base_dir, "Unknown")
    }
    for folder in folders.values():
        os.makedirs(folder, exist_ok=True)
    
    # Fetch and persist the latest NASDAQ traded symbols.
    symbols = get_nasdaq_symbols(base_dir)
    if not symbols:
        print("No symbols found. Please check errors.log for details.")
        return
    total_symbols = len(symbols)
    print(f"Found {total_symbols} traded symbols on NASDAQ. Reference list saved to {os.path.join(base_dir, 'nasdaq_symbols_reference.csv')}")
    
    # Process each symbol in parallel with progress indicator.
    max_workers = 25  # Adjust as needed.
    processed_count = 0
    start_time = time.time()
    results = {}
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_symbol = {executor.submit(process_symbol, sym, folders, 5): sym for sym in symbols}
        
        for future in as_completed(future_to_symbol):
            symbol = future_to_symbol[future]
            processed_count += 1
            try:
                result = future.result()
                results[symbol] = result
                print(result)
            except Exception as exc:
                logging.error(f"{symbol} generated an exception: {exc}")
                print(f"{symbol} generated an exception. See errors.log for details.")
            
            # Every 25 symbols, log progress.
            if processed_count % 25 == 0:
                elapsed = time.time() - start_time
                avg_time = elapsed / processed_count
                remaining = total_symbols - processed_count
                est_seconds = avg_time * remaining
                est_hours = est_seconds / 3600
                progress_msg = (f"Processed {processed_count} out of {total_symbols} symbols, "
                                f"{remaining} remaining, estimated time left: {est_hours:.2f} hours or {est_hours*60:.2f} minutes")
                logging.info(progress_msg)
                print(progress_msg)
    
    # After processing all symbols, commit and push changes.
    commit_and_push_changes(repo_path=".", commit_message="Update NASDAQ data")

In [None]:
main()