## Ingestion Layer: Financial Data Acquisition

**Objective**: Retrieve raw financial statements (Income, Balance Sheet, Cash Flow) from the Financial Modeling Prep (FMP) API.

**Workflow**:
1.  **Configuration**: Retrieve API keys and session tokens from Databricks Secrets.
2.  **Extraction**: Iteratively query the FMP API for a defined list of tickers.
3.  **Validation**: executing HTTP requests with error handling to capture and log failures.
4.  **Storage**: Persist raw JSON responses to the Landing Zone (S3) for downstream processing.

In [0]:
import requests
import json
from datetime import datetime
dbutils.widgets.text("run_date", datetime.now().strftime("%Y-%m-%d"),"Run date")
run_date = dbutils.widgets.get("run_date")

APIKey = dbutils.secrets.get(scope = "ticker", key = "financialmodelprep_token")
BaseUrl = "https://financialmodelingprep.com"
TICKERS = ["AAPL", "MSFT", "GOOGL", "AMZN", "NVDA", "META", "JPM", "V", "JNJ","PG"]
# TICKERS = ["AAPL"]


In [0]:
def process_ticker(ticker,run_date):
    """
    Fetches financial statements for a specific ticker and prepares the data for S3 storage.
    
    Args:
        ticker (str): The stock symbol to query.
        run_date (str): The date of the execution run, used for partitioning.
        
    Returns:
        list: A list of dictionaries, each containing the file name, JSON data, and status code.
              On failure, returns an error log record routed to the 'audit_failures' path.
    """
    Income_statement_Query = f"{BaseUrl}/stable/income-statement?symbol={ticker}&limit=5&period=quarter&apikey={APIKey}"
    Balance_sheet_statement_Query = f"{BaseUrl}/stable/balance-sheet-statement?symbol={ticker}&limit=5&period=quarter&apikey={APIKey}"
    cashflow_statement_Query = f"{BaseUrl}/stable/cash-flow-statement?symbol={ticker}&limit=5&period=quarter&apikey={APIKey}"

    urls = [Income_statement_Query,Balance_sheet_statement_Query,cashflow_statement_Query]
  
    filename = datetime.now().strftime("%Y%m%d_%H%M%S")
    files = []
    try:
        for Query in urls:
            response = requests.get(Query, timeout=10)
            response.raise_for_status() # Raises an HTTPError for bad responses (4xx or 5xx) 
            data = response.json()
            label = Query.split('/')[4].split('?')[0]

            # Spark Partitioning Scheme: /source=fmp/ticker=<SYMBOL>/date=<DATE>/
            file_name = f"landing/source=fmp/ticker={ticker}/date={run_date}/statement={label}/{filename}.json"
            
            json_data = json.dumps(data, indent=2)
            files.append({"file_name": file_name, "json_data": json_data,"status" : response.status_code})
        return files
    except Exception as e:
        print(f"Error processing ticker {ticker}: {e}")
        # Resilience Pattern: Route failures to a quarantine location for later analysis.
        file_name = f"landing/source=fmp/audit_failures/ticker={ticker}/date={run_date}/{filename}.json"
        status_code =  response.status_code if 'response' in locals() else 500
        error_log = {
            "error": str(e),
            "ticker": ticker,
            "status_code": status_code
        }
        return [{"file_name": file_name, "json_data": json.dumps(error_log,indent=2), "status" : status_code}]

        


In [0]:
flat_list = []
for ticker in TICKERS:
    flat_list.extend(process_ticker(ticker,run_date))

print(f"Data collection complete. Total keys generated: {len(flat_list)}. Symbols found: {set([ticker['file_name'] if 'file_name' in ticker else None for ticker in flat_list])}")


### S3 Persistence Layer
Establish a boto3 connection to S3 using temporary session credentials forwarded from the initialization task. This method ensures secure access without hardcoding long-term credentials in the notebook.

In [0]:
import json
import boto3

ACCESS_KEY = dbutils.secrets.get(scope = "ticker", key = "access_key")
SECRET_KEY = dbutils.secrets.get(scope = "ticker", key = "secret_key")
SESSION_TOKEN = dbutils.secrets.get(scope = "ticker", key = "session_key")

# "taskKey" must match the NAME of the task in your Job workflow (e.g., "Init_Auth")
temp_ak = dbutils.jobs.taskValues.get(taskKey="Init_Auth", key="temp_ak", debugValue="debug-key")
temp_sk = dbutils.jobs.taskValues.get(taskKey="Init_Auth", key="temp_sk", debugValue="debug-secret")
temp_token = dbutils.jobs.taskValues.get(taskKey="Init_Auth", key="temp_token", debugValue="debug-token")

# Initialize S3 Client
s3 = boto3.client(
    's3',
    aws_access_key_id=temp_ak,
    aws_secret_access_key=temp_sk,
    aws_session_token=temp_token
)



for ticker in flat_list:
    print(ticker['file_name'])

    # 4. Specify your bucket name and the desired file name (key) in S3
    bucket_name = 'arn:aws:s3:us-east-1:180250667274:accesspoint/accesspoint-to-data'
    s3_object_key = ticker['file_name']

    # 5. Upload the JSON string
    s3.put_object(
    Bucket=bucket_name,
    Key=s3_object_key,
    Body=ticker['json_data']
)
  

    print(f"Successfully uploaded JSON data to s3://{bucket_name}/{s3_object_key}")

### Verification & Quality Assurance

1.  **Resilience**: The pipeline traps exceptions during API calls and routes them to a dedicated `audit_failures` path, ensuring the batch does not fail partially.
2.  **Observability**: Error logs preserve the exact error message and status code for debugging.
3.  **Disaster Recovery**: The `run_date` widget enables execution for arbitrary historical dates, facilitating backfilling.
4.  **Data Integrity**: Data is stored as valid JSON structures, ready for schema-on-read in the Bronze layer.