# Ingest CSV Files from GitHub

**Purpose:** Load all 36 monthly CSV files from GitHub into individual bronze tables

**Source:** https://github.com/sulaiman013/sales-analytics-data-dbt-jobs-fabric

**Target:** 36 individual Delta tables in `source` schema (source.bronze_YYYY_MM)

In [5]:
import time
from pyspark.sql.functions import lit

# Configuration
GITHUB_BASE_URL = "https://raw.githubusercontent.com/sulaiman013/sales-analytics-data-dbt-jobs-fabric/master/data"
YEARS = [2023, 2024, 2025]
MONTHS = range(1, 13)
SOURCE_SCHEMA = "source"

print("="*60)
print("INGESTION: Load 36 CSV files from GitHub")
print("="*60)
print(f"Source: {GITHUB_BASE_URL}")
print(f"Target Schema: {SOURCE_SCHEMA}")
print(f"Total files to load: {len(YEARS) * 12}")

StatementMeta(, 41f7a560-ee30-4f18-960d-89906a2f6c02, 7, Finished, Available, Finished)

INGESTION: Load 36 CSV files from GitHub
Source: https://raw.githubusercontent.com/sulaiman013/sales-analytics-data-dbt-jobs-fabric/master/data
Target Schema: source
Total files to load: 36


In [6]:
# In Fabric Lakehouse, schemas are created automatically when writing tables
# Use the lakehouse's Tables folder with schema-prefixed table names
# Format: schema_name.table_name becomes schema_name_table_name in the path

# Configuration
SOURCE_PREFIX = "source"  # Tables will be: source_bronze_YYYY_MM
print(f"Using table prefix: {SOURCE_PREFIX}")
print("Tables will be written to Lakehouse Tables folder")


StatementMeta(, 41f7a560-ee30-4f18-960d-89906a2f6c02, 8, Finished, Available, Finished)

Using table prefix: source
Tables will be written to Lakehouse Tables folder


In [8]:
# Ingest each CSV file from GitHub using pandas (Fabric can't read HTTP URLs directly with Spark)
import time
import pandas as pd
import requests
from io import StringIO

GITHUB_BASE_URL = "https://raw.githubusercontent.com/sulaiman013/sales-analytics-data-dbt-jobs-fabric/master/data"
YEARS = [2023, 2024, 2025]
MONTHS = range(1, 13)

start_time = time.time()
files_loaded = 0
total_rows = 0

for year in YEARS:
    for month in MONTHS:
        table_name = f"source_bronze_{year}_{month:02d}"
        url = f"{GITHUB_BASE_URL}/sales_{year}_{month:02d}.csv"
        
        try:
            # Download CSV using requests
            response = requests.get(url)
            response.raise_for_status()
            
            # Read into pandas DataFrame
            pdf = pd.read_csv(StringIO(response.text))
            
            # Convert to Spark DataFrame
            df = spark.createDataFrame(pdf)
            row_count = df.count()
            
            # Write to Delta table
            df.write.format("delta").mode("overwrite").saveAsTable(table_name)
            
            files_loaded += 1
            total_rows += row_count
            
            if files_loaded % 6 == 0:
                elapsed = time.time() - start_time
                print(f"  Loaded {files_loaded}/36 files... ({elapsed:.1f}s)")
                
        except Exception as e:
            print(f"  ERROR loading {table_name}: {type(e).__name__}: {str(e)}")

total_time = time.time() - start_time
print(f"\nLoaded {files_loaded}/36 files, {total_rows:,} rows in {total_time:.1f}s")


StatementMeta(, 41f7a560-ee30-4f18-960d-89906a2f6c02, 10, Finished, Available, Finished)

  Loaded 6/36 files... (98.9s)
  Loaded 12/36 files... (167.2s)
  Loaded 18/36 files... (233.5s)
  Loaded 24/36 files... (297.5s)
  Loaded 30/36 files... (356.1s)
  Loaded 36/36 files... (419.6s)

Loaded 36/36 files, 3,600,000 rows in 419.6s


In [9]:
# Summary
print("\n" + "="*60)
print("INGESTION COMPLETE")
print("="*60)
print(f"Schema:         {SOURCE_SCHEMA}")
print(f"Files Loaded:   {files_loaded}")
print(f"Total Rows:     {total_rows:,}")
print(f"Total Time:     {total_time:.2f} seconds")
print(f"Avg per file:   {total_time/files_loaded:.2f} seconds")
print("="*60)

StatementMeta(, 41f7a560-ee30-4f18-960d-89906a2f6c02, 11, Finished, Available, Finished)


INGESTION COMPLETE
Schema:         source
Files Loaded:   36
Total Rows:     3,600,000
Total Time:     419.60 seconds
Avg per file:   11.66 seconds


In [11]:
# List all source tables (they're in the default database with prefix "source_bronze_")
print("\nTables with 'source_bronze_' prefix:")
tables = spark.catalog.listTables()
source_tables = [t for t in tables if t.name.startswith("source_bronze_")]
for t in sorted(source_tables, key=lambda x: x.name):
    print(f"  - {t.name}")
print(f"\nTotal: {len(source_tables)} tables")


StatementMeta(, 41f7a560-ee30-4f18-960d-89906a2f6c02, 13, Finished, Available, Finished)


Tables with 'source_bronze_' prefix:
  - source_bronze_2023_01
  - source_bronze_2023_02
  - source_bronze_2023_03
  - source_bronze_2023_04
  - source_bronze_2023_05
  - source_bronze_2023_06
  - source_bronze_2023_07
  - source_bronze_2023_08
  - source_bronze_2023_09
  - source_bronze_2023_10
  - source_bronze_2023_11
  - source_bronze_2023_12
  - source_bronze_2024_01
  - source_bronze_2024_02
  - source_bronze_2024_03
  - source_bronze_2024_04
  - source_bronze_2024_05
  - source_bronze_2024_06
  - source_bronze_2024_07
  - source_bronze_2024_08
  - source_bronze_2024_09
  - source_bronze_2024_10
  - source_bronze_2024_11
  - source_bronze_2024_12
  - source_bronze_2025_01
  - source_bronze_2025_02
  - source_bronze_2025_03
  - source_bronze_2025_04
  - source_bronze_2025_05
  - source_bronze_2025_06
  - source_bronze_2025_07
  - source_bronze_2025_08
  - source_bronze_2025_09
  - source_bronze_2025_10
  - source_bronze_2025_11
  - source_bronze_2025_12

Total: 36 tables


In [13]:
# In Fabric Lakehouse, schemas are not supported via SQL
# We'll use table name prefixes instead:
#   - source_bronze_YYYY_MM (36 tables) - DONE
#   - pyspark_bronze_sales (PySpark merge output)
#   - dbt_bronze_sales (dbt merge output)

print("Naming convention for comparison test:")
print("  Source tables:  source_bronze_YYYY_MM (36 tables)")
print("  PySpark output: pyspark_bronze_sales")
print("  dbt output:     dbt_bronze_sales")
print("\nReady for performance comparison test!")


StatementMeta(, 41f7a560-ee30-4f18-960d-89906a2f6c02, 15, Finished, Available, Finished)

Naming convention for comparison test:
  Source tables:  source_bronze_YYYY_MM (36 tables)
  PySpark output: pyspark_bronze_sales
  dbt output:     dbt_bronze_sales

Ready for performance comparison test!
