# GCP Data Engineering Services

This notebook provides a comprehensive overview of Google Cloud Platform's data engineering services. GCP offers a robust ecosystem of managed services for building scalable, reliable, and cost-effective data pipelines.

## Table of Contents
1. [Google Cloud Storage (GCS)](#google-cloud-storage)
2. [Dataflow](#dataflow)
3. [Dataproc](#dataproc)
4. [BigQuery](#bigquery)
5. [Pub/Sub](#pubsub)
6. [Best Practices & Architecture Patterns](#best-practices)
7. [Takeaways](#takeaways)

---
## GCP Data Engineering Ecosystem Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         GCP Data Engineering Stack                          │
├─────────────────────────────────────────────────────────────────────────────┤
│  INGESTION          PROCESSING           STORAGE           ANALYTICS        │
│  ─────────          ──────────           ───────           ─────────        │
│  • Pub/Sub          • Dataflow           • GCS             • BigQuery       │
│  • Cloud IoT        • Dataproc           • BigQuery        • Looker         │
│  • Transfer Svc     • Cloud Functions    • Bigtable        • Data Studio    │
│  • Datastream       • Cloud Run          • Firestore       • Vertex AI      │
└─────────────────────────────────────────────────────────────────────────────┘
```

---
<a id='google-cloud-storage'></a>
## 1. Google Cloud Storage (GCS) - Data Lake Foundation

Google Cloud Storage is an object storage service that serves as the foundation for building data lakes on GCP. It provides unified storage for structured, semi-structured, and unstructured data.

### Key Features

| Feature | Description |
|---------|-------------|
| **Storage Classes** | Standard, Nearline, Coldline, Archive |
| **Durability** | 99.999999999% (11 9's) annual durability |
| **Availability** | Up to 99.99% for multi-region |
| **Consistency** | Strong global consistency |
| **Object Size** | Up to 5 TiB per object |

### Storage Classes Comparison

```
┌───────────────┬─────────────────┬─────────────────┬─────────────────┐
│   Standard    │    Nearline     │    Coldline     │    Archive      │
├───────────────┼─────────────────┼─────────────────┼─────────────────┤
│ Hot data      │ Monthly access  │ Quarterly       │ Yearly access   │
│ Frequent      │ 30-day min      │ 90-day min      │ 365-day min     │
│ $0.020/GB/mo  │ $0.010/GB/mo    │ $0.004/GB/mo    │ $0.0012/GB/mo   │
└───────────────┴─────────────────┴─────────────────┴─────────────────┘
```

In [None]:
# Google Cloud Storage - Python SDK Examples
from google.cloud import storage
from datetime import timedelta

# Initialize client
client = storage.Client(project='your-project-id')

# Create a bucket with lifecycle management
def create_data_lake_bucket(bucket_name: str, location: str = 'US') -> storage.Bucket:
    """Create a GCS bucket configured for data lake usage."""
    bucket = client.bucket(bucket_name)
    bucket.storage_class = 'STANDARD'
    bucket.location = location
    
    # Enable versioning for data protection
    bucket.versioning_enabled = True
    
    # Set lifecycle rules for cost optimization
    bucket.lifecycle_rules = [
        {
            'action': {'type': 'SetStorageClass', 'storageClass': 'NEARLINE'},
            'condition': {'age': 30, 'matchesPrefix': ['raw/']}  # Move raw data after 30 days
        },
        {
            'action': {'type': 'SetStorageClass', 'storageClass': 'COLDLINE'},
            'condition': {'age': 90, 'matchesPrefix': ['raw/']}
        },
        {
            'action': {'type': 'Delete'},
            'condition': {'age': 365, 'matchesPrefix': ['temp/']}  # Delete temp data after 1 year
        }
    ]
    
    new_bucket = client.create_bucket(bucket, location=location)
    print(f"Created bucket {new_bucket.name} in {new_bucket.location}")
    return new_bucket

In [None]:
# Data Lake Directory Structure Best Practice
def create_data_lake_structure(bucket_name: str):
    """
    Create a medallion architecture directory structure.
    
    Structure:
    ├── raw/              # Bronze layer - raw ingested data
    │   ├── source1/
    │   └── source2/
    ├── processed/        # Silver layer - cleaned, validated
    │   ├── domain1/
    │   └── domain2/
    ├── curated/          # Gold layer - business-ready
    │   ├── analytics/
    │   └── ml_features/
    └── temp/             # Temporary processing data
    """
    bucket = client.bucket(bucket_name)
    
    directories = [
        'raw/.keep',
        'processed/.keep',
        'curated/.keep',
        'temp/.keep'
    ]
    
    for directory in directories:
        blob = bucket.blob(directory)
        blob.upload_from_string('')
        print(f"Created: gs://{bucket_name}/{directory}")

In [None]:
# Efficient Data Upload with Parallel Composite Uploads
import concurrent.futures
from pathlib import Path

def upload_files_parallel(
    bucket_name: str,
    local_directory: str,
    gcs_prefix: str,
    max_workers: int = 10
):
    """Upload multiple files to GCS in parallel."""
    bucket = client.bucket(bucket_name)
    local_path = Path(local_directory)
    files = list(local_path.rglob('*'))
    files = [f for f in files if f.is_file()]
    
    def upload_file(file_path: Path):
        relative_path = file_path.relative_to(local_path)
        blob_name = f"{gcs_prefix}/{relative_path}"
        blob = bucket.blob(blob_name)
        blob.upload_from_filename(str(file_path))
        return blob_name
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(upload_file, f): f for f in files}
        for future in concurrent.futures.as_completed(futures):
            try:
                blob_name = future.result()
                print(f"Uploaded: {blob_name}")
            except Exception as e:
                print(f"Error uploading {futures[future]}: {e}")

---
<a id='dataflow'></a>
## 2. Dataflow - Unified Stream & Batch Processing

Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines. It provides **unified programming model** for both batch and streaming data processing.

### Key Concepts

| Concept | Description |
|---------|-------------|
| **Pipeline** | A complete data processing task (DAG) |
| **PCollection** | Distributed dataset (immutable) |
| **Transform** | Operations on PCollections (Map, Filter, GroupBy) |
| **Runner** | Execution engine (Dataflow, Direct, Spark, Flink) |
| **Windowing** | Grouping elements by time for streaming |

### Dataflow Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                        Dataflow Pipeline                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Source        Transform 1       Transform 2        Sink          │
│  ┌──────┐       ┌─────────┐       ┌─────────┐      ┌──────┐        │
│  │Pub/Sub│ ───► │ Parse   │ ───► │ Enrich  │ ───► │  BQ  │        │
│  │ GCS   │      │ Filter  │      │ Agg     │      │ GCS  │        │
│  └──────┘       └─────────┘       └─────────┘      └──────┘        │
│                                                                     │
│  ◄──────────────── Autoscaling Workers ─────────────────►          │
└─────────────────────────────────────────────────────────────────────┘
```

In [None]:
# Apache Beam / Dataflow - Batch Processing Example
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions
from apache_beam.io.gcp.bigquery import WriteToBigQuery, BigQueryDisposition

# Configure pipeline options for Dataflow
def get_dataflow_options(project: str, region: str, temp_location: str) -> PipelineOptions:
    """Configure Dataflow runner options."""
    options = PipelineOptions([
        f'--project={project}',
        f'--region={region}',
        f'--temp_location={temp_location}',
        '--runner=DataflowRunner',
        '--streaming=False',  # Batch mode
        '--autoscaling_algorithm=THROUGHPUT_BASED',
        '--max_num_workers=10',
        '--disk_size_gb=50',
        '--worker_machine_type=n1-standard-4'
    ])
    return options

In [None]:
# Batch ETL Pipeline Example
import json
from datetime import datetime

class ParseJsonFn(beam.DoFn):
    """Parse JSON records and handle errors."""
    def process(self, element):
        try:
            record = json.loads(element)
            # Add processing timestamp
            record['processed_at'] = datetime.utcnow().isoformat()
            yield beam.pvalue.TaggedOutput('valid', record)
        except json.JSONDecodeError as e:
            yield beam.pvalue.TaggedOutput('invalid', {'raw': element, 'error': str(e)})

class EnrichDataFn(beam.DoFn):
    """Enrich records with additional computed fields."""
    def process(self, record):
        # Add derived fields
        if 'amount' in record:
            record['amount_category'] = 'high' if record['amount'] > 1000 else 'low'
        yield record

def run_batch_etl_pipeline(
    input_path: str,
    output_table: str,
    dead_letter_path: str,
    options: PipelineOptions
):
    """Run a batch ETL pipeline from GCS to BigQuery."""
    with beam.Pipeline(options=options) as pipeline:
        # Read from GCS
        raw_data = pipeline | 'ReadFromGCS' >> beam.io.ReadFromText(input_path)
        
        # Parse with error handling (fan-out pattern)
        parsed = raw_data | 'ParseJSON' >> beam.ParDo(ParseJsonFn()).with_outputs('valid', 'invalid')
        
        # Process valid records
        enriched = (
            parsed.valid
            | 'Enrich' >> beam.ParDo(EnrichDataFn())
            | 'FilterNulls' >> beam.Filter(lambda x: x.get('user_id') is not None)
        )
        
        # Write to BigQuery
        enriched | 'WriteToBigQuery' >> WriteToBigQuery(
            output_table,
            write_disposition=BigQueryDisposition.WRITE_APPEND,
            create_disposition=BigQueryDisposition.CREATE_IF_NEEDED
        )
        
        # Write errors to dead letter queue
        (parsed.invalid
         | 'FormatErrors' >> beam.Map(json.dumps)
         | 'WriteDeadLetter' >> beam.io.WriteToText(dead_letter_path))

In [None]:
# Streaming Pipeline Example with Windowing
from apache_beam import window
from apache_beam.transforms.trigger import AfterWatermark, AfterProcessingTime, AccumulationMode

def run_streaming_pipeline(
    subscription: str,
    output_table: str,
    options: PipelineOptions
):
    """Real-time streaming pipeline with windowing."""
    with beam.Pipeline(options=options) as pipeline:
        events = (
            pipeline
            # Read from Pub/Sub
            | 'ReadPubSub' >> beam.io.ReadFromPubSub(
                subscription=subscription,
                with_attributes=True,
                timestamp_attribute='event_time'  # Use event time
            )
            | 'DecodeMessages' >> beam.Map(lambda x: json.loads(x.data.decode('utf-8')))
        )
        
        # Apply windowing for aggregations
        windowed_events = (
            events
            | 'AddTimestamp' >> beam.Map(
                lambda x: beam.window.TimestampedValue(x, x['timestamp'])
            )
            # 5-minute tumbling windows
            | 'Window' >> beam.WindowInto(
                window.FixedWindows(5 * 60),  # 5 minutes
                trigger=AfterWatermark(
                    early=AfterProcessingTime(60),  # Early firings every minute
                    late=AfterProcessingTime(300)   # Late data up to 5 min
                ),
                accumulation_mode=AccumulationMode.ACCUMULATING,
                allowed_lateness=beam.Duration(seconds=3600)  # 1 hour late data
            )
        )
        
        # Aggregate by key
        aggregated = (
            windowed_events
            | 'KeyByUser' >> beam.Map(lambda x: (x['user_id'], x['amount']))
            | 'SumPerUser' >> beam.CombinePerKey(sum)
            | 'FormatOutput' >> beam.Map(
                lambda x: {'user_id': x[0], 'total_amount': x[1]}
            )
        )
        
        # Write to BigQuery with streaming inserts
        aggregated | 'StreamToBQ' >> WriteToBigQuery(
            output_table,
            method='STREAMING_INSERTS'
        )

---
<a id='dataproc'></a>
## 3. Dataproc - Managed Spark & Hadoop

Google Cloud Dataproc is a fully managed service for running Apache Spark, Hadoop, Presto, and other open-source tools. It's ideal for existing Spark/Hadoop workloads and complex data transformations.

### Dataproc vs Dataflow

| Aspect | Dataproc | Dataflow |
|--------|----------|----------|
| **Use Case** | Existing Spark/Hadoop jobs | New pipelines, Beam-based |
| **Scaling** | Manual cluster sizing | Fully auto-scaling |
| **Pricing** | Per-cluster (VM hours) | Per-job (data processed) |
| **Startup Time** | ~90 seconds | Near-instant (streaming) |
| **Flexibility** | Full Spark API access | Beam abstractions |
| **Best For** | ML, complex transformations | ETL pipelines, streaming |

In [None]:
# Dataproc Cluster Management
from google.cloud import dataproc_v1
from google.cloud.dataproc_v1.types import Cluster, ClusterConfig, InstanceGroupConfig

def create_dataproc_cluster(
    project_id: str,
    region: str,
    cluster_name: str,
    master_machine_type: str = 'n1-standard-4',
    worker_machine_type: str = 'n1-standard-4',
    num_workers: int = 2
) -> str:
    """Create a Dataproc cluster with autoscaling."""
    
    cluster_client = dataproc_v1.ClusterControllerClient(
        client_options={'api_endpoint': f'{region}-dataproc.googleapis.com:443'}
    )
    
    cluster_config = {
        'project_id': project_id,
        'cluster_name': cluster_name,
        'config': {
            'master_config': {
                'num_instances': 1,
                'machine_type_uri': master_machine_type,
                'disk_config': {'boot_disk_size_gb': 500}
            },
            'worker_config': {
                'num_instances': num_workers,
                'machine_type_uri': worker_machine_type,
                'disk_config': {'boot_disk_size_gb': 500}
            },
            # Enable component gateway for web UIs
            'endpoint_config': {'enable_http_port_access': True},
            # Optional components
            'software_config': {
                'image_version': '2.1-debian11',
                'optional_components': ['JUPYTER', 'ZEPPELIN'],
                'properties': {
                    'spark:spark.jars.packages': 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.32.0'
                }
            },
            # Autoscaling policy
            'autoscaling_config': {
                'policy_uri': f'projects/{project_id}/regions/{region}/autoscalingPolicies/default-policy'
            }
        }
    }
    
    operation = cluster_client.create_cluster(
        request={'project_id': project_id, 'region': region, 'cluster': cluster_config}
    )
    
    result = operation.result()
    print(f"Cluster created: {result.cluster_name}")
    return result.cluster_name

In [None]:
# Submit PySpark Job to Dataproc
def submit_pyspark_job(
    project_id: str,
    region: str,
    cluster_name: str,
    main_python_file: str,
    args: list = None
) -> str:
    """Submit a PySpark job to Dataproc cluster."""
    
    job_client = dataproc_v1.JobControllerClient(
        client_options={'api_endpoint': f'{region}-dataproc.googleapis.com:443'}
    )
    
    job = {
        'placement': {'cluster_name': cluster_name},
        'pyspark_job': {
            'main_python_file_uri': main_python_file,
            'args': args or [],
            'properties': {
                'spark.executor.memory': '4g',
                'spark.executor.cores': '2',
                'spark.dynamicAllocation.enabled': 'true'
            }
        }
    }
    
    operation = job_client.submit_job_as_operation(
        request={'project_id': project_id, 'region': region, 'job': job}
    )
    
    result = operation.result()
    print(f"Job finished: {result.status.state.name}")
    return result.reference.job_id

In [None]:
# Example PySpark Job: ETL from GCS to BigQuery
# This would be saved as a .py file and submitted to Dataproc

PYSPARK_ETL_SCRIPT = '''
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, current_timestamp

def main():
    spark = SparkSession.builder \\
        .appName("GCS-to-BigQuery-ETL") \\
        .getOrCreate()
    
    # Read from GCS (Parquet)
    df = spark.read.parquet("gs://your-bucket/raw/events/")
    
    # Transformations
    transformed = df \\
        .filter(col("event_type").isNotNull()) \\
        .withColumn(
            "event_category",
            when(col("event_type").isin(["click", "view"]), "engagement")
            .when(col("event_type") == "purchase", "conversion")
            .otherwise("other")
        ) \\
        .withColumn("processed_at", current_timestamp())
    
    # Write to BigQuery using Spark BigQuery connector
    transformed.write \\
        .format("bigquery") \\
        .option("table", "project.dataset.events_processed") \\
        .option("temporaryGcsBucket", "your-temp-bucket") \\
        .mode("append") \\
        .save()
    
    print(f"Processed {transformed.count()} records")
    spark.stop()

if __name__ == "__main__":
    main()
'''

print("PySpark ETL script template created")

---
<a id='bigquery'></a>
## 4. BigQuery - Serverless Data Warehouse

BigQuery is Google's fully managed, serverless data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure.

### Key Features

| Feature | Description |
|---------|-------------|
| **Separation of Storage & Compute** | Pay for what you use independently |
| **Columnar Storage** | Optimized for analytical queries |
| **Partitioning** | Time-based and integer-range partitions |
| **Clustering** | Sort data by specified columns |
| **Streaming Inserts** | Real-time data ingestion |
| **BigQuery ML** | ML models using SQL |
| **BI Engine** | In-memory analysis for dashboards |

In [None]:
# BigQuery Python SDK Examples
from google.cloud import bigquery
from google.cloud.bigquery import SchemaField, Table, TimePartitioning

bq_client = bigquery.Client(project='your-project-id')

def create_partitioned_table(
    dataset_id: str,
    table_id: str,
    schema: list,
    partition_field: str = 'event_date',
    clustering_fields: list = None
) -> Table:
    """Create a partitioned and clustered BigQuery table."""
    
    table_ref = f"{bq_client.project}.{dataset_id}.{table_id}"
    table = bigquery.Table(table_ref, schema=schema)
    
    # Time-based partitioning
    table.time_partitioning = TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY,
        field=partition_field,
        expiration_ms=365 * 24 * 60 * 60 * 1000  # 1 year retention
    )
    
    # Clustering for query optimization
    if clustering_fields:
        table.clustering_fields = clustering_fields
    
    created_table = bq_client.create_table(table, exists_ok=True)
    print(f"Created table: {created_table.full_table_id}")
    return created_table

# Example schema
events_schema = [
    SchemaField('event_id', 'STRING', mode='REQUIRED'),
    SchemaField('user_id', 'STRING', mode='REQUIRED'),
    SchemaField('event_type', 'STRING', mode='REQUIRED'),
    SchemaField('event_date', 'DATE', mode='REQUIRED'),
    SchemaField('event_timestamp', 'TIMESTAMP', mode='REQUIRED'),
    SchemaField('properties', 'JSON', mode='NULLABLE'),
    SchemaField('amount', 'FLOAT64', mode='NULLABLE'),
]

In [None]:
# Loading Data into BigQuery
from google.cloud.bigquery import LoadJobConfig, SourceFormat, WriteDisposition

def load_gcs_to_bigquery(
    source_uri: str,
    destination_table: str,
    source_format: str = 'PARQUET'
) -> bigquery.LoadJob:
    """Load data from GCS to BigQuery."""
    
    job_config = LoadJobConfig(
        source_format=getattr(SourceFormat, source_format),
        write_disposition=WriteDisposition.WRITE_APPEND,
        # Schema auto-detection for Parquet
        autodetect=True if source_format == 'PARQUET' else False,
        # Hive partitioning detection
        hive_partitioning=bigquery.HivePartitioningOptions(
            mode='AUTO',
            source_uri_prefix=source_uri.rsplit('/', 1)[0]
        ) if 'year=' in source_uri or 'date=' in source_uri else None
    )
    
    load_job = bq_client.load_table_from_uri(
        source_uri,
        destination_table,
        job_config=job_config
    )
    
    result = load_job.result()  # Wait for completion
    print(f"Loaded {result.output_rows} rows to {destination_table}")
    return load_job

In [None]:
# BigQuery SQL Best Practices

# Optimized query with partition pruning and clustering benefits
OPTIMIZED_QUERY = """
-- Use partition filters to reduce data scanned
SELECT
    user_id,
    event_type,
    COUNT(*) as event_count,
    SUM(amount) as total_amount,
    AVG(amount) as avg_amount
FROM `project.dataset.events`
WHERE 
    -- Partition filter (reduces data scanned)
    event_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND CURRENT_DATE()
    -- Clustering column filter (further optimization)
    AND event_type IN ('purchase', 'subscription')
GROUP BY user_id, event_type
HAVING event_count > 5
ORDER BY total_amount DESC
LIMIT 1000
"""

# Incremental processing pattern
INCREMENTAL_MERGE = """
-- Merge pattern for incremental updates
MERGE `project.dataset.dim_users` AS target
USING (
    SELECT DISTINCT
        user_id,
        FIRST_VALUE(email) OVER (PARTITION BY user_id ORDER BY updated_at DESC) as email,
        FIRST_VALUE(name) OVER (PARTITION BY user_id ORDER BY updated_at DESC) as name,
        MAX(updated_at) as last_updated
    FROM `project.dataset.user_events_staging`
    WHERE _PARTITIONDATE = CURRENT_DATE()
) AS source
ON target.user_id = source.user_id
WHEN MATCHED AND source.last_updated > target.last_updated THEN
    UPDATE SET
        email = source.email,
        name = source.name,
        last_updated = source.last_updated
WHEN NOT MATCHED THEN
    INSERT (user_id, email, name, last_updated)
    VALUES (source.user_id, source.email, source.name, source.last_updated)
"""

print("BigQuery SQL patterns defined")

---
<a id='pubsub'></a>
## 5. Pub/Sub - Messaging & Event Streaming

Google Cloud Pub/Sub is a fully managed real-time messaging service that allows you to send and receive messages between independent applications.

### Core Concepts

```
┌─────────────────────────────────────────────────────────────────────┐
│                        Pub/Sub Architecture                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Publishers              Topic                 Subscribers         │
│   ──────────             ───────                ───────────         │
│   ┌────────┐            ┌───────┐              ┌──────────┐        │
│   │ App 1  │──┐         │       │──────────────│ Sub A    │        │
│   └────────┘  │         │       │              │ (Push)   │        │
│               ├────────►│ Topic │              └──────────┘        │
│   ┌────────┐  │         │       │              ┌──────────┐        │
│   │ App 2  │──┘         │       │──────────────│ Sub B    │        │
│   └────────┘            └───────┘              │ (Pull)   │        │
│                                                └──────────┘        │
│                                                                     │
│   • Messages are retained for 7 days (configurable)                │
│   • At-least-once delivery guaranteed                              │
│   • Exactly-once processing available                               │
└─────────────────────────────────────────────────────────────────────┘
```

### Subscription Types

| Type | Use Case | Delivery |
|------|----------|----------|
| **Pull** | Batch processing, custom rate control | Subscriber pulls messages |
| **Push** | Serverless, webhooks | Pub/Sub pushes to endpoint |
| **BigQuery** | Direct analytics ingestion | Auto-writes to BQ table |
| **Cloud Storage** | Data lake ingestion | Auto-writes to GCS bucket |

In [None]:
# Pub/Sub Publisher Example
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1 import PublisherClient
from google.api_core import retry
import json
from concurrent import futures
from typing import Callable

class PubSubPublisher:
    """High-performance Pub/Sub publisher with batching."""
    
    def __init__(self, project_id: str):
        # Configure batching settings
        batch_settings = pubsub_v1.types.BatchSettings(
            max_messages=100,        # Max messages per batch
            max_bytes=1024 * 1024,   # 1 MB max batch size
            max_latency=0.01         # 10ms max wait time
        )
        
        self.publisher = PublisherClient(
            batch_settings=batch_settings
        )
        self.project_id = project_id
        self.futures = []
    
    def get_topic_path(self, topic_id: str) -> str:
        return self.publisher.topic_path(self.project_id, topic_id)
    
    def publish(
        self,
        topic_id: str,
        data: dict,
        ordering_key: str = None,
        **attributes
    ) -> futures.Future:
        """Publish a message with optional ordering key."""
        topic_path = self.get_topic_path(topic_id)
        
        # Serialize data
        message_bytes = json.dumps(data).encode('utf-8')
        
        # Publish with retry
        future = self.publisher.publish(
            topic_path,
            data=message_bytes,
            ordering_key=ordering_key or '',
            **attributes  # Custom attributes for filtering
        )
        
        self.futures.append(future)
        return future
    
    def flush(self):
        """Wait for all pending publishes to complete."""
        for future in self.futures:
            try:
                message_id = future.result(timeout=30)
            except Exception as e:
                print(f"Publish failed: {e}")
        self.futures.clear()

# Usage example
# publisher = PubSubPublisher('my-project')
# publisher.publish('events-topic', {'user_id': '123', 'event': 'click'}, event_type='click')
# publisher.flush()

In [None]:
# Pub/Sub Subscriber with Exactly-Once Processing
from google.cloud.pubsub_v1 import SubscriberClient
from google.cloud.pubsub_v1.subscriber.message import Message
import time

class PubSubSubscriber:
    """Pub/Sub subscriber with exactly-once semantics."""
    
    def __init__(self, project_id: str):
        self.subscriber = SubscriberClient()
        self.project_id = project_id
        self.streaming_pull_future = None
    
    def get_subscription_path(self, subscription_id: str) -> str:
        return self.subscriber.subscription_path(self.project_id, subscription_id)
    
    def subscribe(
        self,
        subscription_id: str,
        callback: Callable[[Message], None],
        max_messages: int = 100,
        ack_deadline: int = 60
    ):
        """Start streaming pull subscription."""
        subscription_path = self.get_subscription_path(subscription_id)
        
        # Flow control to prevent overwhelming the subscriber
        flow_control = pubsub_v1.types.FlowControl(
            max_messages=max_messages,
            max_bytes=10 * 1024 * 1024  # 10 MB
        )
        
        def wrapped_callback(message: Message):
            try:
                # Process message
                callback(message)
                # Acknowledge on success
                message.ack()
            except Exception as e:
                print(f"Error processing message: {e}")
                # Negative acknowledge - message will be redelivered
                message.nack()
        
        self.streaming_pull_future = self.subscriber.subscribe(
            subscription_path,
            callback=wrapped_callback,
            flow_control=flow_control
        )
        
        print(f"Listening on {subscription_path}...")
        return self.streaming_pull_future
    
    def stop(self):
        """Stop the subscriber."""
        if self.streaming_pull_future:
            self.streaming_pull_future.cancel()
            self.streaming_pull_future.result()

# Example callback
def process_message(message: Message):
    data = json.loads(message.data.decode('utf-8'))
    print(f"Received: {data}")
    print(f"Attributes: {message.attributes}")
    print(f"Message ID: {message.message_id}")

In [None]:
# Create BigQuery Subscription for Direct Analytics
from google.cloud.pubsub_v1 import SubscriberClient
from google.pubsub_v1.types import BigQueryConfig, Subscription

def create_bigquery_subscription(
    project_id: str,
    topic_id: str,
    subscription_id: str,
    bigquery_table: str
):
    """Create a subscription that writes directly to BigQuery."""
    
    subscriber = SubscriberClient()
    topic_path = subscriber.topic_path(project_id, topic_id)
    subscription_path = subscriber.subscription_path(project_id, subscription_id)
    
    bigquery_config = BigQueryConfig(
        table=bigquery_table,  # Format: project:dataset.table
        use_topic_schema=True,  # Use Pub/Sub schema
        write_metadata=True,    # Include message metadata
        drop_unknown_fields=True
    )
    
    subscription = Subscription(
        name=subscription_path,
        topic=topic_path,
        bigquery_config=bigquery_config
    )
    
    result = subscriber.create_subscription(request=subscription)
    print(f"Created BigQuery subscription: {result.name}")
    return result

---
<a id='best-practices'></a>
## 6. Best Practices & Architecture Patterns

### Reference Architecture: Real-Time Analytics Pipeline

```
┌──────────────────────────────────────────────────────────────────────────────┐
│                    Real-Time Analytics Architecture                          │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   DATA SOURCES              INGESTION              PROCESSING                │
│   ────────────              ─────────              ──────────                │
│                                                                              │
│   ┌─────────┐              ┌─────────┐            ┌──────────┐               │
│   │ Web App │───────┐      │         │            │          │               │
│   └─────────┘       │      │         │   Stream   │ Dataflow │               │
│                     ├─────►│ Pub/Sub │───────────►│ Streaming│               │
│   ┌─────────┐       │      │         │            │ Pipeline │               │
│   │ Mobile  │───────┘      │         │            │          │               │
│   └─────────┘              └─────────┘            └────┬─────┘               │
│                                                        │                     │
│   ┌─────────┐              ┌─────────┐            ┌────▼─────┐               │
│   │ Files   │─────────────►│   GCS   │───────────►│ Dataflow │               │
│   │ (Batch) │              │(Landing)│   Batch    │  Batch   │               │
│   └─────────┘              └─────────┘            └────┬─────┘               │
│                                                        │                     │
│   STORAGE                  SERVING                  CONSUMPTION              │
│   ───────                  ───────                  ───────────              │
│                                                                              │
│               ┌────────────────────────────────────────┐                     │
│               │                                        │                     │
│   ┌───────────▼──┐        ┌───────────┐         ┌──────▼──────┐             │
│   │   BigQuery   │◄──────►│  BI Engine│────────►│   Looker    │             │
│   │  (Warehouse) │        │ (In-Memory)│        │ Data Studio │             │
│   └──────────────┘        └───────────┘         └─────────────┘             │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘
```

### Best Practices by Service

#### Google Cloud Storage
- ✅ Use lifecycle policies for automatic tiering
- ✅ Enable Object Versioning for critical data
- ✅ Use regional buckets for latency, multi-region for durability
- ✅ Implement naming conventions with partitions (e.g., `year=2024/month=01/`)
- ❌ Avoid small files - aggregate into larger objects
- ❌ Don't use GCS for low-latency key-value lookups

#### Dataflow
- ✅ Use Flex Templates for production pipelines
- ✅ Implement dead-letter queues for error handling
- ✅ Use side inputs for small lookup tables
- ✅ Set appropriate windowing and triggering strategies
- ❌ Avoid stateful operations when possible
- ❌ Don't use global windows for unbounded data

#### Dataproc
- ✅ Use ephemeral clusters (create → run → delete)
- ✅ Store data in GCS, not HDFS
- ✅ Enable autoscaling policies
- ✅ Use initialization actions for customization
- ❌ Avoid keeping clusters running when idle
- ❌ Don't use local disk for persistent data

#### BigQuery
- ✅ Always use partition filters in queries
- ✅ Cluster tables on high-cardinality filter columns
- ✅ Use `MERGE` for incremental updates
- ✅ Consider BI Engine for dashboard workloads
- ❌ Avoid `SELECT *` - specify needed columns
- ❌ Don't use streaming inserts for batch loads

#### Pub/Sub
- ✅ Use ordering keys for in-order processing
- ✅ Implement idempotent subscribers
- ✅ Set appropriate acknowledgment deadlines
- ✅ Use BigQuery subscriptions for simple analytics
- ❌ Avoid very long-running message processing
- ❌ Don't use Pub/Sub for request-response patterns

In [None]:
# Infrastructure as Code: Terraform Example for GCP Data Platform

TERRAFORM_CONFIG = '''
# main.tf - GCP Data Engineering Infrastructure

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

variable "project_id" {}
variable "region" { default = "us-central1" }

# Data Lake Bucket
resource "google_storage_bucket" "data_lake" {
  name                        = "${var.project_id}-data-lake"
  location                    = var.region
  uniform_bucket_level_access = true
  versioning { enabled = true }
  
  lifecycle_rule {
    condition { age = 30 }
    action {
      type          = "SetStorageClass"
      storage_class = "NEARLINE"
    }
  }
}

# BigQuery Dataset
resource "google_bigquery_dataset" "analytics" {
  dataset_id    = "analytics"
  friendly_name = "Analytics Dataset"
  location      = var.region
  
  default_partition_expiration_ms = 7776000000  # 90 days
}

# Pub/Sub Topic
resource "google_pubsub_topic" "events" {
  name = "events"
  
  message_retention_duration = "604800s"  # 7 days
}

# Pub/Sub to BigQuery Subscription
resource "google_pubsub_subscription" "events_to_bq" {
  name  = "events-to-bigquery"
  topic = google_pubsub_topic.events.name
  
  bigquery_config {
    table            = "${var.project_id}:${google_bigquery_dataset.analytics.dataset_id}.events_raw"
    use_topic_schema = true
    write_metadata   = true
  }
}

# Dataproc Autoscaling Policy
resource "google_dataproc_autoscaling_policy" "default" {
  policy_id = "default-policy"
  location  = var.region

  basic_algorithm {
    yarn_config {
      graceful_decommission_timeout = "30s"
      scale_up_factor               = 0.5
      scale_down_factor             = 0.5
    }
  }

  worker_config {
    min_instances = 2
    max_instances = 10
  }
}
'''

print("Terraform configuration template created")

### Cost Optimization Strategies

| Service | Strategy | Potential Savings |
|---------|----------|-------------------|
| **GCS** | Use lifecycle policies for auto-tiering | 60-80% on cold data |
| **GCS** | Compress files (gzip, snappy) | 50-70% storage |
| **Dataflow** | Use Dataflow Prime for autoscaling | 20-40% compute |
| **Dataproc** | Use preemptible/spot VMs for workers | 60-80% compute |
| **Dataproc** | Ephemeral clusters (delete when idle) | 30-50% |
| **BigQuery** | Partition and cluster tables | 50-90% query costs |
| **BigQuery** | Use flat-rate pricing for predictable workloads | Varies |
| **Pub/Sub** | Use BigQuery subscription (vs. processing) | 50%+ |
| **All** | Reserve capacity with CUDs | 25-55% |

---
<a id='takeaways'></a>
## 7. Key Takeaways

### Service Selection Guide

```
┌─────────────────────────────────────────────────────────────────────┐
│               When to Use Which Service?                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Need to store files/objects?          ───► Google Cloud Storage   │
│                                                                     │
│  Need real-time messaging?             ───► Pub/Sub                 │
│                                                                     │
│  Need to run SQL analytics?            ───► BigQuery                │
│                                                                     │
│  Need to build new ETL pipelines?      ───► Dataflow                │
│                                                                     │
│  Have existing Spark/Hadoop jobs?      ───► Dataproc                │
│                                                                     │
│  Need ML on structured data?           ───► BigQuery ML             │
│                                                                     │
│  Need ML on unstructured data?         ───► Vertex AI + Dataproc    │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### Summary

| Service | Primary Use Case | Key Benefit |
|---------|------------------|-------------|
| **GCS** | Data lake storage | Scalable, durable, cost-effective |
| **Dataflow** | Stream/batch ETL | Unified programming, auto-scaling |
| **Dataproc** | Spark/Hadoop workloads | Managed clusters, ecosystem compat |
| **BigQuery** | Data warehouse | Serverless, fast analytics |
| **Pub/Sub** | Event streaming | Real-time, reliable messaging |

### Integration Patterns

1. **Streaming Analytics**: Pub/Sub → Dataflow → BigQuery
2. **Batch ETL**: GCS → Dataflow/Dataproc → BigQuery
3. **Data Lake**: GCS (landing) → GCS (processed) → BigQuery (serving)
4. **ML Pipeline**: BigQuery → Vertex AI → GCS (models)
5. **Log Analytics**: Cloud Logging → Pub/Sub → BigQuery

### Further Resources

- [GCP Data Engineering Documentation](https://cloud.google.com/architecture/data-engineering)
- [BigQuery Best Practices](https://cloud.google.com/bigquery/docs/best-practices-performance-overview)
- [Dataflow Programming Guide](https://cloud.google.com/dataflow/docs/concepts/beam-programming-model)
- [GCP Professional Data Engineer Certification](https://cloud.google.com/certification/data-engineer)