# Cloud Computing and Big Data for Data Science

In this notebook, we'll explore the fundamental concepts of cloud computing and big data implementations in the context of data science. These technologies have revolutionized how organizations process, store, and analyze massive datasets.

## Table of Contents
1. [Cloud Computing Definition](#cloud-computing-definition)
2. [Importance of Cloud Data Services](#importance-of-cloud-data-services)
3. [Big Data Implementation Examples](#big-data-implementation-examples)
4. [Data Warehousing Concepts](#data-warehousing-concepts)
5. [Big Data Processing Principles](#big-data-processing-principles)
6. [Application to Lending Club Dataset](#application-to-lending-club-dataset)

## Cloud Computing Definition

Cloud computing is the delivery of computing services—including servers, storage, databases, networking, software, analytics, and intelligence—over the Internet ("the cloud"). It offers faster innovation, flexible resources, and economies of scale.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Create sample data to simulate cloud computing and big data scenarios
np.random.seed(42)
n_samples = 5000

# Simulate Lending Club data (to connect with our main project)
data = {
    'loan_id': range(1, n_samples + 1),
    'loan_amnt': np.random.normal(15000, 10000, n_samples),
    'int_rate': np.random.normal(12, 4, n_samples),
    'annual_inc': np.random.normal(75000, 30000, n_samples),
    'dti': np.random.normal(15, 10, n_samples),
    'fico_score': np.random.normal(700, 50, n_samples),
    'emp_length': np.random.gamma(2, 2, n_samples),
    'loan_status': np.random.choice([0, 1], n_samples, p=[0.8, 0.2]),  # 0: Fully Paid, 1: Charged Off
    'grade': pd.cut(np.random.normal(700, 50, n_samples), 
                    bins=[0, 580, 620, 660, 700, 740, 780, 850], 
                    labels=['G', 'F', 'E', 'D', 'C', 'B', 'A']),
    'home_ownership': np.random.choice(['MORTGAGE', 'RENT', 'OWN'], n_samples, p=[0.4, 0.3, 0.3]),
    'purpose': np.random.choice(['debt_consolidation', 'credit_card', 'home_improvement', 'major_purchase', 'small_business'], 
                                n_samples, p=[0.3, 0.2, 0.2, 0.2, 0.1]),
    'region': np.random.choice(['Northeast', 'Southeast', 'Midwest', 'Southwest', 'West'], n_samples),
    'application_date': pd.date_range('2010-01-01', periods=n_samples, freq='H')[:n_samples]
}

# Ensure no negative values and realistic ranges
data['loan_amnt'] = np.abs(data['loan_amnt'])
data['annual_inc'] = np.abs(data['annual_inc'])
data['dti'] = np.abs(data['dti'])
data['fico_score'] = np.clip(data['fico_score'], 300, 850)
data['emp_length'] = np.clip(data['emp_length'], 0, 15)

df = pd.DataFrame(data)

print("Cloud Computing and Big Data - Sample Lending Club Dataset")
print(df.head())
print(f"\nDataset Shape: {df.shape}")

## Importance of Cloud Data Services

Cloud data services have become crucial for data science teams due to several key advantages:

1. **Scalability**: Automatically scale resources up or down based on demand
2. **Cost-effectiveness**: Pay only for what you use, reducing infrastructure costs
3. **Flexibility**: Access to a wide range of data processing and machine learning tools
4. **Reliability**: Built-in redundancy and disaster recovery
5. **Collaboration**: Easy sharing of data and models across teams
6. **Innovation**: Access to cutting-edge tools and services

In [None]:
# Visualization: Benefits of Cloud Computing
benefits = [
    'Scalability',
    'Cost-effectiveness',
    'Flexibility',
    'Reliability',
    'Collaboration',
    'Innovation'
]
importance_scores = [9, 8, 8, 9, 7, 8]

fig = go.Figure(data=[
    go.Bar(x=benefits, y=importance_scores, 
           text=importance_scores, textposition='auto',
           marker_color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7', '#DDA0DD'])
])

fig.update_layout(
    title='Importance of Cloud Computing Benefits for Data Science',
    xaxis_title='Benefits',
    yaxis_title='Importance Score (1-10)',
    height=500
)

fig.show()

# Cloud service provider comparison
providers = ['AWS', 'Google Cloud', 'Microsoft Azure']
data_science_features = [12, 10, 9]
ml_services = [15, 12, 11]
storage_options = [8, 7, 8]

fig = go.Figure(data=[
    go.Bar(name='Data Science Tools', x=providers, y=data_science_features),
    go.Bar(name='ML Services', x=providers, y=ml_services),
    go.Bar(name='Storage Options', x=providers, y=storage_options)
])

fig.update_layout(
    title='Cloud Service Providers - Data Science Capabilities',
    xaxis_title='Cloud Provider',
    yaxis_title='Number of Services/Features',
    barmode='group',
    height=500
)

fig.show()

print("Cloud Computing for Data Science Services:")
print("\n1. Data Storage Solutions:")
print("   - Amazon S3 (Simple Storage Service) - Object storage")
print("   - Google Cloud Storage - Scalable storage solution")
print("   - Azure Blob Storage - Unstructured data storage")

print("\n2. Data Processing Services:")
print("   - AWS EMR (Elastic MapReduce) - Managed Hadoop framework")
print("   - Google Dataproc - Managed Spark and Hadoop")
print("   - Azure HDInsight - Managed analytics service")

print("\n3. Data Analytics Services:")
print("   - AWS Redshift - Data warehousing")
print("   - Google BigQuery - Serverless data warehouse")
print("   - Azure Synapse - Enterprise data analytics")

print("\n4. Machine Learning Services:")
print("   - AWS SageMaker - End-to-end ML platform")
print("   - Google AI Platform - ML development and deployment")
print("   - Azure ML Studio - Drag-and-drop ML development")

## Big Data Implementation Examples

Big data implementations across different industries have transformed how organizations operate and make decisions. Here are some key examples:

1. **Retail and Customer Analytics**: Personalization, inventory management, demand forecasting
2. **Manufacturing and Predictive Maintenance**: Equipment monitoring, predictive maintenance, quality control
3. **Telecommunications and Network Optimization**: Network performance, customer experience, fraud detection

In [None]:
# Visualization: Big Data Use Cases by Industry
industries = ['Retail', 'Manufacturing', 'Telecom', 'Finance', 'Healthcare', 'Transportation']
big_data_adoption = [8.5, 7.8, 8.9, 9.2, 7.0, 6.5]
roi_scores = [7.5, 8.0, 8.5, 8.8, 6.5, 7.0]

fig = make_subplots(rows=1, cols=2, 
                    subplot_titles=('Big Data Adoption by Industry', 'ROI Scores by Industry'))

fig.add_trace(go.Bar(x=industries, y=big_data_adoption, name='Adoption',
                     marker_color='lightblue'), row=1, col=1)

fig.add_trace(go.Bar(x=industries, y=roi_scores, name='ROI Score',
                     marker_color='lightgreen'), row=1, col=2)

fig.update_layout(title='Big Data Implementation Across Industries',
                  height=500)

fig.show()

# Detailed explanation of each implementation
print("Big Data Implementation Examples:")

print("\n1. Retail and Customer Analytics")
print("   - Personalization: Analyzing customer behavior, purchase history, and preferences")
print("     to offer personalized recommendations and targeted marketing.")
print("   - Inventory Management: Predicting demand patterns to optimize inventory levels")
print("     and reduce waste.")
print("   - Customer Segmentation: Grouping customers based on behavior, demographics, and")
print("     preferences for targeted marketing strategies.")

# Simulate retail data scenario
retail_customers = 10000
purchase_freq = np.random.exponential(0.5, retail_customers)
avg_purchase = np.random.normal(100, 50, retail_customers)
retail_df = pd.DataFrame({'customer_id': range(retail_customers),
                         'purchase_freq': purchase_freq,
                         'avg_purchase': np.abs(avg_purchase)})

fig = px.scatter(retail_df.sample(500), x='purchase_freq', y='avg_purchase', 
                 title='Simulated Retail Customer Data: Purchase Frequency vs Average Purchase',
                 labels={'purchase_freq': 'Purchase Frequency', 'avg_purchase': 'Average Purchase Amount'},
                 opacity=0.6)
fig.update_layout(height=400)
fig.show()

print("\n2. Manufacturing and Predictive Maintenance")
print("   - Equipment Monitoring: Collecting real-time sensor data from machines")
print("     to detect anomalies and predict failures.")
print("   - Predictive Maintenance: Using ML models to predict when equipment")
print("     will fail before it happens, reducing downtime.")
print("   - Quality Control: Analyzing production data to identify defects")
print("     early in the manufacturing process.")

# Simulate manufacturing data scenario
sensors = ['Temp', 'Vibration', 'Pressure', 'Humidity', 'Flow']
sensor_readings = {sensor: np.random.normal(50, 10, 5000) for sensor in sensors}
manufacturing_df = pd.DataFrame(sensor_readings)

# Add anomaly detection simulation
manufacturing_df['anomaly_score'] = np.random.exponential(1, 5000)
manufacturing_df['failure_risk'] = (manufacturing_df['anomaly_score'] > 3).astype(int)

fig = make_subplots(rows=2, cols=2, 
                    subplot_titles=('Temperature', 'Vibration', 'Pressure', 'Anomaly Detection'))

for i, sensor in enumerate(['Temp', 'Vibration', 'Pressure']):
    row = (i // 2) + 1
    col = (i % 2) + 1
    fig.add_trace(go.Scatter(y=manufacturing_df[sensor].iloc[:100], 
                             mode='lines', name=sensor), row=row, col=col)

fig.add_trace(go.Histogram(x=manufacturing_df['anomaly_score'], nbinsx=50, 
                           name='Anomaly Score'), row=2, col=2)

fig.update_layout(title='Simulated Manufacturing Sensor Data Analysis', height=600)
fig.show()

print("\n3. Telecommunications and Network Optimization")
print("   - Network Performance: Analyzing network traffic patterns to optimize")
print("     routing and prevent congestion.")
print("   - Customer Experience: Monitoring service quality and customer usage")
print("     patterns to improve service delivery.")
print("   - Fraud Detection: Identifying unusual usage patterns that might")
print("     indicate fraudulent activity.")

# Simulate telecom data scenario
time_range = pd.date_range('2023-01-01', periods=8760, freq='H')  # One year of hourly data
traffic_volume = 1000 + 500 * np.sin(np.arange(8760) * 2 * np.pi / (24 * 7)) + np.random.normal(0, 100, 8760)  # Weekly pattern
latency = 20 + 5 * np.random.exponential(1, 8760)
bandwidth_usage = 50 + 20 * np.random.beta(2, 5, 8760)

telecom_df = pd.DataFrame({
    'timestamp': time_range,
    'traffic_volume': np.abs(traffic_volume),
    'latency': np.abs(latency),
    'bandwidth_usage': np.abs(bandwidth_usage)
})

fig = make_subplots(rows=2, cols=2, 
                    subplot_titles=('Traffic Volume', 'Latency', 'Bandwidth Usage', 'Network Performance'))

fig.add_trace(go.Scatter(x=telecom_df['timestamp'][:168], y=telecom_df['traffic_volume'][:168], 
                         mode='lines', name='Traffic'), row=1, col=1)
fig.add_trace(go.Scatter(x=telecom_df['timestamp'][:168], y=telecom_df['latency'][:168], 
                         mode='lines', name='Latency'), row=1, col=2)
fig.add_trace(go.Scatter(x=telecom_df['timestamp'][:168], y=telecom_df['bandwidth_usage'][:168], 
                         mode='lines', name='Bandwidth'), row=2, col=1)
fig.add_trace(go.Scatter3d(x=telecom_df['traffic_volume'][:500], 
                           y=telecom_df['latency'][:500], 
                           z=telecom_df['bandwidth_usage'][:500], 
                           mode='markers',
                           marker=dict(size=3),
                           name='Performance'), row=2, col=2)

fig.update_layout(title='Simulated Telecom Network Data Analysis', height=700)
fig.show()

## Data Warehousing Concepts

Data warehousing is the foundation for business intelligence and analytics. It involves storing large volumes of data from multiple sources in a way that makes it easy to analyze and report on. Key concepts include:

1. **Fact Tables**: Store quantitative data about business transactions
2. **Dimension Tables**: Store descriptive attributes that provide context to facts
3. **Star Schema**: A logical arrangement of tables centered around a fact table
4. **Snowflake Schema**: A normalized form of star schema
5. **Galaxy Schema (Fact Constellation)**: Multiple fact tables sharing dimension tables
6. **Principles of Massively Parallel Processing (MPP) Databases**: Optimized for analytical queries

In [None]:
# Create a simulated data warehouse for Lending Club

# Dimension Tables
dates_dim = pd.DataFrame({
    'date_id': range(len(df)),
    'date': df['application_date'],
    'year': df['application_date'].dt.year,
    'month': df['application_date'].dt.month,
    'day': df['application_date'].dt.day,
    'quarter': df['application_date'].dt.quarter
})

customer_dim = pd.DataFrame({
    'customer_id': df['loan_id'],
    'fico_score': df['fico_score'],
    'annual_income': df['annual_inc'],
    'home_ownership': df['home_ownership'],
    'employment_length': df['emp_length'],
    'region': df['region']
})

loan_dim = pd.DataFrame({
    'loan_id': df['loan_id'],
    'loan_amount': df['loan_amnt'],
    'interest_rate': df['int_rate'],
    'grade': df['grade'],
    'purpose': df['purpose'],
    'debt_to_income': df['dti']
})

# Fact Table
fact_table = pd.DataFrame({
    'date_id': range(len(df)),
    'customer_id': df['loan_id'],
    'loan_id': df['loan_id'],
    'loan_amount': df['loan_amnt'],
    'interest_rate': df['int_rate'],
    'default_status': df['loan_status'],
    'application_date_key': range(len(df))
})

print("Data Warehousing Example - Lending Club Data Warehouse")
print("\nDates Dimension Table:")
print(dates_dim.head())
print(f"Shape: {dates_dim.shape}")

print("\nCustomer Dimension Table:")
print(customer_dim.head())
print(f"Shape: {customer_dim.shape}")

print("\nLoan Dimension Table:")
print(loan_dim.head())
print(f"Shape: {loan_dim.shape}")

print("\nFact Table:")
print(fact_table.head())
print(f"Shape: {fact_table.shape}")

# Visualization of Star Schema
import plotly.graph_objects as go

fig = go.Figure()

# Create positions for the star schema
center = (0, 0)  # Fact table position
dimensions = [
    (-1, 1, "Dates Dimension"),
    (1, 1, "Customer Dimension"),
    (1, -1, "Loan Dimension"),
    (-1, -1, "Other Dimensions")
]

# Plot fact table
fig.add_trace(go.Scatter(x=[center[0]], y=[center[1]], 
                         mode='markers+text', 
                         text=['Fact Table'],
                         textposition='middle center',
                         marker=dict(size=40, color='red', symbol='square'),
                         name='Fact Table'))

# Plot dimension tables
for x, y, name in dimensions:
    fig.add_trace(go.Scatter(x=[x], y=[y], 
                             mode='markers+text', 
                             text=[name],
                             textposition='middle center',
                             marker=dict(size=30, color='blue', symbol='circle'),
                             name=name))
    
    # Connect dimension to fact table
    fig.add_trace(go.Scatter(x=[center[0], x], y=[center[1], y],
                             mode='lines',
                             line=dict(color='gray', width=2),
                             showlegend=False))

fig.update_layout(
    title='Star Schema Visualization - Lending Club Data Warehouse',
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    width=800,
    height=600
)

fig.show()

# Explain the schemas
print("\nSchema Explanations:")
print("\n1. Star Schema:")
print("   - Central fact table surrounded by dimension tables")
print("   - Denormalized structure for fast query performance")
print("   - Simple joins for analysis")

print("\n2. Snowflake Schema:")
print("   - Normalized version of star schema")
print("   - Dimension tables further normalized")
print("   - Reduced data redundancy")

print("\n3. Galaxy Schema:")
print("   - Multiple fact tables sharing dimension tables")
print("   - Used when multiple business processes are related")
print("   - Example: Loan applications + Collections + Payments")

# Demonstrate MPP principles with simulated query performance
import time

# Simulate different data processing approaches
data_sizes = [1000, 5000, 10000, 15000, 20000]
traditional_processing = [0.1, 0.8, 2.5, 5.0, 8.5]  # Times in seconds
mpp_processing = [0.1, 0.2, 0.4, 0.6, 0.9]  # Times in seconds with parallel processing

fig = go.Figure()
fig.add_trace(go.Scatter(x=data_sizes, y=traditional_processing, 
                         mode='lines+markers', name='Traditional Processing',
                         line=dict(color='red')))
fig.add_trace(go.Scatter(x=data_sizes, y=mpp_processing, 
                         mode='lines+markers', name='MPP Processing',
                         line=dict(color='green')))

fig.update_layout(
    title='Performance Comparison: Traditional vs MPP Processing',
    xaxis_title='Data Size',
    yaxis_title='Processing Time (seconds)',
    height=500
)

fig.show()

print("\nMPP Database Principles:")
print("1. Parallel Processing: Distribute work across multiple nodes")
print("2. Data Distribution: Partition data across nodes for parallel access")
print("3. Query Optimization: Optimize queries to leverage parallelization")
print("4. Columnar Storage: Store data in columns for analytical queries")
print("5. Compression: Reduce storage requirements and improve I/O")

## Big Data Processing Principles

Big data is characterized by the 5 V's: Volume, Velocity, Variety, Veracity, and Value. Processing big data across multiple nodes or servers requires specialized frameworks and principles:

1. **Distributed Computing**: Breaking down large problems into smaller tasks
2. **MapReduce**: Processing model for large datasets
3. **Hadoop Ecosystem**: Framework for distributed storage and processing
4. **Apache Spark**: In-memory computing for big data
5. **Real-time Processing**: Stream processing for immediate insights

In [None]:
# Visualization of Big Data Processing Concepts
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Volume Growth', 'Velocity Processing', 'Variety Sources', '5 V\'s of Big Data'),
    specs=[[{"type": "scatter"}, {"type": "scatter"}],
           [{"type": "bar"}, {"type": "bar"}]]
)

# 1. Volume Growth
years = list(range(2015, 2025))
data_volume = [1, 1.8, 3.2, 5.1, 7.9, 12.0, 17.5, 25.0, 35.0, 48.0]  # in zettabytes
fig.add_trace(go.Scatter(x=years, y=data_volume, mode='lines+markers', name='Data Volume'), row=1, col=1)

# 2. Velocity Processing
processing_types = ['Batch', 'Near Real-time', 'Real-time', 'Stream']
processing_time = [3600, 60, 5, 0.1]  # seconds
fig.add_trace(go.Bar(x=processing_types, y=processing_time, name='Processing Speed'), row=1, col=2)

# 3. Variety Sources
data_types = ['Structured', 'Semi-structured', 'Unstructured']
percentage = [25, 35, 40]
fig.add_trace(go.Bar(x=data_types, y=percentage, name='Data Type Distribution'), row=2, col=1)

# 4. 5 V's of Big Data
v_names = ['Volume', 'Velocity', 'Variety', 'Veracity', 'Value']
v_importance = [9, 8, 8, 7, 9]
fig.add_trace(go.Bar(x=v_names, y=v_importance, name='Importance (1-10)'), row=2, col=2)

fig.update_layout(height=700, title_text="Big Data Processing Concepts")
fig.show()

# Simulate distributed processing
print("Big Data Processing Principles Demonstration:")
print("\n1. Distributed Computing Example (Pseudo-code):")
print("   # Instead of processing all data on one machine:")
print("   single_machine_result = process_large_dataset(data)")
print("   ")
print("   # We distribute it across multiple nodes:")
print("   partitions = split_data(data, num_nodes)")
print("   results = []")
print("   for node_idx in range(num_nodes):")
print("       result = process_partition(partitions[node_idx])")
print("       results.append(result)")
print("   final_result = combine_results(results)")

print("\n2. MapReduce Concept:")
print("   - Map: Apply function to each data element independently")
print("   - Reduce: Combine results from map phase")
print("   Example for counting loan defaults by grade:")
print("   Map: (A, 1), (B, 0), (A, 1), (C, 1) -> (A, [1, 1]), (B, [0]), (C, [1])")
print("   Reduce: (A, sum([1, 1])), (B, sum([0])), (C, sum([1])) -> (A, 2), (B, 0), (C, 1)")

# Demonstrate concept with our data
print(f"\n3. Demonstrating with our Lending Club dataset:")
default_counts_by_grade = df.groupby('grade')['loan_status'].sum()
total_counts_by_grade = df.groupby('grade')['loan_status'].count()
default_rates_by_grade = (default_counts_by_grade / total_counts_by_grade) * 100

print("   Default Rates by Grade:")
for grade, rate in default_rates_by_grade.items():
    print(f"     Grade {grade}: {rate:.2f}%")

# Visualization of MapReduce concept
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Original data
df_sample = df.head(100)
axes[0].scatter(df_sample['fico_score'], df_sample['int_rate'], 
                c=df_sample['grade'].cat.codes, cmap='Set1', alpha=0.7)
axes[0].set_title('Original Dataset')
axes[0].set_xlabel('FICO Score')
axes[0].set_ylabel('Interest Rate')

# After mapping (grouping by grade)
grade_means = df_sample.groupby('grade')[['fico_score', 'int_rate']].mean()
axes[1].scatter(grade_means['fico_score'], grade_means['int_rate'], 
                s=200, c=range(len(grade_means)), cmap='Set1')
axes[1].set_title('After Mapping (Grouped by Grade)')
axes[1].set_xlabel('Avg FICO Score')
axes[1].set_ylabel('Avg Interest Rate')

# After reducing (aggregated values)
axes[2].bar(default_rates_by_grade.index, default_rates_by_grade.values, 
            color=sns.color_palette("husl", len(default_rates_by_grade)), alpha=0.7)
axes[2].set_title('After Reducing (Default Rates)')
axes[2].set_xlabel('Grade')
axes[2].set_ylabel('Default Rate (%)')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Big Data Technologies
technologies = {
    'Hadoop': ['HDFS', 'MapReduce', 'YARN'],
    'Spark': ['Spark Core', 'Spark SQL', 'MLlib', 'Spark Streaming'],
    'NoSQL': ['MongoDB', 'Cassandra', 'HBase'],
    'Cloud Platforms': ['AWS', 'GCP', 'Azure'],
    'Streaming': ['Kafka', 'Storm', 'Flink']
}

print("\n4. Big Data Technologies:")
for tech, subtechs in technologies.items():
    print(f"   {tech}: {', '.join(subtechs)}")

## Application to Lending Club Dataset

Now let's apply these cloud computing and big data concepts to our Lending Club dataset. We'll demonstrate how these technologies would be used in a real-world scenario to analyze loan data.

In [None]:
# Cloud and Big Data Application to Lending Club Dataset

print("Cloud and Big Data Application to Lending Club Dataset")

# 1. Data Pipeline Simulation
print("\n1. Data Pipeline - Simulating Real-World Processing")

# Simulate data coming from different sources
print("   - Credit Bureaus: FICO scores, credit history")
print("   - Internal Loans: Application data, payment history")
print("   - Employment Verification: Income, employment duration")
print("   - Social Media: Additional risk indicators")

# Create simulated pipeline metrics
pipeline_metrics = {
    'data_sources': ['Credit Bureau', 'Internal Systems', 'Employment', 'External APIs'],
    'records_per_minute': [50000, 100000, 25000, 10000],
    'data_volume_tb_per_month': [2.5, 5.0, 1.2, 0.5]
}

pipeline_df = pd.DataFrame(pipeline_metrics)

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Records Processed per Minute', 'Data Volume per Month'),
    specs=[[{"type": "bar"}, {"type": "bar"}]]
)

fig.add_trace(go.Bar(x=pipeline_df['data_sources'], y=pipeline_df['records_per_minute'], 
                     name='Records/Minute'), row=1, col=1)
fig.add_trace(go.Bar(x=pipeline_df['data_sources'], y=pipeline_df['data_volume_tb_per_month'], 
                     name='Volume (TB/month)'), row=1, col=2)

fig.update_layout(height=400, title_text="Data Pipeline Metrics - Lending Club")
fig.show()

# 2. Real-time Processing Simulation
print("\n2. Real-time Processing - Credit Risk Assessment")

# Simulate real-time processing of loan applications
time_steps = list(range(1, 11))  # 10 time units
applications_per_unit = [100, 150, 120, 180, 200, 160, 140, 190, 170, 210]
processed_per_unit = [95, 145, 115, 175, 190, 155, 135, 185, 165, 200]  # With some processing delay

fig = go.Figure()
fig.add_trace(go.Scatter(x=time_steps, y=applications_per_unit, 
                         mode='lines+markers', name='Incoming Applications',
                         line=dict(color='blue')))
fig.add_trace(go.Scatter(x=time_steps, y=processed_per_unit, 
                         mode='lines+markers', name='Processed Applications',
                         line=dict(color='green')))

fig.update_layout(
    title='Real-time Loan Application Processing',
    xaxis_title='Time Unit',
    yaxis_title='Number of Applications',
    height=400
)

fig.show()

# 3. Scalability demonstration
print("\n3. Scalability - Cloud Resource Adjustment")

# Simulate resource usage during different periods
hours = list(range(24))
base_traffic = np.sin(np.array(hours) * 2 * np.pi / 24)  # Daily pattern
weekend_factor = np.where([(h in [0, 1, 2, 3, 4]) for h in hours], 0.7, 1.0)  # Weekend vs weekday
traffic_load = (1 + base_traffic) * weekend_factor * 100

fig = go.Figure()
fig.add_trace(go.Scatter(x=hours, y=traffic_load, mode='lines+markers', 
                         name='Traffic Load',
                         line=dict(color='purple', width=3),
                         fill='tozeroy'))

fig.update_layout(
    title='Daily Traffic Load - Auto-scaling Simulation',
    xaxis_title='Hour of Day',
    yaxis_title='Traffic Load (Arbitrary Units)',
    height=400
)

fig.show()

# 4. Data Lake Architecture
print("\n4. Data Lake Architecture for Lending Club")
print("   Raw Zone: Original application files, credit bureau reports")
print("   Processed Zone: Cleaned and standardized data")
print("   Curated Zone: Aggregated and model-ready datasets")
print("   Consumption Zone: Ready for reporting and analytics")

# Visualize the data lake architecture
zones = ['Raw Zone', 'Processed Zone', 'Curated Zone', 'Consumption Zone']
data_types = [
    ['Application Forms', 'Credit Reports', 'Bank Statements'],
    ['Standardized Data', 'Data Quality Checks', 'Anomaly Detection'],
    ['Feature Sets', 'Model Inputs', 'Aggregations'],
    ['Dashboards', 'Reports', 'Model Predictions']
]
volume = [100, 70, 40, 20]  # Data volume decreases as it gets more refined

fig = go.Figure(data=[
    go.Bar(x=zones, y=volume, 
           text=volume, textposition='auto',
           marker_color=['#FF9999', '#66B2FF', '#99FF99', '#FFD700'])
])

fig.update_layout(
    title='Data Lake Architecture - Volume by Zone',
    xaxis_title='Data Lake Zone',
    yaxis_title='Relative Data Volume',
    height=400
)

fig.show()

# 5. Cost Analysis
print("\n5. Cost Analysis - Traditional vs Cloud Approach")

workloads = ['Data Storage', 'Compute', 'Analytics', 'ML Training', 'Data Transfer']
traditional_cost = [50000, 30000, 20000, 15000, 10000]
cloud_cost = [15000, 10000, 8000, 5000, 3000]

fig = go.Figure(data=[
    go.Bar(name='Traditional', x=workloads, y=traditional_cost, marker_color='red'),
    go.Bar(name='Cloud', x=workloads, y=cloud_cost, marker_color='green')
])

fig.update_layout(
    title='Cost Comparison: Traditional vs Cloud Infrastructure',
    xaxis_title='Workload',
    yaxis_title='Monthly Cost (USD)',
    barmode='group',
    height=400
)

fig.show()

# Calculate cost savings
traditional_total = sum(traditional_cost)
cloud_total = sum(cloud_cost)
savings = traditional_total - cloud_total
savings_percentage = (savings / traditional_total) * 100

print(f"   Traditional Approach Cost: ${traditional_total:,}/month")
print(f"   Cloud Approach Cost: ${cloud_total:,}/month")
print(f"   Monthly Savings: ${savings:,} ({savings_percentage:.1f}%)")

# Summary
print("\nCloud Computing and Big Data Benefits for Lending Club Analysis:")
print("- Scalability to handle growing loan volumes")
print("- Real-time processing for instant credit decisions")
print("- Cost-effectiveness compared to traditional infrastructure")
print("- Advanced analytics capabilities for risk assessment")
print("- Integration with multiple data sources for comprehensive profiles")
print("- Machine learning capabilities for predictive modeling")

# Conclusion

In this comprehensive notebook on cloud computing and big data, we've explored:

1. **Cloud Computing Definition**: Understanding cloud computing as the delivery of computing services over the Internet.

2. **Importance of Cloud Data Services**: The benefits of scalability, cost-effectiveness, flexibility, reliability, collaboration, and innovation for data science teams.

3. **Big Data Implementation Examples**: How different industries leverage big data for customer analytics, predictive maintenance, and network optimization.

4. **Data Warehousing Concepts**: Fact tables, dimension tables, star schema, snowflake schema, and MPP databases for efficient data analysis.

5. **Big Data Processing Principles**: Distributed computing, MapReduce, and technologies for handling large volumes of data.

6. **Application to Lending Club Dataset**: Demonstrating how these concepts apply to real-world financial data analysis.

These technologies and concepts are crucial for modern data science projects, especially when dealing with large-scale datasets like the Lending Club data. Cloud computing and big data technologies enable data scientists to process, analyze, and gain insights from massive datasets that would be impossible to handle with traditional computing approaches.