# Python Academy

## **By Johnson & Johnson | D4U â€“  Platform Development**

### **Data Skills Transformation Program**

As part of J&J's D4U initiative, we are building scalable, future-proof data workflows. This academy provides comprehensive training in Python and PySpark - Databricks.

---

### **Learning Outcomes**

Upon completion of this academy, participants will:
- Master Python fundamentals for healthcare data processing
- Utilize PySpark for distributed computing environments
- Operate effectively within the Databricks platform
- Apply industry best practices for clinical data management
- Develop automated data pipelines for regulatory compliance

---

### **Curriculum Overview**

| **Module** | **Focus Area** | **Duration** | **Key Skills** |
|------------|----------------|--------------|-----------------|
| **Module 1** | Core Python Foundations | 2h 00m | Variables, Functions, Classes, Data Structures, Exception Handling, APIs, File Operations, Context Managers |


**Total Learning Time: 5 hours**

---

### **Module 1: Core Python Foundations - Complete Session List (2h 00m)**

| **Session** | **Title** | **Status** | **Relevance to PySpark** |
|-------------|-----------|------------|---------------------------|
| **1.1** | Basics | Existing | Essential foundation (variables, expressions, types) |
| **1.2** | Modules and Packages | Existing | Supports modularity, used in PySpark imports |
| **1.3** | Data Structures | Existing | Core for manipulating lists, dicts, comprehensions |
| **1.4** | Advanced Data Structures | Expand | Important for handling nested schemas in DataFrames |
| **1.5** | Conditions and Loops | Existing | Critical for control flow and PySpark logic |
| **1.6** | Functions | Existing | Includes lambda, map, filter â€“ core concepts in PySpark transformations |
| **1.7** | Dates and Times | Existing | Important for timestamp data handling |
| **1.8** | Regular Expressions | Existing | Useful for data cleaning and parsing text |
| **1.9** | Classes | Existing | Builds understanding of structure and abstraction |
| **1.10** | Decorators | Existing | Supports advanced functions, relevant for modular pipelines |
| **1.11** | Virtual Environments | Existing | Good practice for environment isolation |
| **1.12** | Exception Handling | Add | Needed for building robust PySpark data pipelines |
| **1.13** | Context Managers and File I/O | Add | Essential for reading/writing files and managing resources |
| **1.14** | Iterators and Generators | Expand | Maps directly to lazy evaluation in PySpark |
| **1.15** | String Processing | Add | Important for cleaning and manipulating string data |
| **1.16** | APIs and JSON | Add | Common for ingesting data from APIs and parsing input files |

---

### **Healthcare Data Applications**

All exercises and examples focus on real-world scenarios:
- Patient data processing and statistical analysis
- Clinical trial data management and validation
- Drug efficacy studies and regulatory reporting
- Hospital operations optimization
- Healthcare analytics and compliance reporting

---

### **Program Features**

This academy provides comprehensive learning materials designed for healthcare professionals:

#### **Practical Training Components**
- Real healthcare scenarios from clinical trial environments
- Interactive coding exercises with immediate validation
- Data visualization techniques for regulatory submissions
- Industry-standard workflows from J&J data engineering teams

#### **Assessment and Validation**
- Topic-specific knowledge assessments
- Hands-on exercises applying concepts to clinical problems
- Progress tracking throughout the learning journey
- Competency validation for each skill area

#### **Professional Support**
- Comprehensive reference documentation
- Code examples for practical implementation
- Review exercises to reinforce critical concepts
- Access to subject matter experts

---

### **Environment Setup & Validation**

Ensure your Databricks environment is properly configured for the academy. Execute the following cells to validate your setup:

In [None]:
# 1. Python Environment Validation
import sys
import platform
from datetime import datetime

print("Python Environment Validation")
print("=" * 50)
print(f"Python Version: {sys.version}")
print(f"Platform: {platform.system()} {platform.release()}")
print(f"Validation Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\nCore Python environment is ready for use.")

# 2. Test Basic Data Structures
sample_healthcare_data = {
    "patient_id": [1001, 1002, 1003],
    "treatment_group": ["Drug_A", "Drug_B", "Placebo"],
    "efficacy_score": [8.5, 7.2, 3.1]
}

print("\nSample Healthcare Data Structure:")
for field, values in sample_healthcare_data.items():
    print(f"  {field}: {values}")

print("\nData structures are functioning correctly.")

In [None]:
# 3. PySpark Environment Validation
try:
    from pyspark.sql import SparkSession
    from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
    
    # Initialize Spark session
    spark = SparkSession.builder \
        .appName("D4U_Academy_Environment_Check") \
        .getOrCreate()
    
    print("PySpark Environment Validation")
    print("=" * 50)
    print(f"Spark Version: {spark.version}")
    
    # For Spark Connect compatibility, use different approach to get app name
    try:
        app_name = spark.sparkContext.appName
        print(f"Application Name: {app_name}")
    except Exception:
        # Spark Connect doesn't support sparkContext, use alternative
        print("Application Name: D4U_Academy_Environment_Check (Spark Connect Mode)")
    
    # Create sample healthcare DataFrame for validation
    healthcare_schema = StructType([
        StructField("patient_id", IntegerType(), True),
        StructField("hospital_site", StringType(), True),
        StructField("primary_diagnosis", StringType(), True),
        StructField("length_of_stay", IntegerType(), True),
        StructField("treatment_cost", DoubleType(), True)
    ])
    
    sample_clinical_data = [
        (1001, "General Hospital", "Type 2 Diabetes", 3, 15750.50),
        (1002, "City Medical Center", "Hypertension", 2, 8920.00),
        (1003, "Regional Medical Center", "Cardiovascular Disease", 7, 42300.75)
    ]
    
    clinical_df = spark.createDataFrame(sample_clinical_data, healthcare_schema)
    
    print("\nSample Clinical Data DataFrame:")
    clinical_df.show()
    
    print("PySpark environment is ready for distributed processing.")
    print("DataFrame operations are functioning correctly.")
    print("âœ… Spark Connect mode detected - optimized for cloud environments")
    
except Exception as e:
    print(f"PySpark validation failed: {str(e)}")
    print("Please ensure PySpark is properly configured in your environment.")
    print("Note: If using Spark Connect, some legacy APIs may not be available.")

In [None]:
# 4. Databricks Platform Validation
try:
    import os
    
    print("Databricks Platform Validation")
    print("=" * 50)
    
    # Check Databricks runtime environment
    if 'DATABRICKS_RUNTIME_VERSION' in os.environ:
        print(f"Databricks Runtime Version: {os.environ['DATABRICKS_RUNTIME_VERSION']}")
        print("Running in Databricks managed environment.")
    else:
        print("Not running in Databricks environment (local development mode)")
    
    # Detect Spark Connect mode
    try:
        # Try to access sparkContext - will fail in Spark Connect
        spark.sparkContext
        spark_mode = "Standard Spark Session"
    except Exception:
        spark_mode = "Spark Connect Mode (Cloud-optimized)"
    
    print(f"Spark Connection Mode: {spark_mode}")
    
    # Test Databricks display function
    try:
        platform_status = [
            {"Component": "Python Core", "Status": "Ready"},
            {"Component": "PySpark Engine", "Status": "Ready"},
            {"Component": "Databricks Utils", "Status": "Ready"},
            {"Component": "Healthcare Data Examples", "Status": "Ready"},
            {"Component": "Spark Mode", "Status": spark_mode}
        ]
        
        status_df = spark.createDataFrame(platform_status)
        
        # Try to use display function (Databricks-specific)
        try:
            display(status_df)
            print("âœ… Databricks display functionality is operational.")
        except NameError:
            # Fallback to standard show() if display() not available
            status_df.show(truncate=False)
            print("âœ… Standard DataFrame display working (display() function not available)")
        
    except Exception as e:
        print(f"DataFrame creation failed: {str(e)}")
    
    print("\nEnvironment validation completed successfully.")
    print("Platform is ready for D4U Python Academy training.")
    print("ðŸš€ Spark Connect provides enhanced security and performance for cloud workloads")
    
except Exception as e:
    print(f"Databricks validation encountered an issue: {str(e)}")
    print("This may be normal depending on your execution environment.")

---

### **Structured Learning Path**

Complete modules sequentially for optimal skill development:

#### **Module 1: Core Python Foundations** (2h 00m)
- **Session 1.1-1.3:** Python Basics and Data Structures
- **Session 1.4-1.8:** Advanced Structures and Data Handling
- **Session 1.9-1.12:** Classes, Decorators, and Exception Handling
- **Session 1.13-1.16:** File I/O, Context Managers, APIs, and JSON
- **Assessment:** Comprehensive Python Knowledge Validation

#### **Module 2: PySpark Fundamentals** (2h 15m)
- **Exercise 2.1:** Introduction to Distributed Computing
- **Exercise 2.2:** PySpark DataFrames and Core Operations
- **Exercise 2.3:** Transformations and Actions
- **Assessment:** PySpark Fundamentals Proficiency Check

#### **Module 3: PySpark Advanced** (1h 30m)
- **Exercise 3.1:** Complex Operations and Window Functions
- **Exercise 3.2:** Joins and Aggregations
- **Exercise 3.3:** Performance Tuning and Optimization
- **Assessment:** Final Capstone Project

---

### **Program Initiation**

**Begin with Module 1: Core Python Foundations**

This streamlined training program supports J&J's **D4U initiative** for next-generation healthcare data solutions and regulatory compliance.

---

### **Professional Support Resources**

- **Documentation:** Comprehensive technical guides and API references
- **Community:** Professional network of healthcare data practitioners
- **Expert Consultation:** Direct access to J&J data engineering teams

**Begin your professional data transformation journey.**