- sqoop is only useful for moving data into and out of hdfs. thats it

## Sqoop Complete Reference Notes

**1. Importing from SQL Database Table**

• Basic import syntax uses JDBC connection to relational databases
  - `sqoop import --connect jdbc:mysql://localhost/database_name --username user --password password --table table_name --target-dir /user/hdfs/target_directory`
  - Sqoop uses JDBC drivers to connect to all major databases (MySQL, PostgreSQL, Oracle, SQL Server)
  - Data imported as text files by default into HDFS
  - Creates one file per mapper in the target directory
  - Automatically generates Java classes for the imported table structure

• Key connection parameters
  - `--connect`: JDBC URL specifying database type, host, port, and database name
  - `--username` / `--password`: Database authentication credentials  
  - `--table`: Source table name in the database
  - `--target-dir`: Specific HDFS destination directory path
  - `--warehouse-dir`: Parent directory where multiple table imports are stored

**2. Mappers and Primary Key Configuration**

• Default mapper behavior
  - Sqoop uses 4 mappers by default for parallel data processing
  - Automatically uses table's primary key for data splitting across mappers
  - Each mapper processes a specific range of primary key values
  - Import fails if no primary key exists and multiple mappers specified
  - Primary key splitting ensures even data distribution across mappers

• Controlling mapper configuration
  - `--num-mappers 2` or `-m 2`: Specify exact number of mappers
  - `--split-by employee_id`: Use custom column for data splitting instead of primary key
  - `-m 1`: Single mapper mode (no primary key requirement)
    * Useful for small tables or tables without primary keys
    * Slower but safer for problematic data distributions

• Split column requirements and considerations
  - Must be numeric, date, or timestamp data type
  - Should have good data distribution across value ranges
  - Avoid columns with many null values or highly skewed data
  - More mappers increase speed but also increase database server load
  - Optimal mapper count usually 2-4 per CPU core in cluster

**3. Importing Partial Data**

• WHERE clause filtering
  - `--where "salary > 50000 AND department = 'IT'"`: Filters data at database source
  - More efficient than importing all data and filtering in Hadoop
  - Reduces network traffic and storage requirements
  - Can use any valid SQL WHERE clause conditions
  - Filtering happens before data transfer, improving performance

• Column selection
  - `--columns "id,name,salary,department"`: Import only specified columns
  - Significantly reduces data transfer and storage needs
  - Column names must match exact database column names
  - Order of columns in output follows the specified sequence
  - Useful for excluding large text or binary columns

• Custom query imports
  - `--query "SELECT id, name, salary FROM employees WHERE salary > 50000 AND \$CONDITIONS"`: Free-form SQL queries
  - Allows complex joins, calculations, and transformations
  - `$CONDITIONS` placeholder is mandatory for parallel execution
    * Sqoop replaces this with appropriate conditions for each mapper
    * Required even if query doesn't actually need additional conditions
  - Must specify `--target-dir` explicitly (cannot use `--warehouse-dir`)
  - Cannot be used with `--table` parameter

**4. Field Separators and Delimiters**

• Default delimiter behavior
  - Field separator: Comma (,)
  - Record separator: Newline (\n)
  - Escape character: Backslash (\)
  - No field enclosure by default

• Custom delimiter configuration
  - `--fields-terminated-by '\t'`: Tab-separated values (cleaner for most data)
  - `--lines-terminated-by '\n'`: Custom record separator
  - `--escaped-by '\\'`: Character to escape special characters
  - `--enclosed-by '"'`: Enclose all fields with specified character
  - `--optionally-enclosed-by '"'`: Enclose only fields containing delimiters

• Delimiter selection best practices
  - Choose delimiters that don't appear in your actual data
  - Tab separation (\t) is common choice for cleaner data files
  - Use field enclosure when data contains delimiter characters
  - Proper escaping prevents data corruption during import
  - Consider downstream processing tool requirements

**5. Sqoop Eval and Connectors**

• Sqoop eval functionality
  - `sqoop eval --connect jdbc:mysql://localhost/testdb --username root --password password --query "SELECT COUNT(*) FROM employees"`
  - Executes simple queries without creating HDFS files
  - Uses single database connection (no parallelism)
  - Read-only operations only
  - Lightweight alternative for quick data validation

• Primary use cases
  - Testing database connectivity before full imports
  - Quick data validation and sample queries
  - Schema verification and column analysis
  - Simple aggregations and counts
  - Debugging connection and authentication issues

• Limitations and considerations
  - No parallel execution capabilities
  - Cannot create HDFS files or Java classes
  - Limited to simple SELECT statements
  - Not suitable for large result sets
  - Results displayed in console only

**6. Incremental Import and Sqoop Jobs**

• Incremental import modes
  - **Append mode**: For insert-only tables that only grow
    * `--incremental append --check-column order_id --last-value 1000`
    * Imports only records where check column > last value
    * Suitable for log tables, transaction tables
    * Check column should be auto-incrementing integer
  - **Last-modified mode**: For tables with updates and inserts
    * `--incremental lastmodified --check-column last_update --last-value "2023-01-01 00:00:00"`
    * Imports records modified after specified timestamp
    * Requires timestamp or date column tracking modifications
    * Handles both new and updated records

• Sqoop jobs for automation
  - `sqoop job --create job_name -- import --table employees --incremental append --check-column id --last-value 0`
  - `sqoop job --exec job_name`: Execute previously created job
  - `sqoop job --list`: Display all created jobs
  - `sqoop job --show job_name`: Show job configuration
  - `sqoop job --delete job_name`: Remove job definition

• Job benefits and considerations
  - Automatically tracks and updates last imported value
  - Saves significant time and bandwidth for large tables
  - Jobs stored in metastore for persistence
  - Check column should be indexed for optimal performance
  - Suitable for regular ETL processes and data synchronization

**7. Password Protection Methods**

• Password file approach
  - Create password file: `echo "mypassword" | hdfs dfs -put - /user/sqoop/password.txt`
  - Set restrictive permissions: `hdfs dfs -chmod 600 /user/sqoop/password.txt`
  - Use in sqoop command: `--password-file /user/sqoop/password.txt`
  - Password not visible in command history or process lists
  - File can be stored in HDFS or local filesystem

• Security best practices
  - Never use plain text passwords in production scripts
  - Set file permissions to 600 (read-write for owner only)
  - Consider using Hadoop credential providers for additional encryption
  - Store password files in secure HDFS locations
  - Regularly rotate passwords and update files accordingly

• Alternative security methods
  - Kerberos authentication for enterprise environments
  - Database connection pooling with encrypted credentials
  - Integration with external key management systems
  - Using database service accounts with minimal privileges

**8. Last Modified Column Handling**

• Implementation requirements
  - `--incremental lastmodified --check-column last_modified_date --last-value "2023-12-01 00:00:00"`
  - Table must have timestamp, datetime, or date column
  - Column should be automatically updated on record modifications
  - `--merge-key product_id`: Specify primary key for handling updates

• Technical considerations
  - Supported data types: timestamp, datetime, date columns
  - Time zone consistency crucial for accurate incremental imports
  - Last-modified column should be indexed for performance
  - Handle database triggers or application-level timestamp updates
  - Consider clock synchronization between database and Hadoop cluster

• Merge operations for updates
  - Use `--merge-key` to handle updated records properly
  - Merge combines new and updated records with existing data
  - Requires additional MapReduce job for merge operation
  - More complex but handles data changes accurately

**9. File Formats and Storage Options**

• Text format (default)
  - Human readable and debuggable
  - Compatible with all Hadoop ecosystem tools
  - Larger file sizes and slower processing
  - Good for initial development and testing

• Sequence file format
  - `--as-sequencefile`: Binary format optimized for MapReduce
  - Splittable for parallel processing
  - Supports compression efficiently
  - Not human readable but better performance

• Avro format
  - `--as-avrodatafile`: Schema evolution support
  - Cross-platform compatibility
  - Schema embedded in file header
  - Compact binary format good for streaming

• Parquet format
  - `--as-parquetfile`: Columnar storage optimized for analytics
  - Excellent compression ratios
  - Fast query performance for analytical workloads
  - Best choice for read-heavy analytical processing

• ORC format
  - `--as-orcfile`: Optimized Row Columnar format
  - Designed for Hive integration
  - Supports ACID transactions
  - Excellent compression and performance in Hive ecosystem
  - Built-in indexing and statistics

• Format selection guidelines
  - **Text**: Development, debugging, simple processing
  - **Sequence**: General MapReduce processing
  - **Avro**: Schema evolution, cross-platform needs
  - **Parquet**: Analytics, BI tools, columnar analysis
  - **ORC**: Hive-centric environments, ACID requirements

**10. Multiple Tables and Null Value Handling**

• Importing all tables
  - `sqoop import-all-tables --connect jdbc:mysql://localhost/testdb --username root --password password --warehouse-dir /user/data/`
  - Imports every table in specified database
  - `--exclude-tables table1,table2`: Skip specific tables
  - Tables imported sequentially, not in parallel
  - Each table creates separate directory under warehouse

• Null value representation
  - `--null-string 'NULL'`: Representation for null string values
  - `--null-non-string '-999'`: Representation for null numeric/date values
  - Choose representations not present in actual data
  - Consistent null handling crucial for downstream processing
  - Different settings for string vs non-string data types

• Multi-table considerations
  - Each table import runs as separate job
  - Total time is sum of individual table import times
  - Database connections used efficiently across imports
  - Monitor database server load during multi-table imports
  - Consider foreign key relationships and import order

**11. Sqoop Export with Staging Tables**

• Basic export functionality
  - `sqoop export --connect jdbc:mysql://localhost/testdb --table target_table --export-dir /user/data/employees`
  - Transfers data from HDFS to relational database tables
  - Reverse operation of import process
  - Requires pre-existing target table with compatible schema

• Staging table approach
  - `--staging-table staging_table --clear-staging-table`: Use intermediate staging table
  - Data loaded to staging table first, then moved to production table
  - Provides atomic operation - all data or none
  - If export fails, production table remains unchanged
  - `--clear-staging-table`: Cleans staging table before export

• Export safety and reliability
  - Staging tables prevent partial data in case of failures
  - Production table remains available during export process
  - Allows validation of staging data before final move
  - Adds overhead but provides data consistency guarantees
  - Essential for critical production systems

• Performance considerations
  - `--batch`: Groups multiple records per statement
  - `--records-per-statement 1000`: Balance memory usage and performance
  - Larger batches reduce database round trips
  - Monitor database transaction log growth during exports

**12. Sqoop Performance Tuning**

• Mapper and connection optimization
  - Optimal mapper count: Usually 2-4 per CPU core in cluster
  - `--num-mappers 8`: Match cluster capacity and database capability
  - `--fetch-size 1000`: Larger fetch sizes reduce database round trips
  - `--split-by id`: Choose well-distributed numeric columns for splitting

• Database-level optimizations
  - Use indexes on split columns and WHERE clause conditions
  - `--connect "jdbc:mysql://localhost/testdb?useSSL=false&rewriteBatchedStatements=true"`: Connection parameter tuning
  - Connection pooling parameters for efficiency
  - Database server configuration for concurrent connections

• Native vs JDBC drivers
  - `--direct`: Use native database utilities (MySQL, PostgreSQL)
    * Bypasses JDBC for bulk operations
    * Often 2-3x faster than JDBC approach
    * Uses database-specific bulk loading tools
    * Limited to specific databases with native support
  - JDBC drivers: Universal compatibility but potentially slower
  - Choose based on database type and performance requirements

• Compression and storage optimization
  - `--compression-codec gzip`: Reduce I/O and storage requirements
  - Available codecs: gzip, bzip2, snappy, lzo
  - Balance compression ratio vs CPU overhead
  - Essential for large datasets and limited network bandwidth

• Resource management tuning
  - Increase mapper memory for large records or complex processing
  - Consider network bandwidth between database and Hadoop cluster
  - Monitor cluster resource usage during imports
  - Balance database server load with import parallelism

• Statement and transaction optimization
  - `--batch`: Enable batch mode for exports (groups multiple records)
  - `--records-per-statement 1000`: Optimize batch size for performance
  - Larger transactions reduce overhead but increase lock time
  - Monitor database transaction logs and lock contention

• Performance monitoring and best practices
  - Use evenly distributed numeric columns for data splitting
  - Match mapper allocation to cluster capacity
  - Optimize database for concurrent read operations
  - Always use compression for large datasets
  - Monitor both database and cluster resources during operations
  - Test different mapper counts and fetch sizes for optimal performance

  ---
  ---