- hive can use spark or mapreduce as engine , but it is just a methodology having purpose of datawarehouse management & batch processing.
- real comparision is hive vs spark-sql
- Use Hive with map-reduce engine for slow batch jobs, Spark-sql for fast flexible processing.
- hive on spark engine is called hive-on-spark.

- SQL processing:
    - Open Source: Apache Hive, Spark SQL (on Spark engine), Presto/Trino
    - AWS: Redshift, Athena
    - Azure: Synapse Analytics
    - Google Cloud: BigQuery
    - Commercial: Snowflake

# Hive Complete Reference Notes

**1. Managed vs External Table Types**

• Managed Tables (Internal Tables)
  - `CREATE TABLE employees (id INT, name STRING, salary DOUBLE);`
  - Hive manages both metadata and data storage
  - Data stored in Hive warehouse directory (`/user/hive/warehouse` by default)
  - When table is dropped, both metadata and data are deleted permanently
  - Hive has full control over table lifecycle
    * Creates directories automatically
    * Manages file organization
    * Handles data cleanup on DROP TABLE
  - Best for tables entirely managed by Hive applications
  - Default table type when not specified

• External Tables
  - `CREATE EXTERNAL TABLE ext_employees (id INT, name STRING) LOCATION '/user/data/employees';`
  - Hive manages only metadata, not the actual data files
  - Data stored in user-specified location outside Hive warehouse
  - When table is dropped, only metadata is removed, data remains intact
  - Data can be shared across multiple applications and tools
    * Other systems can read/write the same data
    * Data persists beyond Hive table lifecycle
    * Useful for data lakes and shared storage scenarios
  - Required when data already exists in specific HDFS location
  - Safer option when data should not be accidentally deleted

• Key differences and selection criteria
  - **Data ownership**: Managed tables owned by Hive, external tables owned by user
  - **Drop behavior**: Managed tables delete data, external tables preserve data
  - **Location control**: Managed tables use warehouse, external tables use custom paths
  - **Use managed for**: Temporary tables, Hive-specific processing, ETL intermediates
  - **Use external for**: Shared data, existing datasets, data lake scenarios

**2. Hive Table Types and Operations**

• Staging tables concept
  - Temporary tables used for data preprocessing and validation
  - `CREATE TABLE staging_sales (date STRING, amount STRING, product STRING);`
  - Often use STRING data types initially for flexible data loading
  - Used to clean, validate, and transform data before loading to production tables
    * Handle data quality issues
    * Perform data type conversions
    * Apply business rules and validations
  - Typically dropped after successful data processing

• Dropping tables safely
  - `DROP TABLE table_name;`: Removes table completely
  - `DROP TABLE IF EXISTS table_name;`: Prevents errors if table doesn't exist
  - `TRUNCATE TABLE table_name;`: Removes all data but keeps table structure
    * Faster than DELETE for removing all records
    * Resets table to empty state
    * Cannot be used with external tables
  - `ALTER TABLE table_name DROP PARTITION (year=2023);`: Remove specific partitions

• Table creation variations
  - `CREATE TABLE LIKE existing_table;`: Copy structure without data
  - `CREATE TABLE AS SELECT (CTAS);`: Create table with data from query
  - `CREATE TEMPORARY TABLE temp_data (...);`: Session-specific tables
    * Automatically dropped when session ends
    * Not visible to other users or sessions
    * Useful for intermediate processing steps

**3. Data Loading: Local vs HDFS Operations**

• Loading from local filesystem
  - `LOAD DATA LOCAL INPATH '/home/user/data.txt' INTO TABLE employees;`
  - Copies data from local file system to HDFS
  - Original file remains in local filesystem
  - Data transferred over network to HDFS
    * Slower for large files due to network transfer
    * Useful for small datasets and development
    * File must exist on machine running Hive client
  - `OVERWRITE` option replaces existing data

• Moving from HDFS
  - `LOAD DATA INPATH '/user/data/employees.txt' INTO TABLE employees;`
  - Moves data files from HDFS location to table location
  - Original files are moved, not copied
    * Much faster operation (metadata update only)
    * No additional storage space required
    * Source files no longer exist in original location
  - Preferred method for large datasets already in HDFS
  - Files must be compatible with table schema and format

• Loading considerations and best practices
  - **Performance**: HDFS move operations much faster than local copy
  - **Data formats**: Ensure file format matches table definition
  - **Partitioning**: Use partition specification for partitioned tables
  - **File organization**: Multiple files loaded as separate data files
  - **Permissions**: Ensure proper HDFS permissions for data access
  - **Validation**: Test with small datasets before bulk loading

**4. Schema on Read Features**

• Core concept and benefits
  - Schema applied when data is read, not when data is written
  - Data stored in raw format without immediate schema validation
  - Flexible data ingestion without upfront schema definition
    * Accept data in various formats and structures
    * Handle schema evolution gracefully
    * Support for semi-structured and unstructured data
  - Contrast with traditional databases (schema on write)

• Implementation in Hive
  - `CREATE TABLE flexible_data (col1 STRING, col2 STRING, col3 STRING);`
  - Data type conversion happens during query execution
  - SerDe (Serializer/Deserializer) handles format interpretation
    * JSON SerDe for JSON data
    * Regex SerDe for custom text formats
    * Avro SerDe for Avro files
  - NULL values returned for incompatible data types

• Advantages and considerations
  - **Fast data ingestion**: No validation overhead during loading
  - **Schema flexibility**: Handle varying data structures
  - **Storage efficiency**: Store raw data without transformation
  - **Query-time overhead**: Type conversion during each query
  - **Data quality**: Potential issues discovered only during queries
  - **Best practices**: Validate critical data paths and common queries

**5. Creating Tables from Existing Tables**

• CREATE TABLE AS SELECT (CTAS)
  - `CREATE TABLE high_earners AS SELECT * FROM employees WHERE salary > 100000;`
  - Creates new table with data populated from query results
  - Schema automatically inferred from SELECT statement
  - Data copied during table creation process
    * New table independent of source table
    * Snapshot of data at creation time
    * Changes to source table don't affect new table
  - Cannot specify additional table properties or detailed schema

• CREATE TABLE LIKE
  - `CREATE TABLE employees_backup LIKE employees;`
  - Copies table structure without copying data
  - Preserves column names, data types, and table properties
  - Useful for creating backup tables or similar structures
    * Maintains partitioning scheme
    * Copies storage format and SerDe properties
    * Preserves bucketing configuration
  - Data loaded separately using INSERT or LOAD statements

• Advanced table creation patterns
  - `CREATE TABLE monthly_sales PARTITIONED BY (year INT, month INT) AS SELECT ...;`
  - Combine CTAS with partitioning for organized data storage
  - Use for creating materialized views and summary tables
  - Efficient way to restructure existing data with new organization

**6. Hive Partitions**

• Partition types and concepts
  - **Static partitioning**: Partition values specified explicitly
  - **Dynamic partitioning**: Partition values determined from data
  - Partitions create separate subdirectories for different partition values
  - Dramatically improves query performance by partition pruning
    * Queries scan only relevant partitions
    * Reduces I/O and processing time
    * Essential for time-series and categorical data organization

• Static partition insertion
  - `INSERT INTO TABLE sales PARTITION (year=2023, month=12) SELECT product, amount FROM staging_sales;`
  - Partition values explicitly specified in INSERT statement
  - All inserted records go to the same partition
  - Simple and predictable partition assignment
    * Full control over data placement
    * Suitable for batch processing known time periods
    * Prevents accidental data placement in wrong partitions

• Static partition loading
  - `LOAD DATA INPATH '/user/data/sales_2023_12.txt' INTO TABLE sales PARTITION (year=2023, month=12);`
  - Data files loaded directly to specific partition
  - Faster than INSERT for bulk data loading
  - File-based partition population
    * Maintains original file structure
    * No data transformation during loading
    * Efficient for pre-organized data files

• Dynamic partitioning configuration
  - `SET hive.exec.dynamic.partition=true;`: Enable dynamic partitioning
  - `SET hive.exec.dynamic.partition.mode=nonstrict;`: Allow all dynamic partitions
  - `SET hive.exec.max.dynamic.partitions=1000;`: Maximum partitions per node
  - `SET hive.exec.max.dynamic.partitions.pernode=100;`: Maximum partitions per mapper

• Dynamic partition insertion
  - `INSERT INTO TABLE sales PARTITION (year, month) SELECT product, amount, year, month FROM staging_sales;`
  - Partition values determined from data columns
  - Automatically creates partitions based on distinct values
  - Partition columns must be last in SELECT statement
    * Order matters for multi-level partitioning
    * Creates hierarchical directory structure
    * Handles multiple partitions in single operation

• Default and advanced partition handling
  - `__HIVE_DEFAULT_PARTITION__`: Default name for NULL partition values
  - `ALTER TABLE sales ADD PARTITION (year=2024, month=1);`: Manual partition creation
  - `ALTER TABLE sales DROP PARTITION (year=2022);`: Remove old partitions
  - Partition pruning optimization automatically applied in queries
  - Regular partition maintenance required for optimal performance

**7. Creating Tables from Sequence File Data**

• Sequence file table creation
  - `CREATE TABLE seq_employees (id INT, name STRING, salary DOUBLE) STORED AS SEQUENCEFILE;`
  - Optimized for MapReduce processing and compression
  - Binary format with better performance than text files
  - Splittable format enabling parallel processing
    * Each split processed by separate mapper
    * Maintains record boundaries across splits
    * Efficient for large dataset processing

• Loading sequence file data
  - `LOAD DATA INPATH '/user/data/employees.seq' INTO TABLE seq_employees;`
  - Files must be in proper SequenceFile format
  - Compatible with MapReduce input/output formats
  - Preserves compression and serialization benefits
    * Faster query processing
    * Reduced storage requirements
    * Better network utilization

• Sequence file benefits and use cases
  - **Performance**: Faster read/write operations than text files
  - **Compression**: Efficient compression support (block and record level)
  - **Splittability**: Enables parallel processing across cluster
  - **Integration**: Native MapReduce compatibility
  - **Use cases**: Large batch processing, ETL intermediates, compressed storage
  - **Considerations**: Not human-readable, requires compatible tools

**8. Hive Buckets (Clustering)**

• Bucketing concept and implementation
  - `CREATE TABLE bucketed_sales (id INT, product STRING, amount DOUBLE) CLUSTERED BY (id) INTO 4 BUCKETS;`
  - Distributes data into fixed number of files based on hash function
  - Hash function applied to bucketing column determines file placement
  - Each bucket stored as separate file within table/partition directory
    * Predictable file organization
    * Consistent data distribution
    * Enables sampling and join optimizations

• Bucketing configuration and loading
  - `SET hive.enforce.bucketing=true;`: Enable automatic bucketing
  - `INSERT INTO TABLE bucketed_sales SELECT * FROM raw_sales;`
  - Number of reducers should match number of buckets
  - Hash function ensures even data distribution across buckets
    * Prevents data skew in processing
    * Enables efficient sampling queries
    * Optimizes join operations between bucketed tables

• Bucketing benefits and optimization
  - **Sampling**: `SELECT * FROM bucketed_sales TABLESAMPLE(BUCKET 1 OUT OF 4);`
  - **Join optimization**: Bucket joins for co-located data
  - **Map-side joins**: Efficient joins when both tables bucketed on join key
  - **Consistent performance**: Predictable query execution times
    * Eliminates data skew issues
    * Enables parallel processing optimizations
    * Improves resource utilization

• Advanced bucketing features
  - `CLUSTERED BY (id) SORTED BY (timestamp) INTO 4 BUCKETS;`: Combine bucketing with sorting
  - Sorted buckets enable faster range queries and aggregations
  - Multiple column bucketing for complex data distribution
  - Integration with partitioning for hierarchical data organization

**9. Schema Evolution**

• Schema evolution capabilities
  - Add new columns to existing tables without data migration
  - `ALTER TABLE employees ADD COLUMNS (department STRING, hire_date DATE);`
  - New columns appear as NULL for existing records
  - Maintains backward compatibility with existing data
    * Old queries continue to work unchanged
    * New queries can access additional columns
    * No downtime required for schema changes

• Column modification operations
  - `ALTER TABLE employees CHANGE COLUMN name full_name STRING;`: Rename columns
  - `ALTER TABLE employees CHANGE COLUMN salary salary DECIMAL(10,2);`: Change data types
  - Compatible type changes preserve existing data
  - Incompatible changes may require data conversion
    * String to numeric conversions
    * Date format standardization
    * Precision and scale adjustments

• Schema evolution best practices
  - **Additive changes**: Safest approach - only add new columns
  - **Default values**: Consider default values for new columns
  - **Data validation**: Test schema changes with sample queries
  - **Rollback planning**: Maintain ability to revert schema changes
  - **Documentation**: Track schema evolution for data governance
  - **Version control**: Maintain schema history and change logs

• Handling schema conflicts
  - SerDe-specific schema evolution capabilities
  - Avro SerDe provides excellent schema evolution support
  - JSON SerDe handles flexible schema changes
  - Regular expressions updated for new data formats
  - Partition-level schema variations supported

**10. Executing HiveQL as Scripts**

• Script execution methods
  - `hive -f script.hql`: Execute HiveQL script from file
  - `hive -e "SELECT * FROM employees LIMIT 10;"`: Execute single query
  - `hive --hiveconf key=value -f script.hql`: Pass configuration parameters
  - Scripts contain multiple HiveQL statements separated by semicolons
    * Comments supported using -- or /* */ syntax
    * Variable substitution using ${variable} syntax
    * Conditional logic through Hive configuration

• Script organization and best practices
  - Use meaningful file names and directory structure
  - Include header comments with purpose and parameters
  - Organize statements logically (DDL, DML, cleanup)
  - Handle errors gracefully with conditional statements
    * Check table existence before operations
    * Validate data quality at key checkpoints
    * Include rollback procedures for complex operations

• Variable substitution and parameterization
  - `${hiveconf:start_date}`: Reference configuration variables
  - `hive --hiveconf start_date=2023-01-01 -f monthly_report.hql`
  - Environment variables accessible through ${env:VARIABLE}
  - Dynamic script behavior based on runtime parameters
    * Flexible date ranges for reports
    * Environment-specific configurations
    * Reusable script templates

• Advanced scripting features
  - `SET hive.cli.print.header=true;`: Print column headers
  - `SET hive.exec.mode.local.auto=true;`: Enable local mode for small queries
  - Source other script files for modular design
  - Integration with shell scripts for complex workflows
  - Error handling and logging strategies

**11. Joins and Working with Dates**

• Join types and syntax
  - `SELECT a.*, b.* FROM employees a JOIN departments b ON a.dept_id = b.id;`: Inner join
  - `LEFT OUTER JOIN`: Include all records from left table
  - `RIGHT OUTER JOIN`: Include all records from right table
  - `FULL OUTER JOIN`: Include all records from both tables
  - `CROSS JOIN`: Cartesian product of both tables (use carefully)

• Join optimization techniques
  - **Map-side joins**: For small tables that fit in memory
    * `/*+ MAPJOIN(small_table) */`: Hint for map-side join
    * Broadcasts small table to all mappers
    * Eliminates shuffle phase for better performance
  - **Bucket map joins**: For bucketed tables on same column
  - **Sort-merge bucket joins**: For sorted and bucketed tables
    * Most efficient join for large tables
    * Requires proper bucketing and sorting strategy

• Date handling functions
  - `SELECT CURRENT_DATE(), CURRENT_TIMESTAMP();`: Current date and time
  - `SELECT YEAR(date_col), MONTH(date_col), DAY(date_col) FROM table;`: Extract date parts
  - `SELECT DATE_ADD(date_col, 30), DATE_SUB(date_col, 7) FROM table;`: Date arithmetic
  - `SELECT TO_DATE(timestamp_col) FROM table;`: Convert timestamp to date
  - `SELECT UNIX_TIMESTAMP(date_string, 'yyyy-MM-dd') FROM table;`: Parse date strings

• Date formatting and conversion
  - `SELECT FROM_UNIXTIME(unix_timestamp, 'yyyy-MM-dd HH:mm:ss');`: Format Unix timestamp
  - `SELECT DATE_FORMAT(date_col, 'yyyy-MM') FROM table;`: Custom date formatting
  - `SELECT DATEDIFF(end_date, start_date) FROM table;`: Calculate date differences
  - `SELECT WEEKOFYEAR(date_col), DAYOFWEEK(date_col) FROM table;`: Week and day functions

• Date-based filtering and partitioning
  - `WHERE date_col BETWEEN '2023-01-01' AND '2023-12-31'`: Date range filtering
  - Use date functions in WHERE clauses for dynamic filtering
  - Partition tables by date columns for optimal performance
  - Consider time zone implications for timestamp data

**12. MSCK REPAIR Command**

• Purpose and functionality
  - `MSCK REPAIR TABLE table_name;`: Synchronize metastore with HDFS
  - Discovers partitions that exist in HDFS but not in Hive metastore
  - Automatically adds missing partition metadata to metastore
  - Essential after external partition creation or data loading
    * Files added directly to HDFS bypass metastore updates
    * External tools may create partitions unknown to Hive
    * Data recovery scenarios require metastore synchronization

• Common use cases
  - After bulk data loading using external tools (Sqoop, Spark, etc.)
  - Following direct HDFS operations that create partition directories
  - Data recovery after metastore corruption or restoration
  - Synchronization in multi-tenant environments
    * Multiple applications writing to same table
    * External ETL processes creating partitions
    * Data lake scenarios with diverse data sources

• MSCK REPAIR operation details
  - Scans table's HDFS directory structure recursively
  - Identifies partition directories based on naming convention
  - Creates metastore entries for discovered partitions
  - Does not validate data quality or schema compliance
    * Only creates metadata entries
    * Assumes proper partition structure exists
    * May create entries for corrupted or incomplete partitions

• Alternative partition management commands
  - `ALTER TABLE table_name ADD PARTITION (year=2023, month=12);`: Manual partition addition
  - `SHOW PARTITIONS table_name;`: List current partitions in metastore
  - `DESCRIBE FORMATTED table_name PARTITION (year=2023);`: Partition details
  - `ALTER TABLE table_name DROP PARTITION (year=2022);`: Remove partition metadata

• Best practices and considerations
  - Run MSCK REPAIR after bulk external data operations
  - Monitor partition count to avoid excessive small partitions
  - Validate critical partitions after repair operations
  - Consider automation for regular synchronization needs
  - Use with caution on tables with many partitions (performance impact)

**13. Performance Tuning in Hive**

• Query optimization techniques
  - **Predicate pushdown**: Move WHERE clauses closer to data source
    * Reduces data scanning and transfer
    * Applied automatically by Hive optimizer
    * More effective with columnar formats (Parquet, ORC)
  - **Projection pruning**: Select only required columns
    * `SELECT id, name FROM employees;` instead of `SELECT * FROM employees;`
    * Reduces I/O and memory usage
    * Critical for tables with many columns

• Join optimization strategies
  - **Map-side joins**: `/*+ MAPJOIN(small_table) */` for small dimension tables
  - **Bucket joins**: Pre-bucket tables on join keys for efficient joins
  - **Sort-merge bucket joins**: Ultimate optimization for large table joins
  - **Join order**: Place largest table last in join sequence
    * Hive optimizer reorders joins automatically
    * Manual hints available for specific optimization needs

• File format optimization
  - **ORC format**: `STORED AS ORC` for best compression and performance
    * Columnar storage with predicate pushdown
    * Built-in indexing and statistics
    * ACID transaction support
  - **Parquet format**: Good alternative for cross-platform compatibility
  - **Compression**: Enable compression for storage and I/O optimization
    * `SET hive.exec.compress.output=true;`
    * Choose appropriate codec (Snappy, ZLIB, LZO)

• Partitioning and bucketing optimization
  - **Partition pruning**: Query only relevant partitions
    * Use partition columns in WHERE clauses
    * Avoid functions on partition columns in filters
    * Monitor partition count to prevent small file problem
  - **Bucketing**: Distribute data evenly across files
    * Enable sampling and map-side joins
    * Prevent data skew in processing
    * Optimize for specific query patterns

• Memory and resource tuning
  - **Mapper memory**: `mapreduce.map.memory.mb=2048`
  - **Reducer memory**: `mapreduce.reduce.memory.mb=4096`
  - **JVM heap settings**: Configure based on data size and complexity
  - **Parallel execution**: `SET hive.exec.parallel=true;` for independent operations
    * Multiple stages executed simultaneously
    * Reduces overall job execution time
    * Monitor resource usage to avoid oversubscription

• Advanced optimization settings
  - **Vectorization**: `SET hive.vectorized.execution.enabled=true;`
    * Processes multiple rows together for better CPU utilization
    * Significant performance improvement for analytical queries
    * Works best with ORC file format
  - **Cost-based optimizer**: `SET hive.cbo.enable=true;`
    * Uses table statistics for better execution plans
    * Requires regular statistics updates with `ANALYZE TABLE`
  - **Dynamic partition pruning**: Automatic optimization for star schema queries

• Monitoring and troubleshooting
  - **Query plans**: `EXPLAIN SELECT ...;` to understand execution strategy
  - **Statistics**: `ANALYZE TABLE table_name COMPUTE STATISTICS;` for optimizer
  - **Job tracking**: Monitor MapReduce jobs through Hadoop UI
  - **Identify bottlenecks**: CPU, memory, I/O, or network constraints
    * Profile queries with different data sizes
    * Test various optimization techniques
    * Monitor resource utilization patterns
  - **Small file problem**: Consolidate small files using concatenation or compaction

# presto/trino: