# Table Of Contents
  * [File Format](#File-Format)
    - [Row Based](#Row-Based)
    - [Column Based](#Column-Based)
    - [Text File Format](#Text-File-Format)
  * [Specialized File Formats](#Specialized-File-Formats)
    - [Avro](#Avro)
    - [ORC](#ORC)
    - [Parquet](#Parquet)
  * [Comparision](#Comparision)
  * [File Format Practicals](#File-Format-Practicals)
  * [File Compression Techniques](#File-Compression-Techniques)
    - [Snappy](#Snappy)
    - [Lzo](#Lzo)
    - [Gzip](#Gzip)
    - [Bzip2](#Bzip2)
  * [Few More Optimizations](#Few-More-Optimizations)
    - [Vectorization](#Vectorization)
    - [Changing the Hive engine](#Changing-the-Hive-engine)
  * [Apache Thrift Server](#Apache-Thrift-Server)
  * [MSCK repair](#MSCK-repair)
  * [Miscellaneous](#Miscellaneous)
  * [Slowly Changing Dimentions](#Slowly-Changing-Dimentions)

#### How the data will be stored?

In this tutorial we'll talk about the following two things:
* File Format
* Compression Techniques

# File Format


#### Why do we need different file formats? 
       OR
#### why do we need it?
1. We want to Save storags
2. Fast Processing
3. Less time for our I/O operations

The type of file format we use can help us in achieveing the above 4 aspects.

    There are a lot of choices available on file formats
    * Some of them will give faster reads
    * Some are good if we need faster writes
    * Some are splittable(Most preferrable in case of BigData since parallel processing will be there)
    * Schema Evaolution Support(Altering the schema after loading or at later stages)
    * Advanced compression
    * Compatibility with the technology
    
All the file formats are divided into two broad categories:
### Row Based
In this file format data is stored in row-wise fashion.</br>
    1. New data is appended in the end of the line.</br>
    2. Writing is easy since it is a sequential storage. </br>
    3. In case of <i>SELECTS</i> If we want to get a subset of columns it has to read the entire record.</br>
    4. Since there are different data types in a single file that's why compression won't be that great.
    
Example is any DBMS table and it is not a good choice for Data Warehousing.



### Column Based
In this file format Data is stored based on per column basis</br>
    1. All the column Values are stored together.</br>
    2. If we write a query requesting some columns then only files containing those columns will be read.</br>
    3. Suitable for faster reads but writing is costlier than Row based.
    4. Suitable for Data Warehousing.
    5. Since same data type is stored in a same file tha't why it compression level will be great.
    
Example is 


## Text File Format

<b>Note:</b> We'll not use it for our production use</br>
Examples: json, csv, XML etc.</br>
These are human readable

* CSV File</br>
1. Everything is stored in text form, even an integer. Let's say ther is a number 1234 and is stored as a string the above number will take 4 x 2 = 8 bytes. On the other hand if it would have been stored as an int it would have taken 4 bytes only.
2. If we want to read the file and store it as an int then we've to convert it into an int before storing it. It'll take processing this leads to increase in time.
3. I/O will increase.

* XML/JSON</br>

Whatever we talked for CSV also applied here
1. These File are not splittable so these are not ideal choice for production use.

## Specialized File Formats

There are some special file formats that we'll mostly use in BigData technologies.</br>
All the file formats are splittable and any compression can be used among them. Compression codec is stored in metadata and whenever any reader reads the data it gets it from the metadata.
* Serializable - Convering the data in a form that ca be transferred over the network and can be stored in a file.
* Deserialization  - Converting the serialized data to readable format.
* Agnostic Compression - Can use any compression technique.


#### Avro
* Generalized file format supporting most of the programmeing languages. We can say it is language neutral.
* It is a row based file format.
* Supports faster writes but slower reads.
* Schema for this file format is stored in JSON.
* This format is self descirbing because metadata schema is kept in JSON form.
* Actual data is stored in a compressed binary format which is quiet efficient in terms of storage.
* Schema Evolution - There is no other file format better than AVRO in case of schema evolution. Adding/Remove/renaming the colums is very well supported.


Avro can be best fit when storing the data in the landing zone of the data lake. When we do ETL operations this is best suited because all the columns will be read.



#### ORC

ORC stands for Optimized Row Columnar format
* Column Based File Format(Writes are not efficient but reads are efficient)
* Highly Efficient in terms of storage(Takes very less space)
* Uses Lightweight compression (dictionary encoding/Bit Packing/Delta Encoding/Run Length encoding) along with generalized compression techniques(snappy/lzo/gzip).

<b> Dictionary Encoding</b>: If there are many values for a key it will store only the distinct values and creates a mapping internally.</br>
<b> Bit Packing</b>: It tris to store the data types in less bits.</br>
<b> Delta Encoding</b>: If there is any increment in the data then Instead of storing the whole column it will actually store the data or the difference between the last and current row.</br>
<b> Run length Encoding</b>: if the characters are repeated it will count the number of chars and stores the numbers. e.g - ssssstttcc -> s5t3c2</br>

* ORC also provides predicate push down.
<b> Predicate Push Down</b>: In where clause whatever conditions we mentioned these are called a predicated. It first run the predicates so that it will filter out the data that is mentioned in the where clause. It pushes the predicates at storage level

* It is best fit for hive because it supports all the data types including complex data types that are there in HIVE. It is specially designed for HIVE.

* Schema Evolution: it supports this but not as matured as that of Avro. Metadata in ORC is stored using protocol buffers which allows addition/removal of fields. But the support is not as good as Avro.
* ORC file is divided into theee parts.
  - **Header**: This only contains "ORC"
  - **Body**: Actual Data is stored and it is made up of stripes(default: 250MB) and these stripes are further divide into blocks(default: 10K) of rows.
    - Stripe also contains three parts:
      * Set Of Indices - Max/Min and count of each column in every row group in that stripe
      * Data broken into row groups
      * Stripe Footer - Encoding Used to compress it
  - **Tail** - Made up of two parts:
    - File footer: Contains the metadata at file level, For each col it will tell min/max at file level as well as stripe level
    - Postscript: Which compression technique we've used and other info that helps to understand the remaining file. This is never compressed.
* **Indices**: These are at three levels:
  1. File Level - Min/Max value for the entire file
  2. Stripe Level - Min/Max for each Stripe
  3. Row Level - Again inside stripe there is row level Min and Max
  
  ```text
    Example:
    There are 100000 records:
    File Level - 1 to 100000
    Stripe level:
      Stripe1 - 1-50000
        Row Level:
        RL1: 1-10000
        RL2: 50001 - 100000 and so on
      Stripe1 - 500001 - 100000
        Row Level:
        RL1: 50001-60000
        RL2: 60001 - 70000 and so on
    ```
* File Level and Stripe indices are kept in file footer
* Row Level indices are kept in index data in stripe
  
<span style="color:red">**Key Takeaways**</span>
* Columnar based
* Data is stored in very storage efficient ways because of many compression encodigs
* Very well suited for HIVE
* The predicates are pushed at the storage level so that we've to read less
* Schema evolution is not great but Okay.

  

#### Parquet
* Column based file format.
* Writes are not efficient but reads are efficients.
* Very good for handline the nested data.
* Shares many design goals with ORC but it is more general purpose(wide compatibliity with many platforms)
* Compression is Efficient.
* It stores metadata at the end of the file that's why it is called self describing.
* It can support schema Evolution to some extent - Only adding or deleting the columns in the end, Not from the middle.
* There are three Parts:
  * **Header** - contains a text "PAR1"
  * **Row group** - Column chunks(Let's say we have 5 columns and 100K rows) -> Further divided into pages
    * Each Column chunk contains 10k rows
  * **Footer** - Contains three parts
    * File Metadata
    * Length of File Metadata
    * Magic Number "PAR1"
    
<span style="color:red">**Key Takeaways**</span>
* Generic File Format and column based
* Faster reads and slower writes
* Compression is efficient but not as good as ORC
* Supports schema evolution to some extent
* Best suited for nested structures
* Splittable



####  Comparision

<table align=left>
<tr>
<th>Avro</th>
<th>Parquet</th>
</tr>
<tr>
<td>
    
* It is a row based file format.
* Provides faster writes but slower reads.
* Avro is quiet mature in schema evolution.
* Doesn't provide the support for nested Data.
    
</td>
<td>
    

* It is column based file format.
* Provides faster reads but slower writes.
* Parquet only supports schema evolution at the end.
* It provides very good support for nested DS.
    
</td>
</tr>
</table>


<table align=left>
<tr>
<th>ORC</th>
<th>Parquet</th>
</tr>
<tr>
<td>
  
* Cannot handle deeply nested data.
* Good at predicate pushdown.
* Also supports ACID properties to somw extent
* Compression - Better than parquet.
* Schema Evolution - Better than parquet.
* Supported Platforms - Hive, Presto.
  
</td>
<td>
    
* Can handle deeply nested data.
* Compression - ORC is better.
* Parquet only supports schema evolution at the end.
* Supported Platforms - Impala, Arrow Drill, Spark.
    
</td>
</tr>
</table>
    


## File Format Practicals


* Create a Table orders_orc by mentioning **stored as orc**
* Load the data using insert columns
* Check the file

```shell
hive> CREATE TABLE `orders_orc`(
    >   `id` bigint,
    >   `product_id` string,
    >   `customer_id` bigint,
    >   `quantity` int,
    >   `amount` double) stored as orc;
OK
Time taken: 0.058 seconds

hive> dfs -ls warehouse/tushar_test.db/orders_orc;
Found 1 items
-rw-r--r--   3 itv002768 supergroup        658 2022-07-17 08:55 warehouse/tushar_test.db/orders_orc/000000_0

hive> dfs -cat warehouse/tushar_test.db/orders_orc/000000_0;
ORC
P7


��(P@��be!1f%.֤���\!ւ���T    �1

    ���4P+


 
PB�R�bb`R�`���dA0��A�����nddd!phonecamerabroom     F�P    Fb $c`��@C�͠��]�A
�0C��K�����h
                  �\|��h����pJ�%3YYq���";u:ig�,߮�7c�����b�``��`���;�+0��Ć
                                                                                   |\�J\�IE���B��y�
@).A�zN�u��%�L�Bl\|@!6&6    . ��K��YJ��
                                            T�t�I0�k�P�(��`��$���,��Q�K�M�Yɂ�G������U�)3E���(?�4�$��N.-.��M-q8
K�J2K*��s�K�J��8X��J31��
                              s��G�    ^�I'x%6�`�೒��bM*���b-���K�P�HY        �sr�X'(�eb��
�q0     �Ip��V��RҜ
                    `����8H�\;��D&�y���"
                                                  (u0��ORChive>
```

**To get the info in readable format use the following command**

```shell
[itv002768@g02 ~]$ hive --orcfiledump /user/itv002768/warehouse/tushar_test.db/orders_orc/000000_0
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Processing data file /user/itv002768/warehouse/tushar_test.db/orders_orc/000000_0 [length: 658]
Structure for /user/itv002768/warehouse/tushar_test.db/orders_orc/000000_0
File Version: 0.12 with ORC_517
Rows: 3
Compression: ZLIB
Compression size: 262144
Type: struct<id:bigint,product_id:string,customer_id:bigint,quantity:int,amount:double>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 3 hasNull: false
    Column 1: count: 3 hasNull: false bytesOnDisk: 14 min: 111111 max: 111113 sum: 333336
    Column 2: count: 3 hasNull: false bytesOnDisk: 26 min: broom max: phone sum: 16
    Column 3: count: 3 hasNull: false bytesOnDisk: 6 min: 1111 max: 1111 sum: 3333
    Column 4: count: 3 hasNull: false bytesOnDisk: 7 min: 1 max: 3 sum: 5
    Column 5: count: 3 hasNull: false bytesOnDisk: 21 min: 10.0 max: 5200.0 sum: 6410.0

File Statistics:
  Column 0: count: 3 hasNull: false
  Column 1: count: 3 hasNull: false bytesOnDisk: 14 min: 111111 max: 111113 sum: 333336
  Column 2: count: 3 hasNull: false bytesOnDisk: 26 min: broom max: phone sum: 16
  Column 3: count: 3 hasNull: false bytesOnDisk: 6 min: 1111 max: 1111 sum: 3333
  Column 4: count: 3 hasNull: false bytesOnDisk: 7 min: 1 max: 3 sum: 5
  Column 5: count: 3 hasNull: false bytesOnDisk: 21 min: 10.0 max: 5200.0 sum: 6410.0

Stripes:
  Stripe: offset: 3 data: 74 rows: 3 tail: 70 index: 163
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 30
    Stream: column 2 section ROW_INDEX start: 44 length 35
    Stream: column 3 section ROW_INDEX start: 79 length 27
    Stream: column 4 section ROW_INDEX start: 106 length 24
    Stream: column 5 section ROW_INDEX start: 130 length 36
    Stream: column 1 section DATA start: 166 length 14
    Stream: column 2 section DATA start: 180 length 19
    Stream: column 2 section LENGTH start: 199 length 7
    Stream: column 3 section DATA start: 206 length 6
    Stream: column 4 section DATA start: 212 length 7
    Stream: column 5 section DATA start: 219 length 21
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2
    Encoding column 3: DIRECT_V2
    Encoding column 4: DIRECT_V2
    Encoding column 5: DIRECT

File length: 658 bytes
Padding length: 0 bytes
Padding ratio: 0%
```

**To see the actual data**
```shell
[itv002768@g02 ~]$ hive --orcfiledump /user/itv002768/warehouse/tushar_test.db/orders_orc/000000_0 -d
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Processing data file /user/itv002768/warehouse/tushar_test.db/orders_orc/000000_0 [length: 658]
{"id":111111,"product_id":"phone","customer_id":1111,"quantity":3,"amount":1200}
{"id":111112,"product_id":"camera","customer_id":1111,"quantity":1,"amount":5200}
{"id":111113,"product_id":"broom","customer_id":1111,"quantity":1,"amount":10}
```

**Below are the commands for Parquet file**

```shell

hive> CREATE TABLE `orders_parquet`(
    >   `id` bigint,
    >   `product_id` string,
    >   `customer_id` bigint,
    >   `quantity` int,
    >   `amount` double) stored as parquet;
OK
Time taken: 0.841 seconds
hive> insert into orders_parquet select * from orders_orc;
Query ID = itv002768_20220717090857_137e9a9f-2901-4d44-8068-fcf30d934594
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1650084714510_39755, Tracking URL = http://m02.itversity.com:19088/proxy/application_1650084714510_39755/
Kill Command = /opt/hadoop/bin/mapred job  -kill job_1650084714510_39755
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-17 09:09:23,017 Stage-1 map = 0%,  reduce = 0%
2022-07-17 09:09:28,292 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.74 sec
2022-07-17 09:09:32,471 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.17 sec
MapReduce Total cumulative CPU time: 4 seconds 170 msec
Ended Job = job_1650084714510_39755
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://m01.itversity.com:9000/user/itv002768/warehouse/tushar_test.db/orders_parquet/.hive-staging_hive_2022-07-17_09-08-57_887_3847282730333851488-1/-ext-10000
Loading data to table tushar_test.orders_parquet
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.17 sec   HDFS Read: 20127 HDFS Write: 1499 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 170 msec
OK
Time taken: 36.655 seconds


hive> dfs -cat warehouse/tushar_test.db/orders_parquet/000000_0;
PAR1<<,       ��(  ����       �DD,6(phonebroomphonecamerabroomLW,WWWWL,(<<,P�@$@P�@$@��@P�@$@lH
                                                            hive_schema
%id
         %
product_id%%
                customer_id%quantity
%amount\id��<   ��(  ��&�
                                                               
product_id��&�<6(phonebroom&�
                                                       customer_id��&�<WWWW,&�quantity��&�<(,&�
amount��&�<P�@$@P�@$@�writer.time.zoneAmerica/TorontoJparquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)\�PAR1


[itv002768@g02 ~]$ parquet-tools meta /user/itv002768/warehouse/tushar_test.db/orders_parquet/000000_0 -d
Invalid arguments: unknown extra argument "-d"

usage: parquet-meta [option...] <input>
where option is one of:
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported
where <input> is the parquet file to print to stdout
[itv002768@g02 ~]$ parquet-tools meta /user/itv002768/warehouse/tushar_test.db/orders_parquet/000000_0
file:        hdfs://m01.itversity.com:9000/user/itv002768/warehouse/tushar_test.db/orders_parquet/000000_0
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       writer.time.zone = America/Toronto

file schema: hive_schema
--------------------------------------------------------------------------------
id:          OPTIONAL INT64 R:0 D:1
product_id:  OPTIONAL BINARY O:UTF8 R:0 D:1
customer_id: OPTIONAL INT64 R:0 D:1
quantity:    OPTIONAL INT32 R:0 D:1
amount:      OPTIONAL DOUBLE R:0 D:1

row group 1: RC:3 TS:416 OFFSET:4
--------------------------------------------------------------------------------
id:           INT64 UNCOMPRESSED DO:0 FPO:4 SZ:91/91/1.00 VC:3 ENC:PLAIN,BIT_PACKED,RLE
product_id:   BINARY UNCOMPRESSED DO:0 FPO:95 SZ:69/69/1.00 VC:3 ENC:PLAIN,BIT_PACKED,RLE
customer_id:  INT64 UNCOMPRESSED DO:0 FPO:164 SZ:90/90/1.00 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
quantity:     INT32 UNCOMPRESSED DO:0 FPO:254 SZ:75/75/1.00 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
amount:       DOUBLE UNCOMPRESSED DO:0 FPO:329 SZ:91/91/1.00 VC:3 ENC:PLAIN,BIT_PACKED,RLE

[itv002768@g02 ~]$ parquet-tools cat /user/itv002768/warehouse/tushar_test.db/orders_parquet/000000_0
id = 111111
product_id = phone
customer_id = 1111
quantity = 3
amount = 1200.0

id = 111112
product_id = camera
customer_id = 1111
quantity = 1
amount = 5200.0

id = 111113
product_id = broom
customer_id = 1111
quantity = 1
amount = 10.0
  
  ```

* If we have to use JSON format then we have to use json serde. As there is no inbuilt support for json. For this we've to load the JAR
* serde is a combination of two things, **ser**ialization and **de**serialization.

## File Compression Techniques

* Compression will help you to save storage.
* It helps us to process the data faster.
* It will reduce the I/O cost.

Compression and uncompression comes with a cost w.r.t Time taken to compress or uncompress and when we compare it with I/O gain we can actually neglect this additional time. Following are the four important compression techniques. However, some of the codecs are optimized for storage and some are for speed.

There is a trade off, if we want file to be more compressed it will take more time to compress.

### Snappy
* Developed by google.
* Snappy is a very fast compression codec.
* However, In terms of compression it is not very good as it will give you ideal size after compression.
* Suitable for quick compression.
* In most projects this codec is used.
* By default, it's not splittable. But if we use it with splittable file formats then no worries because file formats will take care of the things.
* can be used with avro, orc and parquet - container based file formats and splittable.


### Lzo
* Optimized for speed just like snappy.
* Lzo is inherently splittable so that we can use it with file formats which are not splittable.
* Requires additional indexing step.
* It requires separate installation for hadoop.

### Gzip
* Optimized for storage rather than speed(avg. 2.5 times than snappy).
* Processing speed is slow.
* It should be used with container based file formats Since it is not splittable.
* If we are using the gzip we can reduce the block size so that we'll have more number of blocks and can achieve parallelism.

### Bzip2
* Optimized for storage and is very very slow.
* Purpose is to compress the files to a great extent.
* This is splittable and can be used with text, json and XML.
* Compresses around 9% better than Gzip. However, this comes with some cost making it 10 times slower than Gzip.
* Not an ideal choice for hadoop untill or unless our primary concern is to reduce the storage.
* Used for Active archival purpose.

## Few More Optimizations

### Vectorization
Whenever we run a query it will try to get the data row by row this takes time. To avoid this we can fetch the data in batches

Vectorized query execution in Hive is a feature that grately reduces the CPU usage for typical query operations.
In case of vectorized queries we process the 1024 rows at a time and it is configurable.

* You must store your data in ORC format to use query Vectorization.
* set hive.vectorized.execution.enabled = true (This is false by default).


```shell
hive> set hive.vectorized.execution.enabled=true;
hive> create table vectorizedtable(state string, id int) stored as orc;
OK
Time taken: 0.804 seconds
hive> insert into vectorizedtable values('Karnataka', 1);
Query ID = itv002768_20220717133618_db8414a2-e0f0-4b4d-bb8b-24ef80bfac8a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1650084714510_39814, Tracking URL = http://m02.itversity.com:19088/proxy/application_1650084714510_39814/
Kill Command = /opt/hadoop/bin/mapred job  -kill job_1650084714510_39814
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-17 13:36:50,949 Stage-1 map = 0%,  reduce = 0%
2022-07-17 13:36:55,172 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.33 sec
2022-07-17 13:37:00,504 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.81 sec
MapReduce Total cumulative CPU time: 3 seconds 810 msec
Ended Job = job_1650084714510_39814
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://m01.itversity.com:9000/user/itv002768/warehouse/tushar_test.db/vectorizedtable/.hive-staging_hive_2022-07-17_13-36-18_717_326784811544799765-1/-ext-10000
Loading data to table tushar_test.vectorizedtable
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.81 sec   HDFS Read: 16096 HDFS Write: 567 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 810 msec
OK
Time taken: 46.676 seconds
hive> explain select count(*) from vectorizedtable;
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: 1
      Processor Tree:
        ListSink

Time taken: 6.96 seconds, Fetched: 10 row(s)
```

### Changing the Hive engine
* Hive supports three execution Engines
 - Map Reduce(Default)
 - tez
 - Spark
 
 ```shell
hive> set hive.execution.engine;
hive.execution.engine=mr
hive> set hive.execution.engine=spark;
hive> set hive.execution.engine;
hive.execution.engine=spark
```

## Apache Thrift Server

It is also known as Hive server.
If you want to connect to hive remotely you can connect to hive using various client programs.
If we want to connect to hive in a script written in java/python then it can be done with the help of thrift service.

HiveServer is a service that allows remote client to submit requests to Hive using any programming language. Thrift interface acts as a bridge.

## MSCK repair

Let's say you've created an external table that is pointed to a dir and someone has added the partitions manually by creating dir e.g `/data/State=NY`.
If the data is populated in this table it will go in respective dir as per partition.

If we run the command `show partitions table_name`. It won't show the partitions.
`msck repair table_name` command will add the corresponding metadata for hive tables so that partition info is available.

## Miscellaneous

**No Drop table**</BR> If we enable this feature for a table no one will be able to drop the table.</br>

`ALTER TABLE orders enable no_drop`

To disable the no_drop use the followg command:

`ALTER TABLE orders disable no_drop`

If you want to enable it for a partition use the following command:

`ALTER TABLE orders partition(department='HR') enable no_drop`

**Offline Feature**</BR> If you enable this feature on a table then you won't be able to query that table.

`ALTER TABLE orders enable offline` <=> `ALTER TABLE orders disable offline`

**Skipping Header**</BR> If we are getting some random data in a file then we don't want to consider initial n rows.

```shell
hive> create table skip_test(name string, score int)
    > row format delimited fields terminated by ','
    > lines terminated by '\n'
    > stored as textfile
    > tblproperties("skip.header.line.count=3");
OK
Time taken: 0.814 seconds
```
**Making Tables Immutable**</BR> This property will allow data load only for the first time. You won't be able to append or update the data. However, you can overwrite the data.

`tblproperties("immutable"="true")`

**DROP vs TRUNCATE vs PURGE**</BR>
* When we `DROP` a managed table both data and metedata will be deleted.
* When we `DROP` an External table only metadata is deleted not the data.
* When we `TRUNCATE` all the data will be deleted only metadata will be there.
* If `PURGE` is set to true and if we delete the data it will be permanently deleted and cannot be recovered. But if `PURGE` is set to false we can recover the data.

**Treating Empty strings as NULL**</BR> If a file has no value for a particular field or you can say it is blank So it will be treated as blank. By setting the following property we can give NULL or any other value for blanks.

`tblproperties("serialization.null.format"="null")`

**setting hivevar**</BR> We can set the value for a hive variable and can use it in a query.
```shell
hive> describe orders;
OK
id                      string
customer_id             string
product_id              string
quantity                int
amount                  double
zipcode                 char(5)
state                   char(2)

# Partition Information
# col_name              data_type               comment
state                   char(2)
Time taken: 0.181 seconds, Fetched: 11 row(s)
hive> select * from orders ;
OK
o1       c1      p1     NULL    1.11     9011   CA
o2       c2      p2     NULL    2.22     9022   CA
o3       c3      p3     NULL    3.33     9033   CA
o4       c4      p4     NULL    4.44     9044   CA
o10      c10     p10    NULL    10.11    9001   CT
o20      c20     p20    NULL    20.22    9002   CT
o30      c30     p30    NULL    30.33    9003   CT
o40      c40     p40    NULL    40.44    9004   CT
o100     c100    p10    NULL    10.11    9001   NY
o200     c200    p20    NULL    20.22    9002   NY
o300     c300    p30    NULL    30.33    9003   NY
o400     c400    p40    NULL    40.44    9004   NY
Time taken: 6.393 seconds, Fetched: 12 row(s)

hive> set hivevar:zip=9011;
hive> select * from orders where zipcode=${zip};
OK
o1       c1      p1     NULL    1.11     9011   CA
Time taken: 7.471 seconds, Fetched: 1 row(s)
```


**Print table headers**
```shell
hive> set hive.cli.print.header;
hive.cli.print.header=false
hive> set hive.cli.print.header=true;
hive> select * from orders;
OK
orders.id       orders.customer_id      orders.product_id       orders.quantity orders.amount   orders.zipcode  orders.state
o1       c1      p1     NULL    1.11     9011   CA
o2       c2      p2     NULL    2.22     9022   CA
o3       c3      p3     NULL    3.33     9033   CA
o4       c4      p4     NULL    4.44     9044   CA
o10      c10     p10    NULL    10.11    9001   CT
o20      c20     p20    NULL    20.22    9002   CT
o30      c30     p30    NULL    30.33    9003   CT
o40      c40     p40    NULL    40.44    9004   CT
o100     c100    p10    NULL    10.11    9001   NY
o200     c200    p20    NULL    20.22    9002   NY
o300     c300    p30    NULL    30.33    9003   NY
o400     c400    p40    NULL    40.44    9004   NY
Time taken: 9.859 seconds, Fetched: 12 row(s)
```

**Cartesion Product**</BR>
`SELECT * FROM table1, table2`



## Slowly Changing Dimentions
A.K.A Change Data Capture.

Consider you have a table in mysql with too many columns(Dimention table).This data might not change frequently. Let's say we scooped this data to HDFS in the form of file and we created a HIVE table on top of this data. **What If data in mysql changes(not too frequently)? How to make sure that the updated data is synced with the HIVE?** Here comes the concept of SCD.

### Types of SCDs:

#### SCD Type 1:
We've to sync the hive table to make sure that the data is latest. We don't want to maintain the history of the previous data.
* Overwrite the old data with the new data.
* Extremly simple and easy to synchronize the reporting system(OLAP) with operational system(OLTP).
* You lose history everytime you update.

#### SCD Type 2:
We've to sync the hive table to make sure that the data is latest. Here, We have to maintain the history.
* Add new rows with versions.
* Allows you to track hitory.
* Dimentions tables may become very large.
* Additional reporting views needs to be created.

There are three ways in which we can implement SCD type-2:
* Versoning - Always the latest or gratest number represents the latest or updated value.
* Flagging - A flag column is created(Active/Inactive). Only Active one will be considered as the latest. But we cannot say what was the previous value.
* Effective Date Strategy - start_date and end_date columns for the validity. NULL indicates the current. It is most widely used approach.
