# Hive - Partitions

In [None]:
What are Hive Partitions?

-->Hive table partition is a way to split a large table into smaller logical tables based on one or more partition keys. 
These smaller logical tables are not visible to users and users still access the data from just one table.

-->Partition eliminates creating smaller tables, accessing, and managing them separately.

-->When you load the data into the partition table, Hive internally splits the records based on the partition key and 
    stores each partition data into a sub-directory of tables directory on HDFS. 
'The name of the directory would be partition key and it’s value.'

+------------+
| partition  |
+------------+
| state=AL   |
| state=AZ   |
| state=FL   |
| state=NC   |
| state=PR   |
| state=TX   |
+------------+

-->Also, note that while loading the data into the partition table, Hive eliminates the partition key from the actual loaded 
   file on HDFS as it is redundant information and could be get from the partition folder name.


In [None]:
-->To create a Hive table with partitions, you need to use PARTITIONED BY clause along with the column you wanted to 
   partition and its type
    

CREATE TABLE zipcodes(
RecordNumber int,
Country string,
City string,
Zipcode int)
PARTITIONED BY(state string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';


#Load Data to Table
#hdfs dfs -put zipcodes.csv /data/
jdbc:hive2://127.0.0.1:10000> LOAD DATA INPATH '/data/zipcodes.csv' INTO TABLE zipcodes;
    
#Using Insert Into as Static Partition
INSERT INTO zipcodes PARTITION(state='FL') VALUES (891,'US','TAMPA',33605);


jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+------------+
| partition  |
+------------+
| state=AL   |
| state=AZ   |
| state=FL   |
| state=NC   |
| state=PR   |
| state=TX   |
+------------+
6 rows selected (0.074 seconds)



In [None]:
#Add Partition
jdbc:hive2://127.0.0.1:10000>ALTER TABLE zipcodes ADD PARTITION (state='CA') LOCATION '/user/data/zipcodes_ca';

#Rename/update Partition
jdbc:hive2://127.0.0.1:10000> ALTER TABLE zipcodes PARTITION (state='AL') RENAME TO PARTITION (state='NY');

#Renaming Partitions on HDFS
'hdfs dfs -mv /user/hive/warehouse/zipcodes/state=NY /user/hive/warehouse/zipcodes/state=AL'

NOTE:-When you manually modify partitions directly on HDFS, you need to run 'MSCK REPAIR TABLE' to update the Hive Metastore.
      Not doing so will result in inconsistent results.
        

jdbc:hive2://127.0.0.1:10000>'MSCK REPAIR TABLE zipcodes SYNC PARTITIONS;'


In [None]:
#Drop a Partition as Static Partition
jdbc:hive2://127.0.0.1:10000> ALTER TABLE zipcodes DROP IF EXISTS PARTITION (state='AL');

NOTE:- 'Not using IF EXISTS result in error when specified partition not exists.'
    
#Drop Partitions on HDFS
user@namenode:~/hive$ 'hdfs dfs -rm -R /user/hive/warehouse/zipcodes/state=AL'
Deleted /user/hive/warehouse/zipcodes/state=AL


jdbc:hive2://127.0.0.1:10000> MSCK REPAIR TABLE zipcodes SYNC PARTITIONS;
jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes;
+-----------------------------------+
|             partition             |
+-----------------------------------+
| state=AZ                          |
| state=FL                          |
| state=NC                          |
| state=PR                          |
| state=TX                          |
+-----------------------------------+




In [None]:
'How to Filter Partitions?'

SHOW PARTITIONS zipcodes PARTITION(state='NC');
+------------+
| partition  |
+------------+
| state=NC   |
+------------+
1 row selected (0.182 seconds)

'Know Specific Partition Location on HDFS'

DESCRIBE FORMATTED zipcodes PARTITION(state='PR');
SHOW TABLE EXTENDED LIKE zipcodes PARTITION(state='PR');


In [None]:
#Keywords used with Partitions
PARTITIONED BY --> used to create a partition table
ALTER TABLE --> used to add, rename, drop partitions
SHOW PARTITIONS --> used to show the partitions of the table
MSCK REPAIR --> used to synch Hive Metastore with the HDFS data.

# 1. Static Partitions

In [None]:
-->In static partitioning we need to specify the partition column value in each and every LOAD statement.
-->Insert input data files individually into a partition table is Static Partition Usually when loading files (big files) into 
   Hive tables static partitions are preferred

-->suppose we are having partition on column country for table t1(userid, name,occupation, country), so each time we need 
   to provide country value
    
    hive> set hive.mapred.mode = strict #Default value as strict
    hive> LOAD DATA INPATH '/hdfs path of the file' INTO TABLE t1 PARTITION(country="US")
    hive> LOAD DATA INPATH '/hdfs path of the file' INTO TABLE t1 PARTITION(country="UK")
    hive> INSERT INTO zipcodes PARTITION(state='FL') VALUES (891,'US','TAMPA',33605);

-->Static Partition saves your time in loading data compared to dynamic partition You “statically” add a partition in table 
   and move the file into the partition of the table.

-->We can alter the partition in static partition
-->You can get the partition column value form the filename, day of date etc without reading the whole big file.

# 2. Dynamic Partitions

In [None]:
1.single insert to partition table is known as dynamic partition
2.Usually dynamic partition load the data from non partitioned table
3.Dynamic Partition takes more time in loading data compared to static partition
4.When you have large data stored in a table then Dynamic partition is suitable.
5.If you want to partition number of column but you don’t know how many columns then also dynamic partition is suitable
6.Dynamic partition there is no required where clause to use limit. we can’t perform alter on Dynamic partition
7.You can perform dynamic partition on 'hive external table and managed table' If you want to use Dynamic partition in hive 
 then mode is in 'nonstrict' mode Here is hive dynamic partition properties you should allow

'SET hive.exec.dynamic.partition = true;'
'SET hive.exec.dynamic.partition.mode = nonstrict;'

 hive> INSERT INTO TABLE t2 PARTITION(country) SELECT * from T1;   #t1(userid, name,occupation, country)

NOTE- make sure that partitioned column is always the last one in non-partitioned/Source table (as we are having country column in t2)