## Table Of Contents
* [Introduction](#Introduction)
* [Requirements for a DB](#Requirements-for-a-DB)
* [Hbase Properties](#Hbase-Properties)
* [Hbase vs RDBMS](#Hbase-vs-DBMS)
* [Advantages of a Columnar store](#Advantages-of-a-Columnar-store)
* [4-Dimentional Data Model](#4-Dimentional-Data-Model)
* [Hbase Architecture](#Hbase-Architecture)
* [HBase Read/Write Operations](#HBase-Read/Write-Operations)
  - [HBase Write Operations](#HBase-Write-Operations)
  - [Hbase Read Operations](#Hbase-Read-Operations)
* [Compactions](#Compactions)
* [Update/Delete in Hbase](#Update/Delete-in-Hbase)
* [Practicals](#Practicals)
* [CAP Theorem](#CAP-Theorem)
  - [Consistency](#Consistency)
  - [Availability](#Availability)
  - [Partition Toleance](#Partition-Toleance)
* [Cassandra](#Cassandra)
* [Cassandra vs Hbase](#Cassandra-vs-Hbase)
* [Hive-Hbase Integration](#Hive-Hbase-Integration)

### Introduction
**HBase** is a distributed database management system which run on top of Hadoop
* Stores data IN HDFS.
* Its sclable as the capacity is directly proportional to number of nodes.
* Fault tolerant.


### Requirements for a DB
* Structured Manner - In the form of rows & columns.
* Should give Random Access to your data - Indices
* Low Latency - Taking less time in searching n number of rows from a set of rows.
* ACID compliance
  - Atomicity - All or Nothing(All the transactions that are interrelated should be executed or none should execute).
  - Consistency - you can put some contraints in DB tables so that consistency will be maintained.
  - Isolation - If multiple people are operating in a DB and doing some operations there should be a defined sequence in which that transaction should happen. Example is Locking mechanism in tables and row versioning.
  - Durability - Whenever a system fails, it should not be in a state that it is not usable. In case of DB if some changes are done like updated and inserts they should be persistent. In short, Avoiding data loss.
  
  Hadoop is not a database, it doesn't follow the above mentioned points.

### Hbase Properties
**Structured**:
Hbase is a Loose structured DB

**Low Latency**:
Provides real-time access using row based indices called as row keys.

**Random Access**:
Possible based on row keys

**ACID Compliant**:
To some extent at row level.

It offers two thing, it provides you quick ***processing*** and ***searching***. But we are more concerned on searching based on row keys.

### Hbase vs DBMS

* DBMS stores the data in row manner while Hbase stores it in columnar fashion(NoSQL).
* Normalization is there in DBMS, But in case of Hbase it is not preferred.
* DBMS offers joins, group by, order by etc. But in NoSQL we can only CRUD operations. Why? Because data is denormalized.
* DBMS are ACID compliant. However, Hbase provides Hbase compliant at row level. Whenever we trying to update multiple columns for a single row either all of them will be updated or None will be updated. In case of multiple rows update there is now gurantee.


### Advantages of a Columnar store
* If the data is Sparse it will save space. No key value pair is required if there are not values for a column.
* Adding new column is easier since only a key value is required.

### 4-Dimentional Data Model
* **Row Key** - Unique identified for a record. Sorted in ascending order and stored as byte array.
* **Column** - All Columns e.g - dept. grade, title, Name, SSN. Each column family is stored in a separate file and setup at schema definition time.
* **Column family** - All columns grouped in a logical way (work(dept. grade, title) | Personal(Name, SSN)).
* **Timestamp** - e.g Title as AVP on 23455676 and VP on 2445654 - Basically for versioning and latest version is shown.

### Hbase Architecture
Let's talk about how a table looks like in Hbase;

Table - employee</br>
Column Family - Column groups</br>
personal - Name, Age, Address</br>
professional - Designation, Department, Salary</br>

Row_key = employee_id(1...10000)

These are divided like this:</br>
1-2500 (region 1)</br>
2501 - 5000 (region 2)</br>
5001 - 7500 (region 3)</br>
7501 - 10000 (region 4)</br>

Regions are handled by region server

Region server 1 = Region 1 and Region 2</br>
Region server 2 = Region 3 and Region 4</br>


* If we have 4 data nodes we will have 4 region servers running on each DB. Good practice is **1:1** for **region_server:data_node**.
* Each region server can hold **multiple regions** and each region holds data in ascending order based on row keys.
* In each region there is a **memstore** for every column family. Whenever we want to write new records it is stored in memstore and keeps every record in a memory till a threshold defined then it will flush it to the disc and a new file is created. File that is created is called as HFile(In HDFS).
* **WAL(Write Ahead Log)** - Consider that data is still in memstore and server crashes then there is a possibility of data loss. This is tackeled by WAL. Before Inserts are done to memstore it is done to a WAL so that we can replay the logs from WAL since it is on disc(HDFS). One per region server.
* **Block Cache** - We can also call it read cache. Whenever we read the data it is cached in memory so that it can directly fetch the data from memory for next read. One per region server.
* **Zookeeper** - It is a coordinating service for various distributed systems. It is an open source distributed coordinator. Every server sends a heartbeat to zookeeper so that zookeeper will keep track. Also holds location of metatable(Table having the mapping of row_key, region and region servers). This metatable is present on one of the region servers. Zookeeper holds the track of the region server which is holding this metatable.
* **Hmaster** - Hbase has master-slave architecture. Hmaste is master and region servers are slaves. Hmaster assigns regions to region server. In case of some failure or load increases Hmaster will try to balance the load by assigning some of the regions loaded in region server to other regions. Hbase an have one or more masters but only one Hmaster. Only one master is active as Hmaster at a time others are passive.

**Hfiles**
  1. Stores data in sorted key-value pairs.
  2. These are immutable, once created they cannot be modified.
  3. Large in size and depends on memstore size.
  4. Stores data in set of blocks, So that reading is easy based on block indices.
  5. Binary search is applied within the block to search the data since the data is stored in sorted manner.

**MetaTable**
  1. A datastructure that stores the location of the regions along with the region servers.
  2. It helps user identify the region server and its coooesponding regions, where the specific range of key-value pairs are stored.
  3. Meta table is stored in one of the region server and its location is in zookeeper.
  
  
### HBase Read/Write Operations
Steps involved are as follows

1. Contact zookeeper to fetch the location of metatable(If cliend doesn't have the latest cached version of meta table).
2. Client queries the meta table to find the location fo region server for a specific key-value pair.
3. Client cached the region information and metatable information for future interactions.
4. Client can now contact to the region server specifie in step 2. This region server assigns the request to a specific region for Read/Write operations.

#### HBase Write Operations

1. Data needs to be written to WAL.
2. Once the data is written to WAL, It is written to memstore. Once the memestore is full data is flushed to HDFS in the form of Hfile.
3. Finally an acknowledgement is sent back to client.

#### Hbase Read Operations
1. The region server first checks the block cache that stores the recently accessed data.
2. If the data is not there in block cache, It checks for the required data in memstore.
3. If data is not there in memstore then the only way is to read the Hfile from disc. Hfile containing a particular key-value pair is identified.

Once the Hfile is identified instead of reading the entire file, The **Data Block** Index if Hfile is scanned to get the data-block with the key-value pair. A binary search in this data block finally returns the data or null if the data is not present.

### Compactions
Flushes from memstore in the form of Hfiles create multiple files. Specially, incase of heavy incoming writes which leads to two major problems:

1. The read efficiency gets low. Because of som many files more disc seeks are required.
2. It leads to dirty data. A large number of Hfiles and data redundency

To solve the above problems Compaction is there. Compaction is a process of combining the many small Hfiles into one.

Again these are of two types;
1. Minor - Hbase picks some smaller files are writes them on few large Hfiles.
2. Major - All smaller files are written in single large Hfile. This is resource intensive because it merges lot of Hfiles into one. So Hadoop admins runs this when the traffic is low.

### Update/Delete in Hbase
**Updates** are done using timestamp(versioning). We only get to see the latest version that's how updates are done.

**Delete**: Deletes in Hbase are special type of Updates where the value for which delete is requested is not deleted immediately. Instead these values are masked by assigning a tombstone marker to them. Every request to read this value returns NULL to the client because of tombstone marker. Client thinks it is deleted.

This is done because Hfiles are immutable. All the values with tombstone marker are removed during the next compaction.

### Practicals

```shell
[itv002768@g02 ~]$ hbase shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hbase-2.3.4/lib/client-facing-thirdparty/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.3.4, rafd5e4fc3cd259257229df3422f2857ed35da4cc, Thu Jan 14 21:32:25 UTC 2021
Took 0.0005 seconds
hbase(main):001:0>
```

**If services are not running**

service --status-all</br>
and check status for "hbase master" and "hbase region server"

for starting the service - service hbase-master restart | service hbase-regionserver restart

After restarting the services start a new session of hbase shell


```shell
hbase(main):001:0> create 'students', 'personal_details', 'contact_details', 'marks'
Created table students
Took 1.6689 seconds
=> Hbase::Table - students
hbase(main):002:0> list 'students'
TABLE
students
1 row(s)
Took 0.0108 seconds
=> ["students"]

## put 'TABLE_NAME', 'COLUMN_FAMILY:KEY', 'VALUE' 
hbase(main):003:0> put 'students','student1','personal_details:name', 'Tushar'
Took 0.4423 seconds
hbase(main):004:0> put 'students','student1','personal_details:email', 'tushar5353@gmail.com'
Took 0.0047 seconds

hbase(main):006:0> scan 'students'
ROW                                      COLUMN+CELL
 student1                                column=personal_details:email, timestamp=2022-07-26T13:06:11.017, value=tushar5353@gmail.com
 student1                                column=personal_details:name, timestamp=2022-07-26T13:05:46.913, value=Tushar
1 row(s)
Took 0.0608 seconds

# To get a subset of a table let's say a particular row key
hbase(main):008:0> get 'students', 'student1'
COLUMN                                   CELL
 personal_details:email                  timestamp=2022-07-26T13:06:11.017, value=tushar5353@gmail.com
 personal_details:name                   timestamp=2022-07-26T13:05:46.913, value=Tushar
1 row(s)
Took 0.0775 seconds

# To get info for a column family
hbase(main):009:0> get 'students', 'student1', {COLUMN => 'personal_details'}
COLUMN                                   CELL
 personal_details:email                  timestamp=2022-07-26T13:06:11.017, value=tushar5353@gmail.com
 personal_details:name                   timestamp=2022-07-26T13:05:46.913, value=Tushar
1 row(s)
Took 0.0114 seconds

#select name from students where id='student1'
hbase(main):010:0> get 'students', 'student1', {COLUMN => 'personal_details:name'}
COLUMN                                   CELL
 personal_details:name                   timestamp=2022-07-26T13:05:46.913, value=Tushar
1 row(s)
Took 0.0063 seconds

# Delete email from students where id='student1'
hbase(main):012:0> delete 'students', 'student1', 'personal_details:email'
Took 0.0218 seconds

hbase(main):015:0* describe
describe             describe_namespace
hbase(main):015:0* describe 'students'
Table students is ENABLED
students
COLUMN FAMILIES DESCRIPTION
{NAME => 'contact_details', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRES
SION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

{NAME => 'marks', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'N
ONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

{NAME => 'personal_details', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRE
SSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

3 row(s)
Quota is disabled
Took 0.0703 seconds

hbase(main):016:0> exists 'students'
Table students does exist
Took 0.0178 seconds
=> true


# To drop a table we need to disable it first
# Because data is present in memstore and is not flushed, When we disable the table the content of memstore is flushed to disc and then we can drop a table
hbase(main):017:0> drop 'students'
ERROR: Table students is enabled. Disable it first.

For usage try 'help "drop"'

Took 0.0121 seconds

hbase(main):018:0> disable 'students'
Took 0.7219 seconds
hbase(main):019:0> drop 'students'
Took 0.3693 seconds

hbase(main):025:0> create 'census_itv002768', 'personal', 'professional'
Created table census_itv002768
Took 1.1288 seconds
=> Hbase::Table - census_itv002768

# use LIMIT => 1 to limit the rows
# use STARTROW => 3, ENDROW => 5 will give you row 3rd and 4th

[itv002768@g02 ~]$ hadoop fs -ls -h /hbase/data/default/xyz_itv001044/d3c8336f4005a4b64916ea030eb1aafe/cf1
Found 1 items
-rw-r--r--   3 hbase supergroup      4.8 K 2021-09-27 12:27 /hbase/data/default/xyz_itv001044/d3c8336f4005a4b64916ea030eb1aafe/cf1/dffdd255d0004a97954d9d79809875d6 ----- Hfile

```

### CAP Theorem

CAP theorem applies to the distributed systems that store data.</br>
It stands for:</br>
**C - Consistency**</br>
**A - Availability**</br>
**P - Partition Tolerance**</br>
CAP theorum says out of above three we can only get 2. We can only chose two of them and it's upto our requirement we have to build the system.</br>
Three Combinations that are possible:</br>
**CA**(Consistency and Availability)</br>
**AP**(Availability and Partition tolerance)</br>
**CP**(Consistency and Partition tolerance)</br>


#### Consistency

Each node will hold the latest value. Whenever we request for a value we'll get the latest one and thats guranteed it cannot give us the old value as result.</br>
Let's say you are sending a messge to someone on whatsapp. If that person is offline then He'll not get any message or you can say He should not get old or garbage data. He should always get the latest one.

#### Availability
System should always give a response. Even if it not sure of having latest value it should give response. There is no gurantee of latest value.


#### Partition Toleance

A system will continue to operate even there is a network failure.


Consistency and Availability is provided by RDBMS

When we talk about distributed system we have to make sure that partition tolerance is there. So the tradeoff will always be between Availability and Consistency that means we've to chose one out of two. So all the distrubited systems falls under two categories AP and CP.

NoSQL systems: The data is stored in distributed manner across a cluster of interconnected machines and provide network partitioning. There are two flavoures of NoSQL databases that provide two set of gurantees:

1. Consistency and Partitioning Tolerance (Hbase and MongoDB)
2. Availability and partition Tolerance (Cassendra and DynamoDB)

If we chose Availability over consistency then the system processes the query and provide some information. Even if the data is not latest. Need immediate results even if not the latest. e.g hotel booking app.

Let's try to prove CAP theorem by contradiction.

For time being let's assume all three are possible. A distributed Data store posses all three.

Same object is there on two nodes that are connected by a network. Let's say there is a network issue and machine not able to contact each other. Some user is updating the value in machine one to V1 from V0. Another user contacted the machine2 He'll recieve the old value. This is availability but not consistency. To get the latest value network should be fixed between two machines. 



## Cassandra
Cassandra is a distributed column oriented database and is highly performant and highly scalable. We use Hbase when we require transactional activities or need quick retrieval.

* CAP in case of Cassandra

Hbase - CP(Consistency and Partition tolerance)
Cassandra - AP(Availibility and Partition tolerance)

Linkedin - Consider you've 100 likes in your post. After one more like it should show 101 likes. It will show someone liked it but won't increase the counter. After sometime it will update it. In this case we can use cassandra, It wont't provide you the consistency but will provide you the eventual consistency. but won't give you error or something.


* How a cassandra cluster looks like?

There is no master node, All nodes are peers. It is a decentralized architecture. In case of Hbase we run on hadoop cluster and it also provide master-slave architecture where master is hmaster and slave is region server.

For communication among the peers it uses gossip protocol to know what is present on another machine.

In master-slave architecture there can be downtimes if master fails(secondary master can also go down risk is always involved). But in case of Cassandra there is no master, In this way this model is highly Available.

* Tunable Read/Write Consistency:</br>
By default cassandra compromises on Consistency to be highly Available.
- Client will send the request to get the value of A
- Request will got to one of the machines(e.g Node5)
- Node 5 will go and talk to Node 1 to get the result
- Note 5 will return the result to client even though if the result is latest or not

Cassandra provides you tunable consistency. e.g result if n nodes agree on same result. Or Quorum level(multiple machines should have same result)

### Cassandra vs Hbase

<style type="text/css">
  .reveal p {
    text-align: left;
  }
  .reveal ul {
    display: block;
  }
  .reveal ol {
    display: block;
  }  
</style>


<table>
<tr>
<th>Cassandra</th>
<th>Hbase</th>
</tr>
<tr>
<td>
 
* NoSQl distributed DB and hold the data in columnar fashion.
* Highly Sclable.
* Can be used with transactions with Update, inserts and reads with low latency.
* Cassandra is having Decentralize architecture, there is no master.
* It is highly available as it doesn't have dependence on master.
* It provides AP(Availability and Parittion tolerance) and eventual consistency.
* It provides tunable consistancy.
* Cassandra has a separate cluster for keeping data.
* It has its own query languate(CQL).
   
</td>
<td>
    
* NoSQl distributed DB and hold the data in columnar fashion.
* Highly Sclable.
* Can be used with transactions with Update, inserts and reads with low latency.
* It has master-slave architecture.
* It provides CP(Consistency and Partition Tolerance).
* It runs on top of Hadoop Cluster, Data is kept in HDFS.
* Perfect choice if you're working on Hadoop.


</td>
</tr>
</table>


* Sometimes Syntax looks hard in that case you can use apache phoenix. It provides you a wrappper on top of Hbase so that you can write simplq SQL to query Hbase.

## Hive-Hbase Integration

We'll Create a table which we can access from both Hive and Hbase. We want to access this table in Hive and do some aggregations or any processing. We will access this table from Hbase when quick searches are required or Insert/Update.
In the above cases we can create Hbase table managed by hive.

##### Use case of Hive-Hbase table(a.k.a Hbase table managed by hive)
- On Hbase table if we want to do any processing like groupby, aggregation or any kind of map-reduce activity its better to create hbase table managed by Hive
- If you are doing some processing on hive and after the processing is done you want to dump the data in Hive for quick searches.

##### Practical
- We'll have one dataset kv1.txt and we'll be creating a hive table based on structure of kv1.txt file.
```shell
[itv002768@g02 week7_dataset]$ head kv1-200927-183907.txt
238val_238
86val_86
311val_311
27val_27
165val_165
409val_409
255val_255
278val_278
98val_98
484val_484
[itv002768@g02 week7_dataset]$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = d06750c2-9909-4904-a934-f20c6baa9a37

Logging initialized using configuration in file:/opt/apache-hive-3.1.2-bin/conf/hive-log4j2.properties Async: true
Hive Session ID = ad30fbd4-1caa-490c-9bd6-0912ad7df46d
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> create table pokes(foo int, bar string);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table hive.default.pokes already exists)
hive> use tushar_test;
OK
Time taken: 0.129 seconds
hive> create table pokes(foo int, bar string);
OK
Time taken: 0.361 seconds
```

- Loading the file kv1.txt in the hive table created in previous step.

```shell
hive> load data local inpath '/home/itv002768/week7_dataset/kv1-200927-183907.txt' overwrite into table pokes;
Loading data to table tushar_test.pokes
OK
Time taken: 2.215 seconds
```

- Verifying the data in the table.

```shell
hive> select * from pokes limit 10;
OK
238     val_238
86      val_86
311     val_311
27      val_27
165     val_165
409     val_409
255     val_255
278     val_278
98      val_98
484     val_484
Time taken: 15.371 seconds, Fetched: 10 row(s)
```
- Create a Hbase table managed by Hive which we can access from both Hive and Hbase.

*CREATE TABLE hbase_table_1(key int, value string)</br>
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'</br>
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") **key corresponds to key in create table and val also**</br>
TBLPROPERTIES ("hbase.table.name" = "tushar_shared_table") **In hive, table name will be hbase_table_1 and in hbase we can see it as 'tushar_shared_table' - optional(if not provided table name in hbase will be same as that of hive)**</br>*

```shell
hive> CREATE TABLE hbase_table_1(key int, value string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
    > TBLPROPERTIES ("hbase.table.name" = "tushar_shared_table");
2022-07-29 07:52:22,113 INFO  [d49fb0d7-016f-4072-b6f0-79f1f1d9e3c2 main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x100bba26 connecting to ZooKeeper ensemble=m01.itversity.com:2181,m02.itversity.com:2181,w01.itversity.com:2181
2022-07-29 07:52:24,434 INFO  [d49fb0d7-016f-4072-b6f0-79f1f1d9e3c2 main] client.HBaseAdmin: Operation: CREATE, Table Name: default:tushar_shared_table completed
OK
Time taken: 2.588 seconds
```


- Load the data from normal Hive table to table created in previous table.

```shell
hive> INSERT OVERWRITE TABLE hbase_table_1 select * from pokes where foo=98;
Query ID = itv002768_20220729075409_265ead69-b770-46ec-a72e-12dce8d6640d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there\'s no reduce operator
2022-07-29 07:54:26,484 WARN  [d49fb0d7-016f-4072-b6f0-79f1f1d9e3c2 main] mapreduce.TableMapReduceUtil: The addDependencyJars(Configuration, Class<?>...) method has been deprecated since it is easy to use incorrectly. Most users should rely on addDependencyJars(Job) instead. See HBASE-8386 for more details.
2022-07-29 07:54:26,486 WARN  [d49fb0d7-016f-4072-b6f0-79f1f1d9e3c2 main] mapreduce.TableMapReduceUtil: The addDependencyJars(Configuration, Class<?>...) method has been deprecated since it is easy to use incorrectly. Most users should rely on addDependencyJars(Job) instead. See HBASE-8386 for more details.
2022-07-29 07:54:28,519 INFO  [d49fb0d7-016f-4072-b6f0-79f1f1d9e3c2 main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x7fcbc336 connecting to ZooKeeper ensemble=m01.itversity.com:2181,m02.itversity.com:2181,w01.itversity.com:2181
Starting Job = job_1658918988971_0779, Tracking URL = http://m02.itversity.com:19088/proxy/application_1658918988971_0779/
Kill Command = /opt/hadoop/bin/mapred job  -kill job_1658918988971_0779
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 0
2022-07-29 07:54:49,596 Stage-2 map = 0%,  reduce = 0%
2022-07-29 07:54:54,873 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.87 sec
MapReduce Total cumulative CPU time: 2 seconds 870 msec
Ended Job = job_1658918988971_0779
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1   Cumulative CPU: 2.87 sec   HDFS Read: 20143 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 870 msec
OK
Time taken: 47.877 seconds
hive> select * from pokes where foo=98;
OK
98      val_98
98      val_98
Time taken: 14.213 seconds, Fetched: 2 row(s)
hive> select * from hbase_table_1 where key=98;
OK
2022-07-29 07:56:08,108 INFO  [d49fb0d7-016f-4072-b6f0-79f1f1d9e3c2 main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x45482f82 connecting to ZooKeeper ensemble=m01.itversity.com:2181,m02.itversity.com:2181,w01.itversity.com:2181
2022-07-29 07:56:08,195 INFO  [d49fb0d7-016f-4072-b6f0-79f1f1d9e3c2 main] mapreduce.RegionSizeCalculator: Calculating region sizes for table "tushar_shared_table".
2022-07-29 07:56:08,754 INFO  [d49fb0d7-016f-4072-b6f0-79f1f1d9e3c2 main] client.ConnectionImplementation: Closing zookeeper sessionid=0x3017a13950e0bfd
2022-07-29 07:56:08,790 INFO  [d49fb0d7-016f-4072-b6f0-79f1f1d9e3c2 main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x1d4c6e32 connecting to ZooKeeper ensemble=m01.itversity.com:2181,m02.itversity.com:2181,w01.itversity.com:2181
2022-07-29 07:56:08,810 INFO  [d49fb0d7-016f-4072-b6f0-79f1f1d9e3c2 main] mapreduce.TableInputFormatBase: Input split length: 0 bytes.
2022-07-29 07:56:08,849 INFO  [d49fb0d7-016f-4072-b6f0-79f1f1d9e3c2 main] client.ConnectionImplementation: Closing zookeeper sessionid=0x200000102340024
98      val_98
Time taken: 9.482 seconds, Fetched: 1 row(s)

hbase(main):004:0> scan "tushar_shared_table"
ROW                                      COLUMN+CELL
 98                                      column=cf1:val, timestamp=2022-07-29T07:54:54.096, value=val_98
1 row(s)
Took 0.1556 seconds
```

