# Hadoop: Hive and HBase databases

## Intro

__NOTE:__ for this notebook you should start your server with `Hadoop (with YARN) and Spark environment`.

![Hadoop in a box](images/hadoop_env.png)

[The Apache Hadoop software library](https://hadoop.apache.org/) is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

<font color='red'>__VERY IMPORTANT NOTE:__ The Hadoop instance installed within 'Hadoop (with YARN) and Spark environment' was designed only for educational purposes and DOES NOT STORE DATA after you stop your server. You can create or delete files in HDFS filesystem, write data during session, but next time you start Jupyter server there will be clear filesystem with no data in it.</font>

## Hive database

[The Apache Hive](https://hive.apache.org/) is a data warehouse software project __built on top of Apache Hadoop__ for providing data query and analysis. It is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage __using SQL__.

### Data for demo

Source of data is Kaggle [dataset on flights](https://www.kaggle.com/datasets/goyaladi/flight-dataset).

In [None]:
# copy source file to HDFS
!hdfs dfs -put ./data/flight_data.csv /jovyan

In [None]:
# list files in HDFS
!hdfs dfs -ls /jovyan

Will use file with the data for video games sales in different countries:

In [None]:
!hdfs dfs -head /jovyan/flight_data.csv

### Create table in Hive

We need to create a table in Hive to put the data in that table on the next stage. You may want to read about [Hive data types](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types) first.

In [None]:
!hive -e \
    "CREATE TABLE flights ( \
        DepartureCity STRING, \
        ArrivalCity STRING, \
        DepartureDate TIMESTAMP, \
        FlightDuration FLOAT, \
        DelayMinutes INT, \
        CustomerID INT, \
        Name STRING, \
        BookingClass STRING, \
        FrequentFlyerStatus STRING, \
        Route STRING, \
        TicketPrice FLOAT, \
        CompetitorPrice FLOAT, \
        Demand FLOAT, \
        Origin STRING, \
        Destination STRING, \
        Profitability FLOAT, \
        LoyaltyPoints INT, \
        Churned BOOLEAN) \
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' \
    LINES TERMINATED BY '\n' \
    TBLPROPERTIES('skip.header.line.count'='1');" 

Hive produces a lot of logs, so it is recommended to create a file to write output in this file:

In [None]:
!touch result.txt
!echo ---------------------------- >> result.txt

Let's put data into table created:

In [None]:
!hive -e "LOAD DATA INPATH '/jovyan/flight_data.csv' OVERWRITE INTO TABLE flights"

__NOTE__ that after loading the data, the source file will be deleted from the source location, and the file loaded to the Hive data warehouse location or to the LOCATION specified while creating a table.

### SQL queries to Hive

Test query to a new table:

In [None]:
!hive -S -e "SELECT * FROM flights LIMIT 5" >> result.txt
!echo ---------------------------- >> result.txt
!cat result.txt

More complicated query:

In [None]:
!hive -S -e "SELECT FrequentFlyerStatus, avg(DelayMinutes) FROM flights GROUP BY FrequentFlyerStatus" >> result.txt
!echo ---------------------------- >> result.txt
!cat result.txt

## HBase database

### Basics

[Apache HBase](https://hbase.apache.org/) is an open-source non-relational distributed database. You may try it with a very basic example [based on manual](https://hbase.apache.org/book.html).

To run HBase within  `Hadoop (with YARN) and Spark environment` you may need to open a terminal and type `hbase shell`. Then you may try following commands in HBase shell:
```
hbase> create 'test', 'cf'
hbase> list 'test'
hbase> describe 'test'
hbase> put 'test', 'row1', 'cf:a', 'value1'
hbase> scan 'test'
hbase> get 'test', 'row1'
```

### Data processing

We can work with data, but [some libraries are needed](https://sparkbyexamples.com/hbase/hbase-table-filtering-data-like-where-clause/).
```
hbase> import org.apache.hadoop.hbase.filter.SingleColumnValueFilter 
hbase> import org.apache.hadoop.hbase.filter.CompareFilter
hbase> import org.apache.hadoop.hbase.filter.BinaryComparator
hbase> scan 'test', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('cf'), Bytes.toBytes('a'), CompareFilter::CompareOp.valueOf('EQUAL'),BinaryComparator.new(Bytes.toBytes('value1')))}
```

Create table in HBase shell:
```
hbase> create 'geo', 'city', 'provcode', 'provname', 'regcode', 'regname', 'postcode', 'salekey', 'ip'
hbase> list 'geo'
hbase> describe 'geo'
hbase> scan 'geo'
```

Import data from file with help of terminal:
```
$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
    -Dimporttsv.separator=',' \
    -Dimporttsv.columns=HBASE_ROW_KEY,city,provcode,provname,regcode,regname,postcode,salekey,ip \
    geo ~/__MANUAL/data/dim_geo.csv
```

Search in database:
```
hbase> get 'geo', '9'
hbase> import org.apache.hadoop.hbase.filter.SingleColumnValueFilter 
hbase> import org.apache.hadoop.hbase.filter.CompareFilter
hbase> import org.apache.hadoop.hbase.filter.BinaryComparator
hbase> scan 'geo', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('city'), Bytes.toBytes(''), CompareFilter::CompareOp.valueOf('EQUAL'),BinaryComparator.new(Bytes.toBytes('Weston')))}
```