In [1]:
%load_ext dockermagic

# Hive
![Hive](https://hive.apache.org/images/hive_logo_medium.jpg)

- https://hive.apache.org/

## Setup

- version 3.1.2

In [2]:
%%bash

# Download package
wget -q -c https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

# Copy installation package to container
docker cp apache-hive-3.1.2-bin.tar.gz hadoop:/opt

In [3]:
%%dockerexec -u hadoop hadoop

# unpack file and create link
tar -zxf /opt/apache-hive-3.1.2-bin.tar.gz -C /opt
ln -s /opt/apache-hive-3.1.2-bin /opt/hive

# update envvars.sh
cat >> /opt/envvars.sh << EOF
# Hive
export HIVE_HOME=/opt/hive
export PATH=\$PATH:\$HIVE_HOME/bin

EOF

# Fix guava and slf4j versions
rm /opt/hive/lib/guava-19.0.jar
cp /opt/hadoop/share/hadoop/hdfs/lib/guava-27.0-jre.jar /opt/hive/lib
rm /opt/hive/lib/log4j-slf4j-impl-2.10.0.jar

sudo rm /opt/apache-hive-3.1.2-bin.tar.gz

## Hadoop configuration (for beeline)

- core-site.xml

```xml
<configuration>
...
<property>
  <name>hadoop.proxyuser.hadoop.groups</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.hadoop.hosts</name>
  <value>*</value>
</property>
</configuration>
```

## Hive Metastore

- using local Derby database

In [12]:
%%dockerexec -u hadoop hadoop
source /opt/envvars.sh

# create directory in HDFS
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /user/hive/warehouse

# initialize database
mkdir $HIVE_HOME/hiveserver2
cd $HIVE_HOME/hiveserver2
$HIVE_HOME/bin/schematool -dbType derby -initSchema

# start server
nohup $HIVE_HOME/bin/hive --service hiveserver2 \
--hiveconf hive.security.authorization.createtable.owner.grants=ALL \
--hiveconf hive.root.logger=INFO,console &
echo $! > hiveserver2.pid

mkdir: cannot create directory '/opt/hive/hiveserver2': File exists
Metastore connection URL:	 jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver :	 org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User:	 APP
Starting metastore schema initialization to 3.1.0
Initialization script hive-schema-3.1.0.derby.sql

 
Error: FUNCTION 'NUCLEUS_ASCII' already exists. (state=X0Y68,code=30000)
org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization FAILED! Metastore state would be inconsistent !!
Underlying cause: java.io.IOException : Schema script failed, errorcode 2
Use --verbose for detailed stacktrace.
*** schemaTool failed ***
2021-01-11 02:42:54: Starting HiveServer2


## Example

In [11]:
%%dockerexec -u hadoop hadoop
source /opt/envvars.sh

[ ! -d "/opt/datasets" ] && mkdir /opt/datasets
cd /opt/datasets

wget -q -c https://tinyurl.com/y5roz8kz -O stations.csv
hdfs dfs -mkdir -p bikeshare/stations
hdfs dfs -put stations.csv bikeshare/stations

wget -q -c https://tinyurl.com/y6ln8smc -O trips.csv.zip
unzip trips.csv.zip
rm trips.csv.zip
hdfs dfs -mkdir -p bikeshare/trips
hdfs dfs -put trips.csv bikeshare/trips

2021-01-11 02:29:52,387 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
Archive:  trips.csv.zip
  inflating: trips.csv               
2021-01-11 02:30:01,807 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-01-11 02:30:02,637 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false


## Connect with beeline

1. Run beeline

```bash
docker exec -it -u hadoop -w /opt hadoop bash -c "source /opt/envvars.sh; beeline -n hadoop -u jdbc:hive2://localhost:10000"
```

2. Configure jobs executor

```sql
SET hive.execution.engine=mr;
SET mapreduce.framework.name=yarn;
```

3. Create bikeshare database

```sql
CREATE DATABASE bikeshare;
SHOW DATABASES;
USE bikeshare;
```

4. Create stations table

```sql
CREATE EXTERNAL TABLE stations (
station_id INT,
name STRING,
lat DOUBLE,
long DOUBLE,
dockcount INT,
landmark STRING,
installation STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs:///user/hadoop/bikeshare/stations';
```

5. Create trips table

```sql
CREATE EXTERNAL TABLE trips (
trip_id INT,
duration INT,
start_date STRING,
start_station STRING,
start_terminal INT,
end_date STRING,
end_station STRING,
end_terminal INT,
bike_num INT,
subscription_type STRING,
zip_code STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs:///user/hadoop/bikeshare/trips';
```

6. Show tables

```sql
SHOW TABLES;
DESCRIBE stations;
DESCRIBE trips;
DESCRIBE FORMATTED stations;
DESCRIBE FORMATTED trips;
```

7. Run query - number of trips per terminal

```sql
SELECT start_terminal, start_station, COUNT(1) AS count
FROM trips
GROUP BY start_terminal, start_station
ORDER BY count
DESC LIMIT 10;
```

8. Run query - join between stations and trips

```sql
SELECT t.trip_id, t.duration, t.start_date, s.name, s.lat, s.long, s.landmark
FROM stations s
JOIN trips t ON s.station_id = t.start_terminal
LIMIT 10;
```

9. Exit beeline

```sql
!quit
```

In [13]:
%%dockerexec -u hadoop hadoop

cd /opt/hive/hiveserver2

# kill hiveserver2
kill $(cat hiveserver2.pid)
rm hiveserver2.pid