# Apache Hive
`Used as a database that a few people can query - not for making small incremental updates`


Overview  
HiveQL  
Lab 6 Basic Hive Queries  
Lab 7 - Applied Hive Queries  
Lab 7x - Complex Fields


---

# Overview

**Data warehouse** for reading, writing, managing large datasets  
Developed on top of MapReduce (also supports other engines now)  
Uses SQL syntax (`HiveQL`) - also works well with Tableau and PowerBI  

The data for Hive and the computations both run on the Hadoop cluster.  
Hive itself runs on the `client machine` - translates code into an execution plan for MapReduce (batch processing platform)

---


### Storage

**Managed tables** `Default` = fully under Hive, not shared with other tools  
**External tables** = shared between Hive and other apps

1. Metadata - Specifies structure, location of data  
-- Metadata is then stored in an RDBMS metastore (Derby or mySQL), not HDFS  

2. Data - typically in an HDFS directory  
-- eg `/user/../table_name`

---

### Pros and Cons

**Benefits of Hive**  
1. More efficient than writing directly in MapReduce (much less code)  
2. Easy for normal people to use (uses SQL)  
3. Interacts with many tools (eg Tableau)

**Disadvantage of Hive**  
1. Not a 'real time' interactive platform  




<img src=https://i.imgur.com/ikZOUVJ.png width="500" height="440" align="left">


### Hive Use Cases (sentiment analysis, web logs)

1. Data prep  
2. ETL  
3. Data mining (prepping)  
4. Ad optimization

---

### Using Hive (with Beeline)


`

---

# HiveQL

### Overview

Subset of SQL-92 but has extensions  
Can only use joins using `=`, cannot use `<>`

**Managed tables** - aka fully under Hive control (eg `user/hive/customers`)  
Database --> table (directory) --> multiple files

---

### Syntax 

(Using Hue, then Query Editor -> Hive)  
`show databases;`  
`use dualcore;`  
`show tables;`  
`describe customers;`

##### Joins

For joins, list largest table last  
Need to include the `OUTER` term

`
SELECT tab1.col1, tab2.col2
FROM tab1
JOIN tab2 ON (tab1.match = tab2.match);`

---

### Data Types

##### Integer
<img src=https://i.imgur.com/rEKC16C.png width="400" height="340" align="left">

##### Float

<img src=https://i.imgur.com/6H4bniU.png width="400" height="340" align="left">

##### Simple Scalars
<img src=https://i.imgur.com/CN2ODRK.png width="400" height="340" align="left">

##### Complex - each field can hold multiple values - offers faster access, eliminate need for big joins
<img src=https://i.imgur.com/SaJ6cUd.png width="400" height="340" align="left">

**List of all** - tinyint, smallint, int, bigint, float, double, decimal, string, char, varchar, boolean, timestamp, binary, array, map, struct

# Pre-lab

cd C:\vagrant  
vagrant up  
vagrant ssh

`beeline -u jdbc:hive2://` to start beeline  

Alternatively: Go to `http://localhost:8888` with username and password = cloudera to get to Hue

---

# Lab 6  Basic Hive Queries
https://pages.github.umn.edu/deliu/bigdata19/03-Hive1/lab06-intro.html  
https://pages.github.umn.edu/deliu/bigdata19/03-Hive1/lab06-intro-solution.html


###  Running a Query From Hive Shell (with Beeline)
Note that normally we will use Hue, not Beeline

`show databases;`  
`use dualcore;` set current database  
`show tables;`  
`describe customers;` 

Task: look for the winner  
`select * from customers where fname like "Bridg%" and city = "Kansas City";`  
Note: Case sensitive - `fname like "Bridg%"`  
Note: Case insensitive - `lcase(fname) like "bridg%"`  

`ctrl+c` to exit Beeline

---

### Using Hue (UI for HDFS)

`File Browser` at the top right to see all the folders we have  
`Job Browser` at the top right to see what we have done in the past

`Query Editor` --> `Hive` to get to where we run commands

`use dualcore;`  

#Which three products are the most expensive?  
`SELECT price, brand, name FROM dualcore.PRODUCTS ORDER BY
price DESC LIMIT 3'` 

count the number of records in the **customers** table  
`SELECT COUNT(DISTINCT cust_id) AS total FROM customers;`  

Which state has the most customers?  
`select state, count(*) as cnt from customers group by state order by cnt desc limit 10`

---


# Lab 7  - Applied Hive Queries
https://pages.github.umn.edu/deliu/bigdata19/03-Hive1/lab07-hiveql.html  
https://pages.github.umn.edu/deliu/bigdata19/03-Hive1/lab07-hiveql-solution.html

Top 3 selling products  
`SELECT brand, name, COUNT(p.prod_id) AS sold
FROM products p
JOIN order_details d
ON (p.prod_id = d.prod_id)
GROUP BY brand, name, p.prod_id
ORDER BY sold DESC
LIMIT 3;`
  
Gross profit in May 2013  
`SELECT PRINTF("`$`%.2f", SUM(price - cost) / 100) AS profit
FROM products p
JOIN order_details d
ON (d.prod_id = p.prod_id)
JOIN orders o
ON (d.order_id = o.order_id)
WHERE YEAR(order_date) = 2013
  AND MONTH(order_date) = 05;`
  
Per month sales before and after campaign  
`SELECT substr(order_date,1,7) as year_date, 
count(*) as orders FROM orders 
where substr(order_date,1,7) between '2013-02' and '2013-05'
group by substr(order_date,1,7);`

Ad sales by month  
`select recent_orders.month, count(*) as num_orders 
from (SELECT order_id, substr(order_date,1,7) as month FROM orders 
where substr(order_date,1,7) between '2013-02' and '2013-05') 
recent_orders join (select order_id from order_details 
where prod_id = 1274348) tablets on recent_orders.order_id=tablets.order_id
group by recent_orders.month;`

# Lab 7x - Complex Fields

https://pages.github.umn.edu/deliu/bigdata19/03-Hive1/lab07x-complexfields.html  
https://pages.github.umn.edu/deliu/bigdata19/03-Hive1/lab07x-complexfields-solution.html

`SHOW CREATE TABLE customers;` to show the structure of the table

Accessing a struct within a struct  
`SELECT name
FROM customers
WHERE email_preferences.categories.promos=TRUE;`

Accessing mapping  
`SELECT name,addresses['billing']
FROM customers where 
addresses['billing'].state='CA';`

Accessing arrays  
`SELECT name, size(orders) as num_orders, 
orders[0].order_date from customers 
where size(orders)>1;`