# Part 1 - Text Analytics (Hive)
# Part 2 - Log Data (Hive)
# Part 3 - Extending Hive

Sept 30 - lab 10  
Oct 2 - lab 11

<a href="#String-Functions">Strings</a>  
<a href="#Regular-Expressions">Regex</a>  
<a href="#Extending-Hive-(JSON,-Custom-Scripts,-and-UDFs)">Extending Hive</a>  
<a href="#Optimizing-Hive-Query-Performance">Optimize</a>  


---

# String Functions

[Regex Cheat Sheet](https://pages.github.umn.edu/deliu/bigdata19/05-Hive3/regular-expressions-cheat-sheet-v2.pdf)

**Essential points**  
`Split` creates an array from a string  
`Explode` creates indy records from an array  
`Regex` to extract or sub strings  
`N-grams` is a sequence of words

(inclusive, exclusive)

Splitting converts to an array  
Use `SELECT EXPLODE(SPLIT(..)` instead
Insert image (splitting and combining) 

Create histogram with 10 bins (we could also use subq instead)  
`from products
select explode(histogram_numeric(price,10)) as bin`


``histogram_numeric(id, 5) from customers`  
returns one single row - an array of struct

---

<img src=https://i.imgur.com/0uRU4rR.png width="400" height="340" align="left">


<img src=https://i.imgur.com/n7gQtlS.png width="400" height="340" align="left">






<img src=https://i.imgur.com/Ptd8sxp.png width="400" height="340" align="left">


<img src=https://i.imgur.com/kkSPzam.jpg width="400" height="340" align="left">

---

# Regular Expressions

**Regex classes** - character, white space, word  
Use `()` to capture something  
`\\` because 2 interpreters (Hive, then regex engine)

`[]` list of options  
`^` is negate  
`\\d` is digit  
`+` adds one or more  
`.` any character (unless we are inside [])


**Hive's reg expressions**  
`REGEXP` for comparison  
`REGEXP_EXTRACT` to return string matching a patter  
`REGEXP_REPLACE` to replace text 



<img src=https://i.imgur.com/D31V71F.png width="400" height="340" align="left">

<img src=https://i.imgur.com/7YZz4nh.png width="400" height="340" align="left">


---

### CSV SerDa (when too complex for SerDa)

**Hive built in Serdes** - specify when creating table in **row format**  
`
LazySimpleSerDe (default)
RegexSerDe - use for semi structured data, especially for log files
OpenCSVSerde
JsonSerDe
`

---

### Sentiment Analysis

1 - Tokenize the text with `SENTENCE(input)`  
Outer array - each sentence  
Inner array - each word in each sentence

2 - Find n-grams with `NGRAMS(array, number of words per ngram, threshold for result)`  
Output is a struct with 2 attributes - ngram, estfrequency  
Can also use `context_ngrams` if we want certain combos

# Extending Hive (JSON, Custom Scripts, and UDFs)

### JSON

Use jsonserde when each line of the document is a JSON object 

Supports arrays and maps, nested structures  
Have to load JSON with a special Serde

---

### json file with 3 "topics"

Create a table and load in as 1 field  
`create table raw
(json string)
row format delimited;`

For each table:  
`insert into users
select * from raw
get_json_object(...)` 

Could also use **hadoop streaming** with python scripts to extract users, reviews, businesses.  
This would be 3 separate mapreduce jobs.

---

### Querying json fields (non json tables)

#### Dictionary example 
Can use the "value" as a list of strings and use `get_json_object` to parse field.  
This is different than a complex field.  
`input` should be a string with a JSON format. 


$: root object  
`[ ]`: subscript operator for array  
`.` : child operator

`get_json_object(input,` $`.parent.child[index])`

---

### Use external script to transform data

`*` = fields to transform  
`hive> ADD FILE myscript.py;
hive> SELECT TRANSFORM(*) USING 'myscript.py' FROM employees;`

**Example with only 2 fields**  
`hive> SELECT TRANSFORM(product_name, price)
USING 'tax_calculator.py'
AS (item_name STRING, tax INT)
FROM products;`

---

### User defined functions

1. Standard udf - normal function, single row input, single row output
2. user defined aggregate functions - (sum, max, etc)
3. user defined table functions (explode, etc)

**To use a java udf**  
1. Copy function's jar into hdfs  
`hadoop fs -put url-decode-udf.jar /myscripts/` 

2. Register the function  
`CREATE FUNCTION url_decode
AS 'com.example.hive.udf.URLDecode'
USING JAR '/myscripts/url-decode-udf.jar';`

3. Use the function  
`select url_decode(your_url);`

# Lab 10 (Sept 30) - Text Analytics with Hive

[Lab](https://pages.github.umn.edu/deliu/bigdata19/05-Hive3/lab10-text.html)  
[Solution](https://pages.github.umn.edu/deliu/bigdata19/05-Hive3/lab10-text-solution.html)

---

### RUNNING NGRAMS QUERY 

**Trigrams, Top 5**  
`SELECT EXPLODE(NGRAMS(SENTENCES(LOWER(message)), 3, 5))
AS bigrams
FROM ratings
WHERE prod_id = 1274673;`

---


### Investigating after query

**Pattern identified from trigrams**  
`SELECT message
FROM ratings
WHERE prod_id = 1274673
AND message LIKE '%ten times more%'
LIMIT 3;`

**"Red was a word that popped up. Why?**  
**All messages that have "red"**  
`SELECT message
FROM ratings
WHERE prod_id = 1274673
AND message LIKE '%red%'
LIMIT 3;`

---

**There must be a pricing error.**  
**Check in the products table**  
`SELECT *
FROM products
WHERE prod_id = 1274673;`

**Compare red product vs same product of diff colors**  
`SELECT *
FROM products
WHERE name LIKE '%16 GB USB Flash Drive%'
AND brand='Orion';`

# Lab 11 (Oct 2) - Log Data


[Lab](https://pages.github.umn.edu/deliu/bigdata19/05-Hive3/lab11-weblog.html)  
[Solution](https://pages.github.umn.edu/deliu/bigdata19/05-Hive3/lab11-weblog-solution.html)

---

Directory for exercise  
`cd ADIR/exercises/transform`

See how Hive would set up the table - helps us set types and Serde  
create_web_logs.hql is a script that already exists  
`beeline -u jdbc:hive2:// -f create_web_logs.hql`

### Step 1 - Local --> HDFS

Make directory in hdfs  
`hadoop fs -mkdir /dualcore/web_logs`

Take file from exercises folder into hdfs (dualcore folder)  
`hadoop fs -put ADIR/data/access.log /dualcore/web_logs`

Supposed to run this command to make sure everything worked correctly  
`SELECT term, COUNT(term) AS num FROM
     (SELECT LOWER(REGEXP_EXTRACT(request,
        '/search\\?phrase=(\\S+)', 1)) AS term
        FROM web_logs
        WHERE request REGEXP '/search\\?phrase=') terms
   GROUP BY term
   ORDER BY num DESC
   LIMIT 3;`


---

### Step 2 - Identify steps in process

There are 4 steps. We want to count how many people made it to each step.  

 `SELECT COUNT(*), request
 FROM web_logs
 WHERE request REGEXP '/cart/checkout/step\\d.+'
 GROUP BY request;`

---

Who is making it to which step in the process?  
Create a table to show this.

` CREATE TABLE checkout_sessions AS
 SELECT cookie, ip_address, COUNT(request) AS steps_completed
 FROM web_logs
 WHERE request REGEXP '/cart/checkout/step\\d.+'
 GROUP BY cookie, ip_address;`

# Lab 12 (Oct 2) - JSON

[Lab](https://pages.github.umn.edu/deliu/bigdata19/05-Hive3/lab12-extension.html)  
[Solution](https://pages.github.umn.edu/deliu/bigdata19/05-Hive3/lab12-extension-solution.html)

### Create an external table

This is an external table so we need to specify where we are going to save the table  
`CREATE EXTERNAL TABLE json_nested_test (
country string,
languages array<string>,
religions map<string,array<int>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE
LOCATION '...';`  
In hive 4.0, we can use `stored as jsonfile`

---

## Part 1 - get_json_object

Remember: \$.parent.child  

**Price of bicycle**  
In this example, the word 'price' is a key. The actual price is the associated value

`select get_json_object(
'{"store": 
    {"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
    "bicycle":{"price":19.95,"color":"red"}}, 
"email":"amy@only_for_json_udf_test.net", 
"owner":"amy"}'`  
,'$`.store.bicycle.price');`

---

**name of the first fruit**  
In this example, "fruit" is a list.  
"Type" is a key that has the "name" of the fruit as the value

`select get_json_object(
'{"store": 
    {"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
    "bicycle":{"price":19.95,"color":"red"}}, 
"email":"amy@only_for_json_udf_test.net", 
"owner":"amy"}'`  
,'$`.store.fruit[0].type');`


---

## Part 2 - JSON tables

**Download the JSON serde jar file to your home directory**

`cd ~`  
`wget http://idsdl.csom.umn.edu/c/share/msba6330/json-serde-1.3.8-jar-with-dependencies.jar` to download jar  
`ADD JAR /home/cloudera/json-serde-1.3.8-jar-with-dependencies.jar;` to add a jar

**Extract a sample row**  
`cd ~/training_materials/data/chatlogs
head -1 2014-03-15.json`

**Create a hive managed table**  
`create table conversations (
	conversationId INT,
	accountNum INT,
	agentName STRING,
	category STRING,
	messages ARRAY<STRUCT<sender:STRING, time:TIMESTAMP, text:STRING>>
) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';`

**Load local file into managed table**  
`load data local inpath '/home/cloudera/training_materials/data/chatlogs/2014-03-15.json' 
into table conversations;`

**Check**  
`select count(*) from conversations;` number of rows  
`select * from conversations limit 2;` first 2 rows  
`select conversationid, size(messages) from conversations limit 10;` first rows with additional info (size = size of array)

---

## Part 3 - Transforming

**Download a python script**  
`wget http://idsdl.csom.umn.edu/c/share/msba6330/greeting.py`

**Upload to hdfs**  
`hadoop fs -put greeting.py greeting.py`

**Create a table in hue**  
`create table employees (name string, email string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';`

**Insert values into table**  
`INSERT INTO table employees 
values 
("Antoine","antoine@example.fr"),
("Kai","kai@example.de"),
("Pedro","pedro@example.mx"),
("Joel","joel@example.us");`

**Add and run the transformation file**  
`add file hdfs:///user/cloudera/greeting.py;`

`SELECT TRANSFORM(name,  email)
    USING 'greeting.py' AS greeting
    FROM employees;`
    
---

**This is what the python script looked like**  
`import sys
import re
greetings  = {'de':'Hallo','fr':'Bonjour','mx':'Hola'}`

`for line in sys.stdin:
    name, email = line.strip().split('\t')
    match = re.search(r'\.(\w+)', email)
    if match and greetings.has_key(match.group(1)):
        print "{0}\t{1}".format(greetings[match.group(1)],name)
    else:
        print "Hello\t{0}".format(name)`
