Data Processing with Pig and Hive
====


```sql
Select Username,Address 
From UserTable
Where Age>30;
```

Pig:
* Load UserTable;
* For each row check w
* Return the results

Pig Latin:
* Interacting eith HDFS
* Mainpulate data that is sitting on HDFS
* 

## Running pig

Running pig on local mode: `pig -x local`  
Running on distributed mode: `pig -x mapreduce`


* Using the interactive shell called *grunt*
* Execute a pig script *E.g. `pig myscript.pig`*
* Embed your pig query into a Java program



Pig scripts: 
* Comments are noted by "--"

Embedded in Java

```java
import java.io.IoException;
import org.apache.pig.PigServer;

public class idmapreduce {
    public static void main(String[] args) {
}
```

Compile:
```
javac -cp \$PIG_HOME/pig-0.11.0.jar idmapreduce.java
```

Running:

```
java -cp ".." idmapreduce
```

### Grunt shell commands

* `grunt> exec myscript.pig`  execute in a separate forked process
* `grunt> run myscript.pig`   run in the same process

## Pig Latin

A high leve
Overview:
* Read/write from/to HDFS
* Data types
* Diagnostic
* Expressions and functions
* Relational operations (UNION, JOIN, FILTER, etc)
* No supporting command for insert, delete, update

### Pig Latin Workflow

* Load data into an alias  

```pig
-- alias = LOAD filename AS (..);
alias = LOAD 'input.txt' AS (attr1, attr2);

-- comma separate values   
-- mydata = LOAD filename USING PigStorage(',') ..;  
mydata = LOAD 'input.txt'   
         USING PigStorage(',')   
         AS (attr1, attr2, ..);   
```

* Manipulate the alias using relational operations

```pig

```

If you don't specify the name of attributes, they will be defined by `$0, $1, $2, ..`

#### Example workflow

```pig
data = LOAD '/user/hduser/wikipedia/wiki_edits/txt' 
       USING PigStorage(',')
       AS (rev, aid, rid, article, ts, uname, uid);
       
grp = GROUP data BY article; -- output has two columns: group,data

counts = FOREACH grp GENERATE group, COUNT(data);

results = LIMIT counts 4;
DUMP results;

STORE counts INTO 'output.txt' USING PigStorage(',');
```

### Atomic Data Types

|Data Types | |
|:--:|:--:|
| int ||
| long ||
| float ||
| double ||
| chararray ||
| bytearray ||

```pig
data = LOAD 'input.txt' 
       USING PigStorage(',')
       AS (rev:chararray, aid:int, ..);
```

If you don't specify the type, the attributes will be of type *bytearray*, the most generic data type.

#### More complex data types

|Complex Data Types | |
|:--:|:--:|
| Tuple ||
| Bag ||
| Map ||

```
$ hadoop fs -cat data.txt
(1,2,3) (4,5,6)
(4,5,3) (3,3,2)
```

```pig
grunt> A = LOAD 'data.txt' 
           AS (t1:tuple(t1a:int, t1b:int, t1c:int),
               t2:tuple(t2a:int, t2b:int, t2c:int));
                               
DUMP A;
```

Relational Operators in Pig
======


 * FOREACH
 * FILTER 
 * ORDER BY
 * SPLIT 
 * UNION
 * DISTINCT
 * GROUP
 * JOIN

#### FOREACH

```pig
grades = LOAD 'greades.txt' 
         USING PigStorage(',')
         AS (name:chararray, hw1:int, hw2:int, hw3:int);
         
hwtotals = FOREACH grades GENERATE name, hw1+hw2+hw3;

DUMP hwtotals;
```
The result will be:

```
(Alex,67)
(John,79)
(Lee,73)
```

Renaming a column vector:

```pig
titels = FOREACH movies GENERATE $1 AS title;
```

* Using Regular Expressions to extract fields:

Assume the input is given as 
```
Samsung TV (499.99)
iPhone 6s (650.00)

```
and we wat to extract the item names, which is everything from the begining of the line upto the openning paranthesis "("

```pig
REGEXP_EXTRACT 
```

#### FILTER


Equivalent to *WHERE* clause in SQL:
```sql
SELECT columns FROM tablename
WHERE expression;
```
Pig syntax: `FILTER tablename BY expression`
```pig
filtered_result = FILTER tablename BY exoression;
```

Exmaple:
```pig
best = FILTER  hwtotals BY $1 > 80;
```

#### ORDER BY

Pig syntax:
```pig
sorted_result = ORDER tablename BY column1 DESC [, column2 ASC];
```

Example:

```pig
result = ORDER hwtotals BY name ASC;
```

#### SPLIT

Splits a table into two based on a condition.

SQL equivalnce:
```sql
SELECT T1 
```

Pig syntax
```pig
split_alias = SPLIT tablename 
              INTO t1 IF col1>s,
                   t2 IF col1<=s;

```

#### UNION

To obtain union of two tables

#### DISTINCT

```pig
uniqs = DISTINCT dups;
```
In pig, distinct is applied to a table, not to an exression (as is done in SQL).

#### GROUP

Syntax:
```pig
grp = GROUP tablename BY colum1;
```



##### Example:

```pig

data = LOAD 'data.txt' 
       USING PigStorage(',') 
       AS(name:chararray, score:int);

grps = GROUP data by name;

totalscores = FOREACH grps 
              GENERATE group AS name,
                       SUM(data.$1) AS total;
                       
```

#### GROUP ALL

#### JOIN

```pig
alias = JOIN table1 BY col1, table2 BY col2;
```

###### Example:

```pig
grades = LOAD 'grades.txt' USING PigStorage(',') 
         AS (name:chararray, hw1:int, hw2:int, hw3:int);

majors = LOAD 'majors.txt' USING PigStorage(',') 
         AS (name:chararray, dept:chararray);

transcript = JOIN majors BY name, grades BY name;
```