# Pig
![Pig](https://pig.apache.org/images/pig-logo.gif)

- https://pig.apache.org

## Setup

- version 0.17

In [1]:
%%bash

# Download package
cd /opt/pkgs
wget -q -c https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz

# unpack file and create link
tar -zxf pig-0.17.0.tar.gz -C /opt
ln -s /opt/pig-0.17.0 /opt/pig

# update envvars.sh
cat >> /opt/envvars.sh << EOF
# Pig
export PIG_HOME=/opt/pig
export PATH=\${PATH}:\${PIG_HOME}/bin

EOF

cat /opt/envvars.sh

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export PDSH_RCMD_TYPE=ssh

export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_YARN_HOME=${HADOOP_HOME}

export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin     

# Flume
export FLUME_HOME=/opt/flume
export PATH=${PATH}:${FLUME_HOME}/bin

# Sqoop
export SQOOP_HOME=/opt/sqoop
export PATH=${PATH}:${SQOOP_HOME}/bin

# Pig
export PIG_HOME=/opt/pig
export PATH=${PATH}:${PIG_HOME}/bin



In [2]:
# Load environment variables
%load_ext dotenv
%dotenv -o /opt/envvars.sh
%env

{'HOSTNAME': 'hadoop',
 'OLDPWD': '/',
 'PWD': '/opt',
 'HOME': '/home/hadoop',
 'SHELL': '/bin/bash',
 'SHLVL': '1',
 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/hadoop/bin:/opt/hadoop/sbin:/opt/flume/bin:/opt/sqoop/bin:/opt/pig/bin',
 '_': '/usr/bin/nohup',
 'LANGUAGE': 'en.UTF-8',
 'LANG': 'en.UTF-8',
 'JPY_PARENT_PID': '1566',
 'TERM': 'xterm-color',
 'CLICOLOR': '1',
 'PAGER': 'cat',
 'GIT_PAGER': 'cat',
 'MPLBACKEND': 'module://ipykernel.pylab.backend_inline',
 'JAVA_HOME': '/usr/lib/jvm/java-1.8.0-openjdk-amd64',
 'PDSH_RCMD_TYPE': 'ssh',
 'HADOOP_HOME': '/opt/hadoop',
 'HADOOP_COMMON_HOME': '/opt/hadoop',
 'HADOOP_CONF_DIR': '/opt/hadoop/etc/hadoop',
 'HADOOP_HDFS_HOME': '/opt/hadoop',
 'HADOOP_MAPRED_HOME': '/opt/hadoop',
 'HADOOP_YARN_HOME': '/opt/hadoop',
 'FLUME_HOME': '/opt/flume',
 'SQOOP_HOME': '/opt/sqoop',
 'PIG_HOME': '/opt/pig'}

## Example

In [3]:
%%bash

cd /opt/datasets
wget -q -c https://tinyurl.com/y5roz8kz -O stations.csv
hdfs dfs -mkdir stations
hdfs dfs -put stations.csv stations
hdfs dfs -head stations/stations.csv

2,San Jose Diridon Caltrain Station,37.329732,-121.901782,27,San Jose,8/6/2013
3,San Jose Civic Center,37.330698,-121.888979,15,San Jose,8/5/2013
4,Santa Clara at Almaden,37.333988,-121.894902,11,San Jose,8/6/2013
5,Adobe on Almaden,37.331415,-121.8932,19,San Jose,8/5/2013
6,San Pedro Square,37.336721,-121.894074,15,San Jose,8/7/2013
7,Paseo de San Antonio,37.333798,-121.886943,15,San Jose,8/7/2013
8,San Salvador at 1st,37.330165,-121.885831,15,San Jose,8/5/2013
9,Japantown,37.348742,-121.894715,15,San Jose,8/5/2013
10,San Jose City Hall,37.337391,-121.886995,15,San Jose,8/6/2013
11,MLK Library,37.335885,-121.88566,19,San Jose,8/6/2013
12,SJSU 4th at San Carlos,37.332808,-121.883891,19,San Jose,8/7/2013
13,St James Park,37.339301,-121.889937,15,San Jose,8/6/2013
14,Arena Green / SAP Center,37.332692,-121.900084,19,San Jose,8/5/2013
16,SJSU - San Salvador at 9th,37.333955,-121.877349,15,San Jose,8/7/2013
21,Franklin at Maple,37.481758,-122.226904,15,Redwood City,8/12/2013
22,Redwood Cit

2021-01-29 14:31:59,735 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-01-29 14:32:05,611 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false


## Grunt shell

1. run in terminal (-x local => local execution)
```
source /opt/envvars.sh
cd /opt/datasets
pig -x local 2> /dev/null
```
2. 
```
stations = LOAD 'stations.csv' USING PigStorage(',') AS 
(station_id:int, name:chararray, lat:float, long:float, 
 dockcount:int, landmark:chararray, installation:chararray);
```
3. 
```
station_ids_names = FOREACH stations GENERATE station_id, name;
```
4.
```
ordered = ORDER station_ids_names BY name;
```

5. 
```
DESCRIBE stations;
```
6. 
```
ILLUSTRATE ordered;
```

7. 
```
DUMP ordered;
```

8. 
```
QUIT;
```


## Batch execution

In [4]:
%%bash

cd /opt/src

cat > list_stations.pig << EOF
stations = LOAD 'stations' USING PigStorage(',') AS 
(station_id:int, name:chararray, lat:float, long:float, 
 dockcount:int, landmark:chararray, installation:chararray);
station_ids_names = FOREACH stations GENERATE station_id, name;
ordered = ORDER station_ids_names BY name;
STORE ordered INTO 'ordered';
EOF

# run using mapreduce
pig -x mapreduce -f list_stations.pig

2021-01-29 14:45:02,743 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
2021-01-29 14:45:02,747 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
2021-01-29 14:45:02,748 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2021-01-29 14:45:02,908 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2021-01-29 14:45:02,908 [main] INFO  org.apache.pig.Main - Logging error messages to: /opt/src/pig_1611931502880.log
2021-01-29 14:45:03,822 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found
2021-01-29 14:45:04,022 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2021-01-29 14:45:04,024 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://hadoop:9000
2021-01-29 14:45:05,621 [main] INFO  org.apache.pig.PigServer - Pig Scri

In [5]:
%%bash

hdfs dfs -cat ordered/*

62	2nd at Folsom
64	2nd at South Park
61	2nd at Townsend
57	5th at Howard
5	Adobe on Almaden
14	Arena Green / SAP Center
56	Beale at Market
82	Broadway St at Battery St
36	California Ave Caltrain Station
32	Castro Street and El Camino Real
72	Civic Center BART (7th at Market)
41	Clay at Battery
45	Commercial at Montgomery
37	Cowper at University
42	Davis at Jackson
54	Embarcadero at Bryant
51	Embarcadero at Folsom
60	Embarcadero at Sansome
48	Embarcadero at Vallejo
30	Evelyn Park and Ride
21	Franklin at Maple
59	Golden Gate at Polk
73	Grant Avenue at Columbus Avenue
50	Harry Bridges Plaza (Ferry Building)
63	Howard at 2nd
9	Japantown
11	MLK Library
67	Market at 10th
76	Market at 4th
77	Market at Sansome
75	Mechanics Plaza (Market at Battery)
83	Mezes Park
28	Mountain View Caltrain Station
27	Mountain View City Hall
34	Palo Alto Caltrain Station
38	Park at Olive
7	Paseo de San Antonio
47	Post at Kearney
39	Powell Street BART
71	Powell at Post (Union Square)
22	Redwood City Caltrain Stat

2021-01-29 14:48:50,236 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false


## WordCount using Pig

In [6]:
%%bash

cd /opt/datasets
wget -q -c https://tinyurl.com/y68jxy7f -O stop-word-list.csv
hdfs dfs -mkdir stopwords
hdfs dfs -put stop-word-list.csv stopwords
hdfs dfs -cat stopwords/stop-word-list.csv

a, able, about, across, after, all, almost, also, am, among, an, and, any, are, as, at, be, because, been, but, by, can, cannot, could, dear, did, do, does, either, else, ever, every, for, from, get, got, had, has, have, he, her, hers, him, his, how, however, i, if, in, into, is, it, its, just, least, let, like, likely, may, me, might, most, must, my, neither, no, nor, not, of, off, often, on, only, or, other, our, own, rather, said, say, says, she, should, since, so, some, than, that, the, their, them, then, there, these, they, this, tis, to, too, twas, us, wants, was, we, were, what, when, where, which, while, who, whom, why, will, with, would, yet, you, your

2021-01-29 14:49:33,526 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-01-29 14:49:39,133 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false


### Run in grunt

```
source /opt/envvars.sh
cd /opt/datasets
pig -x mapreduce 2> /dev/null
```

```
-- List HDFS content
fs -ls
fs -ls shakespeare

-- Job name to appear in YARN
SET job.name 'Word Count in Pig';

-- Load shakespeare dataset
shakespeare = LOAD 'shakespeare' AS (lineoftext:chararray);

-- Load stopwords
stopwords = LOAD 'stopwords' USING PigStorage() AS (stopword:chararray);

-- Create bag of words
words = FOREACH shakespeare GENERATE
        FLATTEN(TOKENIZE(REPLACE(LOWER(TRIM(lineoftext)),
        '[\\p{Punct},\\p{Cntrl}]',''))) AS word;

-- Remove empty words
realwords = FILTER words BY SIZE(word) > 0;

-- Create bag of stop words
flattened_stopwords = FOREACH stopwords GENERATE
       FLATTEN(TOKENIZE(stopword)) AS stopword;

-- Associate words with respective stop words
right_joined = JOIN flattened_stopwords
               BY stopword RIGHT OUTER,
               realwords BY word;

-- Remove stop words
meaningful_words = FILTER right_joined BY
          (flattened_stopwords::stopword IS NULL);

-- Retrieve remaining words
shakespeare_real_words = FOREACH meaningful_words
          GENERATE realwords::word AS word;

-- Group words
grouped = GROUP shakespeare_real_words BY word;

-- Count grouped words
counted = FOREACH grouped GENERATE group AS word,
          COUNT(shakespeare_real_words) AS wordcount;

-- Sort bag in descending order
ordered = ORDER counted BY wordcount DESC;

-- Select 30 first words
top30 = LIMIT ordered 30;

-- Store output
STORE top30 INTO 'shakespeare_top30';

-- Show output from HDFS
fs -cat shakespeare_top30/*

-- Exit
QUIT;
```