<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#SQOOP-FOR-DATA-IMPORTING" data-toc-modified-id="SQOOP-FOR-DATA-IMPORTING-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>SQOOP FOR DATA IMPORTING</a></span><ul class="toc-item"><li><span><a href="#IMPORT-AS-TEXT-FILES" data-toc-modified-id="IMPORT-AS-TEXT-FILES-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>IMPORT AS TEXT FILES</a></span></li><li><span><a href="#IMPORT-AS-HIVE-TABLES" data-toc-modified-id="IMPORT-AS-HIVE-TABLES-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>IMPORT AS HIVE TABLES</a></span></li><li><span><a href="#AN-IMPORT-EXERCISE" data-toc-modified-id="AN-IMPORT-EXERCISE-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>AN IMPORT EXERCISE</a></span></li></ul></li><li><span><a href="#HIVE" data-toc-modified-id="HIVE-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>HIVE</a></span><ul class="toc-item"><li><span><a href="#CREATE/DROP-DATABASES,-MOVE-AND-COPY-TABLES" data-toc-modified-id="CREATE/DROP-DATABASES,-MOVE-AND-COPY-TABLES-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>CREATE/DROP DATABASES, MOVE AND COPY TABLES</a></span></li><li><span><a href="#A-HIVE-EXERCISE" data-toc-modified-id="A-HIVE-EXERCISE-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>A HIVE EXERCISE</a></span></li></ul></li><li><span><a href="#PIG" data-toc-modified-id="PIG-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>PIG</a></span><ul class="toc-item"><li><span><a href="#PIG-TUTORIAL" data-toc-modified-id="PIG-TUTORIAL-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>PIG TUTORIAL</a></span></li><li><span><a href="#PIG-EXERCISE" data-toc-modified-id="PIG-EXERCISE-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>PIG EXERCISE</a></span></li></ul></li></ul></div>

# SQOOP FOR DATA IMPORTING

We will use sqoop in order to import data
- from RDBMS
- into HDFS as text files
- or as hive tables

Note that we will run all shell commands inside the docker via ssh

So first let's start an ssh connection into our sandbox-hdp docker container:

In [3]:
# ssh root@localhost -p 2223

Note that, in order to be sure we are inside the sandbox-hdp docker, check the shell prompt. It should be as such:

`root@sandbox-hdp ~]#`

The hash sign "#" tells that we are now the root user

The imdb database that we are familiar from SQL sessions is imported into the postgresql server running inside the docker

Due to some limitations for X server display from inside the docker, pgadmin3 cannot run.

So we will prob our postgresql databases only through the command line client "psql"

To do that we will first change the system user to postgres

Note that this is not the same as the database user "postgres"

In [2]:
# su postgres

Now the prompt changed to:

`bash-4.1$`

The "$" tells that we are a normal user again - not the root

In order to connect to a database acting as a database user from the command line, the command is:

`psql dbname dbuser`

In [4]:
# psql imdb postgres

Now the prompt should be:

`imdb=#`

We can now query the database

First, let's see whether all tables are imported.

The command to list the tables in a database is:

`\dt`

In [5]:
# \dt

The output should be:

```bash
imdb=# \dt
                 List of relations
 Schema |         Name          | Type  |  Owner   
--------+-----------------------+-------+----------
 public | name_basics           | table | postgres
 public | title_basics          | table | postgres
 public | title_crew            | table | postgres
 public | title_episode         | table | postgres
 public | title_principals_melt | table | postgres
 public | title_ratings         | table | postgres
```

As we remember from our first sessions (you can check week_02_commands file),

In order to get the total size of the database:

In [6]:
# SELECT pg_size_pretty(pg_database_size('imdb'));

The output should be:

```bash
imdb=# SELECT pg_size_pretty(pg_database_size('imdb'));
 pg_size_pretty 
----------------
 4538 MB
(1 row)
```

We can input any SQL statement that we have created so far directly into the psql prompt.

For example, remember the query:

- Starting with first option (not the subquery), let's make a three way join
- First join titles and principal cast on title id's (tconst)
- And then join principal cast and name basics on name id's (nconst)
- Filter only for actors and actresses
- And sort on first names (ascendng) then title years

```sql
SELECT title_basics.tconst, title_basics.originaltitle, title_basics.startyear, title_basics.runtimeminutes,
	title_basics.genres, title_principals_melt.principalcast, name_basics.primaryname,
	name_basics.birthyear, name_basics.deathyear, name_basics.primaryprofession

FROM title_basics LEFT JOIN title_principals_melt ON title_basics.tconst=title_principals_melt.tconst
	LEFT JOIN name_basics ON title_principals_melt.principalcast=name_basics.nconst

WHERE title_basics.originaltitle ~ '.*Godfather.*Part.*'
	AND title_basics.genres ~ '(?i)drama'
	AND NOT title_basics.genres ~ '(?i)comedy'
	AND title_basics.startyear <= 1990
	AND name_basics.primaryprofession ~'actor|actress'

ORDER BY name_basics.primaryname, title_basics.startyear DESC;
```

The output should be:

```bash
  tconst   |      originaltitle      | startyear | runtimeminutes |   genres    | principalcast |  primaryname   | birthyear | deathyear |     primaryprofession     
-----------+-------------------------+-----------+----------------+-------------+---------------+----------------+-----------+-----------+---------------------------
 tt0099674 | The Godfather: Part III |      1990 |            162 | Crime,Drama | nm0000199     | Al Pacino      |      1940 |           | actor,soundtrack,director
 tt0071562 | The Godfather: Part II  |      1974 |            202 | Crime,Drama | nm0000199     | Al Pacino      |      1940 |           | actor,soundtrack,director
 tt0099674 | The Godfather: Part III |      1990 |            162 | Crime,Drama | nm0000412     | Andy Garcia    |      1956 |           | actor,producer,soundtrack
 tt0099674 | The Godfather: Part III |      1990 |            162 | Crime,Drama | nm0000473     | Diane Keaton   |      1946 |           | actress,producer,director
 tt0071562 | The Godfather: Part II  |      1974 |            202 | Crime,Drama | nm0000473     | Diane Keaton   |      1946 |           | actress,producer,director
 tt0071562 | The Godfather: Part II  |      1974 |            202 | Crime,Drama | nm0000134     | Robert De Niro |      1943 |           | actor,producer,soundtrack
 tt0071562 | The Godfather: Part II  |      1974 |            202 | Crime,Drama | nm0000380     | Robert Duvall  |      1931 |           | actor,producer,soundtrack
 tt0099674 | The Godfather: Part III |      1990 |            162 | Crime,Drama | nm0001735     | Talia Shire    |      1946 |           | actress,producer,director
(8 rows)
```

Now we are pretty sure that, our database is completely inside the docker

Let's start to import via sqoop!

First let's leave the psql shell

In [8]:
# \q

Then let's leave postgres user and shift to hdfs user:

In [9]:
# exit
# su hdfs
# cd /

Now the prompt is:

`[hdfs@sandbox-hdp /]$`

In order the connect to a database server the relevant connect driver (JDBC) must be downloaded to the correct destination

(note that this step is already done in our example. This is for your reference)

The link for the current JDBC file is:

https://jdbc.postgresql.org/download/postgresql-42.1.4.jar

And the path for the file is (inside the sandbox-hdp docker):

`/usr/hdp/current/sqoop-client/lib/`


Note that, we configured the postgresql server inside the docker so that it will not prompt for password, just for convenience.

This is not a secure way in a production environment

Now let's first see whether sqoop can connect to our postgresql database and list the tables:

```bash
sqoop list-tables --connect jdbc:postgresql://localhost:5432/imdb --username postgres
```

The output should be:
```bash

name_basics
title_basics
title_crew
title_episode
title_principals_melt
title_ratings
```

Ok, sqoop recognizes and connects to postgresql server

## IMPORT AS TEXT FILES

Now, first import a single table as a text file into hdfs

The target directory should be non-existent. The direct flag is for fast imports:

```bash

sqoop import --connect jdbc:postgresql://localhost:5432/imdb \
--username postgres \
--table name_basics \
--target-dir /user/data/imdb_import_1 \
--direct

```

The output should be:

```bash

18/01/04 00:55:35 INFO manager.DirectPostgresqlManager: Performing import of table name_basics from database imdb
18/01/04 00:56:02 INFO manager.DirectPostgresqlManager: Transfer loop complete.
18/01/04 00:56:02 INFO manager.DirectPostgresqlManager: Transferred 452.8851 MB in 24.596 seconds (18.413 MB/sec)
```

Now let's check whether the new directory and file(s) exist:

In [10]:
# hdfs dfs -ls /user/data

`drwxr-xr-x   - hdfs hdfs          0 2018-01-04 00:55 /user/data/imdb_import_1`

In [11]:
# hdfs dfs -ls /user/data/imdb_import_1

`-rw-r--r--   1 hdfs hdfs  475483887 2018-01-04 00:56 /user/data/imdb_import_1/part-m-00000`

The file is there. Let's read the head of the file:

In [13]:
# hdfs dfs -cat /user/data/imdb_import_1/* | head

```bash
nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0120689,tt0027125,tt0028333,tt0050419"
nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0038355,tt0040506,tt0055688,tt0037382"
nm0000003,Brigitte Bardot,1934,,"actress,soundtrack,producer","tt0049189,tt0057345,tt0059956,tt0063715"
nm0000004,John Belushi,1949,1982,"actor,writer,soundtrack","tt0077975,tt0078723,tt0080455,tt0072562"
nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0060827,tt0050976,tt0050986,tt0083922"
nm0000006,Ingrid Bergman,1915,1982,"actress,soundtrack,producer","tt0038787,tt0034583,tt0038109,tt0077711"
nm0000007,Humphrey Bogart,1899,1957,"actor,soundtrack,producer","tt0034583,tt0040897,tt0033870,tt0038355"
nm0000008,Marlon Brando,1924,2004,"actor,soundtrack,director","tt0078788,tt0047296,tt0068646,tt0078346"
nm0000009,Richard Burton,1925,1984,"actor,producer,soundtrack","tt0061184,tt0057877,tt0087803,tt0065207"
nm0000010,James Cagney,1899,1986,"actor,soundtrack,director","tt0042041,tt0029870,tt0031867,tt0035575"
```

Note that, by default the imported format for text is comma separated values.

We can also import the file into binary formats (avro, parquet or sequence files)

In order to get more info on import options:

In [15]:
# sqoop help import | less

Importing a single table is OK. But what if we want to import multiple tables at once?

sqoop has another command "import-all-tables"

First let's flush the import directory

In [None]:
# hdfs dfs -rm -r /user/data/imdb_import_1

And check the contents:

In [17]:
# hdfs dfs -ls /user/data

Now import all tables. Not that the option for the target path is now "--warehouse-dir":

```bash

sqoop import-all-tables \
--connect jdbc:postgresql://localhost:5432/imdb \
--username postgres \
--warehouse-dir /user/data/imdb_import_1 \
--direct

```

Let's check the files from hdfs. Note that, a separate directory is created for each table:

In [18]:
# hdfs dfs -ls /user/data/imdb_import_1/*

```bash
-rw-r--r--   1 hdfs hdfs  475483887 2018-01-04 01:07 /user/data/imdb_import_1/name_basics/part-m-00000
-rw-r--r--   1 hdfs hdfs  475483887 2018-01-04 00:56 /user/data/imdb_import_1/part-m-00000
Found 1 items
-rw-r--r--   1 hdfs hdfs  387758108 2018-01-04 01:07 /user/data/imdb_import_1/title_basics/part-m-00000
Found 1 items
-rw-r--r--   1 hdfs hdfs  133155566 2018-01-04 01:07 /user/data/imdb_import_1/title_crew/part-m-00000
Found 1 items
-rw-r--r--   1 hdfs hdfs   72331204 2018-01-04 01:07 /user/data/imdb_import_1/title_episode/part-m-00000
Found 1 items
-rw-r--r--   1 hdfs hdfs  507177900 2018-01-04 01:08 /user/data/imdb_import_1/title_principals_melt/part-m-00000
Found 1 items
-rw-r--r--   1 hdfs hdfs   13053098 2018-01-04 01:08 /user/data/imdb_import_1/title_ratings/part-m-00000
```

We can control the number of map tasks to import the data with the "-m" flag. The higher the number is, separate files will be created for each map task inside the target directory

## IMPORT AS HIVE TABLES

Remember that, Hive is a data warehouse system that provides a SQL like interface to query the data inside the HDFS.

Sqoop can import the data as Hive tables with "--hive-import --create-hive-table" flags

```bash
sqoop import \
--connect jdbc:postgresql://localhost:5432/imdb \
--username postgres \
--table name_basics \
--hive-import --create-hive-table --direct
```

The output should be:
```bash
Time taken: 3.905 seconds
Loading data to table default.name_basics
Table default.name_basics stats: [numFiles=1, numRows=0, totalSize=466334419, rawDataSize=0]
OK
Time taken: 2.267 seconds
```

But where is the file?

It is here:

In [19]:
# hdfs dfs -ls /apps/hive/warehouse/

```bash

Found 5 items
drwxrwxrwx   - hive hadoop          0 2017-11-10 14:59 /apps/hive/warehouse/foodmart.db
drwxrwxrwx   - hdfs hadoop          0 2018-01-04 01:28 /apps/hive/warehouse/name_basics
drwxrwxrwx   - hive hadoop          0 2017-11-10 15:00 /apps/hive/warehouse/sample_07
drwxrwxrwx   - hive hadoop          0 2017-11-10 15:00 /apps/hive/warehouse/sample_08
drwxrwxrwx   - hive hadoop          0 2017-11-10 14:53 /apps/hive/warehouse/xademo.db
```

And we can check from hive that the table is imported:

In [20]:
# hive -e 'show tables'

```bash
OK
name_basics
sample_07
sample_08
Time taken: 3.173 seconds, Fetched: 3 row(s)
```

Let's try to read the file:

In [3]:
# hdfs dfs -cat /apps/hive/warehouse/name_basics/* | head

See that, it is not a pure text file anymore:

```bash
nm0000001Fred Astaire18991987soundtrack,actor,miscellaneoustt0120689,tt0027125,tt0028333,tt0050419
nm0000002Lauren Bacall19242014actress,soundtracktt0038355,tt0040506,tt0055688,tt0037382
nm0000003Brigitte Bardot1934actress,soundtrack,producertt0049189,tt0057345,tt0059956,tt0063715
nm0000004John Belushi19491982actor,writer,soundtracktt0077975,tt0078723,tt0080455,tt0072562
nm0000005Ingmar Bergman19182007writer,director,actortt0060827,tt0050976,tt0050986,tt0083922
nm0000006Ingrid Bergman19151982actress,soundtrack,producertt0038787,tt0034583,tt0038109,tt0077711
nm0000007Humphrey Bogart18991957actor,soundtrack,producertt0034583,tt0040897,tt0033870,tt0038355
nm0000008Marlon Brando19242004actor,soundtrack,directortt0078788,tt0047296,tt0068646,tt0078346
nm0000009Richard Burton19251984actor,producer,soundtracktt0061184,tt0057877,tt0087803,tt0065207
nm0000010James Cagney18991986actor,soundtrack,directortt0042041,tt0029870,tt0031867,tt0035575
```

If we want to delete the table, we do it from the hive command, not by manually deleting the file:

In [21]:
# hive -e 'drop table name_basics'

```bash
OK
Time taken: 4.542 seconds
```

In [1]:
# hive -e 'show tables'

No more name_basics:

```bash
OK
sample_07
sample_08
Time taken: 4.746 seconds, Fetched: 2 row(s)
```

In [2]:
# hdfs dfs -ls /apps/hive/warehouse/

And no more name_basics directories

Now let's import all tables into hive

```bash
sqoop import-all-tables \
--connect jdbc:postgresql://localhost:5432/imdb \
--username postgres \
--hive-import --create-hive-table --direct
```

Let's check whether they are imported into hive:

In [4]:
# hive -e 'show tables'

```bash
OK
name_basics
sample_07
sample_08
title_basics
title_crew
title_episode
title_principals_melt
title_ratings
Time taken: 2.768 seconds, Fetched: 8 row(s)
```

In [5]:
# hdfs dfs -ls /apps/hive/warehouse

```bash
Found 10 items
drwxrwxrwx   - hive hadoop          0 2017-11-10 14:59 /apps/hive/warehouse/foodmart.db
drwxrwxrwx   - hdfs hadoop          0 2018-01-04 01:41 /apps/hive/warehouse/name_basics
drwxrwxrwx   - hive hadoop          0 2017-11-10 15:00 /apps/hive/warehouse/sample_07
drwxrwxrwx   - hive hadoop          0 2017-11-10 15:00 /apps/hive/warehouse/sample_08
drwxrwxrwx   - hdfs hadoop          0 2018-01-04 01:42 /apps/hive/warehouse/title_basics
drwxrwxrwx   - hdfs hadoop          0 2018-01-04 01:42 /apps/hive/warehouse/title_crew
drwxrwxrwx   - hdfs hadoop          0 2018-01-04 01:43 /apps/hive/warehouse/title_episode
drwxrwxrwx   - hdfs hadoop          0 2018-01-04 01:43 /apps/hive/warehouse/title_principals_melt
drwxrwxrwx   - hdfs hadoop          0 2018-01-04 01:44 /apps/hive/warehouse/title_ratings
drwxrwxrwx   - hive hadoop          0 2017-11-10 14:53 /apps/hive/warehouse/xademo.db
```

## AN IMPORT EXERCISE

Now you'll 
- first import a publicly available sql dump into postgresql (inside sandbox-hdp)
- Then import the postgresql database into hdfs as text files and hive tables

The database is "World", containing list of cities, countries and languages. The link is

http://pgfoundry.org/frs/download.php/527/world-1.0.tar.gz

First you should exit hdfs user and become postgres user. Note that the dash "-" after "su" takes the user to its home directory:

In [None]:
# exit
# su - postgres

You can download the file using:

`wget _url_`

And extract the tar.gz archive using:

`tar -xzvf _tar.gz_`

Navigate to dbsamples-0.1/world with cd

There you will see world.sql

Now in two simple commands we will create an empty database named world, and import the sql dump into that database

In [6]:
# createdb world

In [7]:
# psql world < world.sql

Now connect to the database through psql:

In [8]:
# psql world postgres

And list the tables

In [9]:
# \dt

```bash
world=# \dt
              List of relations
 Schema |      Name       | Type  |  Owner   
--------+-----------------+-------+----------
 public | city            | table | postgres
 public | country         | table | postgres
 public | countrylanguage | table | postgres
(3 rows)
```

To see the columns in a table:

In [10]:
# \d+ city

```bash
world-# \d+ city
                       Table "public.city"
   Column    |     Type     | Modifiers | Storage  | Description 
-------------+--------------+-----------+----------+-------------
 id          | integer      | not null  | plain    | 
 name        | text         | not null  | extended | 
 countrycode | character(3) | not null  | extended | 
 district    | text         | not null  | extended | 
 population  | integer      | not null  | plain    | 
Indexes:
    "city_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "country" CONSTRAINT "country_capital_fkey" FOREIGN KEY (capital) REFERENCES city(id)
Has OIDs: no
```

And let's view the head of the tables:

In [11]:
# select * from city limit 10;

```bash
world=# select * from city limit 10;
 id |      name      | countrycode |   district    | population 
----+----------------+-------------+---------------+------------
  1 | Kabul          | AFG         | Kabol         |    1780000
  2 | Qandahar       | AFG         | Qandahar      |     237500
  3 | Herat          | AFG         | Herat         |     186800
  4 | Mazar-e-Sharif | AFG         | Balkh         |     127800
  5 | Amsterdam      | NLD         | Noord-Holland |     731200
  6 | Rotterdam      | NLD         | Zuid-Holland  |     593321
  7 | Haag           | NLD         | Zuid-Holland  |     440900
  8 | Utrecht        | NLD         | Utrecht       |     234323
  9 | Eindhoven      | NLD         | Noord-Brabant |     201843
 10 | Tilburg        | NLD         | Noord-Brabant |     193238
(10 rows)
```

You can also do it for country and countrylanguage tables

Now the rest of the task is:

- Exit psql shell
- Exit postgres user and become hdfs user again
- First check that sqoop can list the tables in the database

- Then import all tables as text files into hdfs
- Check that they are imported using hdfs commands (list or cat the head)

- Then import all tables as hive tables, 
- Check that they are imported using hive command

# HIVE

Hive makes it possible to run queries on datasets residing in hdfs using an SQL like language

Hive can be run:
- Interactively through its shell
- Non-interactively as inline commands
- Or in batch mode through hive scripts

Now that we have imported tables, we can run queries against the tables

First let's run it non-interactively

hive -e "show tables";

```bash
OK
city
country
countrylanguage
sample_07
sample_08
title_basics
title_crew
title_episode
title_principals_melt
title_ratings
Time taken: 4.137 seconds, Fetched: 10 row(s)
```

Now for handling more iterative steps, let's start the hive prompt and run interactively:

In [1]:
# hive

This will turn on the "hive" prompt:

`hive>`

Note that,
- commands should end with semicolon ";"
- they are case insensitive.
- and HiveQL closely follows the flavor of MySQL


Let's first list the tables

```mysql
show tables;
```

```bash
city
country
countrylanguage
sample_07
sample_08
title_basics
title_crew
title_episode
title_principals_melt
title_ratings
```

This way, our tables are not well organized. It is better that we create two databases and attach the tables to relavant databases

hive prompt throws an "OK" after each successful command

## CREATE/DROP DATABASES, MOVE AND COPY TABLES

Let's first create a database named "deneme" from hive prompt:

```mysql
create database deneme;
```

And list the databases:

```mysql
show databases;
```

```bash
hive> show databases;
OK
default
deneme
foodmart
xademo
Time taken: 0.059 seconds, Fetched: 4 row(s)
```

Now let's delete the database:

```mysql
drop database deneme;
```

```mysql
show databases;
```

```bash
hive> show databases;
OK
default
foodmart
xademo
```

Now let's create a database for holding imdb tables together:

```mysql
create database imdb;
```

Now we can either move the tables into imdb or create copies of them inside imdb

Now, first let's create copies of the tables inside imdb database

```mysql
create table imdb.name_basics
as select * from name_basics;
```

```mysql
create table imdb.title_basics
as select * from title_basics;
```

```mysql
create table imdb.title_crew
as select * from title_crew;
```

```mysql
create table imdb.title_episode
as select * from title_episode;
```

```mysql
create table imdb.title_principals_melt
as select * from title_principals_melt;
```

```mysql
create table imdb.title_ratings
as select * from title_ratings;
```

Now we can list the tables in imdb database:

```mysql
show tables in imdb;
```

```mysql
hive> show tables in imdb;
OK
name_basics
title_basics
title_crew
title_episode
title_principals_melt
title_ratings
Time taken: 0.483 seconds, Fetched: 6 row(s)
```

Now can run some queries inside the database. First let's check whether the copying is complete

With the "use DBNAME;" statement, the DBNAME acts as the namespace for tables, hence compact the queries 

```mysql
use imdb; select count(*) from name_basics;
```

```mysql
hive> use imdb; select count(*) from name_basics;
OK
Time taken: 0.335 seconds
OK
8155447
Time taken: 0.576 seconds, Fetched: 1 row(s)
```

You can repeat the query for other tables

Now let's create a second database named imdb_new and MOVE the tables in imdb to this new database

```mysql
create database imdb_new;
show databases;
```

```mysql
alter table imdb.name_basics
rename to imdb_new.name_basics;
```

We can check whether name_basics is moved:

```mysql
hive> show tables in imdb;
OK
title_basics
title_crew
title_episode
title_principals_melt
title_ratings
Time taken: 0.47 seconds, Fetched: 5 row(s)
hive> show tables in imdb_new;
OK
name_basics
Time taken: 0.454 seconds, Fetched: 1 row(s)
```

You can either repeat it for other tables, or roll back into imdb

```mysql
alter table imdb_new.name_basics
rename to imdb.name_basics;
```

Now that we have a hive database on hdfs, similar to our postgresql database, we can run similar queries

For example:
- get the original titles, start year and runtime minutes of
- titles before 1930 and
- longer than 100 minutes
- genres include Drama
- limit to the first 10 results

Note that instead of tilde (~), we should use LIKE in order to make a partial match in string fields to account for the flavor difference b/w PostgreSQL and MySQL

```mysql
use imdb;

SELECT originaltitle, startyear, runtimeminutes
  FROM title_basics
  WHERE startyear <= 1930
  AND runtimeminutes > 100
  AND genres LIKE 'Drama'
  LIMIT 10;
```

```
Atlantis	1913	121
Germinal	1913	150
Les misérables - Époque 2: Fantine	1913	300
Stingaree	1915	250
Who Pays?	1915	360
A Daughter of the Gods	1916	180
Berg-Ejvind och hans hustru	1918	136
La España trágica o Tierra de sangre	1918	240
Vendémiaire	1918	148
Ingmarssönerna	1919	207
Time taken: 0.231 seconds, Fetched: 10 row(s)
```

In fact, under the hood, Hive converts the HiveQL query to a series of map reduce jobs

The plan of the conversion can be viewed by prefixing the statement with "explain"

```mysql
explain
SELECT originaltitle, startyear, runtimeminutes
  FROM imdb.title_basics
  WHERE startyear <= 1930
  AND runtimeminutes > 100
  AND genres LIKE 'Drama'
  LIMIT 10;
```

```mysql
OK
Plan not optimized by CBO.

Stage-0
   Fetch Operator
      limit:10
      Limit [LIM_3]
         Number of rows:10
         Select Operator [SEL_2]
            outputColumnNames:["_col0","_col1","_col2"]
            Filter Operator [FIL_5]
               predicate:((startyear <= 1930) and (runtimeminutes > 100) and (genres like 'Drama')) (type: boolean)
               TableScan [TS_0]
                  alias:title_basics

Time taken: 0.091 seconds, Fetched: 14 row(s)
```

Hive is suitable for simpler queries. However as the queries get more complex and need a clearer definition of the dataflow, we should revert to a tool such as Pig

You can exit the hive shell with "quit;"

## A HIVE EXERCISE

In the previous sqoop example, you were required to import the tables from the "world" database into hive

Now first please create a "world" database and copy the three tables into that database in hive

Note that, in the last example we had provided "use imdb;" as the namespace

In order to refer to tables not attached to a custom database that we created yet, "use default;"

Or you can refer to those databases as default.DBNAME

You should get:

```mysql
hive> show tables in world;
OK
city
country
countrylanguage
Time taken: 0.638 seconds, Fetched: 3 row(s)
```

In order get information about tables, you can run:

```mysql
show create table world.city;
```

```mysql
hive> show create table world.city;
OK
CREATE TABLE `world.city`(
  `id` int, 
  `name` string, 
  `countrycode` string, 
  `district` string, 
  `population` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://sandbox-hdp.hortonworks.com:8020/apps/hive/warehouse/world.db/city'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 
  'numFiles'='1', 
  'numRows'='4079', 
  'rawDataSize'='140392', 
  'totalSize'='144471', 
  'transient_lastDdlTime'='1515066026')
Time taken: 0.304 seconds, Fetched: 21 row(s)
```

The output will return information on the schema (column names and types), file type and size information

You can repeat it for other tables in the database

```mysql
hive> show create table world.country;
OK
CREATE TABLE `world.country`(
  `code` string, 
  `name` string, 
  `continent` string, 
  `region` string, 
  `surfacearea` double, 
  `indepyear` int, 
  `population` int, 
  `lifeexpectancy` double, 
  `gnp` double, 
  `gnpold` double, 
  `localname` string, 
  `governmentform` string, 
  `headofstate` string, 
  `capital` int, 
  `code2` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://sandbox-hdp.hortonworks.com:8020/apps/hive/warehouse/world.db/country'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 
  'numFiles'='1', 
  'numRows'='239', 
  'rawDataSize'='30979', 
  'totalSize'='31218', 
  'transient_lastDdlTime'='1515066029')
Time taken: 0.334 seconds, Fetched: 31 row(s)
```

```
hive> show create table world.countrylanguage;
OK
CREATE TABLE `world.countrylanguage`(
  `countrycode` string, 
  `language` string, 
  `isofficial` boolean, 
  `percentage` double)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://sandbox-hdp.hortonworks.com:8020/apps/hive/warehouse/world.db/countrylanguage'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 
  'numFiles'='1', 
  'numRows'='984', 
  'rawDataSize'='20965', 
  'totalSize'='21949', 
  'transient_lastDdlTime'='1515066048')
Time taken: 0.305 seconds, Fetched: 20 row(s)
```

Now the next task is:

- get the names (from country table) of the countries, official languages (from country languages) of which include english 

You should get:

```mysql
OK
American Samoa
Anguilla
Antigua and Barbuda
Australia
Barbados
Belize
Bermuda
United Kingdom
Virgin Islands, British
Cayman Islands
South Africa
Falkland Islands
Gibraltar
Guam
Hong Kong
Ireland
Christmas Island
Canada
Cocos (Keeling) Islands
Lesotho
Malta
Marshall Islands
Montserrat
Nauru
Niue
Norfolk Island
Palau
Northern Mariana Islands
Saint Helena
Saint Kitts and Nevis
Saint Lucia
Saint Vincent and the Grenadines
Samoa
Seychelles
Tokelau
Tonga
Turks and Caicos Islands
Tuvalu
New Zealand
Vanuatu
United States
Virgin Islands, U.S.
Zimbabwe
United States Minor Outlying Islands
Time taken: 11.424 seconds, Fetched: 44 row(s)
```

# PIG

## PIG TUTORIAL

Pig transforms the declarative nature of Hive into a procedural one, so that dataflow steps are more easily defined 

In this examples, we will use HCatalog to connect Pig to Hive databases. HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid

To enable pig with HCatalog:

```bash
pig -useHCatalog
```

We will write down the steps defining the workflow and the plan will be executed when we enter the DUMP command

As a simple example, let's load city table from world database in Hive

```Pig
city = LOAD 'world.city' USING org.apache.hive.hcatalog.pig.HCatLoader();
```

And let's select the cities where the countrycode is TUR

```Pig
cityturkey = filter city by countrycode == 'TUR';
```

And let's select the cities with a population larger than 1 million

```Pig
largecitytur = filter cityturkey by population > 1000000;
```

Now let's execute the plan:

```Pig
DUMP largecitytur;
```

```Pig
(3357,Istanbul,TUR,Istanbul,8787958)
(3358,Ankara,TUR,Ankara,3038159)
(3359,Izmir,TUR,Izmir,2130359)
(3360,Adana,TUR,Adana,1131198)
(3361,Bursa,TUR,Bursa,1095842)
```

Now let's import other tables in the world database

```Pig
country = LOAD 'world.country' USING org.apache.hive.hcatalog.pig.HCatLoader();


lang = LOAD 'world.countrylanguage' USING org.apache.hive.hcatalog.pig.HCatLoader();
```

Now let's define our previous example in Hive as a Pig dataflow

First, filter the lang table for countries, official languages of which include English

```Pig
codeen = filter lang by language == 'English' and isofficial == true;
```

Then we join the filtered countrylanguage and country tables on the coeds

```Pig
joinen = JOIN country by code, codeen by countrycode;
```

And I select only the name field to be dumped

```Pig
names = foreach joinen generate name;
```

Now we can execute the plan to dump the names

```Pig
DUMP names;
```

```Pig
(Anguilla)
(American Samoa)
(Antigua and Barbuda)
(Australia)
(Belize)
(Bermuda)
(Barbados)
(Canada)
(Cocos (Keeling) Islands)
(Christmas Island)
(Cayman Islands)
(Falkland Islands)
(United Kingdom)
(Gibraltar)
(Guam)
(Hong Kong)
(Ireland)
(Saint Kitts and Nevis)
(Saint Lucia)
(Lesotho)
(Marshall Islands)
(Malta)
(Northern Mariana Islands)
(Montserrat)
(Norfolk Island)
(Niue)
(Nauru)
(New Zealand)
(Palau)
(Saint Helena)
(Seychelles)
(Turks and Caicos Islands)
(Tokelau)
(Tonga)
(Tuvalu)
(United States Minor Outlying Islands)
(United States)
(Saint Vincent and the Grenadines)
(Virgin Islands, British)
(Virgin Islands, U.S.)
(Vanuatu)
(Samoa)
(South Africa)
(Zimbabwe)
```

## PIG EXERCISE

Now as an exercise, we will compare the lifeexpentancy of the whole sample and the life expectancy of the countries with English as official language

To get a feel of calculating averages in Pig, below is the solution for the first part:

```Pig
-- filter for null values
countrynotnull = filter country by lifeexpectancy is not null;

-- get lifeexpectancy column

lifeall = foreach countrynotnull generate lifeexpectancy;

-- combine values into a single group 
lifeallg = group lifeall all;

-- calculate the average
avgall = foreach lifeallg generate AVG(lifeall);
-- execute
DUMP avgall;
```

The result is:
```Pig
(66.486036036036)
```

Now play with code in previous example to get the lifeexpectancy values of English speaking countries (Note that we had extracted the names column. Just change the column)

And apply the steps above (you will have the life expectancies of anglophone countries instead of all countries, rest is the same)

Note that null elimination step is not necessary, the result is the same

The result should be:
```Pig
(71.5027027027027)
```