# HADOOP ECOSYSTEM CONTINUED

First, let's start the daemons again:

First ssh:

In [1]:
sudo service ssh stop
sudo service ssh start

[sudo] password for s: 


In [None]:
service ssh status

And then hdfs:

In [None]:
start-dfs.sh 2>&1 | grep -Pv "^WARNING"

And yarn:

In [None]:
start-yarn.sh

Check whether hdfs and yarn daemons are running:

In [None]:
jps

And postgresql:

In [None]:
sudo service postgresql start

In [None]:
psql -U postgres -c "\l"

## HIVE

Hive is a data warehouse system and MapReduce wrapper that presents an SQL interface. So SQL programmers feel at home through Hive in the Hadoop Ecosystem, without writing MapReduce jobs explicitly

### IMPORT DATA WITH SQOOP

Sqoop can import the data as Hive tables with "--hive-import --create-hive-table" flags

In [None]:
sqoop import \
--connect jdbc:postgresql://localhost:5432/imdb2 \
--username postgres \
--table name_basics \
--hive-import --create-hive-table --direct \
2>&1 | grep -Pv "^(Warning|Please|WARNING)"

And view the imported file:

In [None]:
hdfs dfs -ls /apps/hive/warehouse/

And let's check from hive whether files are imported:

In [None]:
hive -e 'show tables'

Let's try to read the file:

In [None]:
hdfs dfs -cat /apps/hive/warehouse/name_basics/* | head

If we want to delete the table, we do it from the hive command, not by manually deleting the file:

In [21]:
hive -e 'drop table name_basics'

In [1]:
hive -e 'show tables'

In [2]:
hdfs dfs -ls /apps/hive/warehouse/

And no more name_basics directories

Now let's import title_basics into hive

In [None]:
sqoop import \
--connect jdbc:postgresql://localhost:5432/imdb2 \
--username postgres \
--table title_basics \
--hive-import --create-hive-table --direct \
2>&1 | grep -Pv "^(Warning|Please|WARNING)"

Let's check whether they are imported into hive:

In [4]:
hive -e 'show tables'

In [5]:
hdfs dfs -ls /apps/hive/warehouse

### A SQOOP TO HIVE IMPORT EXERCISE

Now you'll 
- first import a publicly available sql dump into postgresql (inside sandbox-hdp)
- Then import the postgresql database into hdfs as text files and hive tables

The database is "World", containing list of cities, countries and languages. The link is

http://pgfoundry.org/frs/download.php/527/world-1.0.tar.gz

First download the file

In [None]:
wget http://pgfoundry.org/frs/download.php/527/world-1.0.tar.gz

And extract the archive:

In [None]:
tar -xzvf world-1.0.tar.gz

Create a new database at postgresql:

In [None]:
createdb -U postgres world

And import the dump into the database:

In [None]:
psql -U postgres world < world.sql

View the tables:

In [None]:
psql -U postgres -d world -c "\dt"

And view the fields of the tables:

In [None]:
psql -U postgres -d world -c "\d+ public.*"

And view some of the rows of the tables:

In [None]:
psql -U postgres -d world -c "select * from city limit 10"

In [None]:
psql -U postgres -d world -c "select * from country limit 10"

In [None]:
psql -U postgres -d world -c "select * from countrylanguage limit 10"

And let's import into hive:

In [None]:
sqoop import \
--connect jdbc:postgresql://localhost:5432/world \
--username postgres \
--hive-import --create-hive-table --direct \
2>&1 | grep -Pv "^(Warning|Please|WARNING)"

### Hive operations

First show tables:

In [None]:
hive -e "show tables"

Then let's create a database from hive:

In [None]:
hive -e "create database deneme"

List the databases:

In [None]:
hive -e "show databases"

Delete deneme database:

In [None]:
hive -e "drop database deneme"

Now let's create a database to hold imdb tables:

In [None]:
hive -e "create database imdb"

Let's create a copy of the title_basics inside imdb database:

In [None]:
hive -e "create table imdb.title_basics
as select * from title_basics"

ANd show the tables inside imdb database: 

In [None]:
hive -e "show tables in imdb"

And let's run a simple query inside a hive database:

In [None]:
hive -e "use imdb; select count(*) from title_basics;"

Now that we have a hive database on hdfs, similar to our postgresql database, we can run similar queries

For example:
- get the original titles, start year and runtime minutes of
- titles before 1930 and
- longer than 100 minutes
- genres include Drama
- limit to the first 10 results

Note that instead of tilde (~), we should use LIKE in order to make a partial match in string fields to account for the flavor difference b/w PostgreSQL and MySQL

In [None]:
hive -e "use imdb;

SELECT originaltitle, startyear, runtimeminutes
  FROM title_basics
  WHERE startyear <= 1930
  AND runtimeminutes > 100
  AND genres LIKE 'Drama'
  LIMIT 10;"

In fact, under the hood, Hive converts the HiveQL query to a series of map reduce jobs

The plan of the conversion can be viewed by prefixing the statement with "explain"

In [None]:
hive -e "use imdb;

explain
SELECT originaltitle, startyear, runtimeminutes
  FROM title_basics
  WHERE startyear <= 1930
  AND runtimeminutes > 100
  AND genres LIKE 'Drama'
  LIMIT 10;"

Hive is suitable for simpler queries. However as the queries get more complex and need a clearer definition of the dataflow, we should revert to a tool such as Pig

### A HIVE exercise with world database

In the previous sqoop example, you were required to import the tables from the "world" database into hive

Now first please create a "world" database and copy the three tables into that database in hive

Note that, in the last example we had provided "use imdb;" as the namespace

In order to refer to tables not attached to a custom database that we created yet, "use default;"

Or you can refer to those databases as default.DBNAME

In [None]:
hive -e "create database world;

use default;
create table world.city as select * from default.city;
create table world.country as select * from default.country;
create table world.countrylanguage as select * from default.countrylanguage;"

See the table sin the world database:

In [None]:
hive -e "show tables in world"

In [None]:
hive -e "show create table world.city;"

In order get information about tables, you can run:

The output will return information on the schema (column names and types), file type and size information

You can repeat it for other tables in the database

Now let's run a query to see the average lifeexpantancy of all countries:

In [None]:
hive -e "use world;
select avg(lifeexpectancy) from country;"

Now the next task is:

- get the names (from country table) of the countries, official languages (from country languages) of which include english 

You should get:

```mysql
OK
American Samoa
Anguilla
Antigua and Barbuda
Australia
Barbados
Belize
Bermuda
United Kingdom
Virgin Islands, British
Cayman Islands
South Africa
Falkland Islands
Gibraltar
Guam
Hong Kong
Ireland
Christmas Island
Canada
Cocos (Keeling) Islands
Lesotho
Malta
Marshall Islands
Montserrat
Nauru
Niue
Norfolk Island
Palau
Northern Mariana Islands
Saint Helena
Saint Kitts and Nevis
Saint Lucia
Saint Vincent and the Grenadines
Samoa
Seychelles
Tokelau
Tonga
Turks and Caicos Islands
Tuvalu
New Zealand
Vanuatu
United States
Virgin Islands, U.S.
Zimbabwe
United States Minor Outlying Islands
Time taken: 11.424 seconds, Fetched: 44 row(s)
```

In [None]:
hive -e "use world;
select c.name from country c left join
countrylanguage l on c.code=l.countrycode
where l.isofficial = true
and l.language = 'English';"

## PIG

Pig is a scripting language for creating workflows based on MapReduce

Pig transforms the declarative nature of Hive into a procedural one, so that dataflow steps are more easily defined 

In this examples, we will use HCatalog to connect Pig to Hive databases. HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid

To enable pig with HCatalog:

```bash
pig -useHCatalog
```

We will write down the steps defining the workflow and the plan will be executed when we enter the DUMP command

As a simple example, let's load city table from world database in Hive

First we will list the steps and then run them as a single script in batch mode:

```Pig
city = LOAD 'world.city' USING org.apache.hive.hcatalog.pig.HCatLoader();
```

And let's select the cities where the countrycode is TUR

```Pig
cityturkey = filter city by countrycode == 'TUR';
```

And let's select the cities with a population larger than 1 million

```Pig
largecitytur = filter cityturkey by population > 1000000;
```

Now let's execute the plan:

```Pig
DUMP largecitytur;
```

In [None]:
pig -useHCatalog <<EOF
city = LOAD 'world.city' USING org.apache.hive.hcatalog.pig.HCatLoader();
cityturkey = filter city by countrycode == 'TUR';
largecitytur = filter cityturkey by population > 1000000;
DUMP largecitytur;
EOF

Now let's import other tables in the world database

```Pig
country = LOAD 'world.country' USING org.apache.hive.hcatalog.pig.HCatLoader();


lang = LOAD 'world.countrylanguage' USING org.apache.hive.hcatalog.pig.HCatLoader();
```

Now let's define our previous example in Hive as a Pig dataflow

First, filter the lang table for countries, official languages of which include English

```Pig
codeen = filter lang by language == 'English' and isofficial == true;
```

Then we join the filtered countrylanguage and country tables on the coeds

```Pig
joinen = JOIN country by code, codeen by countrycode;
```

And I select only the name field to be dumped

```Pig
names = foreach joinen generate name;
```

Now we can execute the plan to dump the names

```Pig
DUMP names;
```

In [None]:
pig -useHCatalog <<EOF
country = LOAD 'world.country' USING org.apache.hive.hcatalog.pig.HCatLoader();
lang = LOAD 'world.countrylanguage' USING org.apache.hive.hcatalog.pig.HCatLoader();
codeen = filter lang by language == 'English' and isofficial == true;
joinen = JOIN country by code, codeen by countrycode;
names = foreach joinen generate name;
DUMP names;
EOF

### PIG EXERCISE

Now as an exercise, we will compare the lifeexpentancy of the whole sample and the life expectancy of the countries with English as official language

To get a feel of calculating averages in Pig, below is the solution for the first part:

```Pig
-- filter for null values
countrynotnull = filter country by lifeexpectancy is not null;

-- get lifeexpectancy column

lifeall = foreach countrynotnull generate lifeexpectancy;

-- combine values into a single group 
lifeallg = group lifeall all;

-- calculate the average
avgall = foreach lifeallg generate AVG(lifeall);
-- execute
DUMP avgall;
```

The result is:
```Pig
(66.486036036036)
```

In [None]:
pig -useHCatalog <<EOF
countrynotnull = filter country by lifeexpectancy is not null;
lifeall = foreach countrynotnull generate lifeexpectancy;
lifeallg = group lifeall all;
avgall = foreach lifeallg generate AVG(lifeall);
DUMP avgall;
EOF

Now play with code in previous example to get the lifeexpectancy values of English speaking countries (Note that we had extracted the names column. Just change the column)

And apply the steps above (you will have the life expectancies of anglophone countries instead of all countries, rest is the same)

Note that null elimination step is not necessary, the result is the same

The result should be:
```Pig
(71.5027027027027)
```

In [None]:
pig -useHCatalog <<EOF
    codeen = filter lang by language == 'English' and isofficial == true and lifeexpectancy is not null;
    joinen = JOIN country by code, codeen by countrycode;
    lifeen = foreach joinen generate lifeexpectancy;

    lifeeng = group lifeen all;
    avgen = foreach lifeeng generate AVG(lifeen);

    DUMP avgen;
EOF