# HADOOP ECOSYSTEM CONTINUED

First, let's start the daemons again:

First ssh:

In [None]:
sudo service ssh stop
sudo service ssh start

In [None]:
service ssh status

And then hdfs:

In [None]:
start-dfs.sh 2>&1 | grep -Pv "^(WARNING|SLF4J)"

And yarn:

In [None]:
start-yarn.sh

Check whether hdfs and yarn daemons are running:

In [None]:
jps

And postgresql:

In [None]:
sudo service postgresql start

In [None]:
psql -U postgres -d imdb2 -c "\dt+"

## HIVE

Hive is a data warehouse system and MapReduce wrapper that presents an SQL interface. So SQL programmers feel at home through Hive in the Hadoop Ecosystem, without writing MapReduce jobs explicitly

### IMPORT DATA WITH SQOOP

Sqoop can import the data as Hive tables with "--hive-import --create-hive-table" flags

In [None]:
sqoop import \
--connect jdbc:postgresql://localhost:5432/imdb2 \
--username postgres \
--table title_ratings \
--hive-import --create-hive-table --direct \
2>&1 | grep -Pv "^(Warning|Please|WARNING|SLF4J)"

And view the imported file:

In [None]:
hdfs dfs -find / -name "title*" 2>&1 | grep -Pv "^(Warning|Please|WARNING|SLF4J)"

In [None]:
hdfs dfs -ls /user/hive/warehouse/ 2>&1 | grep -Pv "^(Warning|Please|WARNING|SLF4J)"

And let's check from hive whether files are imported:

In [None]:
hive -e 'show tables' 2>&1 | grep -Pv "^SLF4J"

Let's try to read the file:

In [None]:
hdfs dfs -cat /user/hive/warehouse/title_ratings/* 2>&1 | grep -Pv "^(Warning|Please|WARNING|SLF4J)" | head

If we want to delete the table, we do it from the hive command, not by manually deleting the file:

In [None]:
hive -e 'drop table title_ratings' 2>&1 | grep -Pv "^SLF4J"

In [None]:
hive -e 'show tables' 2>&1 | grep -Pv "^SLF4J"

In [None]:
hdfs dfs -ls /user/hive/warehouse/ 2>&1 | grep -Pv "^(Warning|Please|WARNING|SLF4J)"

### A SQOOP TO HIVE IMPORT EXERCISE

Now you'll 
- first import a publicly available sql dump into postgresql
- Then import the postgresql database into hdfs as text files and hive tables

The database is "World", containing list of cities, countries and languages. The link is

http://pgfoundry.org/frs/download.php/527/world-1.0.tar.gz

First download the file

In [None]:
wget http://pgfoundry.org/frs/download.php/527/world-1.0.tar.gz

And extract the archive:

In [None]:
tar -xzvf world-1.0.tar.gz

Create a new database at postgresql:

In [None]:
createdb -U postgres world

And import the dump into the database:

In [None]:
psql -U postgres world < dbsamples-0.1/world/world.sql

View the tables:

In [None]:
psql -U postgres -d world -c "\dt"

And view the fields of the tables:

In [None]:
psql -U postgres -d world -c "\d+ public.*"

And view some of the rows of the tables:

In [None]:
psql -U postgres -d world -c "select * from city limit 10"

In [None]:
psql -U postgres -d world -c "select * from country limit 10"

In [None]:
psql -U postgres -d world -c "select * from countrylanguage limit 10"

And let's import into hive:

First get the list of tables:

In [None]:
tables=$(psql -U postgres -d world -t --pset="border=0" -c "\dt" | \
awk -F " " '{ print $2 }')

In [None]:
echo "$tables"

Create a hive database to import into:

In [None]:
hive -e "create database world" 2>&1 | grep -Pv "^SLF4J"

And import each table in the postgresql database into hive database:

In [None]:
echo "$tables" | while read l;
do
    sqoop import \
    --connect jdbc:postgresql://localhost:5432/world \
    --username postgres \
    --table $l \
    --hive-import \
    --create-hive-table \
    --hive-table world.$l \
    --direct \
    2>&1 | grep -Pv "^(Warning|Please|WARNING|SLF4J)"
done

### HIVE OPERATIONS

Let's create a database from hive:

In [None]:
hive -e "create database deneme" 2>&1 | grep -Pv "^(Warning|Please|WARNING|SLF4J)"

List the databases:

In [None]:
hive -e "show databases" 2>&1 | grep -Pv "^(Warning|Please|WARNING|SLF4J)"

Delete deneme database:

In [None]:
hive -e "drop database deneme" 2>&1 | grep -Pv "^(Warning|Please|WARNING|SLF4J)"

### A HIVE EXERCISE WITH WORLD DATABASE

See the tables in the world database:

In [None]:
hive -e "show tables in world" 2>&1 | grep -Pv "^SLF4J"

In order get information about tables, you can run:

In [None]:
hive -e "show create table world.city;" 2>&1 | grep -Pv "^SLF4J"

The output will return information on the schema (column names and types), file type and size information

You can repeat it for other tables in the database

Now that we have a hive database on hdfs, similar to our postgresql database, we can run similar queries

Now let's run a very simple query:

In [None]:
hive -e "use world;
select * from country limit 1;" 2>&1 | grep -Pv "^SLF4J"

In fact, under the hood, Hive converts the HiveQL query to a series of map reduce jobs

The plan of the conversion can be viewed by prefixing the statement with "explain"

In [None]:
hive -e "use world;
explain
select * from country limit 1;" 2>&1 | grep -Pv "^SLF4J"

Hive is suitable for simpler queries. However as the queries get more complex and need a clearer definition of the dataflow, we should revert to a tool such as Pig

Now let's run a query to see the average lifeexpantancy of all countries:

In [None]:
hive -e "use world;
select avg(lifeexpectancy) from country limit 10;" 2>&1 | grep -Pv "^SLF4J"

Now the next task is:

- get the names (from country table) of the countries, official languages (from country languages) of which include english 

You should get:

```mysql
OK
American Samoa
Anguilla
Antigua and Barbuda
Australia
Barbados
Belize
Bermuda
United Kingdom
Virgin Islands, British
Cayman Islands
South Africa
Falkland Islands
Gibraltar
Guam
Hong Kong
Ireland
Christmas Island
Canada
Cocos (Keeling) Islands
Lesotho
Malta
Marshall Islands
Montserrat
Nauru
Niue
Norfolk Island
Palau
Northern Mariana Islands
Saint Helena
Saint Kitts and Nevis
Saint Lucia
Saint Vincent and the Grenadines
Samoa
Seychelles
Tokelau
Tonga
Turks and Caicos Islands
Tuvalu
New Zealand
Vanuatu
United States
Virgin Islands, U.S.
Zimbabwe
United States Minor Outlying Islands
Time taken: 11.424 seconds, Fetched: 44 row(s)
```

In [None]:
hive -e "use world;
select c.name from country c left join
countrylanguage l on c.code=l.countrycode
where l.isofficial = true
and l.language = 'English';"  2>&1 | grep -Pv "^SLF4J"

## PIG

Pig is a scripting language for creating workflows based on MapReduce

Pig transforms the declarative nature of Hive into a procedural one, so that dataflow steps are more easily defined 

In this examples, we will use HCatalog to connect Pig to Hive databases. HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid

To enable pig with HCatalog:

```bash
pig -useHCatalog
```

We will write down the steps defining the workflow and the plan will be executed when we enter the DUMP command

As a simple example, let's load city table from world database in Hive

First we will list the steps and then run them as a single script in batch mode:

```Pig
city = LOAD 'world.city' USING org.apache.hive.hcatalog.pig.HCatLoader();
```

And let's select the cities where the countrycode is TUR

```Pig
cityturkey = filter city by countrycode == 'TUR';
```

And let's select the cities with a population larger than 1 million

```Pig
largecitytur = filter cityturkey by population > 1000000;
```

Now let's execute the plan:

```Pig
DUMP largecitytur;
```

In [None]:
echo $HADOOP_CLASSPATH

In [None]:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$PIG_HOME/lib:/opt/pig/lib/hadoop2-runtime

In [None]:
export PIG_CLASSPATH=$PIG_HOME/lib

In [None]:
export HCAT_HOME=$HIVE_HOME/hcatalog

In [None]:
pig -useHCatalog <<EOF
city = LOAD 'world.city' USING org.apache.hive.hcatalog.pig.HCatLoader();
cityturkey = filter city by countrycode == 'TUR';
largecitytur = filter cityturkey by population > 1000000;
DUMP largecitytur;
EOF

Now let's import other tables in the world database

```Pig
country = LOAD 'world.country' USING org.apache.hive.hcatalog.pig.HCatLoader();


lang = LOAD 'world.countrylanguage' USING org.apache.hive.hcatalog.pig.HCatLoader();
```

Now let's define our previous example in Hive as a Pig dataflow

First, filter the lang table for countries, official languages of which include English

```Pig
codeen = filter lang by language == 'English' and isofficial == true;
```

Then we join the filtered countrylanguage and country tables on the coeds

```Pig
joinen = JOIN country by code, codeen by countrycode;
```

And I select only the name field to be dumped

```Pig
names = foreach joinen generate name;
```

Now we can execute the plan to dump the names

```Pig
DUMP names;
```

In [None]:
pig -useHCatalog <<EOF
country = LOAD 'world.country' USING org.apache.hive.hcatalog.pig.HCatLoader();
lang = LOAD 'world.countrylanguage' USING org.apache.hive.hcatalog.pig.HCatLoader();
codeen = filter lang by language == 'English' and isofficial == true;
joinen = JOIN country by code, codeen by countrycode;
names = foreach joinen generate name;
DUMP names;
EOF

### PIG EXERCISE

Now as an exercise, we will compare the lifeexpentancy of the whole sample and the life expectancy of the countries with English as official language

To get a feel of calculating averages in Pig, below is the solution for the first part:

```Pig
-- filter for null values
countrynotnull = filter country by lifeexpectancy is not null;

-- get lifeexpectancy column

lifeall = foreach countrynotnull generate lifeexpectancy;

-- combine values into a single group 
lifeallg = group lifeall all;

-- calculate the average
avgall = foreach lifeallg generate AVG(lifeall);
-- execute
DUMP avgall;
```

The result is:
```Pig
(66.486036036036)
```

In [None]:
pig -useHCatalog <<EOF
countrynotnull = filter country by lifeexpectancy is not null;
lifeall = foreach countrynotnull generate lifeexpectancy;
lifeallg = group lifeall all;
avgall = foreach lifeallg generate AVG(lifeall);
DUMP avgall;
EOF

Now play with code in previous example to get the lifeexpectancy values of English speaking countries (Note that we had extracted the names column. Just change the column)

And apply the steps above (you will have the life expectancies of anglophone countries instead of all countries, rest is the same)

Note that null elimination step is not necessary, the result is the same

The result should be:
```Pig
(71.5027027027027)
```

In [None]:
pig -useHCatalog <<EOF
    codeen = filter lang by language == 'English' and isofficial == true and lifeexpectancy is not null;
    joinen = JOIN country by code, codeen by countrycode;
    lifeen = foreach joinen generate lifeexpectancy;

    lifeeng = group lifeen all;
    avgen = foreach lifeeng generate AVG(lifeen);

    DUMP avgen;
EOF