# Part 1 - Introduction to Hive
Hive provides an interface for querying and managing large datasets in HDFS. Hive can be used to project structure onto large datasets already residing in distributed storage. Once a structure is defined, we can use the same querying language we used against a MySQL database (SQL)to generate insights on the existing data.

One of the benefits of Hive is that it abstracts the complexity of large-scale parallel processing on datasets. Users can use a familiar query language (SQL) that they may already have experience in using on smaller scale data on local machines, and Hive will abstract the same query across (potentially) thousands of machines and terabytes of data.

In this prac we will look at applying structure to the HR Dataset flat file we uploaded to HDFS in prac 2 so that we can query it using SQL.

## Checking the file in Hadoop
We will first need to SSH into the data7001 node to access the Hive and Hadoop command line interface.

Like in Prac 2, use the Terminal (or putty for windows) program to `SSH` into `clientnode.zones.eait.uq.edu.au`, using your sXXXXXX username as the password. Remember if you are not in UQ network, you need to access `remote.labs.eait.uq.edu.au` first and than access the clientnode.

To verify if the `HR_comma_sep.csv` file is in your HDFS folder run the following command:
```
hadoop fs -ls
```
You should get the following output
```
Found 4 items
drwx------   - sXXXXXXX hadoop          0 2018-08-01 12:00 .Trash
-rw-r--r--   3 sXXXXXXX hadoop     566778 2018-08-29 03:05 HR_comma_sep.csv
-rw-r--r--   3 hdfs     hadoop         17 2018-07-26 00:42 SECRET
-rw-r--r--   3 sXXXXXXX hadoop         19 2018-08-08 04:14 example.txt
```

**If you *don't* see `HR_comma_sep.csv`, run the 2 commands bellow to get the file from the UQ cloud an than push it to Hadoop:**

```
wget https://stluc.manta.uqcloud.net/mdatascience/public/datasets/HumanResourceAnalytics/HR_comma_sep.csv
hadoop fs -put HR_comma_sep.csv
```

We are now ready to start Hive.

## Starting Hive

Once you are in the clientnode.zones.eait.uq.edu.au, you can begin Hive by typing "hive" into the command line prompt. It may take a few seconds to initialise. When it's ready to receive commands, you should see a prompt like the following:

```
hive>
```

Hive allows you to think of your files in HDFS as a database, and query it in a similar way you would in MySQL. In order to create new tables, you will first need to connect to your database. Connect to your database using the following command, where the `sXXXXXX` is replaced with your student number. 

```
hive> use sXXXXXXX;
```

If working, Hive should return an "OK" message along with how long the query took to make.

Now that we are working within the correct database, the next step will be to create a Hive table definition for the HR Dataset which we previously pushed into HDFS. First we will need to create the temporary table using the following syntax:

```
create table hr (
satisfaction_level float,
last_evaluation float,
number_project int,
average_monthly_hours int,
time_spend_company int,
Work_accident int,
left_job int,
promotion_last_5years int,
sales string,
salary string
)
row format delimited
fields terminated by ','
lines terminated by '\n' ;

```
Lets dissect the command
- `create table hr` This line creates a Hive table definition called hr, as per the column and datatype defined in the round braces following it
- `row format delimited` This line tells Hive that the underlying data which this table points to contains one row per line 
- `fields terminated by ','` This line tells Hive that the columns of the underlying data which this table points to, is delimited by comma
- `lines terminated by '\n'` This line tells Hive that the rows of the underlying data which this table points to, is delimited by newline character

Once we've created the table, we can point this table hr to our HR Analytics dataset from HDFS using the following command, replacing sXXXXXX with your student number:

```
LOAD DATA INPATH '/user/sXXXXXX/HR_comma_sep.csv' OVERWRITE INTO TABLE hr;
```


Now that we have created a Hive table definition called `hr`, we can query the table exactly like we did in phpMyAdmin, using SQL. As an interface, you will see that Hive is very similar to a traditional database. However, instead of using a database backend, Hive can abstract our queries over many machines. Although not obvious in a dataset this size, if we had a dataset that was several terabytes in size, you would see significant performance gain over a tradition database as it is backed by a distributed file system and the queries can be distributed across several machines. 

For a reference in Hive commands check their official documentation:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Answer the following questions and **include the SQL query used to determine the answer**.

Unlike phpMyAdmin, you will need to ensure that you end each statement with a ';'.

|<center>TASK</center>|
| ---- |
| How many entries were there in the HR dataset? |

Correct answer is:  15000 entries

SELECT COUNT(*)
FROM HR;

|<center>TASK</center>|
| ---- |
| What was the average number of monthly hours? |

Correct answer is: 201.05 hours


SELECT AVG(average_monthly_hours)
FROM HR;

|<center>TASK</center>|
| ---- |
| What was the average number of monthly hours by those who left their job? |

SELECT AVG(average_monthly_hours)
FROM hr
WHERE left_job = 1;

Correct answer is:  207.41921030523662

Unlike MySQL, you can see that some of your queries were submitted as *jobs*, and may have taken a few seconds to process. Jobs are a common way of describing how processing is submitted to large distributed systems. Unlike your phpMyAdmin database that was only used by you, large distributed systems are often shared and processing may take hours or months. This means that often a job has to be submitted, picked up by some sort of job management process, directed to the appropriate node/s, distribute processing across multiple jobs, maintain a job queue and collate the results across multiple nodes (among many other things!). You can see why this complexity adds time - especially since many of these interactions are over a network. The benefits of a distributed system are only apparent when your dataset is big enough to warrant using one. In our case of our small HR Dataset, it is actually much faster to use R or MySQL!