## The demonstrative notebook for Hive assignments.

To run any HiveQL query in the notebook you should:
1. write the code of query into a separate file using `%%writefile [-a] <file>` magic,
2. execute this file in hive using `! hive -f <file>` command.

To make grading system check a task correctly, execution command must be in a separate cell.

### 1. Creation the database.

Firstly, create your Hive database. You can name the database whatever you want.

Let's drop database if it has already created.

In [1]:
%%writefile creation_db.hql

DROP DATABASE IF EXISTS demodb CASCADE;

Overwriting creation_db.hql


And now create it.

In [2]:
%%writefile -a creation_db.hql
CREATE DATABASE demodb LOCATION '/user/jovyan/somemetastore';

Appending to creation_db.hql


Finally, execute the file we filled earlier.

In [3]:
! hive -f creation_db.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
Time taken: 2.993 seconds
OK
Time taken: 0.332 seconds


On the real Hadoop-cluster where your submission will be checked we already have precreated Hive databases for all users. This helps to avoid database name conflicts. If you're the new user, the database will be created during your first submission of Hive assignment. The system won't allow you to create your own database on Hadoop-cluster so when you submit the final version of the task you shoud **remove or comment** all the lines related to database's dropping and creation. 

You can left all the lines with `USE` without any changes. The grading system will replace database's name to name of the precreated database. In assignments 2 and 3 you'll need to use `stackoverflow_` database. This database's name will not be changed by the grading system.

### 2. Creation the external table

In [None]:
%%writefile external_table.hql

ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;

USE demodb;
DROP TABLE IF EXISTS posts_sample_external;
DROP TABLE IF EXISTS posts_sample;
CREATE EXTERNAL TABLE posts_sample_external (
id int,
posttypeid int,
creationdate string,
tags string,
owneruserid int,
parentid int,
score int,
favoritecount int
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '\<row (?=.*\\bId=\"(\\d+))(?=.*\\bPostTypeId=\"(\\d+))(?=.*\\bCreationDate=\"([0-9-T:.]+)\")(?=.*\\bTags=\"(\\S+)\")?(?=.*\bOwnerUserId=\"(\\d+))?(?=.*\\bParentId=\"(\\d+))?(?=.*\\bScore=\"(\\d\+))(?=.*\\bFavoriteCount =\"(\\d+))?.*$'
)
LOCATION '/data/stackexchange1000/posts/';

CREATE EXTERNAL TABLE posts_sample (
id int,
posttypeid int,
creationdate string,
tags string,
owneruserid int,
parentid int,
score int,
favoritecount int
)
PARTITIONED BY (year STRING,month STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;



In [None]:
! hive -f external_table.hql

In [None]:
%%writefile external_table2.hql
with
()
select year,month,count(*) as cnt from posts_sample group by year,month where row_number ;
