### A. Data preparation

1. Copy data from HDFS directory `/user/pascepet/data/teplota-usa.zip` to your home directory at the local system.
1. Unzip the file `teplota-usa.zip`. (Hint: try the `unzip` Linux statement.)
1. Make some new subdirectory in your user directory at HDFS (e. g. `hive_data`).
1. Copy all unzipped files to the HDFS subdirectory you have just made.

### B. Database preparation

1. Start HIVE CLI using command    
`beeline -u "jdbc:hive2://hador-c1.ics.muni.cz:10000/default;principal=hive/hador-c1.ics.muni.cz@ICS.MUNI.CZ"`
2. Create your database (if not exists).
3. Make your database your working database.


### C. External and internal table

1. Create external table name temperature_tmp, csv file is separated by "," and contains column headers.

| Column name | Data type |
|:------------|:----------|
| stanice     | string    |
| mesic       | int       |
| den         | int       |
| hodina      | int       |
| teplota     | double    |
| flag        | string    |
| latitude    | double    |
| longitude   | double    |
| vyska       | double    |
| stat        | string    |
| nazev       | string    |

2. Create internal table `temperature`. Make it stored as parquet with snappy compression codec.

3. Insert data into the internal table from the external one. Convert temperature data from 10xFahrenheit to celsius using formula $ (\frac{F}{10} - 32) \times \frac{5}{9} $

4. Drop external table. Then check that data files are still in your HDFS subdirectory.

```SQL
create external table temperature_tmp (
	stanice string,
	mesic int,
	den int,
	hodina int,
	teplota double,
	flag string,
	latitude double,
	longitude double,
	vyska double,
	stat string,
	nazev string)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION "/user/username/hive_data"
tblproperties ("skip.header.line.count"="1");

CREATE TABLE IF NOT EXISTS temperature (
  stanice string,
  mesic int,
  den int,
  hodina int,
  teplota double,
  flag string,
  latitude double,
  longitude double,
  vyska double,
  stat string,
  nazev string
)
STORED AS parquet
tblproperties('parquet.compress'='SNAPPY');

INSERT OVERWRITE TABLE temperature
SELECT
  stanice,
  mesic,
  den,
  hodina,
  ((teplota  / 10) - 32) * 5/9,
  flag,
  latitude,
  longitude,
  vyska,
  stat,
  nazev
FROM temperature_tmp where mesic is not NULL;

drop table temperature_tmp;
```

```hdfs dfs -ls /user/username/teplota/```

### D. Quering 

1. Find a state with the highest average temperature in summer (month 6, 7, 8).
1. For each month find state with the highest average temperature in this month.

```SQL
SELECT
  sub.stat,
  avg(sub.teplota) as avg_teplota
FROM (
  SELECT
    teplota,
    stat
  FROM temperature
  WHERE mesic in (6, 7, 8)) sub
GROUP BY sub.stat
ORDER BY avg_teplota DESC
limit 1;

SELECT mesic, stat, avg_teplota
FROM (SELECT
        stat,
        mesic,
        avg_teplota,
        RANK() OVER (PARTITION BY mesic ORDER BY avg_teplota DESC) AS r
      FROM (
             SELECT
               avg(teplota) AS avg_teplota,
               stat,
               mesic
             FROM (
                    SELECT
                      mesic,
                      teplota,
                      stat
                    FROM temperature) sub
             GROUP BY stat, mesic) mesic_avg) mesic_rank
WHERE r = 1;
```

### E. Create internal partitioned table

1. Create table `temperature_part` same as `temperature` table but partitioned by month.
1. Insert data into partitioned table.
1. Inspect partitioned folder on HDFS (`/user/hive/warehouse/username.db/`).

To enable dynamic partitioning execute this commands

```
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
```


```SQL
CREATE EXTERNAL TABLE IF NOT EXISTS temperature_part (
  stanice string,  
  den int,
  hodina int,
  teplota double,
  flag string,
  latitude double,
  longitude double,
  vyska double,
  stat string,
  nazev string
)
partitioned by (mesic int)
STORED AS parquet
tblproperties('parquet.compress'='SNAPPY');

INSERT OVERWRITE TABLE temperature_part partition (mesic)
SELECT
  stanice,
  den,
  hodina,
  ((teplota  / 10) - 32) * 5/9,
  flag,
  latitude,
  longitude,
  vyska,
  stat,
  nazev,
  mesic
FROM temperature_tmp where mesic is not NULL;
```