copy data from HDFS folder `/user/pascepet/teplota-usa` to your HDFS-home folder `/user/username/teplota`

start HIVE CLI using command

`beeline -u "jdbc:hive2://hador-c1.ics.muni.cz:10000/default;principal=hive/hador-c1.ics.muni.cz@ICS.MUNI.CZ"`

1 Create external table

- Create your database (if not exists)
- Make your database your working database
- Create external table name temperature_tmp, csv file is separated by ","

| Column name | Data type |
|:------------|:----------|
| stanice     | string    |
| mesic       | int       |
| den         | int       |
| hodina      | int       |
| teplota     | double    |
| flag        | string    |
| latitude    | double    |
| longitude   | double    |
| vyska       | double    |
| stat        | string    |
| nazev       | string    |

```SQL
create external table temperature_tmp (
	stanice string,
	mesic int,
	den int,
	hodina int,
	teplota double,
	flag string,
	latitude double,
	longitude double,
	vyska double,
	stat string,
	nazev string)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION "/user/username/teplota";
```

2 Create internal table

- Create internal table named temperature stored as parquet with snappy compression codec
- Insert data into internal table. Convert temperature data from 10xFahrenheit to celsius using formula $ (\frac{F}{10} - 32) \times \frac{5}{9} $
- Drop external table
- Check that data files are still on HDFS (`hdfs:///user/username/teplota/`)

### Create internal table

```SQL

CREATE TABLE IF NOT EXISTS temperature (
  stanice string,
  mesic int,
  den int,
  hodina int,
  teplota double,
  flag string,
  latitude double,
  longitude double,
  vyska double,
  stat string,
  nazev string
)
STORED AS parquet
tblproperties('parquet.compress'='SNAPPY');
```

### Insert data 

```SQL
INSERT OVERWRITE TABLE temperature
SELECT
  stanice,
  mesic,
  den,
  hodina,
  ((teplota  / 10) - 32) * 5/9,
  flag,
  latitude,
  longitude,
  vyska,
  stat,
  nazev
FROM temperature_tmp where mesic is not NULL;

```

### Drop external table

```SQL
drop table temperature_tmp;
```

### Check files

```bash
hdfs dfs -ls /user/username/teplota/
```

3 Find a state with the highest average temperature in summer (month 6, 7, 8)


| State | AVG_TEMP |
|:------|:---------|
|       |          |


```SQL

SELECT
  sub.stat,
  avg(sub.teplota) as avg_teplota
FROM (
  SELECT
    teplota,
    stat
  FROM temperature
  WHERE mesic in (6, 7, 8)) sub
GROUP BY sub.stat
ORDER BY avg_teplota DESC
limit 1;

```

4 Create internal partitioned table

- Create table partitioned by month use snappy compression
- Insert data into partitioned table
- Inspect partitioned folder on HDFS (`/user/hive/warehouse/username.db/`)

To enable dynamic partitioning execute this commands

```
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
```


### Create partitioned table

```SQL

CREATE EXTERNAL TABLE IF NOT EXISTS temperature_part (
  stanice string,  
  den int,
  hodina int,
  teplota double,
  flag string,
  latitude double,
  longitude double,
  vyska double,
  stat string,
  nazev string
)
partitioned by (mesic int)
STORED AS parquet
tblproperties('parquet.compress'='SNAPPY');

```

### Insert data 

```SQL
INSERT OVERWRITE TABLE temperature_part partiton (mesic)
SELECT
  stanice,
  den,
  hodina,
  ((teplota  / 10) - 32) * 5/9,
  flag,
  latitude,
  longitude,
  vyska,
  stat,
  nazev,
  mesic
FROM temperature_tmp where mesic is not NULL;

```


5 Advanced SQL

II. Find states with the highest average temperature per month

| Month | State | AVG_TEMP |
|:------|:------|:---------|
|       |       |          |



```SQL

SELECT mesic, stat, avg_teplota
FROM (SELECT
        stat,
        mesic,
        avg_teplota,
        RANK() OVER (PARTITION BY mesic ORDER BY avg_teplota DESC) AS r
      FROM (
             SELECT
               avg(teplota) AS avg_teplota,
               stat,
               mesic
             FROM (
                    SELECT
                      mesic,
                      teplota,
                      stat
                    FROM temperature) sub
             GROUP BY stat, mesic) mesic_avg) mesic_rank
WHERE r = 1;

```