### **Import package yang akan kita gunakan**
#### Untuk melakukan koneksi ke Hive, kita perlu menjalankan fungsi `enableHiveSupport()` pada saat membuat spark session

In [1]:
%spark.pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Hive Basics').enableHiveSupport().getOrCreate()


In [2]:
%spark.pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()

### **Menjalankan perintah SHOW dan DESCRIBE**


#### Untuk menjalankan SQL command ke dalam Hive, kita gunakan fungsi `spark.sql()`. Fungsi ini mengembalikan spark DataFrame, sehingga untuk menampilkannya kita perlu memanggil fungsi `show()`

In [5]:
%spark.pyspark
spark.sql("show databases").show()

In [6]:
%spark.pyspark
spark.sql("describe database default").show(truncate=False)

### **Menjalankan perintah CREATE DATABASE**

In [8]:
%spark.pyspark
spark.sql("create database mytest;")

In [9]:
%spark.pyspark
spark.sql("describe database mytest").show(truncate = False)

### **Membuat managed tabel dari dataframe**

Kita bisa membuat tabel dari sebuah dataframe. Untuk itu kita buat dataframenya terlebih dahulu

In [11]:
%spark.pyspark
data = [['Agus','F',100,150,150],['Windy','F',200,150,180],
        ['Budi','B',200,100,150],['Dina','F',150,150,130],
        ['Bayu','F',50,150,100],['Dedi','B',50,100,100]]

kolom = ["nama","kode_jurusan","nilai1","nilai2","nilai3"]
df = spark.createDataFrame(data,kolom)
df.show()

#### Untuk menyimpan sebuah dataframe menjadi tabel kita menggunakan perintah `DataFrameWriter.saveAsTable()` ada beberapa parameter yang bisa kita pilih, diantaranya yaitu **mode** yang menyediakan pilihan nilai berupa : *append, overwrite, ignore, error, errorifexists*

Untuk contoh ini kita pilih mode *overwrite*, dan kita beri nama tabelnya *mahasiswa*

In [13]:
%spark.pyspark
df.write.mode('overwrite').saveAsTable("mytest.mahasiswa")


In [14]:
%spark.pyspark
spark.sql("show tables from mytest").show()

#### Untuk menampilkan property lengkap dari sebuah tabel, kita gunakan opsi `formatted` atau `extended`

In [16]:
%spark.pyspark
spark.sql("describe formatted mytest.mahasiswa").show(truncate=False)

### **Melakukan query ke tabel Hive**


In [18]:
%spark.pyspark
spark.sql("select * from mytest.mahasiswa").show()

In [19]:
%spark.pyspark
spark.sql("""SELECT  kode_jurusan, count(*) as jumlah_mhs, avg(nilai1) as rata2_nilai1 
            FROM mytest.mahasiswa 
            GROUP BY kode_jurusan""").show()

### **Membuat External Tabel dari DataFrame**


In [21]:
%sh
hdfs dfs -mkdir /user/userdev/mydata
hdfs dfs -mkdir /user/userdev/mydata/mahasiswa

In [22]:
%spark.pyspark
df.write.mode('overwrite') \
        .option("path", "hdfs://myzoo/user/userdev/mydata/mahasiswa") \
        .saveAsTable("mytest.mahasiswa_ext")


In [23]:
%spark.pyspark
spark.sql("describe extended mytest.mahasiswa_ext").show(truncate=False)


In [24]:
%spark.pyspark
spark.sql("SELECT * FROM mytest.mahasiswa_ext").show()

In [25]:
%sh
hdfs dfs -ls /user/userdev/mydata/mahasiswa

### **Membuat External Table dengan CREATE TABLE**

#### Persiapan data untuk di load ke external table

In [28]:
%sh
wget https://github.com/urfie/SparkSQL-dengan-Hive/raw/main/datasets/emp_clean.csv


In [29]:
%sh
hdfs dfs -mkdir /user/userdev/mydata/emp

In [30]:
%sh
hdfs dfs -put emp_clean.csv /user/userdev/mydata/emp

In [31]:
%sh
hdfs dfs -ls /user/userdev/mydata/emp

#### Create External Table


In [33]:
%spark.pyspark
spark.sql("""CREATE  EXTERNAL TABLE mytest.emp_ext(
firstname STRING,
lastname STRING,
email STRING,
gender STRING,
age INT,
jobtitle STRING,
yearsofexperience BIGINT,
salary INT,
department STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://myzoo/user/userdev/mydata/emp'
""")


In [34]:
%spark.pyspark
spark.sql("describe extended mytest.emp_ext").show(truncate=False)

In [35]:
%spark.pyspark
spark.sql("select count(*) from mytest.emp_ext").show()

In [36]:
%spark.pyspark
spark.sql("select * from mytest.emp_ext limit 5").show()

### **Menjalankan fungsi Hive**

In [38]:
%spark.pyspark
spark.sql("select lower(firstname), lower(lastname), lower(department) from mytest.emp_ext limit 5").show()

### **Membuat Tabel dengan Partisi**

#### Kita akan gunakan kolom `department` sebagai partisinya

In [41]:
%spark.pyspark
spark.sql("""CREATE EXTERNAL TABLE IF NOT EXISTS mytest.emp_part(
firstname STRING,
lastname STRING,
email STRING,
gender STRING,
age INT,
jobtitle STRING,
yearsofexperience BIGINT,
salary INT)
partitioned by (department string)
STORED AS ORC
LOCATION 'hdfs://myzoo/user/userdev/mydata/emp_part';
""")

In [42]:
%spark.pyspark
spark.sql("describe formatted mytest.emp_part").show()

#### Insert data dari tabel non partisi ke tabel dengan partisi

In [44]:
%spark.pyspark
spark.sql("insert overwrite table mytest.emp_part partition(department) select *  from mytest.emp_ext;")

#### Tampilkan data dari tabel `emp_part`

In [46]:
%spark.pyspark
spark.sql("select * from mytest.emp_part").show()

#### Tampilkan lokasi fisik tabel di hdfs

In [48]:
%sh
hdfs dfs -ls /apps/spark/warehouse/mytest.db

In [49]:
%sh
hdfs dfs -ls /apps/spark/warehouse/mytest.db/mahasiswa

In [50]:
%sh
hdfs dfs -ls /user/userdev/mydata/emp_part

In [51]:
%sh
hdfs dfs -ls /user/userdev/mydata/emp_part/department=Product

Delete All Data before running first time

In [53]:
%spark.pyspark
spark.sql("drop table mytest.emp_ext;")
spark.sql("drop table mytest.emp_part;")
spark.sql("drop table mytest.mahasiswa_ext;")
spark.sql("drop table mytest.mahasiswa;")

In [54]:
%spark.pyspark
spark.sql("drop database mytest;")

In [55]:
%sh
hdfs dfs -ls /user/userdev/mydata


In [56]:
%sh
hdfs dfs -rmr -skipTrash /user/userdev/mydata