Import package yang akan kita gunakan

In [None]:
from pyspark.sql import SparkSession

Untuk melakukan koneksi ke Hive, kita perlu menjalankan fungsi enableHiveSupport() pada saat membuat spark session

In [None]:
spark = SparkSession.builder.appName('Hive Basics').enableHiveSupport().getOrCreate()

## Menjalankan perintah SHOW dan DESCRIBE

Untuk menjalankan SQL command ke dalam Hive, kita gunakan fungsi `spark.sql()`. Fungsi ini mengembalikan spark DataFrame, sehingga untuk menampilkannya kita perlu memanggil fungsi `show()`

In [None]:
spark.sql("show databases").show()

In [None]:
spark.sql("describe database default").show(truncate=False)

## Menjalankan perintah CREATE DATABASE

In [None]:
spark.sql("create database mytest;")

In [None]:
spark.sql("describe database mytest").show(truncate = False)

## Membuat managed tabel dari dataframe

Kita bisa membuat tabel dari sebuah dataframe. Untuk itu kita buat dataframenya terlebih dahulu

In [None]:
data = [['Agus','F',100,150,150],['Windy','F',200,150,180],
        ['Budi','B',200,100,150],['Dina','F',150,150,130],
        ['Bayu','F',50,150,100],['Dedi','B',50,100,100]]

kolom = ["nama","kode_jurusan","nilai1","nilai2","nilai3"]
df = spark.createDataFrame(data,kolom)
df.show()

Untuk menyimpan sebuah dataframe menjadi tabel kita menggunakan perintah `DataFrameWriter.saveAsTable()` ada beberapa parameter yang bisa kita pilih, diantaranya yaitu **mode** yang menyediakan pilihan nilai berupa : *append, overwrite, ignore, error, errorifexists*

Untuk contoh ini kita pilih mode *overwrite*, dan kita beri nama tabelnya *mahasiswa*

In [None]:
df.write.mode('overwrite') \
         .saveAsTable("mytest.mahasiswa")

In [None]:
spark.sql("show tables from mytest").show()

Untuk menampilkan property lengkap dari sebuah tabel, kita gunakan opsi `formatted` atau `extended`

In [None]:
spark.sql("describe formatted mytest.mahasiswa").show(truncate=False)

## Melakukan query ke tabel Hive 



In [None]:
spark.sql("select * from mytest.mahasiswa").show()

In [None]:
spark.sql("""SELECT  kode_jurusan, count(*) as jumlah_mhs, avg(nilai1) as rata2_nilai1 
            FROM mytest.mahasiswa 
            GROUP BY kode_jurusan""").show()

## Membuat External Tabel dari DataFrame

In [None]:
!hdfs dfs -ls /user/hadoop/mydata

In [None]:
!hdfs dfs -mkdir /user/hadoop/mydata/mahasiswa

In [None]:
df.write.mode('overwrite') \
        .option("path", "hdfs://127.0.0.1:9000/user/hadoop/mydata/mahasiswa") \
        .saveAsTable("mytest.mahasiswa_ext")

In [None]:
spark.sql("describe extended mytest.mahasiswa_ext").show(truncate=False)

In [None]:
spark.sql("SELECT * FROM mytest.mahasiswa_ext").show()

In [None]:
!hdfs dfs -ls /user/hadoop/mydata/mahasiswa

## Membuat Managed Tabel dengan CREATE TABLE

In [None]:
#spark.sql("drop table mytest.emp")
#spark.sql("drop table mytest.emp_ext")

In [None]:
spark.sql("""CREATE TABLE IF NOT EXISTS mytest.emp(
firstname STRING,
lastname STRING,
email STRING,
gender STRING,
age INT,
jobtitle STRING,
yearsofexperience BIGINT,
salary INT,
department STRING)
STORED AS ORC;""")

In [None]:
spark.sql("describe extended mytest.emp").show(truncate=False)

In [None]:
spark.sql("select count(*) from mytest.emp").show()

## Membuat External Table dengan CREATE TABLE

In [None]:
!wget https://github.com/urfie/SparkSQL-dengan-Hive/raw/main/datasets/emp_clean.csv

In [None]:
!hdfs dfs -ls /user/hadoop/mydata

In [None]:
!hdfs dfs -mkdir /user/hadoop/mydata/emp

In [None]:
!hdfs dfs -put emp_clean.csv /user/hadoop/mydata/emp

In [None]:
!hdfs dfs -ls /user/hadoop/mydata/emp

Create external table

In [None]:
spark.sql("""CREATE  EXTERNAL TABLE mytest.emp_ext(
firstname STRING,
lastname STRING,
email STRING,
gender STRING,
age INT,
jobtitle STRING,
yearsofexperience BIGINT,
salary INT,
department STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://127.0.0.1:9000/user/hadoop/mydata/emp'""")

In [None]:
spark.sql("describe extended mytest.emp_ext").show(truncate=False)

In [None]:
spark.sql("select count(*) from mytest.emp_ext").show()

In [None]:
spark.sql("select * from mytest.emp_ext limit 5").show()

## Insert into Managed Table from External Table

In [None]:
spark.sql("INSERT INTO mytest.emp SELECT * FROM mytest.emp_ext;")

In [None]:
spark.sql("select count(*) from mytest.emp").show()

In [None]:
spark.sql("select * from mytest.emp limit 5").show()

## Menjalankan fungsi Hive 

In [None]:
spark.sql("select lower(firstname), lower(lastname), lower(department) from mytest.emp limit 5").show()

## Membuat Tabel Dengan Partisi

Kita akan gunakan kolom `department` sebagai partisinya

In [None]:
spark.sql("""CREATE TABLE IF NOT EXISTS mytest.emp_part(
firstname STRING,
lastname STRING,
email STRING,
gender STRING,
age INT,
jobtitle STRING,
yearsofexperience BIGINT,
salary INT)
partitioned by (department string)
STORED AS ORC;""")

In [None]:
spark.sql("describe formatted mytest.emp_part").show()

Insert data dari tabel non partisi ke tabel dengan partisi

In [None]:
spark.sql("insert overwrite table mytest.emp_part  partition(department) select *  from mytest.emp;")

Tampilkan data dari tabel `emp_part`

In [None]:
spark.sql("select * from mytest.emp_part").show()

Tampilkan lokasi fisik tabel di hdfs

In [None]:
!hdfs dfs -ls /user/hive/warehouse/mytest.db