SQL adalah salah satu bahasa populer untuk pemrosesan dan analisis data. Spark mendukung SQL untuk memproses DataFrame.

Kita akan menggunakan data yang sama dengan yg digunakan pada bab eksplorasi DataFrame.


In [25]:
%pip install pyspark



In [26]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Inisialisasi spark session untuk berinteraksi dengan Spark cluster

In [27]:
spark = SparkSession.builder.appName('DataFrame Basics').getOrCreate()

Download dataset

In [28]:
#!wget https://raw.githubusercontent.com/urfie/SparkSQL-dengan-Hive/main/datasets/indonesia2013-2015.csv
!wget https://github.com/urfie/SparkSQL-dengan-Hive/raw/main/datasets/application_record_header.csv.gz

--2023-10-06 17:29:03--  https://github.com/urfie/SparkSQL-dengan-Hive/raw/main/datasets/application_record_header.csv.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/urfie/SparkSQL-dengan-Hive/main/datasets/application_record_header.csv.gz [following]
--2023-10-06 17:29:04--  https://raw.githubusercontent.com/urfie/SparkSQL-dengan-Hive/main/datasets/application_record_header.csv.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3175443 (3.0M) [application/octet-stream]
Saving to: ‘application_record_header.csv.gz.1’


2023-10-06 17:29:04 (249 MB/s) - ‘application_record_header.csv.gz.1’ sav

Load ke dataframe

In [29]:
df = spark.read.csv("application_record_header.csv.gz", header=True, inferSchema=True)

Sebelum menggunakan SQL, kita perlu membuat temporary table dari dataframe yang akan kita olah.

Gunakan fungsi `createOrReplaceTempView(nama_tabel)` pada dataframe tersebut.

In [30]:
df.createOrReplaceTempView("app_record")

Selanjutnya kita bisa menggunakan nama tabel yang sudah kita definisikan dalam SQL statement.

Untuk mengeksekusi SQL statement, kita gunakan fungsi `sql(sqlstatement)` pada spark session.

In [31]:
spark.sql("select count(*) from app_record").show()

+--------+
|count(1)|
+--------+
|  438557|
+--------+



In [32]:
spark.sql("select * from app_record limit 5").show()

+-------+-----------+------------+---------------+------------+----------------+--------------------+--------------------+--------------------+-----------------+----------+-------------+----------+---------------+----------+----------+---------------+---------------+
|     ID|CODE_GENDER|FLAG_OWN_CAR|FLAG_OWN_REALTY|CNT_CHILDREN|AMT_INCOME_TOTAL|    NAME_INCOME_TYPE| NAME_EDUCATION_TYPE|  NAME_FAMILY_STATUS|NAME_HOUSING_TYPE|DAYS_BIRTH|DAYS_EMPLOYED|FLAG_MOBIL|FLAG_WORK_PHONE|FLAG_PHONE|FLAG_EMAIL|OCCUPATION_TYPE|CNT_FAM_MEMBERS|
+-------+-----------+------------+---------------+------------+----------------+--------------------+--------------------+--------------------+-----------------+----------+-------------+----------+---------------+----------+----------+---------------+---------------+
|5008804|          M|           Y|              Y|           0|        427500.0|             Working|    Higher education|      Civil marriage| Rented apartment|    -12005|        -4542|         1

In [33]:
spark.sql("select distinct NAME_EDUCATION_TYPE from app_record").show(truncate = False)

+-----------------------------+
|NAME_EDUCATION_TYPE          |
+-----------------------------+
|Academic degree              |
|Incomplete higher            |
|Secondary / secondary special|
|Lower secondary              |
|Higher education             |
+-----------------------------+



In [34]:
mydata = (('Academic degree',3),
    ('Incomplete higher',4),
    ('Secondary / secondary special',2),
    ('Lower secondary',1),
    ('Higher education',5))

ref_edu = spark.createDataFrame(mydata).toDF("NAME_EDUCATION_TYPE", "EDU_LEVEL")
ref_edu.createOrReplaceTempView("ref_edu")
spark.sql("select * from ref_edu").show()

+--------------------+---------+
| NAME_EDUCATION_TYPE|EDU_LEVEL|
+--------------------+---------+
|     Academic degree|        3|
|   Incomplete higher|        4|
|Secondary / secon...|        2|
|     Lower secondary|        1|
|    Higher education|        5|
+--------------------+---------+



In [38]:
spark.sql("""SELECT edu_level, count(1) FROM
              (SELECT ref_edu.EDU_LEVEL as edu_level
                FROM app_record LEFT JOIN ref_edu
                ON app_record.NAME_EDUCATION_TYPE=ref_edu.NAME_EDUCATION_TYPE)
             GROUP BY edu_level SORT BY edu_level""").write.saveAsTable(name="aggregated_edu", mode="overwrite")