# Resnu Mukti Ismail Hanif (2310506012)

# Hands-On Pertemuan 6: Data Processing dengan Apache Spark

### 1. Pengenalan Spark DataFrames
Spark DataFrame menyediakan struktur data yang optimal dengan operasi yang dioptimalkan untuk pemrosesan data besar, yang sangat mirip dengan DataFrame di Pandas atau di RDBMS.

- **Tugas 1**: Buat DataFrame sederhana di Spark dan eksplorasi beberapa fungsi dasar yang tersedia.

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('HandsOnPertemuan6').getOrCreate()

data = [('James', 'Sales', 3000),
        ('Michael', 'Sales', 4600),
        ('Robert', 'Sales', 4100),
        ('Maria', 'Finance', 3000)]
columns = ['EmployeeName', 'Department', 'Salary']

df = spark.createDataFrame(data, schema=columns)
df.show()

+------------+----------+------+
|EmployeeName|Department|Salary|
+------------+----------+------+
|       James|     Sales|  3000|
|     Michael|     Sales|  4600|
|      Robert|     Sales|  4100|
|       Maria|   Finance|  3000|
+------------+----------+------+



### 2. Transformasi Dasar dengan DataFrames
Pemrosesan data meliputi transformasi seperti filtering, selections, dan aggregations. Spark menyediakan cara efisien untuk melaksanakan operasi ini.

- **Tugas 2**: Gunakan operasi filter, select, groupBy untuk mengekstrak informasi dari data, serta lakukan agregasi data untuk mendapatkan insight tentang dataset menggunakan perintah seperti mean, max, sum.

In [2]:
df.select('EmployeeName', 'Salary').show()
df.filter(df['Salary'] > 3000).show()
df.groupBy('Department').avg('Salary').show()

+------------+------+
|EmployeeName|Salary|
+------------+------+
|       James|  3000|
|     Michael|  4600|
|      Robert|  4100|
|       Maria|  3000|
+------------+------+

+------------+----------+------+
|EmployeeName|Department|Salary|
+------------+----------+------+
|     Michael|     Sales|  4600|
|      Robert|     Sales|  4100|
+------------+----------+------+

+----------+-----------+
|Department|avg(Salary)|
+----------+-----------+
|     Sales|     3900.0|
|   Finance|     3000.0|
+----------+-----------+



In [3]:
from pyspark.sql.functions import mean, max, sum
df.select(mean("Salary")).show()
df.select(max("Salary")).show()
df.select(sum("Salary")).show()

+-----------+
|avg(Salary)|
+-----------+
|     3675.0|
+-----------+

+-----------+
|max(Salary)|
+-----------+
|       4600|
+-----------+

+-----------+
|sum(Salary)|
+-----------+
|      14700|
+-----------+



### 3. Bekerja dengan Tipe Data Kompleks
Spark mendukung tipe data yang kompleks seperti maps, arrays, dan structs yang memungkinkan operasi yang lebih kompleks pada dataset yang kompleks.

- **Tugas 3**: Eksplorasi bagaimana mengolah tipe data kompleks dalam Spark DataFrames.

In [4]:
df = df.withColumn('SalaryBonus', df['Salary'] * 0.1)
df = df.withColumn('TotalCompensation', df['Salary'] + df['SalaryBonus'])
df.show()

+------------+----------+------+-----------+-----------------+
|EmployeeName|Department|Salary|SalaryBonus|TotalCompensation|
+------------+----------+------+-----------+-----------------+
|       James|     Sales|  3000|      300.0|           3300.0|
|     Michael|     Sales|  4600|      460.0|           5060.0|
|      Robert|     Sales|  4100|      410.0|           4510.0|
|       Maria|   Finance|  3000|      300.0|           3300.0|
+------------+----------+------+-----------+-----------------+



### 4. Operasi Data Lanjutan
Menggunakan Spark untuk operasi lanjutan seperti window functions, user-defined functions (UDFs), dan mengoptimalkan query.

- **Tugas 4**: Implementasikan window function untuk menghitung running totals atau rangkings.

In [5]:
from pyspark.sql.window import Window
from pyspark.sql import functions as F

windowSpec = Window.partitionBy('Department').orderBy('Salary')
df.withColumn('Rank', F.rank().over(windowSpec)).show()

+------------+----------+------+-----------+-----------------+----+
|EmployeeName|Department|Salary|SalaryBonus|TotalCompensation|Rank|
+------------+----------+------+-----------+-----------------+----+
|       Maria|   Finance|  3000|      300.0|           3300.0|   1|
|       James|     Sales|  3000|      300.0|           3300.0|   1|
|      Robert|     Sales|  4100|      410.0|           4510.0|   2|
|     Michael|     Sales|  4600|      460.0|           5060.0|   3|
+------------+----------+------+-----------+-----------------+----+



### 5. Kesimpulan dan Eksplorasi Lebih Lanjut
Review apa yang telah dipelajari tentang pemrosesan data menggunakan Spark dan eksplorasi teknik lebih lanjut untuk mengoptimalkan pemrosesan data Anda.
- **Tugas 5**: Buat ringkasan dari semua operasi yang telah dilakukan dan bagaimana teknik ini dapat diterapkan pada proyek data Anda.

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("joins").getOrCreate()

employees = spark.createDataFrame([(1,'James', 'Sales', 3000),
        (2,'Michael', 'Sales', 4600),
        (3,'Robert', 'Sales', 4100)], ["emp_id", 'EmployeeName', 'Department', 'Salary'])
addresses = spark.createDataFrame([(1, "NY"), (2, "LA"), (4, "DC")], ["emp_id", "address"])

result = employees.join(addresses, "emp_id", "inner")
result.show()

+------+------------+----------+------+-------+
|emp_id|EmployeeName|Department|Salary|address|
+------+------------+----------+------+-------+
|     1|       James|     Sales|  3000|     NY|
|     2|     Michael|     Sales|  4600|     LA|
+------+------------+----------+------+-------+



In [7]:
result = employees.join(addresses, "emp_id", "outer")
result.show()

+------+------------+----------+------+-------+
|emp_id|EmployeeName|Department|Salary|address|
+------+------------+----------+------+-------+
|     1|       James|     Sales|  3000|     NY|
|     2|     Michael|     Sales|  4600|     LA|
|     3|      Robert|     Sales|  4100|   null|
|     4|        null|      null|  null|     DC|
+------+------------+----------+------+-------+



In [8]:
result = employees.join(addresses, "emp_id", "left_outer")
result.show()

+------+------------+----------+------+-------+
|emp_id|EmployeeName|Department|Salary|address|
+------+------------+----------+------+-------+
|     1|       James|     Sales|  3000|     NY|
|     2|     Michael|     Sales|  4600|     LA|
|     3|      Robert|     Sales|  4100|   null|
+------+------------+----------+------+-------+



In [9]:
result = employees.join(addresses, "emp_id", "right_outer")
result.show()

+------+------------+----------+------+-------+
|emp_id|EmployeeName|Department|Salary|address|
+------+------------+----------+------+-------+
|     1|       James|     Sales|  3000|     NY|
|     2|     Michael|     Sales|  4600|     LA|
|     4|        null|      null|  null|     DC|
+------+------------+----------+------+-------+



In [10]:
result = employees.join(addresses, "emp_id", "left_semi")
result.show()

+------+------------+----------+------+
|emp_id|EmployeeName|Department|Salary|
+------+------------+----------+------+
|     1|       James|     Sales|  3000|
|     2|     Michael|     Sales|  4600|
+------+------------+----------+------+



In [11]:
result = employees.join(addresses, "emp_id", "left_anti")
result.show()

+------+------------+----------+------+
|emp_id|EmployeeName|Department|Salary|
+------+------------+----------+------+
|     3|      Robert|     Sales|  4100|
+------+------------+----------+------+

